Issues & Discussions

Please go to our new website and blog at: https://www.maer-net.org/blog/towards-a-credibility-revolution for updates and to comment. Please join the conversation.

Towards a Credibility Revolution:
Why successful replication remains unlikely

Power Failure

Recent meta-science studies find that psychology is typically 4 times more powerful than medical research, and its median power is twice as large as economics.^{1, 2, 3}Yet, only 8% of psychological studies are adequately powered. Statistical power is the probability that a study of a given precision (or sample size) will find a statistical significant effect. For a half century following Cohen, adequate power (80%) has been deemed a pre-requisite of reliable research (see, for example, the APA Publication Manual). With statistical power so low how is it possible that the majority of published findings are statistically significant?⁴ Something does not add up.

The Incredible Shrinking Effect

When 100 highly-regarded psychological experiments were replicated by the Open Science Collaboration, the average effect size shrank by half.⁵ It shrank in half, yet again, when 21 experiments published in Nature and Science were replicated.⁶Size matters. In economics, a simple weighted average of adequately-powered results is typically one-half the size of the average reported economic effect, and one-third of all estimates are exaggerated by a factor of 4.³However, low power and research inflation are the least of social sciences’ replication problems.

On the Unreliability of Science—Heterogeneity

“What meta-analyses reveal about the replicability of psychological research” demonstrates that high heterogeneity is the more stubborn barrier to successful replication in psychology. Even if a replication study were huge, involving millions of experimental subjects and, thereby, having 100% power, typical heterogeneity (74%) makes close replication unlikely. Then, the probability that the replicated experiment will roughly reproduce some previous study’s ‘small’ effect (that is, one between .2 and .5 SMD (standardized mean difference)) is still less than 50%.¹ Heterogeneity is the variation among ‘true’ effects; in other words, it measures the differences in experimental results not attributable to sampling error. Supporters of the status quo are likely to point out that the high heterogeneity that this survey uncovers includes ‘conceptual’ replication as well as ‘direct’ replication. True enough, but large-scale replication efforts that closely control experimental and methods factors (e.g. the Registered Replication Reports and the Many Labs projects) still report sufficient heterogeneity to make close replication unlikely.^1,7

This is not to argue that large-scale, careful replication should not be undertaken. Indeed, they should because they often provide the best scientific evidence available to the social and medical sciences. Unfortunately, such large-scale multi-lab replication projects are feasible for only a relatively few areas of research where studies can be conducted cheaply and quickly.

Enter Meta-Analysis

For some decades, meta-analyses that collect and analyze all relevant research evidence were seen to be the best summaries for research evidence and the very foundation of evidence-based practice (think the Cochrane and Campbell Collaborations). As reported in a recent Science article, meta-analysis has also been dragged into the credibility crisis and can no longer be relied upon to settle all disputes. After all, that’s a pretty high bar! Unfortunately, conventional meta-analysis is easily overwhelmed by high heterogeneity when accompanied with some degree of selective reporting for statistical significance. Even when the investigated social science phenomenon does not truly exist, conventional meta-analysis is virtually guaranteed to report a false positive.⁸And, no single publication bias correction method is entirely satisfactory.^8,9

The Way Forward

With crisis comes opportunity. In a recent authoritative survey of the credibility of economics research, Christensen and Miguel (2018) emphasize transparency and replication as the way forward.¹⁰We believe that the current discussion of ‘crisis’ can be transformed into a credibility revolution if a consensus can be formed about taking a few feasible steps that harden and clarify our research practices. For the sake of brevity, permit us to sketch such steps:

1. Carefully distinguish between exploratory and confirmatory research studies.
Both types of investigations are quite valuable. The central problem of the decades-long statistical significance controversy is that exploratory research is presented in terms of statistical hypothesis testing as if it were confirmatory. Yet, early research that identifies where, how, and under which conditions some new phenomenon is expressed is essential. If only it could be presented and published for what it is without the pretense of hypothesis testing. After some years of exploration, a meta-analysis could be used to access whether the phenomenon in question merits further confirmatory study. If so, a confirmatory research stage should be undertaken where adequately-powered and pre-registered studies that employ classical hypothesis testing are highly valued and encouraged. During the confirmatory research stage, transparency would be quite helpful.

2. Support large-scale, pre-registered replications of mature areas of research.
Large-scale, pre-registered replications are especially valuable during the confirmatory stage of social science research. These efforts have already begun and need to be more highly encouraged and supported through greater funding and by the prestigious publication of multiple-authored reports in our best scholarly journals.

3. Emphasize practical significance over statistical significance.
Much of the debates across the social sciences would disappear if researchers agreed upon how large some effect needed to be in order be worthy of scientific or practical notice—i.e. ‘practical significance.’ The problem is that the combination of high heterogeneity and some selective reporting of statistically significant findings (because the current paradigm values them) makes it impossible for social science research, no matter how rigorous and well-conducted, to distinguish some quite small effect from nothing. Identifying ‘very small’ effects reliably is simply beyond social science. However, meta-analysis can often reliably distinguish a ‘practically significant’ effect (say, 0.1 SMD or 0.1 elasticity) from a zero effect even under the severe challenges of high heterogeneity and notable selective reporting bias.

With a few modest, but real, changes, genuine scientific progress can be made.

Researchers of the World, unite.

—T.D. Stanley and Chris Doucouliagos

References:

1. Stanley, T.D., Cater, E. and Doucouliagos, H. (2018). What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin. http://psycnet.apa.org/doi/10.1037/bul0000169
2. Lamberick et al. (2018) Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology, 102:123-128.
3. Ioannidis, J. P. A, Stanley, T. D., & Doucouliagos, C(H). (2017). The power of bias in economics research. The Economic Journal, 127: F236-265. doi:10.1111/ecoj.12461
4. Brodeur, A., Le, M., Sangnier, M., and Zylberberg, Y. (2016). Star Wars: The empirics strike back. American Economic Journal: Applied Economics, 8:1-32.
5. Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. doi:10.1126/science.aac4716
6. Camerer et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, https://www.nature.com/articles/s41562-018-0399-z .
7. McShane et al. (2018). Large scale replication projects in contemporary psychological research. The American Statistician, forthcoming.
8. Stanley, T. D. (2017). Limitations of PET-PEESE and other meta-analysis methods. Social Psychology and Personality Science, 8: 581–591.
9. McShane, B. B., Böckenholt, U. & Hansen, K. T. (2016). Adjusting for publication bias in meta-analysis: An evaluation of selection methods and some cautionary notes. Perspectives on Psychological Science, 11: 730–749.
10. Christensen, G. and Miguel, E. (2018). Transparency, reproducibility, and the credibility of economics research. Journal of Economic Literature, 56: 920–80.

Past Discussions on Common Pitfalls in conducting Meta-Regression Analysis in Economics

using t-values as effect sizes
reducing economic effects or tests to categories of statistical significance for the purpose of probit (or logit) meta-regression analysis (MRA).

There is a consensus among MAER-Net members that these are ‘pitfalls’ in the sense they are often misinterpreted and/or poorly modelled. MAER-Net does not wish to ‘prohibit’ the use of logit/probit or t-values in meta-analysis. We merely caution those who choose to do so to exercise greater care interpreting the results from their MRAs.

Why issue this caution? A full justification is beyond the scope of any internet post; however, a brief sketch might look something like,

Probit/Logit MRAs:

reducing any statistical effect or test to crude categories such as: statistically significant and positive, stat insig, stat sig and negative or similar ones will necessarily lose much information that is needed to identify the main drivers of reported research findings reliably. This loss of information is often fatal and almost always unnecessary.
doing so inextricably conflates selective reporting bias with evidence of a genuine economic effect. It is not possible to separate out whether a statistically significant result is due to the researchers’ desire to find such an effect or some underlying genuine economic phenomenon. Logit/probit MRAs are just as likely to be identifying factors related to bad science as they are to understand the economic phenomenon under investigation. However, this is not how Logit/probit MRAs are interpreted, but rather are claimed to identify structure in the underlying economic phenomenon.
using better statistical methods is almost always possible whenever the research that is being systematically reviewed is the result of a statistical test or estimate.
conducting these logit/probit MRA is little more than sophisticated ‘vote-counting,’ which is considered to be bad practice in the broader community of meta-analysts. For example, Hedges and Olkin (1985) prove that vote counts are more likely to come to the wrong conclusion as more research accumulates, just the opposite of the desirable statistical property, consistency.

t-values

When t-values are used as the dependent variable, all the moderator variables need to be divided by SE. If not, then their MRA coefficient reflects differential publication bias, not some genuine economic effect.
t-values cannot be considered to be an ‘effect size.’ Doing so, inevitably runs into any number of paradoxes or problems with interpretations. As long as the underlying economic effect is anything other than 0, t-values must increase proportionally with the sqrt(n) and precision (1/SE). So which value of precision or the sqrt(n) should the meta-analyst choose? The perfect study has precision and the sqrt(n) approaching infinity. But here, the t-value will also approach infinity, even when the effect is tiny. Nor is the average t-value a meaningful summary of a research literature. For example, suppose the average t-value of the price elasticity of prescription drugs is -2 (or -1, -3, or any number). Can we infer that prescription drugs are highly sensitive (or insensitive) to prices? Depending on the typical sample size any of these average t-values in consistent with an elastic or an inelastic demand for prescription drugs. Worse still, any average absolute t-value a little larger or smaller than 2 is compatible with a perfectly inelastic demand for prescription drugs and some degree of selection for a statistically significant price effect. Nothing about this important economic phenomenon can be inferred from the typical, or the ideal, t-value.