The efforts in psychology to improve the believability of our science can be boiled down to simple and easy changes to our standard research practices. As a field we should:
- Provide more information with each paper so others can double-check our work, such as the study materials, hypotheses, data, and syntax (through the OSF or journal reporting practices
- Design our studies so they have adequate power or precision to evaluate the theories we are purporting to test (i.e., use larger sample sizes)
- Provide more information about effect sizes in each report, such as what the effect sizes are for each analysis and their respective confidence intervals.
- Reward the production and publication of direct replication.
When confronted with these recommendations it seems many researchers balk. This is surprising to me because many of these recommendations seem quite mundane and easy to implement. Why would researchers choose not to embrace these recommendations as a means to improve the quality of their work?
I believe the reason for the passion behind the protests is that the proposed changes undermine the incentive structure on which the field is built. What is that incentive structure?
In my opinion our incentive system rewards four qualities: 1) finding p values less than .05, 2) small N, conceptual replications, 3) discovering counter-intuitive findings, and 4) producing a clean narrative.
P < .05.
The first, seemingly most valued component of psychological science is that your findings must be “statistically significant” which is indicated concretely by achieving results where the probability of your data given the null hypothesis is less than 5%. Researchers must attain a p-value less than .05 to be a success in psychological science. If your p-value is greater than .05, you have no finding and nothing to say because your work will not be published, discussed in the press, or net you a TED talk.
Because the p-value is the primary key to the domain of scientific success we do almost anything we can to find the desired small p-value. We root around in our data, digging up p-values either by cherry picking studies, selectively reporting outcomes, or through some arcane statistical modeling. It is clear from reviews of psychological science that we not only value p-values less than .05, but also have been remarkably successful in limiting the publication of alternative p-values. In our published literature psychology confirms 95% of its hypotheses (Fanelli, 2012).
Even worse, we punish those who try to publish null effects by considering them “second stringers” or incompetent—especially if they fail to replicate already published, and by default, statistically significant effects. Of course, if you successfully emerge from your graduate training still possessing the view that the ideal of psychological science is the pursuit of truth, maybe you deserve to be punished. The successful, eminent scientists in our field know better. They know that “the game” is not to pursue truth, but to produce something with p < .05. If you don’t figure that out early, you are destined to be unsuccessful because the people in control of resources are the ones who’ve succeeded at the game as it is played now.
Small N, Conceptual Replications
Under the right circumstances, conceptual replications are an excellent device in the researcher’s tool kit. The problem, of course, is that the “right circumstances” are those in which an effect is reproducible—as in directly replicable. In the absence of evidence that an effect can be directly replicated, conceptual replications might as well be a billboard screaming that the effect cannot be directly reproduced and the author was left sifting through either multiple studies or multiple outcomes across studies to find a statistically significant effect.
And for seemingly many good reasons the ideal conceptual replication seems to be a small N replication. Despite decades of smart methodologists pointing out that our research is poorly designed to detect such amazingly subtle things as between subjects, 2x2 interaction effects, researchers continue to plug away at sample sizes well south of 100 where they should be using samples in excess of 400 (Cohen, 1990; Simonsohn, 2014).
Of course, it is easy to rationalize the value of small N studies and conceptual replications. Small N studies are quick and easy to do. They incur little cost and therefore have small consequences if the findings don’t work out. Given our tolerance of tossing null effects in the wastebasket there is really no incentive for running larger studies. And, given that the modal research one finds in journals like the Journal of Personality and Social Psychology is the multi-study package of 5 or more, small N studies, conceptual replications have become the gold standard. Unfortunately, it is a package of irreproducibility.
Counter-Intuitive Findings
The third ideal in psychological science is to be the creative destroyer of widely held assumptions. One of the primary routes to success in psychology, for example, is to be surprising. The best way to be surprising is to be the counter-intuitive innovator—identifying ways in which human behavior is irrational, unpredictable, or downright startling (Ross, Lepper, & Ward, 2010). Now that media mentions and alternative metrics, like number of twitter followers are being used to evaluate scholars, it seems the incentive to publish click-bait worthy research is only increasing.
In one respect it is hard to argue with this motive. We hold those scientists who bring unique discoveries to their field in the highest esteem. And, every once in a while, someone actually does do something truly innovative. In the mean time, we get caught up in the pursuit of cutesy counter-intuitiveness all under the hope that our predilection will become the next big innovation. To be clear, it is really cool when researchers identify something counter-intuitive about human behavior. But the singular pursuit of such goals often leads us to ignore the enduring questions of the human dilemma.
The Tyranny of the Clean Narrative
The last piece of the incentive structure is quite possibly the most insidious because everyone pushes it—authors, reviewers, and editors alike. To be successful your research must provide a clean narrative. The research story must have a consistent beginning, middle, and end. This means, the intro should correspond perfectly to the method section, which correlates perfectly with the findings, which all have to be statistically significant and confirm the hypotheses stated in the introduction. The powerful incentive for a clean narrative promotes many of the questionable research practices we use. We HARK (Hypothesizing After the Results are Known) so as to make a clean narrative. We throw out null findings or are told to throw out null findings by reviewers and editors in order to achieve a clean narrative. We avoid admitting to failures to replicate, again, because it would undermine a clean narrative.
The bias towards a clean narrative is especially prominent at our most prestigious journals. Our top journals envision themselves as the repository for impressive new discoveries. New discoveries cannot possess blemishes. Prioritizing a clean narrative leads reviewers and editors to act as gatekeepers and mistakenly recommend against publishing studies that have null effects. For that matter, when we as researchers fail to put together a consistent package of studies we usually self-select the paper into a lower tier journal because we know it won’t be received well at the top outlets. That means that our most honest science is most likely in our “worst” journals because they have a tendency to be more forgiving of messy narratives.
Summary
In sum, these four pillars of perverse incentives stand strong against efforts to make our science more transparent and reproducible. Arguments against these changes, by their nature conservative arguments to keep the status quo, only help to perpetuate a system that has rewarded individuals and individual careers, but has undermined the integrity and reliability of our science. Reporting only statistically significant findings results in a literature that does not represent the truth. Pushing small N, conceptual replications aids and abets the hiding of inelegant findings that don’t conform perfectly to the theories we test. Overvaluing counter-intuitive findings undermines the development of cumulative knowledge that might be relied on for social policy. Policing studies so that they only report “clean findings” and thus have a clean narrative further promotes a depiction of science that is too good to be true.
Reasons for Pessimism
For some, the open science movement and the efforts of specific journals to change the parameters of the publication process give rise to the hope that our scientific products will become more reliable (Vazire, 2014). I am pessimistic about our ability to change the existing system for one overarching reason. The proposed changes of the open science and reproducibility movement are largely perceived as punitive. They do not provide an alternative, compelling reward structure, but instead instigate a largely corrective check on existing practices.
And, it is difficult to see how it can be any other way. The existing system has maximized our success. It provides a weird p-value driven utopia. With the infinite flexibility of our current incentive system we can, as Simmons et al (2011) showed, provide empirical evidence for any idea, no matter how absurd it may be. All it takes is a lot of data and analyses. I fear these incentives have led to a system in which many, many more people have succeeded for ideas that will not last the test of time or replication, whichever comes first. In other words, we have an excess of success.
The problem with the current push for methodological reform is that, it is hard, unrewarding, and a will result in a science that is a lot uglier than our current system. The truth is less elegant than what we produce in our scientific journals. Adopting a sounder approach to our scientific methods, which is critical for our long-term viability as a science, will inevitably curtail our excess success. There will be fewer famous psychologists, fewer book contracts, and fewer Ted Talks.
This is one reason why people fight so strongly against the benign reforms being proposed. We’ve had a good gig for a long time and the future will be less bright if we do things with transparency and reproducibility in mind. Of course, it is quite possible that our future will be less bright either way.
Relatedly, my pessimism is deepened by watching a critical mass of the senior leadership in psychology protest the proposed changes of the reform movement. On one hand, I can understand their reticence to change. The current system has obviously served them well. After all, they are some of the most eminent, successful researchers in our guild.
On the other hand, I simply cannot fathom leading scientists arguing against things like making their research more transparent, increasing the power of their studies, and making sure their effects are replicable. Keeping the status quo is like keeping the leftover holiday roast to rot in the back of the refrigerator. It festers, grows mold, and stinks up the entire refrigerator, sullying all of the other items contained therein. Why not just throw it out? Holding on to the old ways makes everyone’s work reek and does unknown harm to the entire field. The reform movement may result in a refrigerator that is less full, but at least the food that remains will be edible.
My fear at this juncture is that the punitive nature of the reform movement, combined with the lukewarm reception by senior leaders in psychology will combine to maroon an entire cohort of young scholars. We’ve given young scholars and impossible choice. Do things according to the existing reward structure and produce, in the ideal, an exciting and provocative, if ephemeral set of findings. Do things according to the reform movement and produce something sober, maybe a little messy, and real. And, of course this vision ignores that psychology is part of a larger network of scientists, funders, and benefactors, who are also free to look into the refrigerator and find things too smelly for their liking. Without a quick, decisive change in our approach to conducting science from top to bottom, I fear we will cause our field irreparable harm.
Footnotes
This essay is based, in part, on a previous blog post (http://wp.me/p1b8ZP-3x). I would also like to express my gratitude to Chris Fraley and the graduate students who reviewed earlier versions of this essay and provided invaluable comments intended to improve the document.
Small grammatical changes were made to the article by the author after the article had been posted. The current version reflects those changes.
References
Cohen, J. (1990). Things I have learned (so far). American psychologist, 45(12), 1304.
Fanelli, D. (2012). Negative results are disappearing from most disciplines and countries. Scientometrics, 90, 891-904.
Ross, L., Lepper, M., & Ward, A. (2010). History of social psychology: Insights, challenges, and contributions to theory and application. Handbook of social psychology.
Simmons, J.P., Nelson, L.D., & Simonshohn, U. (2012). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science.
Simonsohn, U. (2014). Small telescopes: detectability and the evaluation of replication results. Available at SSRN: http://ssrn.com/abstract=2259879 or http://dx.doi.org/10.2139/ssrn.2259879
Vazire, S. (2014). Why I’m optimistic. http://sometimesimwrong.typepad.com/wrong/2014/12/why-i-am-optimistic.html