Jul 10

Crimes and Misdemeanors: Reforming Social Psychology

The recent news of Dirk Smeesters’ resignation is certainly not good news for social psychology, particularly so soon after the Diedrik Stapel case, but I believe it can serve as an opportunity for the field to take important steps towards reform. The reforms that are needed the most, however, are not restricted to preventing or detecting the few instances of fraud by unscrupulous researchers who are intentionally falsifying data. What we should be more concerned about are the far less egregious, but much more common offenses that many of us commit, often unknowingly or unintentionally, and almost never with fraudulent intent.

In an excerpt from his new book that appeared in the Wall Street Journal last month, Dan Ariely (@danariely) relates an anecdote about a locksmith who explains that, “locks are on doors to keep honest people honest.” Ariely goes on to say that:

One percent of people will always be honest and never steal. Another 1% will always be dishonest and always try to pick your lock and steal your television; locks won’t do much to protect you from the hardened thieves, who can get into your house if they really want to. The purpose of locks, the locksmith said, is to protect you from the 98% of mostly honest people who might be tempted to try your door if it had no lock.


The same is true of psychologists. A very small fraction of us are “hardened thieves” who are intent on falsifying their data. This is a problem and we should do our best to identify these individuals and expunge their publications from the field. But the vast majority of psychologists are honest people who – while motivated to produce publications and to get jobs, promotions, and prestige – are genuinely committed to the pursuit of the truth. It has become increasingly clear, however, that it is easy for the pursuit of the truth to be derailed by seemingly minor methodological mistakes that yield significant effects, even where none exist. These mistakes – or doors, to stick with the locksmith’s metaphor – are usually entered into without fraudulent or malicious intent. That’s why we need locks: to keep honest people from wandering through these doors.

So why would someone honest walk through a door with no lock on it in the context of psychology research? I believe there are two reasons: rationalization and ignorance – they are able convince themselves that it’s not a major problem, or they simply don’t know any better. We could take a very significant step towards improving both of these issues by adopting the recommendations put forward by Joe Simmons, Leif Nelson, and Uri Simonsohn in their recent False Positive Psychology (PDF) paper.

The paper explains how we increase the likelihood of false positive results through seemingly minor methodological problems, including flexibility in sample size based on interim data analysis (data peeking), failure to report all the variables and conditions in a study, and lack of transparency in reporting the use of covariates and exclusion of outliers. I’ve reproduced their table of proposed solutions to these problems below, and I strongly encourage everybody to read the entire paper.

For me, “data peeking” is a perfect example of how easy it can be to rationalize practices that lead to an increased rate of false positive findings. Most people don’t realize that looking at the data before collecting all of it is much of a problem. Neither did I until recently. In actuality, peeking at your data substantially raises the probability of a false positive finding (for more discussion read Why Data Peeking is Evil by @talyarkoni).

Imagine you’ve meticulously researched and designed a study and planned to collect data from 80 participants. After running 80 participants you analyze your data and you find that the effect you were looking for is significant at the p = .08 level. What to do? You could stop there and report your findings, but you know that when you submit the paper for publication reviewers are likely to reject it because the findings aren’t quite significant at the conventional p < .05 level. Alternatively, you could collect just a little more data. It’s not hard to see how this could be rationalized: the effect is probably real, so what’s the harm in adding a few more data points? The real crime would be to waste all the money, time, and effort that you’ve invested in this study. So you collect data from 20 more subjects and your p value drops to .05 and, voila! Significant results!

The problem is that your finding isn’t really significant at the .05 level. Because your decision to terminate data collection was conditional on the significance level of the effect, you’re much more likely to get significant results than if you were to stop at a pre-determined point. When these results are published, however, there is no way for a reader to know that you didn’t plan to run 100 subjects all along.

It’s not hard to see how many perfectly honest researchers might peek at their data once in a while. Many researchers don’t realize the consequences of peeking at their data. Others may understand that you’re not supposed to peek in after every few subjects and stop collecting data the moment you get significant results, but fail to appreciate that even peeking a few times can make a big difference. Returning to Ariely’s locksmith’s analogy, they’re walking through a door with no lock.

The solutions suggested by Simmons, Nelson, and Simonsohn are a big step in the right direction. Simply making it clear to people that some of their practices are problematic removes ignorance as an excuse. Knowing what’s wrong also makes rationalization that much harder. As Dan Ariely explained in a recent interview, ambiguous rules present more opportunities for rationalization. But no solution is perfect, so I thought I’d share a few things that have been on my mind as I wrestle with these problems.

Unilateral disarmament. It’s easier to get significant results by ignoring the rules rather than following them. Therefore, without an enforcement mechanism, such as the adoption of these (or similar) solutions by institutions such as journals, there is a risk that we could create a situation in which only the most honest researchers would improve their behavior while many others would not. The perverse result would be that our journals would be disproportionately filled with publications by poorly behaved researchers who find it easier to produce significant results.

Having said that, there are reasons for optimism. Even if journals do not adopt the proposed solutions as requirements, reviewers can still bring them up. This possibility should discourage researchers from trying to sneak questionable practices past reviewers. In addition, as Joe Simmons noted in a recent presentation at APS, authors who adopt the solutions can explicitly note that they have done so in their papers, which not only serves to signal the integrity of the paper but, over time, may help to establish a positive social norm.

Make reforms less threatening. Ultimately, the success of reforms like those that Simmons, Nelson, and Simonsohn propose depends on the researchers themselves. Will they take the problems facing the field seriously, or will they try to ignore them and sweep questionable research practices under the rug? I have great faith in the integrity of most researchers, but it’s very important to acknowledge that reform poses a serious threat and that the more we are able to mitigate that threat, the more likely people will be to adopt reforms. If we are too aggressive in our reforms, or if we target individuals who have behaved no differently than their colleagues, then we risk triggering a defensive reaction that will make successful reform slower and more difficult.

I want to be clear that I think Simmons and his colleagues have gone to considerable lengths to make the adoption of the reforms they propose less threatening, but they have very little control over how others seek to enforce the solutions they suggest. For example, one commendable guideline they suggest is for reviewers to be “more tolerant of imperfections in results.” For the long term success of reform, I believe that it’s just as important that this recommendation be put into practice as any of the proposals that would minimize the rate of false positive findings.

Start now, but go slow. Another source of potential threat to researchers is the question of when do we start. The obvious answer is “now!” We have a problem that needs fixing and the problem is not going to go away on its own. But researchers have a pipeline of completed but as-yet-unpublished studies that can reach back several years. If we can’t somehow mitigate the threat that existing research will be considered compromised then we will be make it much more likely that researchers will rationalize away the need for reform.

Take, for example, the graduate student who has been working on her thesis for 4 years only to realize that she peeked at her data in one or two of her 7 studies, though she doesn’t remember which ones or how many times, since she didn’t realize it was a problem at the time. Do we expect her to discard that data? Or what about the young professor who realizes that, after years of collecting data and submitting revisions upon revisions of a paper, his methods are not completely above reproach? It doesn’t seem reasonable to reject his paper on that basis if we’re not also going to go back and retract hundreds of others of paper in the literature.

Perhaps we can take a page from Dick Thaler (@R_Thaler) and Shlomo Benartzi’s “Save More Tomorrow” plan (PDF) to nudge people in the right direction. The key to Save More Tomorrow’s effectiveness is that it doesn’t require people to make the hard choices today, when the perceived costs are the highest, but allows them to commit to making changes in the future. For methodological reform, this would entail a “starting now” approach. Any new research would follow the reformed standards, while existing research would be treated more leniently, while attempting to increase transparency, for a certain period of time, let’s say 2 or 3 or 5 years. The solution is not a perfect one, but aims at maximizing the number of researchers that would commit to the reform process rather than avoiding reform or putting it off.

One final thing to note: Simmons’ and his colleagues’ proposals are a great starting point, but there are other reforms that have been proposed that also deserve attention and I hope to return to them in the future. Among these reforms is the need to be more open with our data and the possibility of pre-registering studies before running them to help keep researchers honest (for more discussion see this post by @Neuro_Skeptic). The most important of these, in my opinion, is the need for replication. Many of the problems of false positive psychology would be reduced considerably if there were more replication in the field of psychology. Although an extensive discussion is beyond the scope of the current article, I recommend an upcoming paper (PDF) on the topic by Brian Nosek and his colleagues and this piece by Ed Yong.

The discussion about reform is an important one for us to be having. I hope that perhaps a silver lining to the Smeesters resignation is that it will underscore the urgency of this discussion. According to Martin Enserink at Science, Smeesters is convinced that his offenses were no different than those of many of his colleagues in marketing and social psychology, who he claims, “consciously leave out data to reach significance without saying so.” Maybe I’m being naïve, but I don’t think that’s true. While I believe that many researchers make methodological mistakes that they don’t recognize as problematic, very few cross the line and knowingly produce falsified results. If you share my assumption that most researchers are honest then what we need to do is to make the rules clear and unambiguous as soon as we can. There may be some hardened thieves among us, but what we really need are more locks on our doors to help keep us honest.


Photo Credits:

Locked Door by Swastiverma (Own work) [CC-BY-SA-3.0], via Wikimedia Commons

Peeking Cat by Kevin Dooley [CC-BY-2.0], via Wikimedia Commons

Danger Sign by By Interzone00 (Flickr: DSCN2044) [CC-BY-2.0], via Wikimedia Commons

Related posts:


8 pings

Skip to comment form

  1. Chris Said

    Dave – I agree that some some rules will keep the honest people honest. But in addition to (or perhaps instead of) a complicated list of rules and regulations, I think a more promising solution is to simply change the incentive structure itself, so that scientists don’t feel any pressure to manipulate their results. http://filedrawer.wordpress.com/2012/04/17/its-the-incentives-structure-people-why-science-reform-must-come-from-the-granting-agencies/

    1. Dave Nussbaum

      Hi Chris, thanks for the comment! I agree that the incentive structure often points us in the wrong direction and I think trying to change that structure is worth doing. Obviously pressure plays a big role in why people manipulate their results and then rationalize those manipulations — Stapel basically says exactly that.

      At the same time I think that the two approaches are largely complementary. I think many of the problems outlined in False Positive Psychology are going to persist even if incentives were corrected. For instance, people peek at data because they don’t realize it’s wrong, or rationalize excluding outliers because they believe in their effect. The added pressure certainly contributes to the problem, but it’s not the whole problem.


  2. Hal

    Great points!.

    But when you say:

    “The perverse result would be that our journals would be disproportionately filled with publications by poorly behaved researchers who find it easier to produce significant results.”

    Just thumb through any issue of Psych Science and you can see that this is the world we already live in. :-(

    1. Dave Nussbaum

      Hi Hal, I hope that’s not entirely true, but it would explain why they won’t publish my research :)

  3. Stacey

    Great points, Dave! I especially like the idea of using a nudge a la Save More Tomorrow and agree that such an approach is likely to reduce defensive concerns that arise if/when the norm for publishing changes.

    1. Dave Nussbaum

      Thanks Stacey, I think that’s often a good approach with big, possibly threatening changes. Sometimes we really have very good intentions, but when it comes time to put them into action we chicken out. Committing now to something “tomorrow” can make that transition a lot easier.

  4. Daniel Nadolny

    Great points! As a side note, we as a field almost always use two-tailed hypothesis tests, even though it is (statistically) acceptable to use one-tailed tests with directional hypotheses. I’d be curious to get a sense for how much flexibility in reporting boosts significance versus how much our disregard of directional hypotheses costs us.

    I definitely agree that reviewers should be more accepting of imperfections in the data. Without this step, following some of these guidelines would lead to reducing creativity and innovation in the field; it would be too risky to try new measures or manipulations, or look for low chance but interesting-if-found effects.

    I really do think that the best solution to all of this is that all data should ultimately be available online, regardless of whether it’s a truly failed study, or something that’s published. With some good search features, I think this would be feasible… that way, articles can be kept within reasonable size, meta-analyses become easy, moderators are found more quickly, less time is wasted attempting to try things that don’t work, new effects can be found in all data, and none of the issues (except perhaps data-peaking) are problems anymore.

    1. Dave Nussbaum

      Thanks for the comments! One thing to add — I’m a proponent of publishing all data and making it openly accessible. The tricky thing is that we don’t always know why a study failed — sometimes it’s because there’s no underlying effect there and sometimes it’s poor execution or bad luck. We would have to figure out how to integrate failed studies into our understanding of an effect.

      For example, if you run 3 pilot studies trying to figure out how to best test for an effect, it may not make sense to include those studies in a meta-analysis of the effect. On the other hand, you can’t get significant results on “take 4″ and run off and try to publish those as though it was the first time you’d ever tried it.

      Having said that, just because it might be a little tricky is not a good reason not to try.

    2. Lauren Meyer

      “a sense for how much flexibility in reporting boosts significance versus how much our disregard of directional hypotheses costs us” >> The Simmons paper linked above includes examples of exactly how different types of “flexibility” lead to inflated “significance” and by how much. It’s way, way more than double the “true” p value.

      “it is (statistically) acceptable to use one-tailed tests with directional hypotheses” >> True. However, I contend that many “hypotheses” are only directional after the researcher finds out the direction of the effect. Unless there is a very strong theory in place, researchers often make their after-the-fact interpretation into their supposed a priori hypothesis. One stats professor I had said that in order to do a one-tailed test, you have to do an honest “look in the mirror test.” If it would have been significant in the opposite direction, would you have reported it as such?

  5. kl

    I disagree that the false positive authors have gone to any lengths to be seen as less threatening. If Simonsohn wants to be less threatening he should stop using this new statistical process, whatever it is, to accuse people of academic misconduct before it has been published and scrutinized. The two cases so far seem guilty but, without full disclosure, how can we be sure? Probably the first few people accused by the House Unamerican Activities Committee were guilty too but we know how quickly that went off the rails. They want accountability and transparency from us but haven’t (yet) provided it themselves.

    1. Hal

      Uri is working expeditiously to prepare and publish an account of his involvement in the current situations. These are each very unique cases and do _not_ involve the “new statistical process” I think you are referring to (the p-curve methodology for assessing the likelihood of p-hacking in groups of papers which Uri has given public talks about at SPSP and APS.) Instead, these cases involve specific odd patterns which appeared in particular published datasets and which were suggested the possibility of flat-out research misconduct–nothing in any ethical gray area. Also keep in mind that Uri’s role in these cases was to bring his observations to the attention of official campus misconduct authorities, who would have had access to a great deal more information in their deliberations, and who presumably have procedures that provide due process to the accused. Uri did not make a public accusation against any individual based on what he observed in their papers. The appropriateness of the ultimate resolution of these cases is the responsibility of campus authorities, not Uri. But I believe that when everyone learns the details of what led Uri to take the actions he took, there will be a consensus that his actions were commendable..

      1. Dave Nussbaum

        Thanks for the update, Hal!

      2. kl

        This account of the event specifically says: “The whistleblower, a U.S. scientist, used a new and unpublished statistical method to search for suspicious patterns in the data…”

        So if it wasn’t this p-hacking method then the media is reporting wrongly (which wouldn’t necessarily be surprising I suppose). I do acknowledge that Smeesters’ university didn’t rely on that method alone, just that it raised initial suspicious. And to be clear, I condemn his actions as much as the next person.

        I just think that the existence of a mysterious, unpublished statistical process–which may or may not have been used to accuse people, and which might as well be wizardry as far as we know so far–is creating rumors and contributing to a nervous climate. Is somebody combing through old papers looking for suspicious patterns using this process? What if there are false positives? Without disclosure it’s impossible to say, so I look forward to reading the papers.

        1. Dave Nussbaum

          I share your concerns both about the possibility of false positives and the climate that unfounded rumors create. As I argue in the post, I think we need to take a judicious approach if we want to be successful in making the reasonable reforms that the field needs. So I don’t want to dismiss your concerns at all.

          What we’ve got going on here, as far as I can tell, is that the exigencies of the situation have required action to be taken faster than the technique could be published. Unfortunately the publication process isn’t fast enough to keep up with events as they unfold in real time.

          So at this point, for people looking at the situation from the outside, it certainly seems mysterious and opaque, although all the details will eventually come to light, so if there was anything amiss about the process — and from my vantage point I don’t believe there was — then people will have a chance to examine it soon.

    2. Dave Nussbaum

      I think Simonsohn can speak best for himself, but I will say two things. One, I think the false positive authors do go to some lengths to avoid accusing anyone of fraudulent intent. Two, as far as I understand, in the Smeesters case Simonsohn brought the issue to the attention of the university only after Smeesters could not produce any of the original data — the university then conducted their own independent investigation. I agree that it would be best if we could see Simonsohn’s methods, but I think it’s fair to allow him to try to publish his findings in a peer reviewed journal before sharing them.

      At the same time, I understand where your concerns are coming from and I think that they go to the heart of the approach that I think we should be taking. As soon as researchers feel like they may be unfairly or arbitrarily targeted there will understandably be a lot of resistance, which will certainly slow the pace of reforms. That’s a big part of why I wrote the article — I think it’s important that we improve the way we do research in the field, but I think if we don’t approach things right then we won’t accomplish much.

      Thanks for your comment!

  6. Rich

    I wonder if it would be possible for scientists to submit their experimental plans (aims,methods, significance) to journals, which could then review the plan and tentatively accept it for publication once the results are completed? Similar to a clinical trial registry but journal-centric. It would force journals to consider research based on aims and significance, rather than the sexy results post hoc. And scientists could use the tentative acceptance in their grant applications. Imagine if you could ask for money to complete the research which has already been tentatively accepted by Nature! If such a system was possible, it might solve a number of the current issues, such as false positive bias and data peeking.

    1. Dave Nussbaum

      As you allude to, some areas of research employ the sort of registry you’re suggesting to good effect. I think it’s particularly useful in clinical trials when there are big money reasons to hide results that people (or companies) don’t like. I think there are some obstacles to pre-acceptance in a journal, mostly it would be a pretty major shift, but I think it could at least play a role.

      Without giving it a lot of thought yet, I could imagine that journals may become conservative in the types of research they pre-approved, and more “innovative” research could be harder to publish in top tier journals, but I could be missing something important.

      In general I think it’s a good idea worth exploring. I think Hal, who commented above, has suggested this sort of system for running replications of existing studies, which I think would work well, as long as there was enough incentive to do the replications.

      Thanks for the comment!

      1. Rich

        I can imagine journals might be conservative – on the other hand, there may also be an incentive to accept innovative research for the sexy impact factor.

        Thinking about it a bit more: in the system I’m imagining, the results could distinguish between registered (primary) aims and unregistered (secondary or exploratory) aims. Meeting the primary aims would naturally be considered more robust and reliable; while the exploratory, unregistered results might be novel (and interpretable), but considered less robust until further replication as a primary aim in a follow-up experiment. I feel like this is the direction clinical trials are already taking.

        Such a two-tier system might also help address a real concern raised by a comment below that strict requirements to pre-register experiments would end up wasting a lot of grant time and money. In this two-tier system, I imagine if the primary aims are not met, then the null result would be reported alongside any significant exploratory results. In this way, secondary results will partially compensate for the lost value of the null primary result.

  7. Vick

    Great post. My only concern is I still haven’t seen any good discussion of what to do with the data when something is trending, or requires controlling for some demogphic variable that is obvious now, but you didn’t consider when you “pre-registered” the experiment, etc. etc.

    Simmons and others seem to be implicitely advocating that you throw out all the data (it didn’t work, so don’t collect more!) and possibly never again test the same hypothesis (you might have just got lucky, and publishing the new study that worked would only increase the file drawer problem). This seems absurd, and yet i haven’t seen a single good suggestion for what to do with the data if you did follow these recommendations to the letter but didn’t get a significant result after testing your initial hypothesis. Other than waste a lot of grant time and money. I think the experimenter degrees of freedom problem is real and can be abused, I think more work needs to be done on how we can continue current practices (exploring data, testing more than one hypothesis, collecting more data to be sure it wasn’t a power issue, revising the experiment, etc.), which are often done with the best of intentions, while simultaneously protecting against inflated p-values.

    1. Dave Nussbaum

      Hi Vick, I think that’s a great question. There are a couple of answers, at least partial ones:

      The first one is technical: you can correct for data peeking statistically. For example, Strube has a way of doing this explained in this article from 2007. Of course you give up some power to do this, but it’s better than starting over.

      The second answer is that the basic solution is transparency. If you want to include a covariate in your analysis that it obvious in retrospect, then do it. But report the results with and without the covariate and be honest. The biggest thing that stands in the way of an answer like this is that until reviewers become more accommodating of disclosed imperfections in the data, researchers will have more incentive to rationalize not being transparent.

      Lastly, sometimes you do have to start again. For example, if you have a hypothesis you like and you run a study which doesn’t work, then revise it a couple of times until it does work, it’s not really fair to take the first version of the study that works and pretend that it’s the only one that exists. If you’re not going to include the 4 failed pilot studies then it’s only fair not to include the one successful one, otherwise the likelihood that you’re capitalizing on chance is much higher than .05. The good news is that if it’s a real effect and you’ve honed your study then it has a very high probability of being replicated and you can have a lot of confidence that you’ve got something real.

  8. s klein

    I think your points are very on-target. But regrettably, I see a bigger issue at work. And it does not reside exclusively in the social psychological domain.

    Specifically, psychology as a “science” is too often either careless or unprepared to deal with conceptual issues. I work in social ,cognitive, neuro, evolutionary psychology as well as philosophy. Of those areas the only ones that appear genuinely concerned about what their terms and tests refer to are philosophy and (surprise, perhaps) evolutionary psychology. The remainder are happy to spin yarns and perform tests on concepts that are ill-defined, undefined, admit or to no social consensus in their respective fields using testing situations that, if they even (accidentally) manage to split nature at a conceptually meaningful joint, are more due to luck than clearly conceived intention. I could give umpteen example and will if you want, but hopefully you have noticed the too often meaningless nature of what passes for experimental work.

    If I am correct, then why? .Because (a) our edict to publish or perish brings out the worst, (b) students and their advisers are not rigorously trained in the philosophy of scientific logic or conceptual analysis. So, they tend to fit into establishes streams of current research endeavor (for publication’s sake) and “go with the flow” despite not being aware that the flow is heading nowhere in particular. It is a sad state of affairs.

    And the desire to be a science not only is thwarted to a large degree, but I think psychology needs to rethink whether that stance is desirable. It is far from clear to me (or to others in the “hard” sciences –e.g., Bohr, Eddington, Stapp, Heisenberg — whether a science of person is possible or desirable. By adopting such a stance we run the real risk of excluding much of “reality” that matters in being human.

  1. The Public Benefits of Social Psychology » Random Assignment

    […] « Crimes and Misdemeanors: Reforming Social Psychology […]

  2. The Scienceblogging Weekly (July 13th, 2012) | Social Media Blog Sites

    […] Crimes and Misdemeanors: Reforming Social Psychology by Dave Nussbaum: The recent news of Dirk Smeesters resignation is certainly not good news for social psychology, particularly so soon after the Diedrik Stapel case, but I believe it can serve as an opportunity for the field to take important steps towards reform. The reforms that are needed the most, however, are not restricted to preventing or detecting the few instances of fraud by unscrupulous researchers who are intentionally falsifying data. What we should be more concerned about are the far less egregious, but much more common offenses that many of us commit, often unknowingly or unintentionally, and almost never with fraudulent intent…. […]

  3. Simonsohn’s Fraud Detection Technique Revealed » Random Assignment

    […] more fraud will be uncovered by Simonsohn or others in the weeks and months to come. As I argued here, fraud is a problem that psychologists need to take very seriously, but no less seriously than the […]

  4. Just Post It (update) » Random Assignment

    […] like a witch hunt, thereby making researchers less open to much-needed reforms. As I argue in this post about reforming social psychology, most psychologists are honest researchers with good intentions and respect for truth and the […]

  5. The Scienceblogging Weekly (July 13th, 2012) | Home Pests

    […] Crimes and Misdemeanors: Reforming Social Psychology by Dave Nussbaum: […]

  6. Reform from the Bottom Up » Random Assignment

    […] issue with the problem of False-Positive Psychology (FPP), which I’ve written about previously here. Published just last year in Psychological Science, the three authors (Joe Simmons, Leif Nelson, […]

  7. Data peeking is always wrong (except when you do it right) « The Hardest Science

    […] data peeking as “a time-honored tradition in the social sciences.” Dave Nussbaum wrote that “Most people don’t realize that looking at the data before collecting all of it is […]

Leave a Reply