Reproducibility in cognitive neuroscience: What is the problem?

Russ Poldrack

Stanford University

A search for “fMRI” for FY2024 NIH grants finds:

1816 matching grants

$941M total costs

What is the brain dysfunction in major depression?

Meta-analysis of 99 published studies

Muller et al, 2017, JAMA Psychiatry

We seem to have created quite a mess.

How can we fix it?

“Sunlight is said to be the best disinfectant”

(Louis Brandeis)

Towards an ecosystem for open and reproducible neuroscience

Designing a more reproducible scientific enterprise

82 | NATURE | VOL 526 | 8 OCTOBER 2015

Improving the choice architecture of science

Choice architecture
- particular set of features that drive people toward or away from particular choices
Nudges
- Improving incentives
- Using the power of defaults
- Providing feedback
- Expecting and prevent errors

The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.

Neuroscience research is badly underpowered

Low power -> unreliable science

Positive Predictive Value (PPV): The probability that a positive result is true

Winner’s Curse: overestimation of effect sizes for significant results

Button et al, 2013

Small samples = high instability of statistical estimates

Schonbrodt & Perugini, 2013

Marek et al., 2022

Small samples + publication bias: the case of candidate genes

Candidate gene associations fail in well-powered GWAS

Jason stein et al. for the ENIGMA Consortium

“In general, previously identified polymorphisms associated with hippocampal volume showed little association in our meta-analysis (BDNF, TOMM40, CLU, PICALM, ZNF804A, COMT, DISC1, NRG1, DTNBP1), nor did SNPs previously associated with schizophrenia or bipolar disorder”

How well powered are fMRI studies?

Median study in 2024 (n=36/group) was powered to find a single 200 voxel activation with d~0.67
Is that plausible?

Updated from Poldrack et al., 2017

Estimating realistic effect sizes

Unbiased effect size estimate

What are realistic effect sizes for fMRI?

Poldrack et al., 2017, Nature Reviews Neuroscience

Depression studies from Muller et al.

Authors must collect at last 20 observations per cell or else provide a compelling cost-of-data-collection justification. This requirement offers extra protection for the first requirement. Samples smaller than 20 per cell are simply not powerful enough to detect most effects, and so there is usually no good reason to decide in advance to collect such a small number of observations. Smaller samples, it follows, are much more likely to reflect interim data analysis and a flexible termination rule (Simmons et al., 2011)

Small samples -> variable estimates of predictive accuracy

Varoquaux, 2018

Small samples + publication bias -> inflated accuracy estimates

Varoquaux, 2018

Doing well-powered science as a trainee

Underpowered science is futile, but many ECRs don’t have resources to do sufficiently powered studies
“if you can’t answer the question you love, love the question you can” (Kanwisher, 2017)
Pivots:
- Collaborate
- Use shared data
- Focus on theory/computational modeling

The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.

Poldrack et al., 2017

“data collection and analysis methods were highly flexible across studies, with nearly as many unique analysis pipelines as there were studies in the sample [241].”

Machine learning can make it worse

“In this article, we use Support Vector Machine (SVM) classifiers, and genetic algorithms to demonstrate the ease by which overfitting can occur, despite the use of cross validation. We demonstrate that comparable and non-generalizable results can be obtained on informative and non-informative (i.e. random) data by iteratively modifying hyperparameters in seemingly innocuous ways.”

It’s not just fMRI

The purpose of this paper is to demonstrate how common and seemingly innocuous methods for quantifying and analyzing ERP effects can lead to very high rates of significant but bogus effects, with the likelihood of obtaining at least one such bogus effect exceeding 50% in many experiments.

Improving reproducibility through pre-registration

Register analysis plan prior to accessing data
- Preferably with code based on analysis of simulated data
This does not prevent exploratory analysis
- But planned and exploratory analyses should be clearly delineated in the paper
If the preregistration commits you to something that you learn is bad, you can always deviate
- as long as you are explicit in the paper

http://www.russpoldrack.org/2016/09/why-preregistration-no-longer-makes-me.html

The requirement for clinical trial registration was associated with many more null effects
This is a “cost” under the current incentives to publish

Kaplan & Irvin, 2015

Pre-registering neuroimaging studies is hard!

Specify as much as possible
- How will sample size be determined?
- Inclusion/exclusion criteria
- Primary hypotheses to be tested
- Anatomical regions of interest
- Analysis plan
  - Preferably with code tested on existing or simulated data

https://prereg-psych.org/create/

Pre-registration prevents p-hacking but does not eliminate analytic variability

How variable are neuroimaging analysis workflows in the wild?

What is the effect on scientific inferences?