Saturday, May 5, 2012

Observational Studies

[O]bservational studies, researchers say, are especially prone to methodological and statistical biases that can render the results unreliable. Their findings are much less replicable than those drawn from controlled research. Worse, few of the flawed findings are spotted -- or corrected -- in the published literature.
That's from a WSJ front page article, "Analytical Trend Troubles Scientists," published Thursday, May 3rd, criticizing the increased use of observational studies (over true experiments) in the hard sciences over the last dozen years. The whole article is really good. I am glad the WSJ devoted a front page article to this issue.

What can go wrong with interpreting data from observational studies? Lots. For example, when running an experiment, the scientist gets to control all the variables of interest, and by varying one of them can get a clear picture of how that variable interacts with the outcome of interest (usually the biggest analytical issues with experimental data is measurement error). In an observational study, many things change at once which makes it much harder to construct correct correlations (is A correlated with B because A causes B, B causes A, or C causes A and B?). 

In addition, yesterday at the Becker Freedman Institute Conference on the Biological Basis of Economic Preferences and Behavior, David Cesarini, one of the presenters, spent some time talking about this issue as it relates to the molecular genetics literature (a literature that in large part tries to determine which genes do what). This another sub-field where there are concerns about the veracity of much of the research. Basically his argument boiled down to another common concern with observational studies: data mining. Even with small p-values, when running hundreds of thousands of regressions of an outcome on a SNP (single nucleotide polymorphism), one is bound to find spurious correlation, especially since the regressions have such low power (he cited an average R-squared of .004). So, what was his advice to economists and other scientists who want to use genetic data in their work? "Stop recapitulating the mistakes of medical genetics and set high standards."

Economists really do have a comparative advantage here over other hard scientists. Economists have had to deal with the lack of experiments since the invention of the field. As a result, we are trained to always be on the look out for what could go wrong when interpreting survey and observational data (or even when we don't have the data we need at all), and we learn a lot of statistical techniques and econometric methods to deal with the problems that creep up all the time in observational studies. Most medical researchers and other hard scientists haven't had this training because all their data work has come from experiments where all the manipulated and controlled variables are known. Because of this, they are bound to make lots of mistakes.


  1. Did Cesarini mean that because of low power - probability of detecting a meaningful deviation from the null hypothesis - the fraction of statistically significant results that represent something of meaning is much less? Or, was his issue primarily with multiple testing? I view these as slightly different problems that each make the other problem worse.

    Imagine that you have power = 1, but that you still have the five percent Type I error rate. 20 hypothesis tests will, on average, yield one false positive. Setting aside the issue of power, that level of certainty about obtaining a false result (despite not knowing which one it will be) is a problem. My impression is that researchers adjust their individual tests so that the groupwise Type I error rate is 5 percent (depending on the # of tests, something like 1 or 0.1 percent for each test), but there are other ways to deal with this.

    Low power exacerbates the problem. In an example, a test that rejects at the 5 percent level will have, on average, 1 in 20 false positives, but if that same test only detects 20 percent of the true deviations from the null hypothesis, a rejection of the null might not be as meaningful. Imagine a setting with 20 independent hypothesis tests, and in truth, 5 of these have a meaningful deviation from the null hypothesis. With Power = 0.2, you'll detect one of these on average. Match that with one false positive, and you've got (very loosely) a 50 percent chance that your statistical result is spurious.

    To take this one step further, and infer causation, we need even more (i.e., to rule out confounding variables, and reverse causation).

    1. Yes, I see my "especially" in the post is confusing.

      It seemed to be a little of both (he was talking fast, and I may not have gotten his point exactly right). According to my notes: He mentioned the standard in the literature is to set p=5x10^-8, but this represents a loosening of a standard that was in place that was not yielding any statistically significant results. Then he talked about problems with running hundreds of thousands of regressions per paper searching for significant relationships. Then he mentioned: there are "too many false positives because of underpowered studies." Then he advocated solutions: use larger samples when possible, introduce groupwise testing, and split the sample into two pieces (a discovery sample (80% of data) and a validation sample (20% of data)). I think your third paragraph captures what he was trying to convey.