‘Correlation does not imply causation’ is an adage students from all social sciences are made to recite from a very early age. What is less often systematically discussed is what could be actually going on so that two phenomena are correlated but not causally related. Let’s try to make a list:
1a) The correlation might be due to coincidence. This is essentially a variant of the previous point but with focus on time series. It is especially easy to mistake pure noise (randomness) for patterns (relationships) when one looks at two variables over time. If you look at the numerous ‘correlation is not causation’ jokes and cartoons on the internet, you will note that most concern the spurious correlation between two variables over time (e.g. number of pirates and global warming): it is just easier to find such examples in time series than in cross-sectional data.
1b) Another reason to distrust correlations is the so-called ‘ecological inference‘ problem. The problem arises when data is available at several levels of observation (e.g. people nested in municipalities nested in states). Correlation of two variables aggregated at a higher level (e.g. states) cannot be used to imply correlation of these variables at the lower (e.g. people). Hence, the higher-level correlation is a statistical artifact, although not necessarily due to mistaking ‘noise’ for ‘signal’.
2) The correlation might be due to a third variable being causally related to the two correlated variables we observe. This is the well-known omitted variable problem. Note that statistical significance test have nothing to contribute to the solution of this potential problem. Statistical significance of the correlation (or, of the regression coefficient, etc.) is not sufficient to guarantee causality. Another point that gets overlooked is that it is actually pretty uncommon for a ‘third’ (omitted) variable to be so highly correlated with both variables of interest as to induce a high correlation between them which would disappear entirely once we account for the omitted variable. Are there any prominent examples from the history of social science where a purported causal relationship was later discovered to be completely spurious due to an omitted variable (not counting time series studies)?
3) Even if a correlation is statistically significant and not spurious in the sense of 2), there is still nothing in the correlation that establishes the direction of causality. Additional information is needed to ascertain in which way the causal relationship flows. Lagging variables and process-tracing case studies can be helpful.
All in all, that’s it: a correlation does not imply causation, but unless the correlation is due to noise, statistical artifact, or an confounder (omitted variable), correlation is pretty suggestive of causation. Of course, causation here means that a variable is a contributing factor to variation in the outcome, rather than that the variable can account for all the changes in the outcome. See my posts on the difference here and here.
Am I missing something?