Tuesday, September 13, 2011

Things Everyone Should Know About Statistics

I recently read How to Lie with Statistics and it solidified the need for this post.

Sample Size Matters

OK Cupid does some cool things with their data. Awhile back, the site published a blog post comparing their homo- and hetero-sexual users that debunked some common, bigoted myths: namely, that homosexuals lust after and seek to convert heterosexuals, and that homosexuals are likely to have had numerous sexual partners. The results were striking: only 0.6% of gay men ever searched for straight matches, and only 0.1% of lesbians did the same. Hetero- and homosexuals had the exact same median number of sexual partners. But what really struck me was a one of the hundreds of comments on this particular post (btw, don't ever read comments) that said something along the lines of “Your study must be wrong, because I've met two gay guys and they both said they had hundreds of sexual partners.” It was hardly the only comment that drew a ridiculous conclusion from a sample size far smaller than OK Cupid's enormous user base. But this is how people think: well, this is my experience, and my experience is representative, so it must be universally true. Needless to say, thousands of varied data points is far superior to any individual's anecdotes. But…

Fully Randomized Sampling is a Fiction

…even OK Cupid's massive sample is innately flawed. Why? Because it is not a cross-section of all gay people. It's all gay people subscribed to an online dating site based in America. So we can expect geographic, national, racial, income, educational, technological, and (most important in this case) relationship-status biases. And I'm not singling out OK Cupid here: I have yet to encounter a research study without some kind of sampling bias. I love Pew's Internet and American Life Project, for instance, but most of their surveys are delivered over the phone, which skews things in a significant way given their topics. People such as myself, who have not owned a landline for almost a decade and have never been listed in a phonebook, are invisible to their methodology. The only studies that come close to true randomization are done in scientific laboratories, where independent variables are carefully curtailed. But such experiments hardly represent life in the wild where causality becomes absurdly complex, and thus are limited in terms of extrapolation. And speaking of causality…

Correlation Does Not Imply Causation

I hesitated a bit before talking about this, because C≠C has become a thoughtless soundbite. I frequently hear it misused in totally irrelevant contexts, and taking the maxim too seriously leads to an insurmountable, Humean skepticism. But it must be said: just because two variables appear to be aligned does not mean there is any causal connection between them. Perhaps the best demonstration of this was produced just recently, with Google's tongue-in-cheek Correlate that finds extremely strong correlations between arbitrary sets of searches, or matches a user-drawn curve to different search terms' popularity over time. Some of these are quite meaningful: Google is famously able to trace flu outbreaks better than the CDC by studying search term occurrence geographically.

The message is: find a healthy level of skepticism in relation to all things statistical or be persistently deceived.