Most online statistics that you read are bullshit. They are weak and misleading numbers that persevere in the form of infographics, unsound “studies,” and in our everyday business storytelling. Not only can these statistics be misleading, but if misinterpreted they can have negative effects on your business.
In this post I’m going to discuss the relationship between correlation and causation, and why these are important when deciding what data to believe. At the end I’ll describe a way to look at comparative social data and make better decisions around its results.
Correlation and causation defined
Let’s start with a very straightforward definition of correlation and causation:
Correlation is the degree to which two or more quantities are linearly associated. (source: Wolfram Alpha)
Causation is the act or process of causing something to happen or exist. (source: Merriam Webster)
It’s important to understand the distinction between the two. Correlation shows an association between variables but can never show that one thing causes another. Here are a couple of graphic representations of correlation between two variables.
Online statistics will
never rarely ever show causation
You’ve probably heard the phrase “correlation doesn’t necessarily equal causation.” It’s an overused expression that people generally use before they inappropriately infer cause from correlation. Put more simply, because two things happen at the same time doesn’t mean that they are causing the other to happen.
As an example, my wife and I have two kids. We had the same (horrible) nurse in the delivery room for both of their deliveries. Although there is 100% correlation between the nurse’s presence and our kid’s births, neither caused the other.
Online statistics tend to infer causation from correlation, despite brandishing the “correlation is not causation” cliche. This is wrong to write and wrong to accept.
Consider how difficult it is to determine what causes anything else. All variables would have to be isolated and then one would have to be explicitly shown to cause a specific outcome. In science this can be done through well-constructed, rigorous, expensive trials. For the type of data that informs infographics and most “studies” it is close to impossible.
Which is not to say that correlation is a bad thing….
Some correlation is actionable
We wouldn’t take the time to observe and measure correlation unless it was helpful to us.
Consider the Google algorithm. There are many variables and conditions that inform what results come back to you on a SERP. Searchmetrics and Moz (and I’m sure others) do thorough analyses of SERPs to attempt to gauge how much correlation there is between certain variables and search engine rank. Let’s take a look at the top twelve ranking factors and their correlation coefficients from Searchmetrics 2014 data:
- Clickthrough rate (.67)
- Relevant terms (.34)
- Google +1 (.33)
- Number of backlinks (.31)
- Facebook shares (.28)
- Facebook total (.28)
- Facebook comments (.27)
- Pinterest (.27)
- SEO-visibility of backlinking URL (.26)
- Facebook Likes (.25)
- Tweets (.24)
- % backlinks = “rel=nofollow” (.23)
You may be tempted to look at the list and determine that you need to focus more on Google +1s and less on Facebook posts in order to raise your SERP position. That conclusion assumes that this is a report of causality, and the relationships these studies establish are not causal at all. From a statistical standpoint, the correlation coefficients in this study are weak for every factor other than clickthrough rate. Recall also that Google can’t see a large portion of Facebook and said explicitly that +1s don’t influence SERPs, so they can’t cause anything to happen. These results show (weak) correlation where there isn’t a high probably of causation.
An alternative way to look at this study of correlation is to ask yourself how many of these items you can improve for your website? Is there a way to devote your resources that might impact a few of these items? Since we really can’t understand the specifics of the Google algorithm (which x causes y), we can look at correlation here as a list of things that when improved may improve our SERP position to some degree. This is why many SEO practitioners will focus more effort on link building (#4) than on soliciting +1s.
If you consider how hard it would be to isolate a specific variable from others, you can safely assume that comparative data measures correlation. By disregarding the statistical possibility of causation, you can be properly critical of the data and make better decisions around the data.
Many writers pay lip service to correlation and causality and then continue to incorrectly infer causality from their data. Don’t fall down this rabbit hole.
If you take nothing else away from this post understand this: All comparative social data and statistics will be correlative. You’re unlikely to ever isolate an explicit cause for your outcomes. And while I know this is unsettling, treating your comparative data as correlative allows you to make better decisions about it.
Let me know your thoughts on this one.