Despite its enormous potential, big data also comes with important limitations largely hidden in the current hype. Two ‘big data’ myths stand out: the confusion of correlation with causation and the illusion of the population versus a big sample.
First, big data mining advocates claim that correlations suffice while the quest for causal interpretation should be abandoned. The real danger is that you will be “fooled by association”, as explained in Freakonomics. I consulted a car company whose managers were upset because, while profits were up, a ‘key performance indicator’ was down. After causal analysis, it became clear that this indictor did not cause or lead profits; the correlation was merely a coincidence and turned around in recent periods. As a result of the causal analysis, managers could refocus their energies on moving those indicators that actually do lead sales, as shown in chapter 8 of www.notsizedata.com
Second, big data sometimes gives the illusion that sampling bias is no longer an issue (as it is for small data) because the data capture the entire population. However, “N = all is often an assumption rather than a fact about the data” (Kaiser Fung, Numbersense). For example, your social media data may accurately capture online sentiment, but only for those consumers who are online and care enough about your brand and product category to comment through the online channel. In recent research across 15 product categories, we compared the power of representative offline survey metrics (awareness, consideration, liking) and online behavior metrics (paid ad clicks, site visits and social media conversations) to explain and predict sales. We found that online behavior metrics excelled in short-term predictions, but that offline survey metrics excelled in medium-term predictions.
What have we learned? Blowing up data size does dissolve us from the challenges of meaningful inference from the data. The recent review of the Google Flu Trends “success” story illustrates both the importance of causal inference and the sampling bias – excellently described by Tim Harford (http://www.ft.com/intl/cms/s/2/21a6e7d8-b479-11e3-a09a-00144feabdc0.html#axzz2xN8T2nIp). In his words: “Big data has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.”