Prior to the 2016 US presidential elections, The Upshot—a website run by The New York Times—conducted a survey of 867 likely Florida voters that showed Hillary Clinton leading Donald Trump by a 1% margin. The Upshot then shared the raw data with four well-respected pollsters and asked them to predict the result. Three of these pollsters predicted a Clinton victory with 4%, 3%, and 1% margins respectively, while the remaining one predicted a win for Trump by 1%. A clear 5% difference between the five estimates was observed, even though all were based on the same data. Well, are opinion polls a bit of an inexact science?
“In God we trust; all others must bring data.” This quote by statistician William Edwards Deming has become a mantra for modern-day data science—not only for the importance of data measurement and analysis in business but also because it points to the promise of finding objective answers. But interestingly enough, data-based conclusions are often not unique. Different experts would often come up with different sets of (even contradictory) conclusions from the same data. And we know that economists may debate whether a recovery is V-shaped, W-shaped or K-shaped—all are based on the same data.
Through a landmark study, a 2020 research paper in Nature shocked the scientific community. It reported the variability in the analysis of the same neuroimaging dataset by 70 teams worldwide. These teams were asked to analyse a dataset on 108 subjects who could either make or lose money by playing a type of gambling game while placed in MRI scanners.
The concerned hypotheses were related to what happened in the brain during such a task. The teams were in partial agreement about only four of the nine hypotheses, whereas mostly divided opinions on the other five explanations were obtained.
The above study is not an exception though. Very different meanings from the same data can be drawn, as concluded by Raphael Silberzahn of IESE Business School, Barcelona and Eric L Uhlmann of INSEAD in Singapore in their 2015 Nature paper. They observed that significant variations in the results of the analyses of complex data may be difficult to avoid, even by experts with honest intentions.
Their study consisted of 29 teams involving 61 analysts, who were given the same dataset collected by a sports statistics firm on 2,053 soccer players of the first male divisions of England, Germany, France and Spain in the 2012–13 season. The underlying research question was: whether referees are more likely to give red cards to players with a dark skin tone than to those with a light skin tone. Analytic approaches varied widely across the teams, and 20 teams (69%) found a statistically significant positive effect while nine teams (31%) did not observe a significant relationship.
Well, different datasets may yield different conclusions, as one might easily understand. But how could the same data result in contradictory conclusions? This, however, is possible since statistics is not a natural science. In physics, whenever an apple falls from a tree, the acceleration due to gravity remains the same. But in statistics, the measures and models are often artificially built based on the judgements and wisdom of the concerned statisticians. Different measures use data in different ways and thus are bound to differ. The common man, however, struggles to understand which analysis should be trusted. And from a broader perspective, how should the data-driven policies of a society be set?
The analysis of a dataset should be based on the understanding of causality behind a phenomenon. The 2018 book, The Book of Why: The New Science of Cause and Effect, coauthored by Turing Award-winning computer scientist Judea Pearl, shows how understanding causality has revolutionised science. In a data example mentioned in the book, the hit rate of two American baseball players—David Justice and Derek Jeter—for three consecutive years—1995, 1996 and 1997—are compared. Hits/At Bats for Justice were 104/411 = 0.253, 45/140 = 0.321 and 163/495 = 0.329, and those for Jeter were 12/48 = 0.250, 183/582 = 0.314 and 190/654 = 0.291 for these years. It can be observed that Jeter’s hit rate is lower every year, but an overall rate by pooling the three years’ data together would show the opposite.
The underlying reason is that Jeter had a large number of ‘At Bats’ when his hit rate was the highest, in 1996; and Justice had a large number of the same when his hit rate was the lowest, in 1995. Weighted analyses could provide different results, for sure.
In reality, of course, one cannot afford to engage tens of analysts to solve a problem. The culture of crowdsourcing data is thus gaining momentum, though not extensively. Researchers who painstakingly collect a dataset are often notoriously guarded about their data. Usually, companies too are. And the problem is that “once a finding has been published in a journal, it becomes difficult to challenge. Ideas become entrenched too quickly, and uprooting them is more disruptive than it ought to be,” wrote Silberzahn and Uhlmann in their Nature paper.
Well, the history of science would vow that the conclusions from the simple observation that the sun crosses the sky every day might even be varied and contradictory as well. Overall, it’s quite tricky to extract objective information and, of course, the ‘truth’ from data. We really don’t know which analysis to believe. The (possibly masked) data and the method of analysis are not often provided. And yet, even your own analysis can very much be biased, by an unknown amount, depending on the choice of variables, models and methodologies.
Professor of Statistics, Indian Statistical Institute, Kolkata