Originally published in The Reasoner Volume 10, Number 4– March 2016
The Oscar winning documentary Citizenfour brought the concept of metadata to the attention of general audiences. As one scene of the film explains, we leave, mostly unwillingly, many digital traces of our daily activities. Most Londoners, for instance, use an Oyster card to travel across the city. When they top-up their Oyster online or opt in for the convenient auto top-up, they effectively allow whoever has access to the data, to track their routine. (And the recent introduction of contactless payment on the London transport system clearly made this even simpler.) This can then be linked to what people buy, what they read on the internet, what they post on social networks, and indeed, to what other people do. That’s metadata.
It goes without saying that metadata is syntax with no semantics. There are many reasons as to why people do what they do, and there are many people travelling independently on the same journey. Quite obviously then, the dots representing their digital traces can be joined in a number of distinct ways, and possibly found to draw specific but wrong pictures. That’s why the Owellian idea that someone possesses a wealth of metadata about us is indeed frightening. But knowing that governments may kill based on that, is rather hard to accept.
The opening of this recent piece by C. Grothoff and J.M. Porup on Arstechnica UK is chilling:
In 2014, the former director of both the CIA and NSA proclaimed that “we kill people based on metadata.” Now, a new examination of previously published Snowden documents suggests that many of those people may have been innocent.
The article refers to the US National Security Agency’s SKYNET programme which monitors massively Pakistan’s mobile phone networks to obtain metadata. The goal is to quantify the likelihood of any particular individual being a terrorist. Data scientist and human right activist Patrick Ball dubs the method used by NSA as “ridiculously optimistic” and “completely bullshit.” The reported result is appalling:
thousands of innocent people in Pakistan may have been mislabelled as terrorists by that “scientifically unsound” algorithm, possibly resulting in their untimely demise.
As the piece then explains, the methods used by the NSA are very similar to those used by Big Data business applications and spam filters. With a twist: instead of selling products, the output of the machine learning algorithm is a death-sentence for those who are labelled “terrorists” by it. (Needless to say the details are politically quite involved, so I refer interested readers to Grothoff and Porup’s rich list of links to find out more.) Whilst an irrelevant suggestion to buy a certain book or an email labelled wrongly as spam can be at most annoying, giving the wrong label to a target of the SKYNET programme may have dreadful consequences. And yet, all those mistakes boil down to nothing more sophisticated than the base-rate fallacy.
In a nutshell, this very well-known problem in the calculus of probability, shows that in testing for a property which is not frequently observed in a population, even very accurate tests may lead to a great proportion of false positives, i.e. individuals who are wrongly attributed the property tested for. This fallacy is so well-known that it features in textbooks, with the typical example being the disproportionate number of false positives which arises from a 99% accurate HIV test run on randomly selected samples. An example from the SKYNET programme mentioned in the Grothoff and Porup article is the Al-Jazeera journalist and longtime bureau chief in Islamabad, who scores very high on the NSA terrorist ranking because of his frequent journeys in areas known for terrorist activities.
It is quite unbelievable that such a macroscopically flawed piece of reasoning is being used in SKYNET thereby threatening the lives of thousands. For the fact that terrorists are a tiny minority of the (Pakistani) population doesn’t require proof. And even an otherwise remarkably low rate of false positives, can potentially lead to thousands and thousands of false-positive executions. Indeed many are reluctant to believe that no one at NSA is able to spot this gigantic mistake, see for instance the discussion on Andrew Gelman’s blog. This might suggest that the latest instalment of the Snowden documents is just showing one very incomplete fragment of the story. is just showing one very incomplete fragment of the story. Be this as it may, it’s certainly a story which shouldn’t have existed in the first place.
(Many thanks to Teddy Groves for pointing this out to me.)