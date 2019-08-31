Over a decade ago, AOL decided to give away its user’s search history so that academics could use it for research. AOL thought they had anonymized that data, so no usernames or other identifying information was included with the search history. But it didn’t take long for researchers to start identifying users based on their searches.
What you search for gives away clues as to who you are. From looking for local restaurants to plumbers to cars, each search is a clue and it didn’t take much to identify many of the AOL users.
In today’s world of data science and machine learning, robust datasets are essential for progress to be made. The more accurate the dataset the better, so datasets from real-world applications are often used for research. They are made available by anonymizing the data first, taking only small samples of the data, and then given to researchers.
With no personally identifying information, the data is no longer considered protected by data protection laws and can be freely shared or sold. This includes all types of data, from financial to medical and everything in between, and will include things such as birthdates (because the age of a person can be useful), zip codes (because the general location of a person can be useful), marital status, gender and a number of other data points while omitting anything that directly identifies the person (such as name, phone number, social security number, and so on).
Recently, a study was published where researchers set out to determine just how many points of data it takes to identify someone in an anonymized dataset. What the researchers found was that they could correctly identify 99.98 percent of Americans in any dataset using just 15 demographic data points.
With only three data points, they can still correctly identify 83 percent of the people in a dataset. Of course, which three data points mattered a lot: they were date of birth, gender and zip code. And the other 12 mattered as well, and there’s no guarantee that every dataset will contain these specific 15 data points. But there are plenty of datasets out there with thousands of data points.
The recent Cambridge Analytica scandal shows us that they had more than 5,000 data points per U.S. voter, though there was no attempt to anonymize that data. Facebook gave it away freely to a researcher who then gave it to Cambridge Analytica. There’s a good documentary on Netflix called ‘The Great Hack’ if you want more information about the Cambridge Analytica scandal.
I find it a bit chilling that someone with my anonymized medical records could identify me so easily. So what do we do about it? Well, right now there’s not much you can do about it.
You can provide fewer data points that can become a part of a dataset. Ceasing use, or most of your use, of free services such as Facebook (where you are the product) and moving your online activities towards companies that have a strong stance on privacy can help. The U.S. doesn’t have strong data protection laws such as the European Union does with General Data Protection Regulation. But even data anonymized to GDPR standards allows for identification such as this.
We need stronger laws regarding data sharing and the kinds of information that can be allowed to be in it. That would be a good start.