Making Perfect De-Identification the Enemy of Good De-Identification

Making Perfect De-Identification the Enemy of Good De-Identification

This week, Ann Cavoukian and Dan Castro waded into the de-identification debate with a new whitepaper, arguing that the risk of re-identification has been greatly exaggerated and that de-identification will play a central role in the age of big data. FPF has repeatedly called for the need for informed conversations about what practical de-identification requires, and while part of the challenge is that terms like de-identification or “anonymization” have come to mean very different things to different stakeholders, privacy advocates have effectively made perfection the enemy of the good when it comes to de-identifying data.

Cavoukian and Castro highlight the oft-cited re-identification of Netflix users as an example of how re-identification risks have been overblown. Researchers were able to compare data released by Netflix with records available on the Internet Movie Database in order to uncover the identities of Netflix users.  While this example highlights the challenges facing organizations when they release large public datasets, it is easy to ignore that only two out of 480,189 Netflix users were successfully identified in this fashion. That’s a 0.0004 percent re-identification rate – that’s only a little bit worse than anyone’s odds of being struck by lightning.*

De-identification’s limitations are often conflated with a lack of trust in how organization’s handle data in general. Most of the big examples of re-identification, like the Netflix example, focus on publicly-released datasets. When data is released into the wild, organizations need to be extremely careful; once data is out there anyone with the time, energy, or technological capability has the opportunity to try to re-identify the dataset. There’s no question that companies have made mistakes when it comes to making their data widely available to the public.

But focusing on publicly-released information does not describe the entire universe of data that exists today. In reality, much data is never released publicly. Instead, de-identification is often paired with a variety of administrative and procedural safeguards that govern how individuals and organizations can use data. When used in combination, bad actors must (1) circumvent administrative restraints and (2) then re-identify any data before getting any value from their malfeasance. As a matter of simple statistics, the probability of breaching both sets of controls and successfully re-identifying data in a non-public database is low.

De-identification critics remain skeptical. Some have argued that any potential ability to reconnect information to an individual’s personal identify suggests inadequate de-identification. Perfect unlinkability may be an impossible standard, but this argument is less an attack on the efficacy of de-identification than it is a manifestation of a lack of trust. When some suggest we ignore privacy, it makes it easier for critics to not trust how businesses protect data. Fights about de-identification thus became a proxy for how much to trust industry.

In the process, discussions about how to advance practical de-identification are lost. As a privacy community, we should fight over exactly what de-identification means. FPF is currently engaged in just such a scoping project. Recognizing that there are many different standards for how academics, advocates, and industry understand “de-identified” data should be the start of a serious discussion about what we expect out of de-identification, not casting aside the concept altogether. Perfect de-identification may be impossible, but good de-identification isn’t.

-Joseph Jerome, Policy Counsel

* Daniel Barth-Jones notes that I’ve compared the Netflix re-identification study to the annual risk of being hit by lightning and responds as follows:

This was an excellent and timely piece, but there’s a fact that should be corrected because this greatly diminishes the actual impact of the statistic you’ve cited. The article cites the fact that only two out of 480,189 Netflix users were successfully identified using the IMDb data, which rounds to a 0.0004 percent (i.e., 0.000004 or 1/240,000) re-identification risk. This is correct, but then the piece goes on to say “that’s only a little bit worse than anyone’s odds of being struck by lightning.” Which, without further explanation, is likely to be misconstrued.

The blog author cites the annual risk for being hit by lightning (which is, of course, exceedingly small). However, the way most people probably think about lightning risk is not “what’s the risk of being hit in the next year”, but rather “what’s my risk of ever being hit by lightning”? While estimates of the lifetime risk of being hit by lightning vary slightly (according to the precision of the formulas used to calculate this estimate), one’s lifetime odds of being hit by lightning is somewhere between 1 in 6,250 and 1 in 10,000, so even if you went with the more conservative number here, the risk being re-identified by the Netflix attack was only 1/24 of your lifetime risk of being hit by lighting (assuming you’ll make to age 80 without something else getting you). This is truly a risk at a magnitude that no one rationally worries about.

Although the evidence-base provided by the Netflix re-identification was extremely thin, the algorithm is intelligently designed and it will be helpful to the furtherance of sound development of public policy to see what the re-identification potential is for such an algorithm with a real-world sparse dataset (perhaps medical data?) for a randomly selected data sample when examined with some justifiable starting assumptions regarding the extent of realistic data intruder background knowledge (which should reasonably account for practical data divergence issues).

Leave a Reply


Privacy Calendar

Sep
23
Tue
Mapping Issues with the Web: An ... @ Tow Center for Digital Journalism/Columbia Journalism School
Mapping Issues with the Web: An ... @ Tow Center for Digital Journalism/Columbia Journalism School
Sep 23 @ 5:00 pm – 6:30 pm
On the occasion of Bruno Latour’s visit to Columbia University, this presentation will show participants how to operationalize his seminal Actor-Network Theory using digital data and methods in the service of social and cultural research.
Sep
26
Fri
Yale Day of Data @ Yale University
Yale Day of Data @ Yale University
Sep 26 @ 8:30 am – 5:00 pm
This day-long event will focus on data science and partnerships across industry, academia, and government initiatives. The day will also include presentations by eight Yale faculty and researchers on issues specific to research data management,[...]
Oct
11
Sat
City by Numbers: Big Data and th... @ Pratt Institute
City by Numbers: Big Data and th... @ Pratt Institute
Oct 11 @ 9:30 am – 6:00 pm
Big Data—the exponential growth and availability of information—is one of the defining phenomena of our time. It affects us all on different levels – with far-reaching social, environmental, and governmental significance. To help make sense[...]
Oct
21
Tue
Consumer Action’s 43rd Annual Aw... @ Google
Consumer Action’s 43rd Annual Aw... @ Google
Oct 21 @ 6:00 pm – Oct 21 @ 8:00 pm
To mark its 43rd anniversary, Consumer Action’s Annual Awards Reception on October 21, 2014, will celebrate the theme of “Train the Trainer.” Through the power of individual and small group trainings, Consumer Action each year is[...]
Oct
29
Wed
Big Data and Privacy: Navigating... @ Schulze Hall
Big Data and Privacy: Navigating... @ Schulze Hall
Oct 29 @ 4:00 pm – 7:00 pm
The rapid emergence of “big data” has created many benefits and risks for businesses today. As data is collected, stored, analyzed, and deployed for various business purposes, it is particularly important to develop responsible data[...]
Jan
28
Wed
all-day Data Privacy Day
Data Privacy Day
Jan 28 all-day
“Data Privacy Day began in the United States and Canada in January 2008, as an extension of the Data Protection Day celebration in Europe. The Day commemorates the 1981 signing of Convention 108, the first[...]

View Calendar