Making Perfect De-Identification the Enemy of Good De-Identification

Making Perfect De-Identification the Enemy of Good De-Identification

This week, Ann Cavoukian and Dan Castro waded into the de-identification debate with a new whitepaper, arguing that the risk of re-identification has been greatly exaggerated and that de-identification will play a central role in the age of big data. FPF has repeatedly called for the need for informed conversations about what practical de-identification requires, and while part of the challenge is that terms like de-identification or “anonymization” have come to mean very different things to different stakeholders, privacy advocates have effectively made perfection the enemy of the good when it comes to de-identifying data.

Cavoukian and Castro highlight the oft-cited re-identification of Netflix users as an example of how re-identification risks have been overblown. Researchers were able to compare data released by Netflix with records available on the Internet Movie Database in order to uncover the identities of Netflix users.  While this example highlights the challenges facing organizations when they release large public datasets, it is easy to ignore that only two out of 480,189 Netflix users were successfully identified in this fashion. That’s a 0.0004 percent re-identification rate – that’s only a little bit worse than anyone’s odds of being struck by lightning.*

De-identification’s limitations are often conflated with a lack of trust in how organization’s handle data in general. Most of the big examples of re-identification, like the Netflix example, focus on publicly-released datasets. When data is released into the wild, organizations need to be extremely careful; once data is out there anyone with the time, energy, or technological capability has the opportunity to try to re-identify the dataset. There’s no question that companies have made mistakes when it comes to making their data widely available to the public.

But focusing on publicly-released information does not describe the entire universe of data that exists today. In reality, much data is never released publicly. Instead, de-identification is often paired with a variety of administrative and procedural safeguards that govern how individuals and organizations can use data. When used in combination, bad actors must (1) circumvent administrative restraints and (2) then re-identify any data before getting any value from their malfeasance. As a matter of simple statistics, the probability of breaching both sets of controls and successfully re-identifying data in a non-public database is low.

De-identification critics remain skeptical. Some have argued that any potential ability to reconnect information to an individual’s personal identify suggests inadequate de-identification. Perfect unlinkability may be an impossible standard, but this argument is less an attack on the efficacy of de-identification than it is a manifestation of a lack of trust. When some suggest we ignore privacy, it makes it easier for critics to not trust how businesses protect data. Fights about de-identification thus became a proxy for how much to trust industry.

In the process, discussions about how to advance practical de-identification are lost. As a privacy community, we should fight over exactly what de-identification means. FPF is currently engaged in just such a scoping project. Recognizing that there are many different standards for how academics, advocates, and industry understand “de-identified” data should be the start of a serious discussion about what we expect out of de-identification, not casting aside the concept altogether. Perfect de-identification may be impossible, but good de-identification isn’t.

-Joseph Jerome, Policy Counsel

* Daniel Barth-Jones notes that I’ve compared the Netflix re-identification study to the annual risk of being hit by lightning and responds as follows:

This was an excellent and timely piece, but there’s a fact that should be corrected because this greatly diminishes the actual impact of the statistic you’ve cited. The article cites the fact that only two out of 480,189 Netflix users were successfully identified using the IMDb data, which rounds to a 0.0004 percent (i.e., 0.000004 or 1/240,000) re-identification risk. This is correct, but then the piece goes on to say “that’s only a little bit worse than anyone’s odds of being struck by lightning.” Which, without further explanation, is likely to be misconstrued.

The blog author cites the annual risk for being hit by lightning (which is, of course, exceedingly small). However, the way most people probably think about lightning risk is not “what’s the risk of being hit in the next year”, but rather “what’s my risk of ever being hit by lightning”? While estimates of the lifetime risk of being hit by lightning vary slightly (according to the precision of the formulas used to calculate this estimate), one’s lifetime odds of being hit by lightning is somewhere between 1 in 6,250 and 1 in 10,000, so even if you went with the more conservative number here, the risk being re-identified by the Netflix attack was only 1/24 of your lifetime risk of being hit by lighting (assuming you’ll make to age 80 without something else getting you). This is truly a risk at a magnitude that no one rationally worries about.

Although the evidence-base provided by the Netflix re-identification was extremely thin, the algorithm is intelligently designed and it will be helpful to the furtherance of sound development of public policy to see what the re-identification potential is for such an algorithm with a real-world sparse dataset (perhaps medical data?) for a randomly selected data sample when examined with some justifiable starting assumptions regarding the extent of realistic data intruder background knowledge (which should reasonably account for practical data divergence issues).

Leave a Reply


Privacy Calendar

Nov
7
Fri
all-day George Washington Law Review 201... @ George Washington University Law School
George Washington Law Review 201... @ George Washington University Law School
Nov 7 – Nov 8 all-day
Save the date for the GW Law Review‘s Annual Symposium, The FTC at 100: Centennial Commemorations and Proposals for Progress, which will be held on Saturday, November 8, 2014, in Washington, DC. This year’s symposium, hosted in[...]
Nov
11
Tue
10:15 am You Are Here: GPS Location Track... @ Mauna Lani Bay Hotel & Bungalows
You Are Here: GPS Location Track... @ Mauna Lani Bay Hotel & Bungalows
Nov 11 @ 10:15 am
EFF Staff Attorney Hanni Fakhoury will present twice at the Oregon Criminal Defense Lawyers Association’s Annual Sunny Climate Seminar. He will give a presentation on government location tracking issues and then participate in a panel[...]
Nov
12
Wed
all-day PCLOB Public Meeting on “Definin... @ Washington Marriott Hotel
PCLOB Public Meeting on “Definin... @ Washington Marriott Hotel
Nov 12 all-day
The Privacy and Civil Liberties Oversight Board will conduct a public meeting with industry representatives, academics, technologists, government personnel, and members of the advocacy community, on the topic: “Defining Privacy.”   While the Board will[...]
Nov
20
Thu
all-day W3C Workshop on Privacy and User... @ Berlin, Germany
W3C Workshop on Privacy and User... @ Berlin, Germany
Nov 20 – Nov 21 all-day
The Workshop on User Centric App Controls intents to further the discussion among stakeholders of the mobile web platform, including researchers, developers and service providers. This workshop serves to investigate strategies toward better privacy protection[...]
Dec
2
Tue
all-day IAPP Practical Privacy Series 2014
IAPP Practical Privacy Series 2014
Dec 2 – Dec 3 all-day
Government and FTC and Consumer Privacy return to Washington, DC. For more information, click here.
Dec
11
Thu
9:00 am Progress of the EU Data Protecti...
Progress of the EU Data Protecti...
Dec 11 @ 9:00 am
The EU Member States have agreed to conclude the negotiations on the EU Data Protection draft Regulation in 2015. The process will have arrived at a critical point by the end of this year. The[...]

View Calendar