Making Perfect De-Identification the Enemy of Good De-Identification

Making Perfect De-Identification the Enemy of Good De-Identification

This week, Ann Cavoukian and Dan Castro waded into the de-identification debate with a new whitepaper, arguing that the risk of re-identification has been greatly exaggerated and that de-identification will play a central role in the age of big data. FPF has repeatedly called for the need for informed conversations about what practical de-identification requires, and while part of the challenge is that terms like de-identification or “anonymization” have come to mean very different things to different stakeholders, privacy advocates have effectively made perfection the enemy of the good when it comes to de-identifying data.

Cavoukian and Castro highlight the oft-cited re-identification of Netflix users as an example of how re-identification risks have been overblown. Researchers were able to compare data released by Netflix with records available on the Internet Movie Database in order to uncover the identities of Netflix users.  While this example highlights the challenges facing organizations when they release large public datasets, it is easy to ignore that only two out of 480,189 Netflix users were successfully identified in this fashion. That’s a 0.0004 percent re-identification rate – that’s only a little bit worse than anyone’s odds of being struck by lightning.*

De-identification’s limitations are often conflated with a lack of trust in how organization’s handle data in general. Most of the big examples of re-identification, like the Netflix example, focus on publicly-released datasets. When data is released into the wild, organizations need to be extremely careful; once data is out there anyone with the time, energy, or technological capability has the opportunity to try to re-identify the dataset. There’s no question that companies have made mistakes when it comes to making their data widely available to the public.

But focusing on publicly-released information does not describe the entire universe of data that exists today. In reality, much data is never released publicly. Instead, de-identification is often paired with a variety of administrative and procedural safeguards that govern how individuals and organizations can use data. When used in combination, bad actors must (1) circumvent administrative restraints and (2) then re-identify any data before getting any value from their malfeasance. As a matter of simple statistics, the probability of breaching both sets of controls and successfully re-identifying data in a non-public database is low.

De-identification critics remain skeptical. Some have argued that any potential ability to reconnect information to an individual’s personal identify suggests inadequate de-identification. Perfect unlinkability may be an impossible standard, but this argument is less an attack on the efficacy of de-identification than it is a manifestation of a lack of trust. When some suggest we ignore privacy, it makes it easier for critics to not trust how businesses protect data. Fights about de-identification thus became a proxy for how much to trust industry.

In the process, discussions about how to advance practical de-identification are lost. As a privacy community, we should fight over exactly what de-identification means. FPF is currently engaged in just such a scoping project. Recognizing that there are many different standards for how academics, advocates, and industry understand “de-identified” data should be the start of a serious discussion about what we expect out of de-identification, not casting aside the concept altogether. Perfect de-identification may be impossible, but good de-identification isn’t.

-Joseph Jerome, Policy Counsel

* Daniel Barth-Jones notes that I’ve compared the Netflix re-identification study to the annual risk of being hit by lightning and responds as follows:

This was an excellent and timely piece, but there’s a fact that should be corrected because this greatly diminishes the actual impact of the statistic you’ve cited. The article cites the fact that only two out of 480,189 Netflix users were successfully identified using the IMDb data, which rounds to a 0.0004 percent (i.e., 0.000004 or 1/240,000) re-identification risk. This is correct, but then the piece goes on to say “that’s only a little bit worse than anyone’s odds of being struck by lightning.” Which, without further explanation, is likely to be misconstrued.

The blog author cites the annual risk for being hit by lightning (which is, of course, exceedingly small). However, the way most people probably think about lightning risk is not “what’s the risk of being hit in the next year”, but rather “what’s my risk of ever being hit by lightning”? While estimates of the lifetime risk of being hit by lightning vary slightly (according to the precision of the formulas used to calculate this estimate), one’s lifetime odds of being hit by lightning is somewhere between 1 in 6,250 and 1 in 10,000, so even if you went with the more conservative number here, the risk being re-identified by the Netflix attack was only 1/24 of your lifetime risk of being hit by lighting (assuming you’ll make to age 80 without something else getting you). This is truly a risk at a magnitude that no one rationally worries about.

Although the evidence-base provided by the Netflix re-identification was extremely thin, the algorithm is intelligently designed and it will be helpful to the furtherance of sound development of public policy to see what the re-identification potential is for such an algorithm with a real-world sparse dataset (perhaps medical data?) for a randomly selected data sample when examined with some justifiable starting assumptions regarding the extent of realistic data intruder background knowledge (which should reasonably account for practical data divergence issues).

Leave a Reply


Privacy Calendar

Oct
21
Tue
6:00 pm Consumer Action’s 43rd Annual Aw... @ Google
Consumer Action’s 43rd Annual Aw... @ Google
Oct 21 @ 6:00 pm – Oct 21 @ 8:00 pm
To mark its 43rd anniversary, Consumer Action’s Annual Awards Reception on October 21, 2014, will celebrate the theme of “Train the Trainer.” Through the power of individual and small group trainings, Consumer Action each year is[...]
Oct
24
Fri
9:00 am Web Privacy & Transparency Confe... @ Princeton University
Web Privacy & Transparency Confe... @ Princeton University
Oct 24 @ 9:00 am – 4:00 pm
On Friday, October 24, 2014, the Center for Information Technology Policy (CITP) at Princeton University is hosting a public conference on Web Privacy and Transparency. It will explore the quickly emerging area of computer science research that[...]
Oct
29
Wed
4:00 pm Big Data and Privacy: Navigating... @ Schulze Hall
Big Data and Privacy: Navigating... @ Schulze Hall
Oct 29 @ 4:00 pm – 7:00 pm
The rapid emergence of “big data” has created many benefits and risks for businesses today. As data is collected, stored, analyzed, and deployed for various business purposes, it is particularly important to develop responsible data[...]
Oct
30
Thu
9:00 am The Privacy Act @40: A Celebrati... @ Georgetown Law
The Privacy Act @40: A Celebrati... @ Georgetown Law
Oct 30 @ 9:00 am – 5:30 pm
The Privacy Act @40 A Celebration and Appraisal on the 40th Anniversary of the Privacy Act and the 1974 Amendments to the Freedom of Information Act October 30, 2014 Agenda 9 – 9:15 a.m. Welcome[...]
Nov
7
Fri
all-day George Washington Law Review 201... @ George Washington University Law School
George Washington Law Review 201... @ George Washington University Law School
Nov 7 – Nov 8 all-day
Save the date for the GW Law Review‘s Annual Symposium, The FTC at 100: Centennial Commemorations and Proposals for Progress, which will be held on Saturday, November 8, 2014, in Washington, DC. This year’s symposium, hosted in[...]
Nov
11
Tue
10:15 am You Are Here: GPS Location Track... @ Mauna Lani Bay Hotel & Bungalows
You Are Here: GPS Location Track... @ Mauna Lani Bay Hotel & Bungalows
Nov 11 @ 10:15 am
EFF Staff Attorney Hanni Fakhoury will present twice at the Oregon Criminal Defense Lawyers Association’s Annual Sunny Climate Seminar. He will give a presentation on government location tracking issues and then participate in a panel[...]
Dec
2
Tue
all-day IAPP Practical Privacy Series 2014
IAPP Practical Privacy Series 2014
Dec 2 – Dec 3 all-day
Government and FTC and Consumer Privacy return to Washington, DC. For more information, click here.

View Calendar