Making Perfect De-Identification the Enemy of Good De-Identification

Making Perfect De-Identification the Enemy of Good De-Identification

This week, Ann Cavoukian and Dan Castro waded into the de-identification debate with a new whitepaper, arguing that the risk of re-identification has been greatly exaggerated and that de-identification will play a central role in the age of big data. FPF has repeatedly called for the need for informed conversations about what practical de-identification requires, and while part of the challenge is that terms like de-identification or “anonymization” have come to mean very different things to different stakeholders, privacy advocates have effectively made perfection the enemy of the good when it comes to de-identifying data.

Cavoukian and Castro highlight the oft-cited re-identification of Netflix users as an example of how re-identification risks have been overblown. Researchers were able to compare data released by Netflix with records available on the Internet Movie Database in order to uncover the identities of Netflix users.  While this example highlights the challenges facing organizations when they release large public datasets, it is easy to ignore that only two out of 480,189 Netflix users were successfully identified in this fashion. That’s a 0.0004 percent re-identification rate – that’s only a little bit worse than anyone’s odds of being struck by lightning.*

De-identification’s limitations are often conflated with a lack of trust in how organization’s handle data in general. Most of the big examples of re-identification, like the Netflix example, focus on publicly-released datasets. When data is released into the wild, organizations need to be extremely careful; once data is out there anyone with the time, energy, or technological capability has the opportunity to try to re-identify the dataset. There’s no question that companies have made mistakes when it comes to making their data widely available to the public.

But focusing on publicly-released information does not describe the entire universe of data that exists today. In reality, much data is never released publicly. Instead, de-identification is often paired with a variety of administrative and procedural safeguards that govern how individuals and organizations can use data. When used in combination, bad actors must (1) circumvent administrative restraints and (2) then re-identify any data before getting any value from their malfeasance. As a matter of simple statistics, the probability of breaching both sets of controls and successfully re-identifying data in a non-public database is low.

De-identification critics remain skeptical. Some have argued that any potential ability to reconnect information to an individual’s personal identify suggests inadequate de-identification. Perfect unlinkability may be an impossible standard, but this argument is less an attack on the efficacy of de-identification than it is a manifestation of a lack of trust. When some suggest we ignore privacy, it makes it easier for critics to not trust how businesses protect data. Fights about de-identification thus became a proxy for how much to trust industry.

In the process, discussions about how to advance practical de-identification are lost. As a privacy community, we should fight over exactly what de-identification means. FPF is currently engaged in just such a scoping project. Recognizing that there are many different standards for how academics, advocates, and industry understand “de-identified” data should be the start of a serious discussion about what we expect out of de-identification, not casting aside the concept altogether. Perfect de-identification may be impossible, but good de-identification isn’t.

-Joseph Jerome, Policy Counsel

* Daniel Barth-Jones notes that I’ve compared the Netflix re-identification study to the annual risk of being hit by lightning and responds as follows:

This was an excellent and timely piece, but there’s a fact that should be corrected because this greatly diminishes the actual impact of the statistic you’ve cited. The article cites the fact that only two out of 480,189 Netflix users were successfully identified using the IMDb data, which rounds to a 0.0004 percent (i.e., 0.000004 or 1/240,000) re-identification risk. This is correct, but then the piece goes on to say “that’s only a little bit worse than anyone’s odds of being struck by lightning.” Which, without further explanation, is likely to be misconstrued.

The blog author cites the annual risk for being hit by lightning (which is, of course, exceedingly small). However, the way most people probably think about lightning risk is not “what’s the risk of being hit in the next year”, but rather “what’s my risk of ever being hit by lightning”? While estimates of the lifetime risk of being hit by lightning vary slightly (according to the precision of the formulas used to calculate this estimate), one’s lifetime odds of being hit by lightning is somewhere between 1 in 6,250 and 1 in 10,000, so even if you went with the more conservative number here, the risk being re-identified by the Netflix attack was only 1/24 of your lifetime risk of being hit by lighting (assuming you’ll make to age 80 without something else getting you). This is truly a risk at a magnitude that no one rationally worries about.

Although the evidence-base provided by the Netflix re-identification was extremely thin, the algorithm is intelligently designed and it will be helpful to the furtherance of sound development of public policy to see what the re-identification potential is for such an algorithm with a real-world sparse dataset (perhaps medical data?) for a randomly selected data sample when examined with some justifiable starting assumptions regarding the extent of realistic data intruder background knowledge (which should reasonably account for practical data divergence issues).

Leave a Reply


Privacy Calendar

Jan
26
Mon
8:30 am Privacy as a Profit Center: Leve... @ Old Slip by Convene
Privacy as a Profit Center: Leve... @ Old Slip by Convene
Jan 26 @ 8:30 am – Jan 27 @ 4:15 pm
Learn how those on the leading edge of privacy governance and digital innovation from companies including Cigna, Cisco Systems, eBay Inc. Public Policy Lab, FocusMotion,Ghostery, Goodyear Tire & Rubber Company, Google, HP Enterprise Security Products, JPMorgan[...]
Jan
28
Wed
all-day Data Privacy Day
Data Privacy Day
Jan 28 – Jan 29 all-day
“Data Privacy Day began in the United States and Canada in January 2008, as an extension of the Data Protection Day celebration in Europe. The Day commemorates the 1981 signing of Convention 108, the first[...]
Mar
4
Wed
all-day Global Privacy Summit 2015
Global Privacy Summit 2015
Mar 4 – Mar 6 all-day
For more information, click here.
Mar
10
Tue
6:00 pm CDT Annual Dinner “TechProm” 2015
CDT Annual Dinner “TechProm” 2015
Mar 10 @ 6:00 pm – 9:00 pm
Featuring the most influential minds of the tech policy world, CDT’s annual dinner, TechProm, highlights the issues your organization will be facing in the future and provides the networking opportunities that can help you tackle[...]
Mar
13
Fri
all-day BCLT Privacy Law Forum
BCLT Privacy Law Forum
Mar 13 all-day
This program will feature leading academics and practitioners discussing the latest developments in privacy law. UC Berkeley Law faculty and conference panelists will discuss cutting-edge scholarship and explore ‘real world’ privacy law problems. Click here[...]
May
27
Wed
all-day PL&B’s Asia-Pacific Roundtable (...
PL&B’s Asia-Pacific Roundtable (...
May 27 all-day
PROFESSOR GRAHAM GREENLEAF, Asia-Pacific Editor, Privacy Laws & Business International Report, will lead a roundtable on the countries of most interest to business in the Asia-Pacific region. Click here for more information.
Jul
6
Mon
all-day PL&B’s 28th Annual International...
PL&B’s 28th Annual International...
Jul 6 – Jul 8 all-day
The Privacy Laws & Business 27th Annual International Conference featured more than 40 speakers and chairs from many countries over 3 intensive days. At the world’s longest running independent international privacy event participants gained professionally by[...]

View Calendar