New Study Shows Need for De-identification Best Practices

New Study Shows Need for De-identification Best Practices

Publically releasing sensitive information is risky.  In 1997, Latanya Sweeney used full date of birth, 5 digit ZIP code, and gender to show that seemingly anonymous medical data could be linked to an actual person when she uncovered the health information of William Weld, the former governor of Massachusetts.   Sweeney in a new study analyzes the data available in the Public Genome Project (PGP) and shows once again that many people can be re-identified by using date of birth, ZIP, and gender, when other data such as a voter registration list is available.

Sweeney’s work is important, but we don’t think it should be considered an indictment of de-identification.   The cases so often cited as proof that de-identification doesn’t work – the AOL Search data release, the Netflix prize, the Weld example and the PGP data – are all examples of barely or very poorly de-identified data.  De-identification experts do NOT consider a publically disclosed database with full date of birth, 5 digit ZIP code, and gender de-identified.  In fact, those three data points divide the US population into over 3 billion unique combinations.  Full date of birth divides a population into over 36 thousand separate groups and ZIP codes further divide the US population into over 43 thousand separate groups.  Publically releasing a database with such a large number of unique combinations allows additional databases to be added and gives attackers all the time in the world to examine the data. Thus, public disclosure greatly increases the risk of identifying individuals from a database.

Sweeney’s study shows the importance of very strong de-identification practices when data is disclosed publically.  With public data, organizations should use very strong de-identification techniques, such as the Privacy Analytics Risk Assessment Tool developed by Dr. Khaled El Emam or the use of differential privacy as proposed by Dr. Cynthia Dwork.

For nonpublic databases, however, strong de-identification techniques may not strike the right balance between data utility and privacy.  When nonpublic databases are protected by both technical and administrative controls, reasonable de-identification techniques, as opposed to very strong de-identification techniques, may be appropriate.  Attackers do not have unlimited time to attempt to break the technical de-identification protection, third party data is not available, and measures are in place to provide legal commitments.  Data breaches can occur of course, but certainly we need to recognize the very different status of protected versus unprotected data and should appreciate the range of protections that can support a de-identification promise.

FPF staff are conducting research exploring the different risk profiles of nonpublic databases and publically released databases and the relevant best practices for “pretty good” de-identification for restricted databases.  Please contact us if you are interested.

 

Leave a Reply


Privacy Calendar

Sep
15
Mon
all-day NIST Privacy Engineering Workshop @ San Jose Marriott
NIST Privacy Engineering Workshop @ San Jose Marriott
Sep 15 – Sep 16 all-day
Privacy is a challenging subject that spans a number of domains, including law, policy and technology. Notwithstanding numerous sets of principles, including the foundational Fair Information Practice Principles (FIPPs), that seek to address the handling[...]
Sep
17
Wed
all-day IAPP Privacy Academy and CSA Con... @ San Jose Convention Center
IAPP Privacy Academy and CSA Con... @ San Jose Convention Center
Sep 17 – Sep 19 all-day
This fall, the International Association of Privacy Professionals (IAPP) and Cloud Security Alliance (CSA) are bringing together the IAPP Privacy Academy and the CSA Congress under one roof, giving you access to even more valuable[...]
Sep
19
Fri
The NSA, Privacy and the Global ... @ Georgetown Law Center
The NSA, Privacy and the Global ... @ Georgetown Law Center
Sep 19 @ 1:15 pm – 2:45 pm
WHAT The NSA, Privacy and the Global Internet: Perspectives on Executive Order 12333 WHEN Friday, September 19, 2014 1:15 – 2:45 p.m. WHERE Georgetown University Law Center McDonough Hall, Room 200 600 New Jersey Avenue,[...]
Sep
23
Tue
Mapping Issues with the Web: An ... @ Tow Center for Digital Journalism/Columbia Journalism School
Mapping Issues with the Web: An ... @ Tow Center for Digital Journalism/Columbia Journalism School
Sep 23 @ 5:00 pm – 6:30 pm
On the occasion of Bruno Latour’s visit to Columbia University, this presentation will show participants how to operationalize his seminal Actor-Network Theory using digital data and methods in the service of social and cultural research.
Sep
26
Fri
Yale Day of Data @ Yale University
Yale Day of Data @ Yale University
Sep 26 @ 8:30 am – 5:00 pm
This day-long event will focus on data science and partnerships across industry, academia, and government initiatives. The day will also include presentations by eight Yale faculty and researchers on issues specific to research data management,[...]
Oct
11
Sat
City by Numbers: Big Data and th... @ Pratt Institute
City by Numbers: Big Data and th... @ Pratt Institute
Oct 11 @ 9:30 am – 6:00 pm
Big Data—the exponential growth and availability of information—is one of the defining phenomena of our time. It affects us all on different levels – with far-reaching social, environmental, and governmental significance. To help make sense[...]
Oct
21
Tue
Consumer Action’s 43rd Annual Aw... @ Google
Consumer Action’s 43rd Annual Aw... @ Google
Oct 21 @ 6:00 pm – Oct 21 @ 8:00 pm
To mark its 43rd anniversary, Consumer Action’s Annual Awards Reception on October 21, 2014, will celebrate the theme of “Train the Trainer.” Through the power of individual and small group trainings, Consumer Action each year is[...]

View Calendar