De-Identification: A Critical Debate

De-Identification: A Critical Debate

Ann Cavoukian and Dan Castro recently published a report titled Big Data and Innovation, Setting the Record Straight: De-Identification Does Work. Arvind Narayanan and Edward Felten wrote a critique of this report, which they highlighted on Freedom to Tinker. Today Khaled El Emam and Luk Arbuckle respond on the FPF blog with this guest post.


Why de-identification is a key solution for sharing data responsibly

Khaled El Emam (University of Ottawa, CHEO Research Institute & Privacy Analytics Inc.)

Luk Arbuckle (CHEO Research Institute, Privacy Analytics Inc.)

Arvind Narayanan and Edward Felten have responded to a recent report by Ann Cavoukian and Dan Castro (Big Data and Innovation, Setting the Record Straight:  De-Identification Does Work)  by claiming that de-identification is “not a silver bullet” and “still does not work.” The authors are misleading on both counts. First, no one, certainly not Cavoukian or Castro, claims that de-identification is a silver bullet, if by that you mean that de-identification is the modern equivalent of the medieval, magic weapon that could always and inexplicably defeat otherwise unconquerable foes like werewolves and vampires. Second, and to get away from unhelpful metaphors, de-identification does work, both in theory and in practice, and there is ample evidence that that’s true.  Done properly, de-identification is a reliable and indispensable technique for sharing data in a responsible way that protects individuals.

Narayanan and Felten assert viewpoints that are not shared by the larger disclosure control community. Assuming the reader has already read both reports, we’ll respond to some of Narayanan’s and Felten’s claims and look at the evidence.

It’s important to highlight that we take an evidence-based approach—we support our statements with evidence and systematic reviews, rather than express opinions. This is important because the evidence does not support the Narayanan and Felten perspective on de-identification

Real-world evidence shows that the risk of re-identifying properly anonymized data is very small

Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small [1]. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.

It is possible to de-identify location data

The authors claim that there are no good methods for de-identifying location data. In fact, there is relevant work on the de-identification of different types of location data [2]–[4]. The challenge we are facing is that many of these techniques are not being deployed in practice. We have a knowledge dissemination problem rather than a knowledge problem – i.e., sound techniques are known and available, but not often enough used. We should be putting our energy into translating best practices within the analytics community.

Computing re-identification probabilities is not only possible, but necessary

The authors criticize the computation of re-identification probabilities and characterize that as “silly”. They ignore the well-established literature on the computation of re-identification risk [5], [6]. These measurement and estimation techniques have been used for decades to share census as well as other population data and national surveys. For example, the Journal of Official Statistics has been publishing papers on risk measurement for a few decades. There is no evidence that these published risk probabilities were “silly” or, more importantly, that any of that data anonymized in reliance upon on such risk measurements was re-identified.

Second, the authors argue that a demonstration attack where a single individual in a database is re-identified is sufficient to show that a whole database can be re-identified. There is a basic fault here. Re-identification is probabilistic. If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.

The authors then go on to compare hacking the security of a system to re-identification by saying that if they hack one instance of a system (i.e., a demonstration of the hack) then all instances are hackable. But there is a fundamental difference. Hacking a system is deterministic. Re-identification is not deterministic – re-identifying a record does not mean that all records in the data set are re-identifiable. For example, in clinical research, if we demonstrate that we can cure a single person by giving him a drug (i.e., a demonstration) that does not mean that the drug will cure every other person—that would be nonsense. An effect on an individual patient is just that—an effect on an individual person. As another analogy, an individual being hit by lightning does not mean that everyone else in the same city is going to be hit by lightning. Basically, demonstrating an effect on a single person or a single record does not mean that the same effect will be replicated with certainty for all the others.

We should consider realistic threats

The authors emphasize the importance of considering realistic threats and give some examples of considering acquaintances as potential adversaries. We have developed a methodology that addresses the exact realistic threats that Narayanan and Felten note [4], [7]. Clearly everyone should be using such a robust methodology to perform a proper risk assessment—we agree. Full methodologies for de-identification have been developed (please see our O’Reilly book on this topic [4]) – the failure to use them broadly is the challenge society should be tackling.

The NYC Taxi data set was poorly de-identified – it is not an example of practices that anyone should follow

The re-identification attack on the NYC taxi data was cited as an example of how easy it is to re-identify data. That data set was poorly de-identified, which makes for a great example of the need for a robust de-identification methodology. The NYC Taxi data used a one way hash without a salt, which is just poor practice, and takes us back to the earlier point that known methods need to be better disseminated. Using the NYC taxi example to make a general point about the discipline of de-identification is just misleading.

Computing correct probabilities for the Heritage Health Prize data set

One example that is mentioned by the authors is the Heritage Health Prize (HHP). This was a large clinical data set that was de-identified and released to a broad community [8]. To verify that the data set had been properly and securely de-identified, HHP’s sponsor commissioned Narayanan to perform a re-identification attack on the HHP data before it was released. It was based on the results of that unsuccessful attack that the sponsor made the decision to release the data for the competition.

In describing his re-identification attack on the HHP data set, Narayanan estimated the risk of re-identification to be 12.5%, using very conservative assumptions. . This was materially different from the approximately 1% risk that was computed in the original de-identification analysis [8]. To get to 12.5%, he had to assume that the adversary would know seven different diagnosis codes (not common colloquial terms, but ICD-9 codes) that belong to a particular patient. He states “roughly half of members with 7 or more diagnosis codes are unique if the adversary knows 7 of their diagnosis codes. This works out to be half of 25% or 12.5% of members” (A. Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset”, 2011). That, by most standards, is quite a conservative assumption, especially when he also notes that diagnosis codes are not correlated in this data set – i.e., seven unrelated conditions for a patient! It’s not realistic to assume that an adversary knows so much medical detail about a patient. Most patients themselves do not know many of the diagnosis codes in their own records. But even if such an adversary does exist, he would learn very little from the data (i.e., the more the adversary already knows, the smaller the information gain from a re-identification). None of the known re-identification attacks that used diagnosis codes had that much detailed background information.

The re-identification attack made some other broad claims without supporting evidence—for example, that it would be easy to match the HHP data with the California hospital discharge database. We did that! We matched the individual records in the de-identified HHP data set with the California Stat Inpatient Database over the relevant period, and demonstrated empirically that the match rate was very small.

It should also be noted that this data set had a terms-of-use on it. All individuals who have access to the data have to agree to these terms-of-use. An adversary who knows a lot about a patient is likely to be living in the US or Canada (i.e., an acquaintance) and therefore the terms-of-use would be enforceable if there was a deliberate re-identification.

The bottom line from the HPP is that the result of the commissioned re-identification attack (whose purpose was to re-identify individuals in the de-identified data) was that it did not re-identify a single person. You could therefore argue that Narayanan made the empirical case for sound de-identification!

The authors do not propose alternatives

The process of re-identification is probabilistic. There is no such thing as zero risk. If relevant data holders deem any risk to be unacceptable, it will not be possible to share data. That would not make sense – we make risk-based decisions in our personal and business lives every day. Asking for consent or authorization for all data sharing is not practical, and consent introduces bias in the data because specific groups will not provide consent [9], [10]. For the data science community, the line of argument that any risk is too much risk is dangerous and should be very worrisome because it will adversely affect the flow of data.

The authors pose a false dichotomy for the future

The authors conclude that the only alternatives are (a) the status quo, where one de-identifies and, in their words, “hopes for the best”; (b) using emerging technologies that involve some trade-offs in utility and convenience and/or using legal agreements to limit use and disclosure of sensitive data.

We strongly disagree with that presentation of the alternatives.  First, the overall concept of trade-offs between data utility and privacy is already built into sound de-identification methodologies [7]. What is acceptable in a tightly controlled, contractually bound situation is quite different from what is acceptable when data will be released publicly – and such trade-offs are and should be quantified.

Second, de-identification is definitely not an alternative to using contracts to protect data. To the contrary, contractual protections are one part (of many) of the risk analyses done in contemporary de-identification methodologies. The absence of a contract always means that more changes to the data are required to achieve responsible de-identification (e.g., generalization, suppression, sub-sampling, or adding noise).

Most of all, we strongly object to the idea that proper de-identification means “hoping for the best.”  We ourselves are strongly critical of any aspect of the status quo whereby data holders use untested, sloppy methods to anonymize sensitive data.  We agree with privacy advocates that such an undisciplined approach is doomed to result in successful re-identification attacks and the growing likelihood of real harm to individuals if badly anonymized data becomes re-identified. Instead, we maintain, on the basis of decades of both theory and real-world evidence, that careful, thorough de-identification using well-tested methodologies achieves crucial data protection and produces a very small risk of re-identification. The challenge that we, as a privacy community, need to rise up to is to transition these approaches into practice and increase the maturity level of de-identification in the real world.

A call to action

It is important to encourage data custodians to use best current practices to de-identify their data. Repeatedly attacking poorly de-identified data captures attention, and it can be constructive if the lesson learned is that better de-identification methods should be used.



[1]          K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, p. e28071, Dec. 2011.

[2]          Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan Wrobel, “Movement Data Anonymity through Generalization,” Transactions on Data Privacy, vol. 3, no. 2, pp. 91–121, 2010.

[3]          S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger, “Revealing the spatial distribution of a disease while preserving privacy,” Proc. Natl. Acad. Sci. U.S.A., vol. 105, no. 46, pp. 17608–17613, Nov. 2008.

[4]          K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.

[5]          L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996.

[6]          L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001.

[7]          K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.

[8]          K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research, vol. 14, no. 1, p. e33, Feb. 2012.

[9]          K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association, vol. 16, no. 5, pp. 670–682, 2009.

[10]        K. El Emam, E. Jonker, E. Moher, and L. Arbuckle, “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics, vol. 13, no. 4, pp. 42–44, 2013.


Posted On
Jul 25, 2014
Posted By
D. Kellus Pruitt DDS

Those advocating de-identification, such as Khaled El Emam and Luk Arbuckle, are currently fending off attacks from opponents who argue that de-ID is not secure enough. The fact is, a person’s medical record contains so much information about the individual that just removing names, addresses and phone numbers is not enough precaution to protect patients’ privacy. It takes expert skill to determine what must be removed from medical records before they can be safely employed for secondary uses such as research.

Pardon me for intruding, but I am trying my best to draw attention to the fact that dentistry is different.

Outside the morgue, it is exceedingly unlikely that a person’s electronic dental records to be of aid in re-identifying the owner… even if someone wanted to. Because of the limited nature of dental records, it seems to me that de-identification of dentists’ primary records might offer a convenient, low-risk niche where the principles of de-identification (perhaps combined with tokenization to digitally re-unite the dental record with ePHI as needed) could be relatively safely applied.

It would be sort of like learning to swim in the shallow end.

D. Kellus Pruitt DDS

Posted On
Jul 28, 2014
Posted By
D. Kellus Pruitt DDS

I might have long ago lost interest in the security offered by de-identified dental records, except nobody has said it will not work. In fact, over a year ago, following Dr. Khaled El Emam’s article “Perspectives on Health Data De-identification,” I asked him:

“If it were possible to de-identify dentists’ primary electronic dental records according to the expert method of determination – in combination with employing tokenization to re-populate volatile identifiers as needed – wouldn’t this theoretically be more secure, as well as perhaps more convenient than a dentist’s DIY, on-site encryption?”

Khaled El Emam:
“While we have not done any work with dentistry data, the large amounts of health data we have analyzed indicate that Safe Harbor has critical weaknesses as described in the articles, and that is why we recommend a risk based approach following the statistical method (or expert determination method). Also, a risk based approach to de-identification would take the sensitivity of the data and potential harm to patients into account. So I think there is no inconsistency with what Dr Pruitt is advocating. In fact, a risk based approach may result in more data being disclosed and hence advancing evidence-based practices much faster.”

It’s common sense, actually. If patients’ identities are unavailable, they cannot be stolen.

The nature of dentistry lends itself to a de-ID solution, in my opinion. However changing minds of dental leaders who have committed to Meaningful Use requirements is difficult – even though electronic dental records are both more expensive and more dangerous than paper dental records.

Posted On
Jul 26, 2014
Posted By
Steve Wilson

El Emam and Arbuckle write:
“If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.”

No but are we to assume that if one record in a set of 100 is re-identified, then no more will be? That’s not how probabilities work either. If the probability of re-identification is 0.01 then repeat applications of a probabilistic process may well eventually crack the while set.

One problem in this debate is that there are no accepted ways to characterise re-identification probability. It’s a bit like the lack of standards for rating identification errors in biometrics. Advocates for a position rely on isolated figures like “X percent” to convey an impression that something is very secure or very insecure. Readers are given no information on how the probabilities worked out. Are they for one-off random identification attempts, or for a concerted series of attacks over time? Are they for a given known attack vector or for or the sum of all known attack vectors? How will the likelihood of re-identification change over time, as more data linkages become available?

My reading of Narayanan and Felten is that, while they did go for some flourish in their headline “Still Not Working” their advocacy is of caution. We must not assume the best from a given small probability figure for re-identification. The stakes are high; we must not be complacent.

On the other hand, I feel that Cavoukian and Castro are more for advocating for Big Data than they are for the scientific study of de-identification. And remember it was they who started the dueling rhetoric with their headline “Re-identification *works*” (emphasis in the original).

Posted On
Jul 28, 2014
Posted By
Daniel Castro

Hi Steve,

Just for the record, I wrote a paper more than two years ago discussing the need to develop a privacy R&D roadmap, specifically to advance research in areas such as de-identification, and I also co-organized a workshop on this topic that brought together some of the best computer scientists and researchers in academia, government, and industry. So I’d say that my think tank has done as much as almost anyone in pushing for more scientific study of de-identification.

You can see my response to Narayanan and Felten here:

Yes, de-identification does work when done right (I think its fairly obvious that it doesn’t work when done wrong, but if the title doesn’t make that clear, I think the rest of the paper does). To me, this is like saying “yes, planes can fly” (not that I am denying the existence of plane crashes). In contrast, Narayanan and Felten argue that de-identification doesn’t just fail sometimes, but that they think it never works. Ever. Under any condition (or to take the analogy, they are saying “planes have never, and will never, be capable of flight”). That position is simply not supported by the facts, and I suspect they will eventually recant it.

Posted On
Jul 28, 2014
Posted By
Steve Wilson

Hi Daniel.
Thanks for the details. I’ll leave Ed and Arvind defend themselves!
On the question of the science of de-identification, I am sure your think tank is very good, but my point is that saying “the chances of your record being de-identified is 10,000″ is a problematic statement on many levels.

Firstly and mundanely, lay people are notoriously unable to conceptualise probabilities. So we need a better presentation of the facts if we are to hope for informed consent.

More importantly, the context and methodology behind any given probability is rarely set out. In particular, I am uneasy about calculations if they relate to the chances of anyone at all being randomly re-identified by the application of some algorithm, because re-identification as an attack could be a targeted exercise. To go back to the biometric comparison I made: fingerprint recognition vendors like to say the “False Accept Rate is 0.1% or something but those figures are usually computed under “Zero Effort Imposter” assumptions. That is, the false accepts being considered are the random errors. Biometric accuracy figures never reflect the probability of a concerted attacked being successful. So that’s my worry with re-id probabilities as quoted. It might be 1 in 10,000 for a random exposure of any one person’s record, but if an attacker is looking for me in particular, what are the odds? I suspect that is less knowable. From another angle, it must be true that for a set of say 1,000,000 de-identified records, some people will be easier to re-identify than others. So to whom does a quote like “1 in 10,000″ apply?

Finally the probability of re-identification is one dimension in a risk assessment; the other is the impact of re-identification (if we treat re-id as an adverse event then conventional risk assessment would have us compute the Risk as a sort of product of likelihood and impact). I took Luk Arbuckle to task on Twitter yesterday for casually opining (so it seemed) that the re-id “risk is low and not harmful”. The harm that results from re-id cannot be characterised like that for everyone in the population. It worries me that some proponents of de-identification conflate likelihood and impact like that.

Posted On
Aug 16, 2014
Posted By
Khaled El Emam

The other key thing to note, again about the HHP study which Narayanan & Felten refer to, is that the estimator that was used to get that 12.5% number (you’ll have to go back through the thread to get what I am talking about) was never validated. They made up an estimator which seemed to make sense, used it, and drew strong conclusions from that. Usually statistical estimators need to be validated to show that they are somewhat accurate. People build careers on developing and validating risk estimators. You can’t just invent one and start using it. Therefore, we have no idea if that 12.5% is too high, too low, or off the charts. For conclusions to have scientific credibility, the estimators that are used in the analysis need to be validated first.

Posted On
Aug 22, 2014
Posted By
Joe Moore

As an individual concerned about the accelerating decline in privacy fostered by the same interests that are defending de-identification as an interest balancing mechanism, I am left wondering how the “if it was successfully attacked, it wasn’t properly de-identifed in the first place” defense is supposed to reassure me.

If I see in a privacy policy that data will be collected but de-identified, how am I as a layman supposed to know whether it will be “properly de-identified”. Certainly I don’t get to choose de-identification methods on the open market. I am left to choose a provider and hope for the best. I certainly don’t have any rights if my provider chooses poorly.

Leave a Reply

Privacy Calendar

all-day Data Privacy Day
Data Privacy Day
Jan 28 – Jan 29 all-day
“Data Privacy Day began in the United States and Canada in January 2008, as an extension of the Data Protection Day celebration in Europe. The Day commemorates the 1981 signing of Convention 108, the first[...]
all-day Data Privacy Day
Data Privacy Day
Jan 28 – Jan 29 all-day
“Data Privacy Day began in the United States and Canada in January 2008, as an extension of the Data Protection Day celebration in Europe. The Day commemorates the 1981 signing of Convention 108, the first[...]

View Calendar