Author Archive

FPF Statement on Today’s Safe Harbor Complaint

Today, the Center for Digital Democracy filed a complaint with the Federal Trade Commission, alleging that companies are violating the U.S.-EU Safe Harbor agreement. CDD’s filing came with a report criticizing the practices of thirty companies.

“We are carefully reviewing the report’s claims, but the dozen we have examined so far seem to reflect the authors distaste for marketing, rather than legal safe harbor violations,” said Jules Polonetsky, Executive Director, Future of Privacy Forum.

The Future of Privacy Forum has long focused on the value of the Safe Harbor agreement, and issued a comprehensive report on the framework last fall.

Cross Border Privacy Rules Advance at Beijing Meetings

APEC’s Data Privacy Subgroup concluded its 2014 meetings in Beijing, China earlier this week.   The Future of Privacy Forum participated in these meetings as a member of the U.S. delegation.  The biggest development of the week was Canada’s submission of its Notice of Intent to participate in the Cross Border Privacy Rules (CBPR) system.  After a favorable determination by the APEC’s Joint Oversight Panel, Canada will become the fourth country to join the system, along with the United States, Mexico and Japan.   In addition, TRUSTe, an APEC-approved Accountability Agent, announced that 14 companies are in the process of seeking certification.  Taken together, these developments, along with Mexico’s recent steps toward interoperability have provided promising momentum in the establishment of an international privacy framework.

Still much work remains before the true potential of the system can be fully realized.  In July, FPF hosted officials from Privacy Thailand, a University-based consortium that advises the Thai Prime Minister’s office on data privacy and security issues.  During their week-long visit, FPF and Privacy Thailand met with representatives from the Department of Commerce, the Federal Trade Commission and the U.S. Department of State to consider Thailand’s accession to the system.   FPF will continue work with interested APEC members to provide capacity building assistance.

On August 8, APEC Economies and representatives from the EU’s Article 29 Working Party met to discuss next steps on the jointly developed Common Referential.  This document identifies points of commonality between the CBPR system and the EU’s system of Binding Corporate Rules (BCRs).  APEC members agreed to take this work forward by developing case studies that demonstrate the practical interoperability of these two systems and a checklist outlining the combined obligations for a company seeking certification under both.

On August 10, APEC Economies agreed to establish a working group to consider the applicability of the APEC Privacy Framework to Big Data.  This group will consider, among other things, appropriate administrative and policy safeguards when de-identifying personal information.  FPF plans to participate in this working group.

Participants continued the development of a CBPR certification system for data processors.   In July, FPF hosted a meeting of this working group to develop the program requirements under this certification.  Completion of this project is expected in advance of the next APEC Data Privacy Subgroup meetings in Clark, Philippines in January, 2015.

Comments to NTIA on Big Data

Today, FPF submitted comments to the NTIA as it begins its exploration of how big data impact the Consumer Privacy Bill of Rights. While the NTIA sought comment on over a dozen key questions, our filing focus largely on four issues: (1) the need for additional clarity surrounding the flexible application of the Consumer Privacy Bill of Rights’ privacy principles, (2) challenges to the “notice and choice” model and using context to inform a use-based approach to data use, (3) practical de-identification, and (4) what internal review boards might look like and consider in the age of big data.

Much of our filing builds upon FPF’s thinking on how to develop a benefit-risk analysis for data protects, with big data concerns of particular importance. Industry increasingly faces ethical considerations over how to minimize data risks while maximizing benefits to all parties. As the White House’s earlier Big Data Report acknowledged, there is a potential tension between socially beneficial and privacy invasive uses of information in everything from educational technology to consumer generated health data. The advent of big data requires active engagement by both internal and external stakeholders to increase transparency, accountability and trust.

FPF believes that a documented review process could serve as an important tool to infuse ethical considerations into data analysis without requiring radical changes to the business practices or innovators or industry in general. Institutional review boards (IRBs), which remain the chief regulatory response to decades of questionable ethical decisions in the field of human subject testing, provide a useful precedent for focusing on good process controls as a way to address potential privacy concerns. While IRBs have become a rigid compliance device and would be inappropriate for wholesale use in big data decision-making, they could provide a useful template for how projects can be evaluated based on prevailing community standards and subjective determinations of risks and benefits, particularly in cases involving greater privacy risks. Using an IRB model as inspiration, big data may warrant the creation of new advisory processes within organizations to more fully consider ethical questions posed by big data.

Moving forward, broader big data ethics panels could provide a commonsense response to public concerns about data misuse. While these institutions could provide a further expansion of the role of privacy professionals within organizations, they might also provide a forum for a diversity of viewpoints inside and out of organizations. Ethics reviews could include members with different backgrounds, training, and experience, and could seek input from outside actors including consumer groups and regulators.While these panels will vary between the public and private sector, businesses and researchers, they could provide an important check on any data misuse.

Organizations and privacy professionals have become experienced at evaluating risk, but they should also engage in a rigorous data benefit analysis in conjunction with traditional privacy risks assessments. FPF suggests that organizations could develop procedures to assess the “raw value” of a data project, which would require organizations to identify the nature of a project, its potential beneficiaries, and the degree to which those beneficiaries would benefit from the project. Our guidance for this process is included in our filing for the first time.

Of course, big data hasn’t changed all the rules. And not every use of big data implicates our privacy. Many uses of big data are machine-to-machine or highly aggregated. Many new uses of data are marginal, which our current processes for mitigating risks can well address.

De-Identification: A Critical Debate

Ann Cavoukian and Dan Castro recently published a report titled Big Data and Innovation, Setting the Record Straight: De-Identification Does Work. Arvind Narayanan and Edward Felten wrote a critique of this report, which they highlighted on Freedom to Tinker. Today Khaled El Emam and Luk Arbuckle respond on the FPF blog with this guest post.


Why de-identification is a key solution for sharing data responsibly

Khaled El Emam (University of Ottawa, CHEO Research Institute & Privacy Analytics Inc.)

Luk Arbuckle (CHEO Research Institute, Privacy Analytics Inc.)

Arvind Narayanan and Edward Felten have responded to a recent report by Ann Cavoukian and Dan Castro (Big Data and Innovation, Setting the Record Straight:  De-Identification Does Work)  by claiming that de-identification is “not a silver bullet” and “still does not work.” The authors are misleading on both counts. First, no one, certainly not Cavoukian or Castro, claims that de-identification is a silver bullet, if by that you mean that de-identification is the modern equivalent of the medieval, magic weapon that could always and inexplicably defeat otherwise unconquerable foes like werewolves and vampires. Second, and to get away from unhelpful metaphors, de-identification does work, both in theory and in practice, and there is ample evidence that that’s true.  Done properly, de-identification is a reliable and indispensable technique for sharing data in a responsible way that protects individuals.

Narayanan and Felten assert viewpoints that are not shared by the larger disclosure control community. Assuming the reader has already read both reports, we’ll respond to some of Narayanan’s and Felten’s claims and look at the evidence.

It’s important to highlight that we take an evidence-based approach—we support our statements with evidence and systematic reviews, rather than express opinions. This is important because the evidence does not support the Narayanan and Felten perspective on de-identification

Real-world evidence shows that the risk of re-identifying properly anonymized data is very small

Established, published, and peer-reviewed evidence shows that following contemporary good practices for de-identification ensures that the risk of re-identification is very small [1]. In that systematic review (which is the gold standard methodology for summarizing evidence on a given topic) we found that there were 14 known re-identification attacks. Two of those were conducted on data sets that were de-identified with methods that would be defensible (i.e., they followed existing standards). The success rate of the re-identification for these two was very small.

It is possible to de-identify location data

The authors claim that there are no good methods for de-identifying location data. In fact, there is relevant work on the de-identification of different types of location data [2]–[4]. The challenge we are facing is that many of these techniques are not being deployed in practice. We have a knowledge dissemination problem rather than a knowledge problem – i.e., sound techniques are known and available, but not often enough used. We should be putting our energy into translating best practices within the analytics community.

Computing re-identification probabilities is not only possible, but necessary

The authors criticize the computation of re-identification probabilities and characterize that as “silly”. They ignore the well-established literature on the computation of re-identification risk [5], [6]. These measurement and estimation techniques have been used for decades to share census as well as other population data and national surveys. For example, the Journal of Official Statistics has been publishing papers on risk measurement for a few decades. There is no evidence that these published risk probabilities were “silly” or, more importantly, that any of that data anonymized in reliance upon on such risk measurements was re-identified.

Second, the authors argue that a demonstration attack where a single individual in a database is re-identified is sufficient to show that a whole database can be re-identified. There is a basic fault here. Re-identification is probabilistic. If the probability of re-identification is 1 in 100, the re-identification of a single record does not mean that it is possible to re-identify all hundred records. That’s not how probabilities work.

The authors then go on to compare hacking the security of a system to re-identification by saying that if they hack one instance of a system (i.e., a demonstration of the hack) then all instances are hackable. But there is a fundamental difference. Hacking a system is deterministic. Re-identification is not deterministic – re-identifying a record does not mean that all records in the data set are re-identifiable. For example, in clinical research, if we demonstrate that we can cure a single person by giving him a drug (i.e., a demonstration) that does not mean that the drug will cure every other person—that would be nonsense. An effect on an individual patient is just that—an effect on an individual person. As another analogy, an individual being hit by lightning does not mean that everyone else in the same city is going to be hit by lightning. Basically, demonstrating an effect on a single person or a single record does not mean that the same effect will be replicated with certainty for all the others.

We should consider realistic threats

The authors emphasize the importance of considering realistic threats and give some examples of considering acquaintances as potential adversaries. We have developed a methodology that addresses the exact realistic threats that Narayanan and Felten note [4], [7]. Clearly everyone should be using such a robust methodology to perform a proper risk assessment—we agree. Full methodologies for de-identification have been developed (please see our O’Reilly book on this topic [4]) – the failure to use them broadly is the challenge society should be tackling.

The NYC Taxi data set was poorly de-identified – it is not an example of practices that anyone should follow

The re-identification attack on the NYC taxi data was cited as an example of how easy it is to re-identify data. That data set was poorly de-identified, which makes for a great example of the need for a robust de-identification methodology. The NYC Taxi data used a one way hash without a salt, which is just poor practice, and takes us back to the earlier point that known methods need to be better disseminated. Using the NYC taxi example to make a general point about the discipline of de-identification is just misleading.

Computing correct probabilities for the Heritage Health Prize data set

One example that is mentioned by the authors is the Heritage Health Prize (HHP). This was a large clinical data set that was de-identified and released to a broad community [8]. To verify that the data set had been properly and securely de-identified, HHP’s sponsor commissioned Narayanan to perform a re-identification attack on the HHP data before it was released. It was based on the results of that unsuccessful attack that the sponsor made the decision to release the data for the competition.

In describing his re-identification attack on the HHP data set, Narayanan estimated the risk of re-identification to be 12.5%, using very conservative assumptions. . This was materially different from the approximately 1% risk that was computed in the original de-identification analysis [8]. To get to 12.5%, he had to assume that the adversary would know seven different diagnosis codes (not common colloquial terms, but ICD-9 codes) that belong to a particular patient. He states “roughly half of members with 7 or more diagnosis codes are unique if the adversary knows 7 of their diagnosis codes. This works out to be half of 25% or 12.5% of members” (A. Narayanan, “An Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset”, 2011). That, by most standards, is quite a conservative assumption, especially when he also notes that diagnosis codes are not correlated in this data set – i.e., seven unrelated conditions for a patient! It’s not realistic to assume that an adversary knows so much medical detail about a patient. Most patients themselves do not know many of the diagnosis codes in their own records. But even if such an adversary does exist, he would learn very little from the data (i.e., the more the adversary already knows, the smaller the information gain from a re-identification). None of the known re-identification attacks that used diagnosis codes had that much detailed background information.

The re-identification attack made some other broad claims without supporting evidence—for example, that it would be easy to match the HHP data with the California hospital discharge database. We did that! We matched the individual records in the de-identified HHP data set with the California Stat Inpatient Database over the relevant period, and demonstrated empirically that the match rate was very small.

It should also be noted that this data set had a terms-of-use on it. All individuals who have access to the data have to agree to these terms-of-use. An adversary who knows a lot about a patient is likely to be living in the US or Canada (i.e., an acquaintance) and therefore the terms-of-use would be enforceable if there was a deliberate re-identification.

The bottom line from the HPP is that the result of the commissioned re-identification attack (whose purpose was to re-identify individuals in the de-identified data) was that it did not re-identify a single person. You could therefore argue that Narayanan made the empirical case for sound de-identification!

The authors do not propose alternatives

The process of re-identification is probabilistic. There is no such thing as zero risk. If relevant data holders deem any risk to be unacceptable, it will not be possible to share data. That would not make sense – we make risk-based decisions in our personal and business lives every day. Asking for consent or authorization for all data sharing is not practical, and consent introduces bias in the data because specific groups will not provide consent [9], [10]. For the data science community, the line of argument that any risk is too much risk is dangerous and should be very worrisome because it will adversely affect the flow of data.

The authors pose a false dichotomy for the future

The authors conclude that the only alternatives are (a) the status quo, where one de-identifies and, in their words, “hopes for the best”; (b) using emerging technologies that involve some trade-offs in utility and convenience and/or using legal agreements to limit use and disclosure of sensitive data.

We strongly disagree with that presentation of the alternatives.  First, the overall concept of trade-offs between data utility and privacy is already built into sound de-identification methodologies [7]. What is acceptable in a tightly controlled, contractually bound situation is quite different from what is acceptable when data will be released publicly – and such trade-offs are and should be quantified.

Second, de-identification is definitely not an alternative to using contracts to protect data. To the contrary, contractual protections are one part (of many) of the risk analyses done in contemporary de-identification methodologies. The absence of a contract always means that more changes to the data are required to achieve responsible de-identification (e.g., generalization, suppression, sub-sampling, or adding noise).

Most of all, we strongly object to the idea that proper de-identification means “hoping for the best.”  We ourselves are strongly critical of any aspect of the status quo whereby data holders use untested, sloppy methods to anonymize sensitive data.  We agree with privacy advocates that such an undisciplined approach is doomed to result in successful re-identification attacks and the growing likelihood of real harm to individuals if badly anonymized data becomes re-identified. Instead, we maintain, on the basis of decades of both theory and real-world evidence, that careful, thorough de-identification using well-tested methodologies achieves crucial data protection and produces a very small risk of re-identification. The challenge that we, as a privacy community, need to rise up to is to transition these approaches into practice and increase the maturity level of de-identification in the real world.

A call to action

It is important to encourage data custodians to use best current practices to de-identify their data. Repeatedly attacking poorly de-identified data captures attention, and it can be constructive if the lesson learned is that better de-identification methods should be used.



[1]          K. El Emam, E. Jonker, L. Arbuckle, and B. Malin, “A Systematic Review of Re-Identification Attacks on Health Data,” PLoS ONE, vol. 6, no. 12, p. e28071, Dec. 2011.

[2]          Anna Monreale, Gennady L. Andrienko, Natalia V. Andrienko, Fosca Giannotti, Dino Pedreschi, Salvatore Rinzivillo, and Stefan Wrobel, “Movement Data Anonymity through Generalization,” Transactions on Data Privacy, vol. 3, no. 2, pp. 91–121, 2010.

[3]          S. C. Wieland, C. A. Cassa, K. D. Mandl, and B. Berger, “Revealing the spatial distribution of a disease while preserving privacy,” Proc. Natl. Acad. Sci. U.S.A., vol. 105, no. 46, pp. 17608–17613, Nov. 2008.

[4]          K. El Emam and L. Arbuckle, Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly, 2013.

[5]          L. Willenborg and T. de Waal, Statistical Disclosure Control in Practice. New York: Springer-Verlag, 1996.

[6]          L. Willenborg and T. de Waal, Elements of Statistical Disclosure Control. New York: Springer-Verlag, 2001.

[7]          K. El Emam, Guide to the De-Identification of Personal Health Information. CRC Press (Auerbach), 2013.

[8]          K. El Emam, L. Arbuckle, G. Koru, B. Eze, L. Gaudette, E. Neri, S. Rose, J. Howard, and J. Gluck, “De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset,” Journal of Medical Internet Research, vol. 14, no. 1, p. e33, Feb. 2012.

[9]          K. El Emam, F. Dankar, R. Issa, E. Jonker, D. Amyot, E. Cogo, J.-P. Corriveau, M. Walker, S. Chowdhury, R. Vaillancourt, T. Roffey, and J. Bottomley, “A Globally Optimal k-Anonymity Method for the De-identification of Health Data,” Journal of the American Medical Informatics Association, vol. 16, no. 5, pp. 670–682, 2009.

[10]        K. El Emam, E. Jonker, E. Moher, and L. Arbuckle, “A Review of Evidence on Consent Bias in Research,” American Journal of Bioethics, vol. 13, no. 4, pp. 42–44, 2013.

Privacy Chutzpah: A Story for the Onion?

I recently received an email promoting a campaign by a group called Some Of Us, an organization that generates petitions opposing various activities of large companies. This campaign was directed at Facebook, calling on the social network to not sell user data to advertisers. Facebook has recently announced plans to allow advertisers to target ads to Facebook users based on the web sites users have visited. Facebook is not selling user data to advertisers, but I can understand the confusion. Behavioral advertising is complicated, and although selling user data to advertisers is very different than choosing ads for users based on their web surfing, it’s not uncommon for critics to use broad language to blast targeted ads in general.

How Ads Work on Facebook from Facebook on Vimeo.

The surprise was what I found when I examined the privacy policy for the Some of Us site. In a move worthy of an Onion fake news story, the Some of Us policy discloses that it works with ad networks to retarget ads to users on the web after they visit the Some of Us site. Yup! Some of Us does exactly what it is calling on users to protest to Facebook. A quick scan of the site using popular tracking cookie scanner Ghostery finds the code for several ad companies, including leading data broker Axciom.

Some of Us also complains that the Facebook opt-out process, where Facebook links users to the industry central opt-out site found at, is too tedious. But Some of Us doesn’t even bother to provide its visitors with a link or an url to opt-out, as the behavioral advertising code enforced by the Better Business Bureau requires. Some of Us just tells visitors they can visit the Network Advertising Initiative opt out page, leaving them to research how to find the opt-out page on their own.

It gets better. Some of Us solicits users emails and names for petitions, but only if you read the site privacy policy will you learn that signing the petition adds you to the email list for future emails from Some of Us about other causes. The site privacy policy also explains the use of email web bugs that enable Some of Us to personally track when and if the recipients of emails open and read the emails.

I am used to reading stories in the media blasting behavioral ads on the home pages of newspapers embedded with dozens of web trackers. Reporters don’t run the web sites of newspapers, and although they might want to consider whether the ad tracking they consider odious is funding their salaries, they can credibly argue that the business side of media and reporting are separate worlds. But how can an advocacy group blast behavioral ads while targeting behavioral ads to users who come to sign a petition against behavioral ads?!!!

I signed the petition and was immediately taken to a page where Some of Us encouraged me to share the news with my friends on Facebook.

-Jules Polonetsky, Executive DirectorThis post originally appeared on LinkedIn

Privacy Calendar

all-day Big Data: A Tool for Inclusion or Exclusion? @ Constitution Center
Big Data: A Tool for Inclusion o… @ Constitution Center
Sep 15 all-day
The Federal Trade Commission will host a public workshop entitled “Big Data: A Tool for Inclusion or Exclusion?” in Washington on September 15, 2014, to [...]
all-day IAPP Privacy Academy and CSA Congress 2014 @ San Jose Convention Center
IAPP Privacy Academy and CSA Con… @ San Jose Convention Center
Sep 17 – Sep 19 all-day
This fall, the International Association of Privacy Professionals (IAPP) and Cloud Security Alliance (CSA) are bringing together the IAPP Privacy Academy and the CSA Congress [...]
6:00 pm Consumer Action’s 43rd Annual Awards Reception @ Google
Consumer Action’s 43rd Annual Aw… @ Google
Oct 21 @ 6:00 pm – 8:00 pm
To mark its 43rd anniversary, Consumer Action’s Annual Awards Reception on October 21, 2014, will celebrate the theme of “Train the Trainer.” Through the power of [...]
all-day Data Privacy Day
Data Privacy Day
Jan 28 all-day
“Data Privacy Day began in the United States and Canada in January 2008, as an extension of the Data Protection Day celebration in Europe. The [...]
all-day Data Privacy Day
Data Privacy Day
Jan 28 all-day
“Data Privacy Day began in the United States and Canada in January 2008, as an extension of the Data Protection Day celebration in Europe. The [...]

View Calendar