Redaction and Re-identification Risk

This post is part one in a series examining privacy and transparency issues in the context of public access to digital court records, building on my essay “Digital Court Records Access, Social Justice and Judicial Balancing: What Judge Coffin Can Teach Us.”

In its proposed electronic court records access rules, the Maine Supreme Judicial Court (SJC) imposes on litigants new and extensive filing obligations, including requiring litigants to redact certain categories of sensitive personal information.

Regardless of what one might think about the wisdom of placing this burden on litigants, it is important to ask what the SJC hopes to achieve by this requirement. Even assuming full compliance, which is doubtful, redaction as a de-identification technique, without more, would be wholly inadequate to protect the privacy of Maine citizens.

In today’s big data world, given the sophistication of data handlers, it is well-recognized that de-identification alone is not enough to prevent re-identification of individuals, and the SJC’s reliance on it promotes a false sense of security. The risk of re-identification of individuals from purportedly de-identified databases is significant.

As pointed out in my essay,

As long ago as 2010, Paul Ohm, a leading privacy scholar, brought attention to the fact that computer scientists “have demonstrated that they can often ‘reidentify’ or ‘deanonymize’ individuals hidden in anonymized data with astonishing ease.” In his groundbreaking article examining this research, Ohm described in detail three spectacular failures of anonymization to reinforce his point that “we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed.” Each of these incidents – the 2006 AOL data release, the Massachusetts Group Insurance Commission’s release of “de-identified” medical records, and the 2006 Netflix prize data study – has been widely publicized.

In each of these incidents, said Ohm, “even though administrators had removed any data fields they thought might uniquely identify individuals, researchers . . . unlocked identity by discovering pockets of surprising uniqueness remaining in the data.”

The Federal Trade Commission (FTC), in its privacy framework published in 2012, likewise concluded that “[t]here is significant evidence demonstrating that technological advances and the ability to combine disparate pieces of data can lead to identification of a consumer, computer, or device even if the individual pieces of data do not constitute [personally identifiable information (PII)].” Moreover, continued the FTC, “not only is it possible to re-identify non-PII data through various means, businesses have strong incentives to actually do so.”

The FTC’s privacy framework, which requires organizations to implement three significant protections for data to minimize the risk of re-identification, established a best practices standard that is widely accepted. First, the organization “must take reasonable measures to ensure that the data is de-identified. This means that the [organization] must achieve a reasonable level of justified confidence that the data cannot reasonably be used to infer information about, or otherwise be linked to, a particular [individual], computer, or other device.” Second, the organization must “publicly commit to maintain and use the data in a de-identified fashion, and not to attempt to re-identify the data.” Third, if the organization “makes such de-identified data available to other [persons], it should contractually prohibit such persons from attempting to re-identify the data.

Echoing the FTC, in 2014 the President’s Council of Advisors on Science and Technology (PCAST) concluded that “[a]nonymization remains somewhat useful as an added safeguard, but it is not robust against near-term future re-identification methods. PCAST does not see it as being a useful basis for policy.”

Closer to home, the Maine Health Data Organization (MHDO), an independent executive branch agency that maintains a comprehensive health information database comprising health care information about Maine citizens collected from health care facilities and payors, has established rules designed to make data publicly available and accessible to the broadest extent consistent with the laws protecting individual privacy and proprietary information.

Recognizing the ease with which purportedly de-identified data can be re-identified, the MHDO rules define “direct patient identifiers” as “[i]nformation such as name, social security number, and date of birth, that uniquely identifies an individual or that can be combined with other readily available information to uniquely identify an individual.” The MHDO definition mirrors the safe harbor definition of “de-identified” health data under HIPAA, which mandates removal of 18 categories of identifiers from a data file, in addition to requiring that the data file “[cannot without actual knowledge of the covered entity] be used alone or in combination with other information to identify an individual who is a subject of the information.”

The MHDO rules, both in design and practice, essentially implement each of the protections for data established as a best practices standard by the FTC to minimize the risk of re-identification. For example, under the MHDO rules, release to the public of any de-identified data or limited data sets is made conditional on the recipient’s agreement to abide by the terms of the MHDO’s standard data use agreement which, among other things, requires that the recipient only use the released data in ways that maintain patient anonymity and prohibits the recipient from “link[ing] these data to other records or data bases” in an attempt to identify any individuals.

Two recent trial court orders in…

Read The Full Article