Open, pseudonymised and personal data: what Government should do next

For some weeks we’ve been hearing further around the Public Data Corporation idea: for commercial partnerships to monetise new categories of government data sets, and a possible expo to sell public data this autumn. Sir Bonar even dictated some Tweets on the matter to his ever-efficient secretary Patricia.

This matter prompts more thoughts, after I was today invited to quite possibly the wrong meeting in Whitehall. The agenda was about boosting the economy, and the question was: what government data sets does business want?

I made a series of sincere contributions, but they seemed to strike a dissonant note. So I’ll share some reflections here (tho the meeting was held under the Chatham House rule and therefore I shall be scrupulous to mention no names or affiliations).

The release of public data since the Power of Information work and Rufus Pollock’s seminal report on the economic benefits has been a rare old government IT success story and a credit to all involved: previous administration (Tom Watson again!), present administration, and officials throughout. But there’s still more to do.

1. What public data do we need released as open data?

Today’s meeting heard that data.gov.uk has “0.01%” of government’s data holdings. That’s no sort of real measure, because no-one knows how much data government holds. An audit is said to be under way but evidently not yet complete.

What do we still need? The lovely @hadleybeeman did a straw poll with the LinkedGov team and came up with:

Infrastructure stuff

timetables for transport and indeed anything

locations of stuff

restaurant health ratings

Keying data/indices

structure of local authorities/NHS trusts

weather data

more detailed mapping data (under open licence)

Companies House data

contracts/procurement/tendering data (with reliable feeds)

Ofcom data re radio frequencies.

Cheers Hadley! There’s plenty more data about stuff, facts, inanimate objects, statistics and numbers we still need to make open in a structured way. There’s loads more financial detail needed, far deeper than COINS goes (good start as that is).

2. What’s the problem with releasing anonymisd data?

But the offer from government to big data companies (and this is why I say I was in the wrong meeting) is substantially for data about people. The message is: which data sets that government possesses would most improve your business; tell us what you want, because we’re in a mood to clear obstacles out of the way.

This included talk of anonymised or pseudonymised primary care records, individual-claimant-level benefit data (anonymised); individual pupil level data about attainment and attendance (ditto). This is what the big data industry wants, with the caveat: “if it’s pseudonymised, we can’t append our own granular data”.

This was not the place, we were told, to have a philosophical discussion about why releasing “pseudonymised” data is problematic. But..but..but…..This is like being told by lemmings, as they rush towards the edge of a cliff, that this is no time to have a philosophical talk about the nature of gravity. “Deanonymisation is just theory” we were told.

Certain Tweeps and others take issue with that. @craigperko, for example, who works in the power sector, sees deanonymisation every week: “If you have a limited number of participants, even small details can say who is who.” He refers to this paper and to Schneier’s writeup of it in Wired.

The classic paper on this is by Paul Ohm: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization @futureidentity, a respected analyst, points out “the Netflix stuff didn’t look very theoretical to me…” @craigperko stops work to fires off a series of 140-character blasts:

The big issue we find is when a small number of people have a specific detail. Anonymizing fails if rare data remains id-able. Aggregate data only anonymises if the aggregation erases that rare/semi-unique data COMPLETELY. Pseudonymized data is not anonymized. Not even vaguely. It’s a matter of how many deanonymizing attacks there may be, and how important it is to prevent them. Pseudonyms are a weak defense which is suitable for protected internal databases with low leak probabilities….Sorry, you kind of got me ranting. If a psuedonym-based database is leaked, many of the pseudonyms will be easy to crack.

Thank you @craigperko. Her Majesty’s Government needed to hear that: rant on Sir! And he does:

In essence, pseudonimity is suitable for limited-release data in secure situations. (However, in my opinion, that level of protection is too weak for medical or legal situations.) (And certainly not suitable for public data sets.) Oh, uh, forgot: deanonymizing for us is mostly the issue of home owners being away or having particular habits. That’s our common experience with it. It’s on our minds a lot. Getting someone robbed (or screwed by their power company) because we’re showing their data is… not a great plan.

Good man.

So. Even if you’re trying to do something good, like achieve better health outcomes, create a longitudinal study of drug effects, reduce welfare mispayments and improve public services such as education the present government would be ill-advised to rush into releasing individual-level data sets. The mild-mannered Chris Graham would have a thing or two to say. Caspar and all the best-informed folk in the digital rights world would be apoplectic.

Open data is an area of real progress. There’s potential for real progress on personal data too. But if the Coalition wants to bring a No2ID-like campaign down on its head with the added twist of privatising to unaccountable organisations what it was unacceptable for government to do, this wholesale release of anonymised data to big data corps is the fast track.

3. What data does Government need to release to make the new personal data ecosystem work?

So what is the correct way to release personal data, and what will it achieve?

Answer: we need to link the data policy discussed to day with Cabinet Office’s ID Assurance, and with BIS’s Mydata (of which key players present today were not really aware). And we need to take account of the new emerging personal data ecosystem. It goes like this.

With ID Assurance individuals can log in securely to online services. Under BIS’ Mydata initiative, they’ll be able to download records from businesses. They need to be able to do that from public services also: health records, education records, job-seeking etc. They also need to be able to acquire tokens: proof they own the car or car drive from DVLA, proof they have a passport from IPS, or proof they’re on welfare from DWP so they can easily get the best energy tariff.

That’s the data we need from government to make the new personal data ecosystem work. The market will be ready to equip the individual to deal with all this, whether its new big company services or personal data ecosystem startups like reputation.com, personal.com or the work we’re doing at Mydex.org. Then individuals can control whether they share data for longitudinal health studies, statistical or research purposes or whatever else. And this will save huge sums through improved data logistics, and also unleash a huge new – disruptive – torrent of economic activity. We’ll get there.