Explorables

How randomized response can help collect sensitive information responsibly

Giant datasets are revealing new patterns in cancer, income inequality and other important areas. However, the widespread availability of fast computers that can cross reference public data is making it harder to collect private information without inadvertently violating people's privacy. Modern randomization techniques can help preserve anonymity.

Anonymous Data

Let's pretend we're analysts at a small college, looking at anonymous survey data about plagiarism.

We've gotten responses from the entire student body, reporting if they've ever plagiarized or not. To encourage them to respond honestly, names were not collected.

The data here has been randomly generated

On the survey students also report several bits of information about themselves, like their age...

...and what state they're from.

This additional information is critical to finding potential patterns in the data—why have so many first-years from New Hampshire plagiarized?

Revealed Information

But granular information comes with a cost.

One student has a unique age/home state combination. By searching another student database for a 19-year old from Vermont we can identify one of the plagiarists from supposedly anonymous survey data.

Increasing granularity exacerbates the problem. If the students reported slightly more about their ages by including what season they were born in, we'd be able to identify about a sixth of them.

This isn't just a hypothetical: A birthday / gender / zip code combination uniquely identifies 83% of the people in the United States.

With the spread of large datasets, it is increasingly difficult to release detailed information without inadvertently revealing someone's identity. A week of a person's location data could reveal a home and work address—possibly enough to find a name using public records.

Randomization

One solution is to randomize responses so each student has plausible deniability. This lets us buy privacy at the cost of some uncertainty in our estimation of plagiarism rates.

Step 1: Each student flips a coin and looks at it without showing anyone.

Step 2: Students who flip heads report plagiarism, even if they haven't plagiarized.

Students that flipped tails report the truth, secure with the knowledge that even if their response is linked back to their name, they can claim they flipped heads.

With a little bit of math, we can approximate the rate of plagiarism from these randomized responses. We'll skip the algebra, but doubling the reported non-plagiarism rate gives a good estimate of the actual non-plagiarism rate.

Flip coins

How far off can we be?

If we simulate this coin flipping lots of times, we can see the distribution of errors.

The estimates are close most of the time, but errors can be quite large.

Flip coins 200 times

Reducing the random noise (by reducing the number of students who flip heads) increases the accuracy of our estimate, but risks leaking information about students.

If the coin is heavily weighted towards tails, identified students can't credibly claim they reported plagiarizing because they flipped heads.

One surprising way out of this accuracy-privacy tradeoff: carefully collect information from even more people.

If we got students from other schools to fill out this survey, we could accurately measure plagiarism while protecting everyone's privacy. With enough students, we could even start comparing plagiarism across different age groups again—safely this time.

 

Conclusion

Aggregate statistics about private information are valuable, but can be risky to collect. We want researchers to be able to study things like the connection between demographics and health outcomes without revealing our entire medical history to our neighbors. The coin flipping technique in this article, called randomized response, makes it possible to safely study private information.

You might wonder if coin flipping is the only way to do this. It's not—differential privacy can add targeted bits of random noise to a dataset and guarantee privacy. More flexible than randomized response, the 2020 Census will use it to protect respondents' privacy. In addition to randomizing responses, differential privacy also limits the impact any one response can have on the released data.

Credits

Adam Pearce and Ellen Jiang // September 2020

Thanks to Carey Radebaugh, Fernanda Viégas, Emily Reif, Hal Abelson, Jess Holbrook, Kristen Olson, Mahima Pushkarna, Martin Wattenberg, Michael Terry, Miguel Guevara, Rebecca Salois, Yannick Assogba, Zan Armstrong and our other colleagues at Google for their help with this piece.

More Explorables