How randomized response can help collect sensitive information responsibly

Anonymous Data

Let's pretend we're analysts at a small college, looking at anonymous survey data about plagiarism.

We've gotten responses from the entire student body, reporting if they've ever plagiarized or not. To encourage them to respond honestly, names were not collected.

The data here has been randomly generated

On the survey students also report several bits of information about themselves, like their age...

...and what state they're from.

This additional information is critical to finding potential patterns in the data—why have so many first-years from New Hampshire plagiarized?

Revealed Information

But granular information comes with a cost.

One student has a unique age/home state combination. By searching another student database for a 19-year old from Vermont we can identify one of the plagiarists from supposedly anonymous survey data.

Increasing granularity exacerbates the problem. If the students reported slightly more about their ages by including what season they were born in, we'd be able to identify about a sixth of them.

This isn't just a hypothetical: A birthday / gender / zip code combination uniquely identifies 83% of the people in the United States.

With the spread of large datasets, it is increasingly difficult to release detailed information without inadvertently revealing someone's identity. A week of a person's location data could reveal a home and work address—possibly enough to find a name using public records.

Randomization

One solution is to randomize responses so each student has plausible deniability. This lets us buy privacy at the cost of some uncertainty in our estimation of plagiarism rates.

Step 1: Each student flips a coin and looks at it without showing anyone.

Step 2: Students who flip heads report plagiarism, even if they haven't plagiarized.

Students that flipped tails report the truth, secure with the knowledge that even if their response is linked back to their name, they can claim they flipped heads.

With a little bit of math, we can approximate the rate of plagiarism from these randomized responses. We'll skip the algebra, but doubling the reported non-plagiarism rate gives a good estimate of the actual non-plagiarism rate.

Flip coins

How far off can we be?

If we simulate this coin flipping lots of times, we can see the distribution of errors.

The estimates are close most of the time, but errors can be quite large.

Flip coins 200 times

Reducing the random noise (by reducing the number of students who flip heads) increases the accuracy of our estimate, but risks leaking information about students.

If the coin is heavily weighted towards tails, identified students can't credibly claim they reported plagiarizing because they flipped heads.

One surprising way out of this accuracy-privacy tradeoff: carefully collect information from even more people.

If we got students from other schools to fill out this survey, we could accurately measure plagiarism while protecting everyone's privacy. With enough students, we could even start comparing plagiarism across different age groups again—safely this time.

How randomized response can help collect sensitive information responsibly

Anonymous Data

Revealed Information

Randomization

How far off can we be?

Conclusion

Credits

More Explorables