Introduction to Differential Privacy

Ka Mo Lau
April 15, 2019
Brands / Marketing Strategy / Programmatic Creative

Privacy has become an increasingly hot topic in ad tech. From GDPR to ITP 2.0, marketers are becoming increasingly conscious of the importance of privacy, which they now have to actively balance against the need for transparency and accountability. Recently, industry leaders have started talking about differential privacy, and how this technology could be the solution to balance privacy with security. Digiday provides a good introduction here.

Before diving into differential privacy, it’s helpful to keep in mind how marketers actually consume data. It may seem counter intuitive, but a savvy data-driven marketer doesn’t actually care about any specific individual in their campaigns. Rather, the marketer is optimizing for the behavior (and results) from the entire group or segment it is targeting. (If you are a marketer, ask yourself this question: in your last analysis, did you care that User #123 converted or did you care how many users in your target population spent money?) This insight helps us realize a system that hides the behavior of any given individual but provides accurate user behavior can strike the balance between user privacy and transparency. Does this solution exist? It can with differential privacy.

Differential privacy is a set of statistical techniques that introduce noise into any given data set in order to protect user anonymity without changing your overall conclusion. Does it sound too good to be true?

Here’s an oversimplified example of differential privacy principles at work. If you wanted to ask a group of people sensitive questions such as “Have you cheated on your spouse?” you will likely get few people who to tell you the true answer. However, imagine before people answered, they were told to privately flip a coin. If the coin lands heads, they tell the truth – yes or no. If the coin lands tails, they then flip the coin again privately. If it is heads, they say “yes” no matter what the truth is. If it is tails, they say “no,” again despite the truth.

As a result of this basic obfuscation, any outsider who looks at the data won’t know if an individual participant’s recorded answer is the truth or not because it could easily have been an arbitrary answer. That said, there is a known statistical distribution of correct answers (50% of answers) versus arbitrary answers (25% no, 25% yes) thanks to random coin flips. For a large population sampled, you will be able to then reveal what is the true rate of spousal cheating without risking any individual’s privacy!

This example, of course, oversimplifies the actual mechanics of differential privacy. In reality, more complex techniques can be applied to each data set in order for more robust data security and greater transparency. But that discussion is better left for the Data Privacy 201 course…