Chapter 2. Privacy in the Classroom

This chapter describes a scenario where a seemingly-innocuous action can have unintended consequences on the privacy of individuals in a classroom. Building on this example, the chapter will highlight several key concepts which will be important throughout the remainder of the book.

By the end of this chapter, you should understand what makes two datasets adjacent, and what it means to define the sensitivity of a mechanism.

Privacy and the Mean

Imagine you are in a class with 10 students total. You’ve just taken an exam, and are eagerly awaiting the results. One day, the professor walks into the room and writes a number on the board: 85%. She announces that this was the average grade on the exam, and that she will be passing back papers shortly. Suddenly, you get a text from your good friend Ari: “Big news! Just got a job with the circus and had to drop the class. See you this summer!” You mention to the professor that your friend is no longer enrolled in the class. With this information, she opens her computer, types for a moment, then walks over to the board, erases the old average, and writes 84%.

This scenario is innocent enough, right? After all, she didn’t write your friend’s score on the board.

“Well, in a way, she just did”, says a voice behind you.

Who said that? You turn around to find a demon sitting in the back row (yes, this connects to privacy, bear with us for a second). You are curious about his claim that the professor just essentially wrote your friend’s score on the board.

“What do you mean?” you ask.

He begins to explain that you know how to calculate an average in general. Suppose each student’s score is denoted by xi, the set of all such scores is X, and the size of this set is n. Then

X¯=1ni=1nxi

And you know there were 10 people in the class before your friend dropped it:

X¯before=110i=110xi=85

where each xi is an exam score. Further, you know there were 9 people after he left, and that the average score was 84:

X¯after=19i=19xi=84

Subtracting the two equations from each other:

X¯before-X¯after=110i=110xi-19i=19xi=1

Let’s call your friend x10, then:

(110-19)i=19xi+110x10=1

Simplifying the subtraction term and using the definition of X¯after:

-190*9*X¯after+110x10=1

and simplifying the fraction to its lowest common denominator:

-110X¯after+110x10=1

Now youjust need to isolate x10, your friend’s exam score:

x10=10*(1+110X¯after)=10+X¯after

You already know X¯after, it is simply the average written on the board:

x10=10+84=94

Hold on - that’s your friend’s exam grade. What just happened? This demon just determined your friend’s grade using several pieces of information: the mean before he left, the mean after he left, and the number of students in the class. As you’ve just learned, releasing statistics about a dataset can also release sensitive information about individuals in the data.

How Could This be Prevented?

What could the professor have done? She could have written “80-90” on the board, and then not updated the value after your friend had dropped. After all, 85 and 84 are both between 80 and 90. However, this doesn’t give the students enough specificity. If someone gets an 80, they may think they are on the mean, but the mean could also be 90. This also doesn’t protect privacy in all circumstances. What if the average had dropped to a 79, and she had updated the average to read “70-80”? Then you still could have used the definition of the mean to reconstruct an estimate for your friend’s grade. All in all, this is suboptimal for the students and their privacy.

Alternatively, what if she samples a value from a distribution and adds it to the mean? For example, if she tells you that the mean has noise added to it. Then you can understand the probabilities of different scores, without knowing with perfect information the true score. The question becomes: how do you add noise in such a way that you strike a balance between protecting privacy and keeping the statistic as useful as possible? If you know that your friend is the only person who is leaving, you can choose the noise in such a way that you sufficiently protect his score.

Yet this raises an important point - you aren’t just protecting privacy for one scenario (your friend leaving) but for all such scenarios where one person joins or leaves the class. You can’t know who will leave the class or when, which means it is insufficient to look at only one outcome; you must instead prepare for all possible changes to the class roster.

What if Someone Else Had dropped?

So back to this demon, let’s call him Friedrich. He comes to you and says “You will live this day out over and over, but each time, a different person will drop the class. When this happens, I will use the definition of the mean to uncover their score.” Now, if you have put sufficient privacy protections in place, you would shrug this demon off, knowing he will be frustrated or simply think he has the true scores computed. However, if you have only protected against the scenario where your friend leaves the class, then that leaves everyone else vulnerable.

Now suddenly you are sitting in a room, exactly like the classroom in the previous example, but with one major difference - your friend is sitting next to you. His application to the circus was tragically rejected, but he is still happy to be with you in class. Right before class starts, you get a text from another classmate, Bobby, who is going on tour with his band and has to drop the class. At least he promises you VIP passes next time they come through town. You are relieved that Ari’s score is private, but in the back row, you see Friedrich the demon ready to calculate Bobby’s score, and understand that it will be leaked.

The professor comes in and writes the mean exam score on the board, and you point out that Bobby has dropped the class. Just like last time, she updates the score based on this new information. But this is a different score being removed, so the mean changes: 80. In this parallel universe, the exam average drops 5 points instead of 1! Regardless of the change in the average, Friedrich the demon still sits in the back row calculating the score of whoever dropped the class. There is nothing special about this particular parallel universe, in fact, you need to prepare for all of them. In a classroom of 10 students, there are 10 possible “adjacent classrooms” where exactly one person drops the class. These are the types of events that can cause privacy leakage, as was shown earlier in the chapter. This concept of adjacency is crucial to providing rigorous privacy guarantees.

Adjacent Dataset

Two datasets X and X’ are adjacent if X and X’ differ by a single individual. In the DP literature, and in common use, neighboring is used interchangeably with adjacent.

“Differ” is intentionally vague: there are many senses in which two datasets may differ by a single individual. In the classroom example, we consider two datasets to be adjacent if they differ by the addition or removal of a single individual.

Let’s recap what you know so far. Releasing a statistic can “leak” private information about individuals in a dataset. In particular, if you know that one data point has been removed, and you know how much a given statistic has changed, you can reconstruct the value of the missing data point. This means that you have to protect against all possible scenarios like this. Further, you know that adding noise to a statistic can prevent this type of reconstruction from learning anything with perfect certainty. However, you don’t yet know how much noise to add.

How do you define sensitivity of the exam score mean over these possible classes? To determine this, you need the concept of sensitivity. You know that the mean can change when person leaves the class: in one case, it dropped by 1 point, and in the other, by 5 points. First, think about the largest change possible that can occur when any one person leaves the class. This is the local sensitivity.

Local Sensitivity of the Mean

For a dataset X, the local sensitivity of the mean is the smallest number S such that |X¯-X'¯|S for any adjacent dataset X'.

Notice that the local sensitivity varies depending on the scores of other students in the class. Unfortunately, this means that the local sensitivity itself can reveal information about the dataset. Because of this, the local sensitivity is not actually strong enough to defend against Friedrich.

Obviously, we are more concerned with human actors, but he serves as a useful foil for what information someone might have about a dataset and what they could do with it.

We instead need the global sensitivity. You can think of the global sensitivity as the greatest possible local sensitivity, for any initial dataset. Intuitively, the global sensitivity can be found by identifying a dataset that would maximize the local sensitivity. Based on this intuition, in the most extreme case, let’s say your friend got a 0, and everyone else in the class got a 100. (how embarassing!) In this situation, what does him leaving do to the mean?

X¯before=110i=110xi=10010=90

and after:

X¯after=19i=19xi=100

Which means the change in the mean is

|X¯before-X¯after|=10

Plugging in any values for the class, you will find that the maximum change in the mean is 10. For the mean of this dataset, this number is special: it is the global sensitivity.

Global Sensitivity of the Mean

For any pair of adjacent datasets X,X', the global sensitivity of the mean is the smallest number S such that |X¯-X'¯|S.

The global sensitivity tells us how much the statistic can change, for any possible pair of initial datasets.

For the classroom mean, our global sensitivity is 10, since in the most distinguishable case, the exam score average changes by 10 points. Note that the justification given above is somewhat ad-hoc: it relies on correctly identifying the choice of X and X’ for which the difference is the greatest. This kind of justification is prone to human error, which is why this kind of analysis is better handled by a mathematical proof. This book contains examples of proofs as you progress.

Now that you’ve seen this scenario from the perspective of the students, let’s look at the professor’s role as the only person with access to the full dataset.

Since the students don’t have access to the full dataset, the change in a statistic value is the only signal they can use to learn anything about the underlying data. On the other hand, those with access to the complete dataset don’t need to infer anything, because they can look directly at each individual row.

What is a Trusted Curator?

Let’s examine this example from the professor’s point of view. In this scenario, the professor has total access to all the student data, while the students are only given access to aggregate statistics. This means that the professor is a trusted curator. This is a key term throughout the remainder of the book!

Trusted Curator

An individual or organization with complete access to a sensitive dataset.

The students are the individuals in the dataset, and the school staff and students represent the public that will receive the data analysis.

Suppose the professor has the student scores in a database:

exam_scores
-----------
94
80
80
81
82
85
95
100
83
70

The professor then queries the student database using a mean function to get the mean test score for the class:

exam_scores.mean()
>>> 85

As before, student 0 joins the circus and drops the class. The professor computes a new mean and updates the score board:

del exam_scores[0]
exam_scores.mean()
>>> 84

You now know that anyone who knows how many students the school has could easily calculate the student’s grade, which is 94.

Conclusion

As you’ve just seen, disclosing statistics about a dataset leaks information about individuals in the dataset. The professor certainly wasn’t trying to disclose sensitive information to her students, but the combination of a statistic and information about how the dataset had changed was enough to isolate an individual. This will be a common theme throughout the book: how can you responsibly release statistics about sensitive data without disclosing data about individuals in the dataset?

You should now understand data privacy scenarios as situations where a trusted curator releases information about a sensitive dataset in such a way that the public cannot learn anything about any individuals in the data. You should also have an intuitive understanding of dataset adjacency and sensitivity. In particular, access to an adjacent dataset and a statistic can be enough to learn sensitive data about an individual. The role of sensitivity will become more apparent in the next chapter, which will tie these three key concepts together and proceed to effectively privatizing statistics.

Exercises

  1. Which of the following database pairs are adjacent?

    1. X = [1,2,3] Y=[1,2,3,4]

    2. X = [A,B,C] Y=[C,A,D,A]

    3. X = [A,B] Y=[B,A,C]

    4. X = [D,E,F] Y=[E,F,D,G]

  2. Write a program that takes as input two databases and returns whether they are adjacent.

  3. Consider a COUNT function that takes as input a database and returns the number of rows in the database.

    1. What is the sensitivity of COUNT?

    2. Do the same but for the median

  4. Imagine the exam from the previous example had an extra credit question, and the maximum score is actually 110.

    1. What is the sensitivity of the mean exam score?

    2. Now do the same for several max scores {120, 130, 140,…​.}

    3. What is the relationship between sensitivity and the max exam score?

    4. If there is no limit to the exam score, is the sensitivity still well-defined?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.21.166