Chapter 4. Dataset Transformations, Dataset Distance, and Stability

Data transformations are at the heart of data analysis tasks. In the context of transformations, it will aid your understanding to consider a dataset to be any form of data that has not been made private. That is, computed statistics (such as the sum or mean estimates) are also dataset(s). Transformations encompass any function from a dataset to a dataset.

Some commonly-used data transformations compute statistics - when you have a dataset, and you want to understand a property of it. A statistic aggregates many records into a summary, like a sum, mean, standard deviation, or even regression parameter(s).

Some transformations may modify each record in a dataset one at a time, therefore preserving the dimensions of the data, like when applying mathematical transformations to each element in a column, or when assigning bin numbers to each record. Other transformations still may compute a dataset join, or load data from a csv file or database.

In order to establish privacy guarantees later, it is first necessary to consider how an individual’s influence on the data may change once a transformation is applied. This change in influence in characterized by a function’s stability.

This chapter discusses the stability of a number of dataset transformations that are typically used in differentially private releases.

In DP, it is valuable to look at statistics as special cases of more abstract data transformations. A broader perspective means that you can use more general tools to understand the sensitivity of a transformation and ensure that your data analysis process has the expected privacy guarantees.

One of the most common dataset transformations is the sum. In many situations, each element in the data you are privatizing has no natural bounds - values could approach positive or negative infinity without limitation. Naturally, unbounded data implies that many queries, like the sum, have unbounded sensitivity. When the sensitivity is unbounded, you can’t construct a DP mechanism that will privatize your statistic, since a change in a single value could cause an arbitrarily large change in the statistic, therefore leaking information about the dataset. This kind of scenario highlights the importance of having a well-defined sensitivity when performing a private data analysis.

Just like a statistic has an associated sensitivity, a more general data transformation has a notion of stability. Stability measures how much a transformation changes, given a change in the input data. By the end of this chapter, you should understand the notion of stability with respect to dataset transformations. Using this knowledge, you should be able to identify stable dataset transformations that complement differentially private mechanisms. In particular, you will be able to determine when it is appropriate to use clipping/clamping when performing a DP analysis.

Stability of Data Transformations

A transformation is a computation that maps a dataset to another dataset, or maps a dataset to an aggregate. Consider a transformation that squares each element ina dataset: f(x)=x2. Then given the dataset X={1,2,3}, the transformed data is:

f(X)={1,4,9}

This transformation operates on each row independently and returns a transformed dataset with the same number of rows. Such a transformation is called a row-by-row transformation. Such row-by-row transformations are rather common; you will find them useful as you learn more about DP.

The mean is another example of a transformation - one that maps a dataset to value computed from an aggregate of the data. Taking the same dataset, we can compute

μ(X)=1+2+33=2

In this case, the result of the transformation is a single value, not a dataset. This single value takes the data in aggregate and returns a value that describes a property of the dataset. We’ll generally refer to such transformations as aggregators.

In this chapter, we are interested in transformations with bounded stability. Stability is a property of a transformation that guarantees that similar inputs map to similar outputs. If a transformation maps similar inputs to similar outputs, then we say it is stable.

You might be wondering what we mean by similar here - great question. Mathematically speaking, you will always want to be precise with your notions of similarity and closeness. Much like you say a release is "ϵ-private,” you will also say that a transformation T is "c-stable.” Relative to the absolute distance, this means that whenever |u-v|din, then it is also true that |T(u)-T(v)|c·din. In other words, inputs that are within din distance from each other will result in outputs which are no further apart than c·din from each other.

Stability may already feel familiar to you, because stability is a generalization of sensitivity. Recall the definition of sensitivity from the previous chapter:

Δf=maxABdAbs(f(A),f(B))=maxAB|f(A)-f(B)|

Recall that AB requires that A and B are similar; the distance between them is bounded. The sensitivity is a form of stability - it guarantees that, when any two input datasets A and B are similar, the outputs are similar.

Sensitivity of the Sum Aggregator

We can use the definition of sensitivity to compute the sensitivity of a sum aggregator function, f. This is not as daunting as it may sound! Define the function as a sum of the inputs: f(X)=iXi where each Xi and substitute this form of f(X) into the definition of sensitivity:

Δf=maxABdAbs(f(A),f(B))=maxABiAi-iBi

If you know that A and B differ by at most one record, then iBi=iAi±Bn. With this observation, the formula collapses:

maxABiAi-iBi=maxABiAi-iAi±Bn=maxAB|Bn|

This quantity is almost useful, but the magnitude of Bn is unbounded, so the sensitivity is infinite. Therefore, the sum transformation is not stable. At least, not in its current state. All you need is an assumption that Bn is bounded.

If you know that Bn is no smaller than L, and no greater than U, then maxAB|Bn|kmax(|L|,U). Therefore, you can say that the sum transformation on a dataset with bounded elements is "max(|L|,U)-stable.”

Sensitivity and Clipping

Let’s think back to the classroom example - every score in the class was between 0 and 100. Why is that? Simply by circumstance, there was no extra credit on the exam, and there was no (negative) penalty for guessing incorrectly, so the minimum score was 0.

What happens in a scenario where we don’t have such a natural domain for the data? Let’s consider financial situations where the maximum is unknown or undefined. What is the maximum price of a house? The maximum income possible? Such scenarios do not give us natural bounds on the data.

Now remember how we calculated sensitivity as the maximum change in the final statistic that is possible from adding or removing a single data point. We knew that the lowest average score was 0, and the highest single score was 100, which gave us a sensitivity of 10. What happens if we don’t have a maximum score? We can no longer compute the sensitivity, since it becomes (0+)10 which is undefined. In situations like this, we need to “clip” the data so that we can have a well-defined sensitivity.

Clipping

Clipping is a transformation that replaces each value outside of a set of pre-defined bounds with their nearest neighbor within the bounds. np.clip does this for the scalar case. The OpenDP Library similarly has a constructor for clamping data:

import opendp.prelude as dp
input_space = dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance()

clip_trans = input_space >> dp.t.then_clamp((0., 10.))

mock_dataset = np.random.normal(0., scale=10.)
transformed_dataset = clip_trans(data)

Now you have the tools to make your sensitivity value well-defined, even in cases where the data is unbounded. By using a clipping function like the ones above, you can ensure that your dataset lies between min_value and max_value, resulting in a sensitivity value that is meaningful for constructing DP statistics.

Example: DP Analysis of Income Data

Suppose you have a dataset of individuals and their annual incomes. The least amount they can be paid per year is $0, but it is unclear what the maximum is. For this reason, your data is on the domain [0,).

In this case, you would need to use outside domain knowledge to estimate a good maximum given the dataset. If you set it very high, you are almost guaranteed to not cut off any of the values, but your sensitivity will suffer. Conversely, if the maximum is set too low, then you will clip too many values and lose utility in your final statistic.

Choosing an optimal clipping value

There are several approaches you can take when selecting clipping bounds. Consider the DP mean process - for data without any natural bounds, the sensitivity is undefined. In order to guarantee DP in a mean release, you need to somehow provide bounds to the data.

There are two ways to deal with extreme values: trimming, and clipping. Trimming means removing any values outside the range [min, max], while clipping means replacing values outside the range with min or max.

Percentiles are a natural starting point. If you define a percentile value p, you can trim (remove) the p2% smallest and (1-p2)% largest values. Trimming a total of 10% of the data (the 5th and 95th percentiles) is a relatively common starting point.

This has an effect on the final statistic. As an example, consider the set of numbers [1,2,8,9,11,22,25,27,35,47,58,68,68,74,75,89,90,94,98,100] with mean 50.05. This dataset has 20 values, meaning that each value represents 5% of the data. If you trim this dataset based on the 5th and 95th percentiles, you end up removing the smallest and largest value: [2,8,9,11,22,25,27,35,47,58,68,68,74,75,89,90,94,98]. This has mean 50.00.

Python Functions for Bound Estimation

Thankfully, you don’t have to go through all of these steps every time you want to establish optimal clipping parameters.

import numpy as np
import opendp.prelude as dp

def release_dp_mean(bounds, contributions, epsilon):
    context = dp.Context.compositor(
        data=data,
        privacy_unit=dp.unit_of(contributions=contributions),
        privacy_loss=dp.loss_of(epsilon=epsilon),
        split_evenly_over=2
    )

    numerator = context.query().clamp(bounds).sum().laplace().release()
    denominator = context.query().count().laplace().release()
    return numerator / denominator

pr = []
true_mean = sum(data) / len(data)
print(f"True mean: {true_mean}
")

print("Mean	Utility")
for upper in [50., 75., 90., 100.]:
    clipped_dp_mean = release_dp_mean((0., upper), contributions=1, epsilon=1.)
    utility = abs(clipped_dp_mean - true_mean)
    print(f"{clipped_dp_mean}	{utility}")

Categorical Data Transformations

Categorical data is, as its name suggests, data that be divided into different categories.

Table 4-1. Vehicle database
License PlateVehicle Type

ABC123

Car

JCO549

Truck

OFJ295

Car

EMF494

Motorcycle

QMC583

Truck

In this case, you can aggregate the vehicles by the column “Vehicle Type” and generate a count: 2 cars, 2 trucks, 1 motorcycle. The possible categories (car, truck, motorcycle) are also known as the keys.

When dealing with categorical data, the set of keys might be entirely known or only partially known. In some cases, the set of keys is clear and well known. For example, consider a census survey. If there is a question related to citizenship status, there are only two possible keys: citizen or non-citizen. If a data analyst decides to create a histogram of citizenship status, the data analyst will query the number of individuals whose status is “citizen” and the number of individuals whose status is “non-citizen.”

The data analyst does not know how many individuals are in each category, so they will query the count of citizens and the count of non-citizen individuals. In this query, an individual belongs to either the category “citizen” or “non-citizen”, so all neighboring databases will have the same set of categories.

Next, consider an illustrative example where a data analyst queries citizenship status statistics from a database:

Table 4-2. Survey of individual characteristics of employees in a company
First NameLast NameBudgetCitizenship Status

Caryl

Baptie

$898,031.59

Citizen

Moyra

Leverson

$847,791.81

non-Citizen

Elinore

Gillbard

$729,605.84

Citizen

Farleigh

Crampton

$9,742,235.31

Citizen

Sebastien

Marples

$3,677,589.94

non-Citizen

Baxy

Doohan

$1,044,044.63

Citizen

Henrie

Whawell

$4,670,377.71

Citizen

Dorothy

Drummer

$2,641,401.28

Citizen

Meaghan

Clinnick

$989,042.50

non-Citizen

Tildy

Gutans

$5,986,640.99

Citizen

The data analyst queries the count of individuals with Citizenship Status == Citizen and the count of individuals where Citizenship Status == non-Citizen. These two queries represent the citizenship status histogram. In this case, adding or subtracting individuals from this database will not change the key set in the histogram, since the only two options are “citizen” and “non-citizen.”

If the analyst builds a histogram of citizenship counts from this table, it will appear like so:

ch4 fig2
Figure 4-1. Histogram of citizenship status (non-private)

This dataset contains an example of a categorical variable with a known key set. Because the key set is known, the set of keys is consistent over all possible neighboring databases. There is no third possible key that can be introduced here - the category is binary. If the data analyst wants to release a differentially-private histogram of citizenship statuses of individuals in a database, they can use a differentially private count query on the database. Because the keys are known, the DP count query will provide a differentially private data release.

Recall from chapter 3 that DP has the parallel composition property: this means that performing a ϵ-DP operation on different (non-overlapping) parts of a dataset is still ϵ-DP. Further, the total ϵ does not increase with the number of subsets. Using this definition, we can apply a DP count function to each bin in the histogram, adding noise separately to each bin.

An example of a DP count with noise addition could look like this:

query database without Barb
Figure 4-2. Histogram of citizenship status (private)

Note that the counts are close to the true values of 7 and 3.

Adding or subtracting individuals from this database will not affect the key set, since there are only two fixed options. The possible key set always remains the same independently of the individuals in the dataset.

When categorical variables have a wider variety of possible categories, you may encounter a scenario in which the key set of the categorical variable is unknown. Consider an example where a data analyst is building a histogram from the same dataset, but this time based on job titles. The job titles are self-reported by the individuals, meaning that the set of possible job titles is not known ahead of time.

Table 4-3. Job title associated with each user in the data
First NameLast NameBudgetJob TitleCitizenship Status

Caryl

Baptie

$898,031.59

Accountant

Citizen

Moyra

Leverson

$847,791.81

Teacher

non-Citizen

Elinore

Gillbard

$729,605.84

Actor

Citizen

Farleigh

Crampton

$9,742,235.31

Engineer

Citizen

Sebastien

Marples

$3,677,589.94

Teacher

non-Citizen

Baxy

Doohan

$1,044,044.63

Medical Doctor

Citizen

Henrie

Whawell

$4,670,377.71

Accountant

Citizen

Dorothy

Drummer

$2,641,401.28

Teacher

Citizen

Meaghan

Clinnick

$989,042.50

Accountant

non-Citizen

Tildy

Gutans

$5,986,640.99

Engineer

Citizen

Suppose the data analyst decides to make a GROUP BY query, getting the counts of individuals for each job title:

query database without Barb
Figure 4-3. Histogram of job titles (non-private)

And the differentially private version of the histogram is

query database without Barb
Figure 4-4. Histogram of job titles (private)

Now, suppose that Elionore Gillbard requests that her data be redacted from the database. The new database without Elionore Gillbard is a neighboring database. The new histogram produced by the data analyst will look like so:

[[histogram_of_job_titles_of_a_neighboring database]] .Histogram of job titles of a neighboring database (non-private) image::images/ch4_fig4.png["query database without Barb"]

Applying differential privacy to the counts will result in the following histogram

query database without Barb
Figure 4-5. Histogram of job titles of a database (private) .

When comparing the histogram from Figure 4-4 with the histogram from Figure 4-6, it is easy to notice that there is a privacy violation happening in the data release. Both histograms should be differentially private, and yet they are leaking information. Based on a quick inspection, you can infer that the job title of the person who left the database is “Actor.”

To overcome this issue, the data analyst should pre-define the key set before making the analysis. In the case of the job titles variable, the data analyst can create a set of job titles based on their previous knowledge of possible answers to the question of “what is your job title?” In addition to the job titles that the data analyst includes in the pre-defined set, the data analyst can define a category “other” that will include new job titles and outliers.

Now let’s understand how the pre-defined key set is defined and used to make differetially private queries to categorical data in the following example.

Example: Histogram Queries

Let’s consider the database described in Table 4-4. To query a differentially private histogram of job title, the data analyst will:

  • Define the key set of the categories in the histogram based on previous information

  • Count the number of individuals in each of the pre-defined categories

The data analyst starts by defining the following job titles categories as the key set for the data analysis:

  • Accountant

  • Teacher

  • Engineer

  • Other

The histogram maps the job titles Accountant, Teacher, and Engineer to equivalent categories in the key set, while the job titles Actor and Medical Doctor are mapped to the category Other. The non-private histogram of job counts appears like so:

query database without Barb
Figure 4-6. Histogram of job titles with a pre-defined key set

The private histogram has a very similar appearance, with noise added to the counts using a Laplace mechanism with ϵ=1:

query database without Barb
Figure 4-7. Histogram of job titles with a pre-defined key set.

Notice that with this configuration, any individual, with any job title, can be added or deleted from the database without resulting in privacy violations. In contrast to the previous example, any unexpected job title will simply be mapped to Other.

Suppose Marta James, an architect, is added to the database:

Table 4-4. Job title associated with each user in the data
First NameLast NameBudgetJob TitleCitizenship Status

Caryl

Baptie

$898,031.59

Accountant

Citizen

Moyra

Leverson

$847,791.81

Teacher

non-Citizen

Elinore

Gillbard

$729,605.84

Actor

Citizen

Farleigh

Crampton

$9,742,235.31

Engineer

Citizen

Sebastien

Marples

$3,677,589.94

Teacher

non-Citizen

Baxy

Doohan

$1,044,044.63

Medical Doctor

Citizen

Henrie

Whawell

$4,670,377.71

Accountant

Citizen

Dorothy

Drummer

$2,641,401.28

Teacher

Citizen

Meaghan

Clinnick

$989,042.50

Accountant

non-Citizen

Tildy

Gutans

$5,986,640.99

Engineer

Citizen

Marta

James

$9,480,446,38

Architect

non-Citizen

The pre-defined key set prevents the new histogram configuration from adding a job title Architect in the histogram, and potentially leaking the addition of Marta to the dataset.

Summary

Exercises

  1. Identify the following transformations as either aggregates or general transformations. If they are not aggregates, specify their range. Construct a limitation on their input so that the stability is defined.

    1. f(x, min, max) = x if x in [min, max]

    2. f(x)=1|x|ix

    3. f(x)=x2

    4. f(x)=Ax where A=010100010

  2. Generate an array of 1000 random floats using a library like NumPy.

    1. Calculate the mean of the data

    2. Calculate the 5th and 95th percentiles for the data

    3. Clip these values

    4. Recompute the mean - how has it changed?

    5. Repeat c and d with trimming

  3. Prove that DP observes parallel composition

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.238.82.77