Data transformations are at the heart of data analysis tasks. In the context of transformations, it will aid your understanding to consider a dataset to be any form of data that has not been made private. That is, computed statistics (such as the sum or mean estimates) are also dataset(s). Transformations encompass any function from a dataset to a dataset.
Some commonlyused data transformations compute statistics  when you have a dataset, and you want to understand a property of it. A statistic aggregates many records into a summary, like a sum, mean, standard deviation, or even regression parameter(s).
Some transformations may modify each record in a dataset one at a time, therefore preserving the dimensions of the data, like when applying mathematical transformations to each element in a column, or when assigning bin numbers to each record. Other transformations still may compute a dataset join, or load data from a csv file or database.
In order to establish privacy guarantees later, it is first necessary to consider how an individual’s influence on the data may change once a transformation is applied. This change in influence in characterized by a function’s stability.
This chapter discusses the stability of a number of dataset transformations that are typically used in differentially private releases.
In DP, it is valuable to look at statistics as special cases of more abstract data transformations. A broader perspective means that you can use more general tools to understand the sensitivity of a transformation and ensure that your data analysis process has the expected privacy guarantees.
One of the most common dataset transformations is the sum. In many situations, each element in the data you are privatizing has no natural bounds  values could approach positive or negative infinity without limitation. Naturally, unbounded data implies that many queries, like the sum, have unbounded sensitivity. When the sensitivity is unbounded, you can’t construct a DP mechanism that will privatize your statistic, since a change in a single value could cause an arbitrarily large change in the statistic, therefore leaking information about the dataset. This kind of scenario highlights the importance of having a welldefined sensitivity when performing a private data analysis.
Just like a statistic has an associated sensitivity, a more general data transformation has a notion of stability. Stability measures how much a transformation changes, given a change in the input data. By the end of this chapter, you should understand the notion of stability with respect to dataset transformations. Using this knowledge, you should be able to identify stable dataset transformations that complement differentially private mechanisms. In particular, you will be able to determine when it is appropriate to use clipping/clamping when performing a DP analysis.
A transformation is a computation that maps a dataset to another dataset, or maps a dataset to an aggregate.
Consider a transformation that squares each element ina dataset:
This transformation operates on each row independently and returns a transformed dataset with the same number of rows. Such a transformation is called a rowbyrow transformation. Such rowbyrow transformations are rather common; you will find them useful as you learn more about DP.
The mean is another example of a transformation  one that maps a dataset to value computed from an aggregate of the data. Taking the same dataset, we can compute
In this case, the result of the transformation is a single value, not a dataset. This single value takes the data in aggregate and returns a value that describes a property of the dataset. We’ll generally refer to such transformations as aggregators.
In this chapter, we are interested in transformations with bounded stability. Stability is a property of a transformation that guarantees that similar inputs map to similar outputs. If a transformation maps similar inputs to similar outputs, then we say it is stable.
You might be wondering what we mean by similar here  great question.
Mathematically speaking, you will always want to be precise with your notions of similarity and closeness.
Much like you say a release is "
Stability may already feel familiar to you, because stability is a generalization of sensitivity. Recall the definition of sensitivity from the previous chapter:
Recall that
We can use the definition of sensitivity to compute the sensitivity of a sum aggregator function,
If you know that
This quantity is almost useful, but the magnitude of
If you know that
Let’s think back to the classroom example  every score in the class was between 0 and 100. Why is that? Simply by circumstance, there was no extra credit on the exam, and there was no (negative) penalty for guessing incorrectly, so the minimum score was 0.
What happens in a scenario where we don’t have such a natural domain for the data? Let’s consider financial situations where the maximum is unknown or undefined. What is the maximum price of a house? The maximum income possible? Such scenarios do not give us natural bounds on the data.
Now remember how we calculated sensitivity as the maximum change in the final statistic that is possible from
adding or removing a single data point.
We knew that the lowest average score was 0, and the highest single score was 100, which gave us a sensitivity of 10.
What happens if we don’t have a maximum score? We can no longer compute the sensitivity,
since it becomes
Clipping is a transformation that replaces each value outside of a set of predefined bounds with their nearest neighbor within the bounds.
np.clip
does this for the scalar case.
The OpenDP Library similarly has a constructor for clamping data:
import
opendp.prelude
as
dp
input_space
=
dp
.
vector_domain
(
dp
.
atom_domain
(
T
=
float
)),
dp
.
symmetric_distance
()
clip_trans
=
input_space
>>
dp
.
t
.
then_clamp
((
0.
,
10.
))
mock_dataset
=
np
.
random
.
normal
(
0.
,
scale
=
10.
)
transformed_dataset
=
clip_trans
(
data
)
Now you have the tools to make your sensitivity value welldefined, even in cases where the data is unbounded. By using a clipping function like the ones above, you can ensure that your dataset lies between min_value and max_value, resulting in a sensitivity value that is meaningful for constructing DP statistics.
Suppose you have a dataset of individuals and their annual incomes.
The least amount they can be paid per year is $0, but it is unclear what the maximum is.
For this reason, your data is on the domain
In this case, you would need to use outside domain knowledge to estimate a good maximum given the dataset. If you set it very high, you are almost guaranteed to not cut off any of the values, but your sensitivity will suffer. Conversely, if the maximum is set too low, then you will clip too many values and lose utility in your final statistic.
There are several approaches you can take when selecting clipping bounds. Consider the DP mean process  for data without any natural bounds, the sensitivity is undefined. In order to guarantee DP in a mean release, you need to somehow provide bounds to the data.
There are two ways to deal with extreme values: trimming, and clipping. Trimming means removing any values outside the range [min, max], while clipping means replacing values outside the range with min or max.
Percentiles are a natural starting point. If you define a percentile value
This has an effect on the final statistic.
As an example, consider the set of numbers
Thankfully, you don’t have to go through all of these steps every time you want to establish optimal clipping parameters.
import
numpy
as
np
import
opendp.prelude
as
dp
def
release_dp_mean
(
bounds
,
contributions
,
epsilon
):
context
=
dp
.
Context
.
compositor
(
data
=
data
,
privacy_unit
=
dp
.
unit_of
(
contributions
=
contributions
),
privacy_loss
=
dp
.
loss_of
(
epsilon
=
epsilon
),
split_evenly_over
=
2
)
numerator
=
context
.
query
()
.
clamp
(
bounds
)
.
sum
()
.
laplace
()
.
release
()
denominator
=
context
.
query
()
.
count
()
.
laplace
()
.
release
()
return
numerator
/
denominator
pr
=
[]
true_mean
=
sum
(
data
)
/
len
(
data
)
(
f
"True mean:
{
true_mean
}
"
)
(
"Mean
Utility"
)
for
upper
in
[
50.
,
75.
,
90.
,
100.
]:
clipped_dp_mean
=
release_dp_mean
((
0.
,
upper
),
contributions
=
1
,
epsilon
=
1.
)
utility
=
abs
(
clipped_dp_mean

true_mean
)
(
f
"
{
clipped_dp_mean
}
{
utility
}
"
)
Categorical data is, as its name suggests, data that be divided into different categories.
License Plate  Vehicle Type 

ABC123  Car 
JCO549  Truck 
OFJ295  Car 
EMF494  Motorcycle 
QMC583  Truck 
In this case, you can aggregate the vehicles by the column “Vehicle Type” and generate a count: 2 cars, 2 trucks, 1 motorcycle. The possible categories (car, truck, motorcycle) are also known as the keys.
When dealing with categorical data, the set of keys might be entirely known or only partially known. In some cases, the set of keys is clear and well known. For example, consider a census survey. If there is a question related to citizenship status, there are only two possible keys: citizen or noncitizen. If a data analyst decides to create a histogram of citizenship status, the data analyst will query the number of individuals whose status is “citizen” and the number of individuals whose status is “noncitizen.”
The data analyst does not know how many individuals are in each category, so they will query the count of citizens and the count of noncitizen individuals. In this query, an individual belongs to either the category “citizen” or “noncitizen”, so all neighboring databases will have the same set of categories.
Next, consider an illustrative example where a data analyst queries citizenship status statistics from a database:
First Name  Last Name  Budget  Citizenship Status 

Caryl  Baptie  $898,031.59  Citizen 
Moyra  Leverson  $847,791.81  nonCitizen 
Elinore  Gillbard  $729,605.84  Citizen 
Farleigh  Crampton  $9,742,235.31  Citizen 
Sebastien  Marples  $3,677,589.94  nonCitizen 
Baxy  Doohan  $1,044,044.63  Citizen 
Henrie  Whawell  $4,670,377.71  Citizen 
Dorothy  Drummer  $2,641,401.28  Citizen 
Meaghan  Clinnick  $989,042.50  nonCitizen 
Tildy  Gutans  $5,986,640.99  Citizen 
The data analyst queries the count of individuals with Citizenship Status == Citizen and the count of individuals where Citizenship Status == nonCitizen. These two queries represent the citizenship status histogram. In this case, adding or subtracting individuals from this database will not change the key set in the histogram, since the only two options are “citizen” and “noncitizen.”
If the analyst builds a histogram of citizenship counts from this table, it will appear like so:
This dataset contains an example of a categorical variable with a known key set. Because the key set is known, the set of keys is consistent over all possible neighboring databases. There is no third possible key that can be introduced here  the category is binary. If the data analyst wants to release a differentiallyprivate histogram of citizenship statuses of individuals in a database, they can use a differentially private count query on the database. Because the keys are known, the DP count query will provide a differentially private data release.
Recall from chapter 3 that DP has the parallel composition property: this means that performing a
An example of a DP count with noise addition could look like this:
Note that the counts are close to the true values of 7 and 3.
Adding or subtracting individuals from this database will not affect the key set, since there are only two fixed options. The possible key set always remains the same independently of the individuals in the dataset.
When categorical variables have a wider variety of possible categories, you may encounter a scenario in which the key set of the categorical variable is unknown. Consider an example where a data analyst is building a histogram from the same dataset, but this time based on job titles. The job titles are selfreported by the individuals, meaning that the set of possible job titles is not known ahead of time.
First Name  Last Name  Budget  Job Title  Citizenship Status 

Caryl  Baptie  $898,031.59  Accountant  Citizen 
Moyra  Leverson  $847,791.81  Teacher  nonCitizen 
Elinore  Gillbard  $729,605.84  Actor  Citizen 
Farleigh  Crampton  $9,742,235.31  Engineer  Citizen 
Sebastien  Marples  $3,677,589.94  Teacher  nonCitizen 
Baxy  Doohan  $1,044,044.63  Medical Doctor  Citizen 
Henrie  Whawell  $4,670,377.71  Accountant  Citizen 
Dorothy  Drummer  $2,641,401.28  Teacher  Citizen 
Meaghan  Clinnick  $989,042.50  Accountant  nonCitizen 
Tildy  Gutans  $5,986,640.99  Engineer  Citizen 
Suppose the data analyst decides to make a GROUP BY
query, getting the counts of individuals for each job title:
And the differentially private version of the histogram is
Now, suppose that Elionore Gillbard requests that her data be redacted from the database. The new database without Elionore Gillbard is a neighboring database. The new histogram produced by the data analyst will look like so:
[[histogram_of_job_titles_of_a_neighboring database]] .Histogram of job titles of a neighboring database (nonprivate) image::images/ch4_fig4.png["query database without Barb"]
Applying differential privacy to the counts will result in the following histogram
When comparing the histogram from Figure 44 with the histogram from Figure 46, it is easy to notice that there is a privacy violation happening in the data release. Both histograms should be differentially private, and yet they are leaking information. Based on a quick inspection, you can infer that the job title of the person who left the database is “Actor.”
To overcome this issue, the data analyst should predefine the key set before making the analysis. In the case of the job titles variable, the data analyst can create a set of job titles based on their previous knowledge of possible answers to the question of “what is your job title?” In addition to the job titles that the data analyst includes in the predefined set, the data analyst can define a category “other” that will include new job titles and outliers.
Now let’s understand how the predefined key set is defined and used to make differetially private queries to categorical data in the following example.
Let’s consider the database described in Table 44. To query a differentially private histogram of job title, the data analyst will:
Define the key set of the categories in the histogram based on previous information
Count the number of individuals in each of the predefined categories
The data analyst starts by defining the following job titles categories as the key set for the data analysis:
Accountant
Teacher
Engineer
Other
The histogram maps the job titles Accountant, Teacher, and Engineer to equivalent categories in the key set, while the job titles Actor and Medical Doctor are mapped to the category Other. The nonprivate histogram of job counts appears like so:
The private histogram has a very similar appearance, with noise added to the counts
using a Laplace mechanism with
Notice that with this configuration, any individual, with any job title, can be added or deleted from the database without resulting in privacy violations. In contrast to the previous example, any unexpected job title will simply be mapped to Other.
Suppose Marta James, an architect, is added to the database:
First Name  Last Name  Budget  Job Title  Citizenship Status 

Caryl  Baptie  $898,031.59  Accountant  Citizen 
Moyra  Leverson  $847,791.81  Teacher  nonCitizen 
Elinore  Gillbard  $729,605.84  Actor  Citizen 
Farleigh  Crampton  $9,742,235.31  Engineer  Citizen 
Sebastien  Marples  $3,677,589.94  Teacher  nonCitizen 
Baxy  Doohan  $1,044,044.63  Medical Doctor  Citizen 
Henrie  Whawell  $4,670,377.71  Accountant  Citizen 
Dorothy  Drummer  $2,641,401.28  Teacher  Citizen 
Meaghan  Clinnick  $989,042.50  Accountant  nonCitizen 
Tildy  Gutans  $5,986,640.99  Engineer  Citizen 
Marta  James  $9,480,446,38  Architect  nonCitizen 
The predefined key set prevents the new histogram configuration from adding a job title Architect in the histogram, and potentially leaking the addition of Marta to the dataset.
Identify the following transformations as either aggregates or general transformations. If they are not aggregates, specify their range. Construct a limitation on their input so that the stability is defined.
f(x, min, max) = x if x in [min, max]
Generate an array of 1000 random floats using a library like NumPy.
Calculate the mean of the data
Calculate the 5th and 95th percentiles for the data
Clip these values
Recompute the mean  how has it changed?
Repeat c and d with trimming
Prove that DP observes parallel composition