This chapter builds on concepts discussed in previous chapters to construct a higher-level perspective on differential privacy. This perspective will help you gain a more mathematical understanding of differential privacy, as well as unify concepts discussed in previous chapters.
The goal of this chapter is to give you a consistent framework under which you can feel confident in constructing your own differentially private algorithms.
To accomplish this objective, this chapter will:
lay a theoretical framework for understanding differential privacy
adapt topics discussed in previous chapters to the framework
introduce new stable transformations and private mechanisms within the framework
The high-level idea is that differential privacy is a system for relating distances. The system is composed of a set of functions, each of which is stable. Stable functions are functions for which close inputs result in close outputs. If you know how much your private dataset could possibly change, and then apply a stable function to it, then you also know how much the function’s output could possibly change. Knowing how much the output could possibly change can be equivalent to knowing how private the output is.
To help familiarize yourself with this perspective, the chapter will then walk through a series of examples, each of which will demonstrate how differentially private algorithms break down into provably stable functions.
To get started, it will be helpful to familiarize yourself with the concept of a metric space first.
A metric space is a pair consisting of a domain and a metric, where the metric may be used to compute the distance between any two members of the domain. That is, a metric space is a set of values, equipped with a way of computing the distance between said values. In Chapter 1, the claim was made that differentially private algorithms consist of a pipeline of smaller functions. Each function in the pipeline maps from one metric space to another.
A domain is the set of possible values that a dataset may take on. The set may consist of virtually anything, and is usually defined with some descriptors, which are italicized below. The descriptors may be very restrictive-- like the set of booleans. Alternatively, the set may contain an infinite number of values-- like the set of all real numbers, or the set of all vectors of bounded integers. The domain may even contain the set of all dataframes, or the set of graphs of bounded degree.
A metric is a kind of function that computes the distance between any two members of a domain.
Common metrics in differential privacy are the absolute distance,
One of the first steps towards making a DP release is to define the initial metric space that your sensitive dataset resides in.
If you have made a DP release in the past, you may not have even recognized this step!
Many DP libraries make an implicit assumption about the metric space.
For the guarantees of any DP library to hold,
then the metric space of your dataset must match the metric space the DP library uses.
In this step, you should describe the set of possible values your dataset may take on,
choose a metric that characterizes the distance between neighboring datasets,
and then set an upper bound on the distance, called
For instance, say your dataset is a single integer-- the number of individuals in your household.
The dataset resides in the domain of integers, where neighboring datasets differ by the absolute distance.
You know intuitively that any two neigboring datasets differ by at most one (this is
Any two datasets are close (or adjacent/neighboring) if the distance between them is no greater than the bound. In the above example, any two datasets/integers are close if they differ by no more than one. This same concept of closeness can be applied to any choice of metric space, dataset, and distance.
DP releases follow a known probability distribution.
Privacy loss parameters like
Notice how privacy loss parameters can be interpreted similarly as
These privacy loss parameters complement
Together, the input and output distance fully characterize the privacy guarantee of a differentially private algorithm. The input distance is a bound on how much an individual can influence a dataset, and the output distance is a bound on how much an individual can influence the output distribution/how private the data release is. The next section will discuss how to relate these two distances.
A function defines a relationship between two metric spaces. That is, a function maps a member of the first metric space to a member of the second metric space. A function encompasses any private or non-private computation you might perform on a dataset. Of particular interest are stable functions.
A function is considered stable if close inputs result in close outputs.
“Close” means there is an upper bound on the distance between any two values.
Restated, a stable function is a function for which,
if you know any two inputs to the function differ by no more than some upper bound
To get started, assume both the input metric and output metric is the absolute distance. Consider the following function:
def
double
(
x
):
return
x
*
2
This function is stable because it can be proven that if the inputs to the function differ by at most
Stable functions don’t necessarily use the absolute distance metric!
Other common choices of metrics are the symmetric distance, Hamming distance or
Differentially private algorithms naturally break down into a sequence of functions, where each function can be shown to be stable. You can think about differential privacy as a system for relating distances. Under the perspective of function stability, differential privacy is a special kind of stability constraint on a function where the upper bound on output distances is computed via a privacy measure.
Data analysts start their analysis with a limit on how much an individual can influence a dataset (an input distance), and if all of the functions in the pipeline are stable, then that bound can be translated all the way to the end of the pipeline. If the final function in the pipeline measures distances with a privacy measure, then the output distance is a privacy guarantee.
We’ll show more examples of metric spaces and stable functions in the following sections.
Often in the literature of differential privacy, a distinction is made between bounded and unbounded differential privacy. This distinction is based on the existence of one piece of public information: the dataset size.
When the dataset size is unknown, then you are operating under unbounded differential privacy. In unbounded differential privacy, the data domain consists of all datasets of any size. The distance between any two datasets is computed by counting the fewest number of records that must be added or removed to convert any one dataset into another dataset. A metric that can express unbounded DP is the symmetric distance. In unbounded DP, the user must consider the greatest number of additions or removals an individual may make to a dataset.
When the dataset size is known, then you are operating under bounded differential privacy. In bounded differential privacy, the data domain consists of all datasets that have exactly as many records as the known dataset size. In this regime, the distance between any two dataset is computed by counting the number of rows that must be edited to convert any one dataset into another dataset. In bounded DP, you must consider the greatest number of edits an individual may make to a dataset.
Unbounded differential privacy considers additions and removals of individuals from a dataset. Whereas, in bounded differential privacy, edits can be counted instead. Bounded differential privacy is a special case of unbounded differential privacy, where the data domain is restricted to a known size. Notice that in bounded differential privacy, the number of edits is equivalent to two times the number of additions and removals, because each edit constitutes one addition and one removal. This relationship makes it possible to convert between different notions of neighboring datasets.
Restrictions placed on the data domain reduce the sensitivity because the restrictions exclude extreme dataset pairs.
Even clamping each individual datum is a form of restricted sensitivity: clamping restricts the domain of possible datasets to those datasets, and under this restriction to the domain, the unbounded-DP sensitivity of the sum becomes finite. A size descriptor on the domain may also reduce the sensitivity, as shown in the following table.
In the following table an individual may add or remove up to
Statistic | any numeric vector | with L, U | with L, U, N |
Count | 0 | ||
Sum | |||
Mean | |||
Variance |
Under bounded DP,
Any kind of descriptor based on what is considered “public information” can reduce the sensitivity: including a lower bound on the dataset size, the data types that the data takes on, or limitations on the norm of a data row.
These descriptors, or restrictions, on the domain may come from two sources:
descriptors may be public information about the dataset
descriptors may be established via data processing
In the first case, descriptors may only be safely considered public information about a dataset if, for all potential datasets under consideration, that descriptor holds invariant. In many real-world applications, descriptors can be derived from common-sense relationships, like how the age of parents may not be less than that of their children.
Restricting the domain by incorporating public knowledge you have about your data, and judiciously applying preprocessing, can significantly reduce the sensitivity of your queries.
In the previous section, one kind of data descriptor was a bound on the L1 or L2 norm on each row in a dataset. This is useful for releasing a query result that spans multiple columns simultaneously, more efficiently. In this context, the sensitivity of the sum scales according to the norm bound:
The
When each row of a dataset is of bounded norm (no greater than
Note that, while this definition gives an example where datasets are considered neighboring under the symmetric distance, this choice of metric is arbitrary.
The
A very common situation where this arises is in computing histograms.
Say you were asked to prepare a differentially private release of fitness data to show the effectiveness of a product.
You want to share counts of the number of unique individuals of your product, grouped by binned ages and binned activity levels.
Given the structure of this query, it is safe to claim that any one individual may influence at most one partition-
any one individual cannot be of multiple ages, or multiple activity levels, at once!
This makes it possible to bound the L1 sensitivity of the entire count query.
Adding or removing any one individual can influence at most one of the counts,
so the
A very simple adjustment of the query can cause this very nice property to degrade.
When also binning by time range (say there are
In the context of relating distances, a function that computes the histogram (without noise addition) is
Stable transformations with
To show this for pure differential privacy, where
In this case,
Consider
This may seem like too much, but the math quickly simplifies:
At this stage, observe how well the Laplace mechanism complements queries with known
In the overall scheme of relating distances,
this function that samples Laplace noise is
Together with the previous section,
applying both the histogram and Laplace functions sequentially is
Another common choice of metric is the
The
Similarly as in the case of the
Again, the choice of metric (the symmetric distance) is arbitrary. Any given dataset metric may be used, and they will influence the resulting, derived, sensitivity. This section will continue using the symmetric distance.
A significant benefit of using the
First, recall how the
For this choice of query, on this dataset, you derive significant benefit from the knowledge that an individual may influence each partition by at most one.
As you can see, this sensitivity is far tighter than before.
Unfortunately, not all queries satisfy the constraint that an individual may influence each partition by at most one!
If, in the fitness data example, you were to partition on only binned ages and binned activity levels,
then you would have no way of knowing how many records from an individual may fall into each bin.
Therefore, the sensitivity of this query, under the
A similar relationship can be shown between the Gaussian mechanism and the
The proof is somewhat lengthy, but simplifies to the following guarantee:
Together with the stable transformation, since each function is stable, the entire function is stable. Starting from a bound on the distance between the input datasets, you can bound the distance between aggregates (sensitivity). Then, you can use this sensitivity to bound the distance between any two output distributions. This gives the final bound on privacy.
The exponential mechanism is used for private selection from a set of candidates. In this setting, you know what the set of possible outputs can be (your candidate set), and you want to choose one (or more) candidates from that set that best answers a query on the private dataset.
A motivating example (similar to the seminal paper) involves pricing: Imagine that you have decided to sell off each of your baseball cards, and want to choose the best price point to maximize profit.
If you set the price too high, you will fail to find buyers, whereas setting the price too low will also lose money. In this setting, assume that prospective buyers will tell you what they are willing to spend, so long as this information isn’t leaked to other buyers.
To use the exponential mechanism, you must define a set of potential price points before conferring with the buyers. This set of potential price points forms the candidate set. The exponential mechanism selects one of those candidates based on the private information from the buyers.
Just like in the prior examples, you can divide the algorithm into a stable transformation and a private mechanism. The stable transformation is the scoring function, which maps the private dataset to a vector of scores. The private mechanism is the exponential mechanism, which selects one of the candidates based on the scores.
You can think of the scoring as a dataset aggregation, just like you might think of a dataset sum as an aggregation.
The scoring function
The sensitivity of this score vector is based on the greatest amount any one score may change.
That is, the sensitivity of the exponential mechanism is based on the
The
You can use this norm, just as in the prior examples, to define the sensitivity of the scoring function.
This time
In the following score function
An equivalent way to write this function in Python is as follows:
def
score_quantile_discrete
(
x
,
candidates
):
"""Assuming `x` is sorted, scores every element in `x`
according to rank distance from the `alpha`-quantile."""
num_leq
=
(
x
[
None
]
<=
candidates
[:,
None
])
.
sum
(
axis
=
1
)
return
-
np
.
abs
(
num_leq
-
alpha
*
len
(
x
))
The score function is based on the number of records in the dataset that are less than (
If you’d like to find the median, then let
The scoring function is stable because, when the distance between the input datasets is bounded, the distance between the output vectors is also bounded.
This bound skips over many steps for brevity,
but if you were to substitute the definition of
The exponential mechanism selects the candidate with likelihood proportional to the exponentiated score.
In other words, the likelihood of selecting the
The
To show that this is private, again appeal to the definition of pure differential privacy.
The exponential in the likelihood lends nice utility guarantees. Relatively small distances between scores can result in a much higher probability of selecting the better candidate.
When you substitute the probability of selecting candidate
Since the math simplifies to
import
numpy
as
np
def
mechanism_exponential_discrete
(
x
,
candidates
,
epsilon
,
scorer
,
sensitivity
):
"""Return an `epsilon`-DP sample from the set of discrete `candidates`.
:param x: 1d dataset for which to release the estimate
:param candidates: a list of candidate values to select from
:param epsilon: privacy parameter
:param scorer: accepts `data` and `candidates` and returns scores
:param sensitivity: the greatest that `scorer` can change
:returns a sample from `candidates` with prob proportional to `scorer`
"""
# score each candidate
scores
=
scorer
(
x
,
candidates
)
# for numerical stability; omitting this line gives same probabilities
scores
-=
scores
.
max
()
# compute likelihood of selecting each candidate
likelihoods
=
np
.
exp
(
epsilon
*
scores
/
(
2
*
sensitivity
))
# normalize to a probability
probabilities
=
likelihoods
/
likelihoods
.
sum
()
# sample one index wrt the selection probabilities
cdf
=
probabilities
.
cumsum
()
index
=
np
.
argmax
(
cdf
>=
np
.
random
.
uniform
())
# return the respective candidate
return
candidates
[
index
]
As is shown, this mechanism privately selects the single candidate with approximately the highest score. Together with the scorer transformation, since each function is stable, the entire function is stable. Starting from a bound on the distance between the input datasets, you can bound the distance between score vectors (sensitivity). Then, you can use this sensitivity to bound the distance between the probabilities of selecting each candidate. This gives the final bound on privacy.
The previous sections have shown how to construct differentially private algorithms from stable functions, under the framework of relating distances between metric spaces. This perspective is particularly powerful, because it unifies many concepts of differential privacy under a single framework. You have the power to isolate stable functions mapping from one metric space to another, where each metric space represents common differentially private concepts-- like unbounded-DP, bounded-DP, function sensitivity, and restricted sensitivity.
The system allows you to reason about the stability, or privacy, of any subset of functions in a differentially private algorithm. This can be useful in many situations, like when you need to start your reasoning about the closeness of two datasets after already applying a dataset transformation in an external system, or when you need to tailor an existing differentially private algorithm to your use-case.
There are many more examples of stable functions that can be used to construct differentially private algorithms, which will be shown in the following chapters.
Similar to def double
discussed earlier in the chapter, consider the following function that duplicates each record in a list:
def
duplicate
(
x
):
assert
instanceof
(
x
,
list
)
return
x
*
2
Under what metric could you argue that this is a stable function? What would the stability of this transformation be?
If you don’t know the size of a dataset, are there other domain descriptors you can use to reduce the sensitivity of a mean?
Is it possible to transform a dataset with an unknown number of records to a dataset with a known number of records? What would the stability of this transformation be?
The following questions involve dataset domains corresponding to other kinds of datasets, such as those consisting of text or graphs.
You have a graph dataset where each individual can contribute one node. Show that the count of nodes in a graph is a stable transformation from the domain of graphs to the domain of integers.
You have a dataset consisting of tweets and want to conduct sentiment analysis. Can you claim that a transformation from the domain of tweets to a categorical domain of sentiments (where each individual is assigned one sentiment) is stable?
Recall that a dataset domain may consist of any medium of data. Take a moment to visualize the domains of datasets you may want to conduct a DP analysis over. What transformations would you want to apply to them. Are these transformations stable? Are there domain descriptors that would make them stable? If so, are there transformations that would establish these descriptors?
3.238.82.77