Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Machine Learning and Differential Privacy

Machine learning is a term that describes a variety of tools to find patterns in data. In this chapter, you will learn the basics of machine learning, using linear regression as an entry point. Building upon this knowledge, you will learn about the relationship between the training process, model parameters, and potential privacy violations. To defend against different attack scenarios you will learn about different ways to privatize machine learning models during the training process.

Broadly speaking, machine learning is the process of a computer receiving a data set and learning relationships and patterns in it. The process of learning these patterns is called training. Given a function with some unknown parameters, the training process tries different parameter values until it finds the values that allow it to most closely match the data. In order to understand how accurate the model is, you perform testing - this is the part of the process where you use your parameters to make predictions based on the input data, and compare the predictions to the true values. If the tests show that your model is accurate, then congratulations, you’ve trained the computer to learn a model about the data. If not, it is back to the drawing board, but there are always other model types and training processes to try.

For example, simple linear regression is a process that estimates the slope and intercept between two columns in a dataset. These two parameters uniquely determine a line in the plane that estimates a linear relationship in your data set. This seems innocuous enough, however, there are situations in which you will need to privatize this model in order to protect the privacy of individuals in the training data.

Why would you need to do this? Consider a membership attack ¹ where a malicious actor observes that a model is more confident about its prediction on certain data points. The actor can then conclude that these points are similar to the points that the model was trained on. By repeatedly tuning these data points to maximize the model’s confidence, the actor can converge on points that are very close (or identical) to real data points from the training set. In cases where a model was trained on sensitive data, this constitutes a privacy violation. ².

Supervised vs. Unsupervised Learning

Machine learning is often broken into two broad categories: supervised, and unsupervised. Supervised learning describes a scenario where the data comes with labels to learn from. These labels aid the model in learning patterns between the different categories. For example, the following table shows a data set of fruit measurements.

Table 7-1. Fruits in a grocery store
Height (cm)	Diameter (cm)	Type
6.2	6.7	Apple
7.0	6.9	Orange
9.0	6.2	Pear
6.3	6.5	Apple
7.1	6.8	Orange
6.4	6.8	Apple
9.1	6.1	Pear
9.2	6.2	Pear
7.2	7.4	Orange

Now imagine you are given a task - you know the height and circumference of a fruit, and want to predict its type. In this case, the column Type is the label of the data set - it is what your model wants to predict. This is an example of supervised learning - the word supervised is used here because you are “supervising” the model’s learning process by giving it labels to learn from. Supervised learning is common in scenarios where you have large amounts of pre-existing data that was labelled by humans.

If, on the other hand, you are given the dataset without this column, then your learning objective is different.

Table 7-2. Fruits in a grocery store
Height (cm)	Diameter (cm)
6.2	6.7
7.0	6.9
9.0	6.2
6.3	6.5
7.1	6.8
6.4	6.8
9.1	6.1
9.2	6.2
7.2	7.4

You know that each row represents a fruit, but do not know which type it is. In fact, this data doesn’t even tell you how many types of fruit there are in the data set. With a data set like this, you can perform an unsupervised learning process, which will try to infer how many categories there are in the data, and which category each member belongs to. Thus, it may notice that there are three fruits in the data set where the ratio of height to diameter is large - the fruit is taller than it is wide. These rows will be clustered together and the model will tell you “these objects seem similar.” The model can therefore identify pears based on height and diameter without even knowing what a pear is, it simply looks at patterns in the data and makes its best estimate.

Unsupervised learning is more common in cases where you have data that hasn’t been human reviewed. Sometimes, you will want to understand the global structure of the data as a first pass before doing further analysis on it. In a case like this, unsupervised learning can be a helpful first step to estimate how many clusters exist in the data. In other cases, the unsupervised learning is the main event - a classic example being handwritten digit detection that clusters images of digits into like categories, without any pre-existing labels or human guidance.

Simple Linear Regression

If training a model is the process of making it represent the data as accurately as possible, then how do you measure the correctness of its estimate? You need a mathematical formulation of correctness for the model to compare itself against. Such a formulation is often called an objective function - a function that takes a data point as input, and returns how close the model’s prediction is to the true value.

A simple linear regression model transforms a single argument $x$ to an output $y$ via two parameters, an intercept $beta 0$ and slope $beta 1$ :

y = β_{0} + β_{1} x

Given a set of points $left-parenthesis x 0 comma y 0 right-parenthesis comma left-parenthesis x 1 comma y 1 right-parenthesis comma period period period period comma left-parenthesis x Subscript upper N Baseline comma y Subscript upper N Baseline right-parenthesis$ , you want to learn the y intercept $beta 0$ and slope $beta 1$ that describes a linear relationship between x and y. In other words, you want to learn the parameters that make the function $y = β_{0} + β_{1} x$ fit the data as close as possible. For any guess of $beta 0$ and $beta 1$ , the model needs to understand the difference between its estimate and the true value. Since we are learning a linear model, the estimate for each $y_{i}$ is $beta 0 plus beta 1 x$ , and the difference gives you the error of the estimate for $y_{i}$ : $e r r (y_{i}) = y_{i} - (β_{0} + β_{1} x)$ . To understand the overall error for the dataset, compute the mean squared error:

$\sum_{i = 1}^{N} e r r {(y_{i})}^{2} = \sum_{i = 1}^{N} {(y_{i} - β_{0} - β_{1} x_{i})}^{2}$

Unfortunately, like any other statistic, regression parameters contain private information about the underlying dataset.

DP Approach 1: Sufficient Statistics

When the data is centered around zero, the regression parameters become relatively simple: $β_{1} = \frac{\sum_{i} x_{i} y_{i}}{\sum_{i} x_{i}^{2}}$ , and $β_{0} = \sum_{i} y_{i} / n - β_{1} \sum_{i} x_{i}$ .

The naive approach is to try to add noise directly to $beta 1$ , with noise calibrated according to the sensitivity of $beta 1$ . This breaks down when you try to derive the sensitivity of the regression coefficient $beta 1$ . The sensitivity of $beta 1$ is no smaller than in the case where X’ differs from X by one record:

Δ β_{1} \geq max_{X \sim X'} \frac{\sum_{i} x_{i} y_{i}}{\sum_{i} x_{i}^{2}} - \frac{\sum_{i} x_{i} y_{i} + x_{i}' y_{i}'}{\sum_{i} x_{i}^{2} + x_{i}'^{2}}|

If all $x_{i}$ are zero, then the quantity $StartFraction sigma-summation Underscript i Endscripts x Subscript i Baseline y Subscript i Baseline Over sigma-summation Underscript i Endscripts x Subscript i Superscript 2 Baseline EndFraction$ is undefined, and therefore so is the sensitivity.

A common approach to avoid the issue of undefined sensitivity is to replace individual point estimates in the formula, and then postprocess to get an estimate of the final statistic.

Denote the DP estimate for $upper S Subscript x y Baseline equals sigma-summation Underscript i Endscripts x Subscript i Baseline y Subscript i$ as ${\tilde{S}}_{x y}$ , and similarly for ${\tilde{S}}_{x x}$ . These point estimates can be post-processed to approximate $beta 1$ :

β_{1} \approx \frac{{\tilde{S}}_{x y}}{{\tilde{S}}_{x x}}

The sensitivity of ${\tilde{S}}_{x y}$ is now a direct substitution and simplification from the definition of sensitivity.

We first consider the case where one record is added:

\begin{matrix} Δ S_{x y} & = max_{X \sim X^{'}} S_{x y} - S_{x y}^{'}| \\ = max_{X \sim X^{'}} \sum_{i} x_{i} y_{i}) - \sum_{i} x_{i} y_{i} + x â € ™ y^{'})| \\ = max_{X \sim X^{'}} x â € ™ y â € ™| \\ = max_{X \sim X^{'}} b \cdot b| \\ = b^{2} \end{matrix}

This derivation makes use of an assumption that all $x_{i}$ and $y_{i}$ are bounded between $left-bracket negative b comma b$ ]. A similar derivation for the case where one record is removed works out to the same sensitivity, meaning the overall sensitivity is $b^{2}$ . Another similar derivation for $S_{x x}$ results in the same sensitivity in the add/remove model.

Now that you have the sensitivities of each of your point estimates $S_{x y}$ and $S_{x x}$ , you can use your knowledge to make DP releases of these quantities with an additive noise mechanism, and post-process these estimates into $beta 1$ .

import numpy as np
from opendp.measurements import make_base_laplace
from opendp.mod import enable_features
enable_features("contrib")

def dp_beta1_via_plugin(x, y, b, epsilon):
    x, y = np.clip(x, -b, b), np.clip(y, -b, b)

    # noise scale is multiplied by 2 to account for the composition
    # both dp estimators happen to share the same noise distribution
    laplace_mech = make_base_laplace(scale=2.0 * b ** 2 / epsilon)

    # make the point releases
    dp_Sxy = laplace_mech(sum(x_i * y_i for x_i, y_i in zip(x, y)))
    dp_Sxx = laplace_mech(sum(x_i**2 for x_i in x))

    # post-process
    return dp_Sxy / dp_Sxx

The total privacy expenditure is the basic composition of the privacy expenditure for releasing $S_{x y}$ and the privacy expenditure for releasing $S_{x x}$ . Privately estimating unknown quantities in $beta 0$ follows the same routine, making use of prior DP algorithms for individual point estimates, and then performing post-processing.

Multivariate Regression

The concepts behind simple linear regression can be extended for when you have multiple predictors, letting $x_{i j}$ denote the predictors of y. This scenario is called multivariate regression. In this section, you will learn to extend the “plug-in approach” for multivariate regression, and also loosen the assumption that the data is centered around zero.

This time, consider a private dataset consisting of m predictors for y. You would like to solve for $beta equals left-bracket beta 0 comma period period period beta Subscript m Baseline$ ], such that $β = {argmin}_{β} \sum_{i = 1}^{n} {(β_{i} \cdot x_{i} - y_{i})}^{2}$ .

The ordinary least squares solution for estimating a linear regression model is $β = {(X^{T} X)}^{- 1} X y$ .

This formula is a generalization of the univariate case above, where X is the design matrix, with two columns:

a column of ones to estimate the bias term
the input vector x

X = \begin{matrix} 1 & x_{0} \\ 1 & x_{1} \\ ⋮ & ⋮ \\ 1 & x_{N} \end{matrix}]

Then the quantity $X^{T} X$ is

\begin{matrix} X^{T} X = \begin{matrix} 1 & 1 & \dots \\ x_{0} & x_{1} & \dots \end{matrix}] \begin{matrix} 1 & x_{0} \\ 1 & x_{1} \\ ⋮ & ⋮ \\ 1 & x_{N} \end{matrix}] \end{matrix}

This statement simplifies, since each value in the resulting matrix can be expressed as a sum:

X^{T} X = \begin{matrix} \sum_{i}^{N} 1 & \sum_{i}^{N} x_{i} \\ \sum_{i}^{N} x_{i} & \sum_{i}^{N} x_{i} x_{i} \end{matrix}]

Recall that the inverse of a 2x2 matrix $\begin{matrix} a & b \\ c & d \end{matrix}]$ can be found using the rule

X^{- 1} = \begin{matrix} d & - b \\ - c & a \end{matrix}] \frac{1}{a b - b c}

and we can use this statement to calculate the inverse of $X^{T} X$ :

{(X^{T} X)}^{- 1} = {\begin{matrix} N & S_{x} \\ S_{x} & S_{x x} \end{matrix}]}^{- 1} = \begin{matrix} S_{x x} & - S_{x} \\ - S_{x} & N \end{matrix}] \frac{1}{S_{x x} N - S_{x}^{2}}

To complete the estimator, we need to calculate $X y$ :

X y = \begin{matrix} 1 & x_{0} \\ 1 & x_{1} \\ ⋮ & ⋮ \end{matrix}] \begin{matrix} y_{0} & y_{1} & \dots \end{matrix}] = \begin{matrix} y_{0} + y_{1} \dots \\ x_{0} y_{0} + x_{1} y_{1} \dots \end{matrix}]

Using the previous definitions of $S_{y}$ and $S_{x y}$ , we can express this quantity as

$X y = \begin{matrix} S_{y} \\ S_{x y} \end{matrix}]$

This simplifies to:

{(X^{T} X)}^{- 1} X y = \begin{matrix} S_{x x} & - S_{x} \\ - S_{x} & N \end{matrix}] \begin{matrix} S_{y} \\ S_{x y} \end{matrix}] \frac{1}{S_{x x} N - S_{x}^{2}}

DP Approach 2: Theil-Sen Estimator

Another approach for estimating DP regression coefficients is via a privatized version of the Theil-Sen estimator. The Theil-Sen estimator first makes a random pairing of each point with another point in the dataset. A slope is computed for each pairing, and the median of this set of slopes is used as the final estimate of the slope.

This algorithm is amenable to differentially private techniques, by breaking it into three pieces

a transformation from your dataset to a dataset of pairs
a transformation from a dataset of pairs to a vector of slopes
a DP median mechanism on the vector of slopes

If you plot the dataset of pairs, it looks like this:

import matplotlib.pyplot as plt
import numpy as np

dataset_size = 20
x = np.random.normal(size=dataset_size)
y = x * np.random.normal(loc=1., scale=0.4, size=dataset_size) + 3
dataset = np.stack([x, y]).T

# Randomly construct pairings
np.random.shuffle(dataset)

# This assumes that dataset size is even:
dataset_of_pairs = np.array(np.array_split(dataset, 2))

for p1, p2 in zip(*dataset_of_pairs):
    slope = (p1[1] - p2[1]) / (p1[0] - p2[0])
    plt.plot([p1[0], p2[0]], [p1[1], p2[1]], label=f"{slope:.2f}")
plt.legend()
plt.show()

The legend shows the corresponding dataset of slopes.

First, consider the stability of a dataset transformation to a dataset of pairs. If an individual may influence at most k records in the input dataset, then in the worst case, the individual may influence up to k pairs.

The transformation from a dataset of pairs to a dataset of slopes is a stable row-by-row transformation. Therefore, it can be argued formally that if an individual can influence at most k records in the input dataset, then the individual can influence at most k slopes.

The final step to estimate $beta 1$ is an application of the DP median mechanism covered previously.

$beta 0$ can be estimated with plug-in point estimates and postprocessing, as in Approach 1.

Gradient Descent

Training a machine learning model is fundamentally about minimization. At each step of the training process, the model tries a set of predictions, and calculates error from the true values. It then modifies its parameters and tries again. How does it modify the parameters? By taking a step in the direction of fastest change.

Think of yourself at the top of a mountain. If you want to get to the bottom as quickly as possible, do you zigzag left and right as you do, or do you follow the pull of gravity and walk straight down? Conversely, if you are at the edge of a canyon and want to find the lowest point, you will also walk straight down the side of the canyon, rather than running laps around its edge as you descend.

What is the connection between canyons and machine learning? In this case, if you construct an error function that gives you the difference between your model output and true values, you want to find the parameters that minimize the error. This is a crucial concept moving forward - machine learning is about optimizing parameters that minimize the error between the model prediction and the true values in the data. Ideally, this function should be convex, that is, have only one minimum value. In practice, this isn’t always the case, and there may be multiple local minima in the error function.

For example, consider the function $f (x, y) = {(x - 2)}^{4} + {(y + 3)}^{2}$ . If you are given the point (1,1) and asked to find the quickest path from the point to the minimum of $f left-parenthesis x comma y right-parenthesis$ , how will you proceed? First, calculate the gradient $\nabla f (x, y) = (4 {(x - 2)}^{3}, 2 (y + 3))$ . This vector will point in the direction of the steepest descent. At a given point $left-parenthesis x Subscript i Baseline comma y Subscript i Baseline right-parenthesis$ , take a step by calculating $(x_{i + 1}, y_{i + 1}) = (x_{i}, y_{i}) - γ \nabla f (x, y)$ . Here, $γ$ determines the size of the step taken in each iteration and is called the learning rate.

An implementation of this algorithm is included below. In this case, the code will only stop iterating once the difference in f changes by an amount specified by the variable tol. In other cases, you may want to set a fixed number of iterations in a for loop.

import numpy as np


def f(x,y):
    return (x - 2)**4 + (y + 3)**2


def gradient(x,y):
    return np.array([4 * (x - 2) ** 3, 2 * (y + 3)])


if __name__ == '__main__':
    x_old = 10
    y_old = 10
    gamma = 0.01
    tol = 1e-16

    x_new, y_new = np.array([x_old, y_old]) - gamma * gradient(x_old, y_old)

    while np.abs(f(x_new, y_new) - f(x_old, y_old)) > tol:
        x_new, y_new = x_old, y_old
        x_old, y_old = np.array([x_old, y_old]) - gamma * gradient(x_old, y_old)
        print(x_new, y_new, f(x_new, y_new))

If you run this code, you will see the x and y values descend from 10 down to values increasingly close to 0. Change the gamma value - what do you see?

The following table demonstrates gradient descent for the function, as it searches from the point (1,1) to approximate the minimum at (1.806, -2.994).

Table 7-3. Gradient Descent
header	x	y	f(x,y)
0	1	1	17
1	1.4	0.19999999999999996	10.369600000000002
2	1.4864	-0.44000000000000017	6.623182505122199
3	1.5405919821824	-0.9520000000000001	4.23884851982989
4	1.5793762594281457	-1.3616000000000001	2.7156567789467383
5	1.6091436890118227	-1.6892800000000001	1.741325180606942
6	1.6330279263142091	-1.951424	1.1176472334399676
7	1.6527957582168578	-2.1611392	0.71821993366425
8	1.669538055848909	-2.32891136	0.4622857158456387
9	1.6839733072533936	-2.463129088	0.2982049664192858
10	1.6965983044406683	-2.5705032704	0.1929411198544475
11	1.7077698690579728	-2.6564026163200003	0.12535205723685655
12	1.7177522690816889	-2.725122093056	0.0819041818260867
13	1.7267462377198413	-2.7800976744448	0.05393228605201927
14	1.7349075208010498	-2.82407813955584	0.03588693922437874
15	1.7423591667431255	-2.8592625116446717	0.024213185580110395
16	1.7491999223731172	-2.8874100093157375	0.016633001413099688
17	1.7555101204372556	-2.90992800745259	0.011686050476040176
18	1.761355902914583	-2.927942405962072	0.008435713837441053
19	1.7667923113881525	-2.9423539247696575	0.006280888062175426
20	1.7718655885006172	-2.953883139815726	0.0048354790560833725
21	1.7766149189414302	-2.9631065118525806	0.0038512287335502717
22	1.7810737650811392	-2.9704852094820646	0.0031682827917619717
23	1.7852709046944812	-2.9763881675856516	0.0026835201826356826
24	1.789231246537579	-2.981110534068521	0.0023302563822229865
25	1.7929764780799733	-2.984888427254817	0.0020652311076308997
26	1.7965255848863149	-2.9879107418038537	0.0018602623208747828
27	1.7998952707629081	-2.990328593443083	0.0016968900736663364
28	1.803100300399119	-2.9922628747544664	0.0015629365939676355
29	1.8061537809071124	-2.993810299803573	0.0014502949681062403

These numbers are in close agreement with our expectation! Since the function is differentiable, we can know that $4 {(x - 2)}^{3} + 3 {(y + 3)}^{2} = 0$ is a local minimum and can simply check that this is true when $x equals 2$ and $y = - 3$ . This is a luxury you will rarely (if ever) have: to check the minimum of a function using its derivative.

Using Gradient Descent to Learn Parameters

In the previous example, you found the point (x, y) that minimized a function f. In this case, the data points (x, y) were your data - the inputs to your function. A more useful scenario is finding parameters for a function that minimize error relative to a data set. This means that instead of finding a point (x, y) that minimizes a function, you will find the parameters $α$ and $β$ that minimize a function f relative to a data set $left-parenthesis x 0 comma y 0 right-parenthesis comma left-parenthesis x 1 comma y 1 right-parenthesis comma period period period period comma left-parenthesis x Subscript upper N Baseline comma y Subscript upper N Baseline right-parenthesis$ .

Consider simple linear regression, where you want to learn the best parameters for $f (x) = β_{0} + β_{1} x_{1} + β_{2} x_{2}$ . Here, you aren’t minimizing this function itself - you are instead minimizing the function that calculates the error between the regression’s estimation and the observed data. For example, let’s use the Mean Squared Error as a starting point:

$E = \frac{1}{n} \sum_{i = 0}^{n} {(y_{i} - (β_{0} + β_{1} x_{1}^{i} + β_{2} x_{2}^{i}))}^{2}$

where $y_{i}$ is the true observed value.

Since you want to find the best parameters to match the data, you will take the gradient relative to the parameters themselves:

$\frac{\partial E}{\partial β_{0}} = \frac{- 2}{n} \sum_{i = 0}^{n} (y_{i} - β_{0} - β_{1} x_{1}^{i} - β_{2} x_{2}^{i})$

$\frac{\partial E}{\partial β_{1}} = \frac{- 2}{n} \sum_{i = 0}^{n} x_{1}^{i} (y_{i} - β_{0} - β_{1} x_{1}^{i} - β_{2} x_{2}^{i})$

$\frac{\partial E}{\partial β_{2}} = \frac{- 2}{n} \sum_{i = 0}^{n} x_{2}^{i} (y_{i} - β_{0} - β_{1} x_{1}^{i} - β_{2} x_{2}^{i})$

Then at each step, you update the parameters $beta 0 comma beta 1 comma beta 2$ using the now-familiar gradient descent:

$(β_{0}, β_{1}, β_{2}) \to (β_{0}, β_{1}, β_{2}) - γ (\frac{\partial E}{\partial β_{0}}, \frac{\partial E}{\partial β_{1}}, \frac{\partial E}{\partial β_{2}})$

import numpy as np

# Prepare a dataset
ideal_params = [4., 2., -3.]
dataset_size = 100  # assuming known dataset size

data_x = np.random.normal(size=[dataset_size, 2])
data_y = np.hstack([np.ones(dataset_size), data_x]) @ ideal_params

# Discover params via gradient descent
num_steps = 10

params = np.zeros(3)
for _ in range(num_steps):
    instance_level_grads = gradient(data_x, data_y) # n x 2 array
    df_dx = dp_mean(instance_level_grads[:, 0])
    df_dy = dp_mean(instance_level_grads[:, 1])

    params -= gamma * np.array([df_dx, df_dy])

Stochastic Gradient Descent

While gradient descent is a powerful tool for finding the minimum of a function, it has its limitations. Look back to the gradient calculation step in the previous chapter, and note that it requires iterating over the entire data set for each of the parameters. This is only practical for small data sets over a low number of dimensions. Otherwise, the process can become computationally burdensome.

Imagine you are trying to minimize the error over hundreds of parameters and a data set with millions of entries - this would quickly become untenable. For example, a problem with 100 parameters and a data set of 100,000 rows would mean 100 * 100,000 = 10,000,000 calculations to compute the gradient. Further, this step will often be carried out hundreds or thousands of times in order to converge to an approximation of the minimum. Even 100 steps would mean doing 100 * 10,000,000 = 1 billion calculations! We clearly need something more efficient.

Thankfully, there is a technique called stochastic gradient descent, where the gradient is approximated via a random shuffling of the data. Rather than iterate over the entire data set, this algorithm first randomly permutes the data set, then takes a random sampling of the permuted data. Gradient descent is then performed on this random sample, leading to an approximation of the true gradient. Consider the efficiency challenge mentioned in the previous paragraph - if you have a problem with 100 parameters and 100,000 data points, the total number of calculations needed becomes unreasonable. However, if you take a random sample of 1000 data points and approximate the gradient, your total number of calculations drops from 1 billion to (100 parameters) * 1000 (data points) * 100 (iterations) = 10 million, a substantial improvement in speed over regular gradient descent.

When using multiple small subsets of the data (often called batches), the process is called mini batch gradient descent. In mini batch gradient descent, each iteration involves splitting the data set into multiple small subsets and calculating the gradient for each batch. This technique lends itself easily to parallelization - imagine you have 10 minibatches to calculate gradients over. You can have each calculate in parallel, and then average the results together.

Privatizing the Gradient Descent Process

You are already familiar with the DP concepts necessary to privatize the gradient descent algorithm. In this case, you will just be using the DP mean on each component of the gradient to privatize the training process.

The following code snippet does exactly this - for each iteration of gradient descent, it takes the DP mean of the x component and the DP mean of the y component of the gradient. Then, instead of taking a step in the direction of the gradient, it takes a step in the direction of a vector of these DP means. Think of this like taking little random steps to the left and right as you walk down a hill - it will no longer be the fastest route from top to bottom, but predicting your path will be harder for someone trying to follow behind you.

Minimum Viable DP Gradient Descent

This section demonstrates a rudimentary DP gradient descent algorithm. The presentation here focuses on simplicity at the expense of utility. The next chapter will make adjustments to the algorithm that make it practically useful, by increasing the complexity.

Gradient descent iteratively updates each weight (parameter) in the network:

$w_{t + 1} = w_{t} - g r a d_{t} \cdot γ$

Where $w_{t}$ is the weight parameter after $t$ steps.

Based on this formula, each of the parameters in models trained with gradient descent can be decomposed into the sum of some initial random value $w 0$ and a set of updates from each training step: $w_{t} = w_{0} + \sum_{t} - g r a d_{t} \cdot γ$ . The training process can be made differentially private by replacing each $g r a d_{t}$ with $D P g r a d_{t}$ . This holds because of the closure of privacy guarantees under postprocessing: Since the initial random noise $w 0$ is chosen independently from the data, and each update is a postprocessing of a DP release, then $w_{t}$ is a postprocessing of multiple private gradient releases. The total privacy expenditure to train $w_{t}$ is the composition of the privacy losses used to release each $D P g r a d_{t}$ , for all gradient releases up to $t$ .

Now take a closer look at how to compute a single DP gradient. In each training step, the gradient is a dataset transformation from a batch of $k$ sensitive records to a batch of $k$ gradients, where each gradient corresponds to one record. Computing $D P g r a d$ is straightforward: compute the DP mean of this batch of gradients using techniques you are already familiar with.

import opendp.prelude as dp
import numpy as np

dp.enable_features("contrib")

data_size = 1_000
x = np.random.normal(loc=0, scale=2, size=data_size)
y = x * 2 + 3

w_slope, w_scale = 0.0, 0.0
learning_rate = 0.01
num_steps = 500

input_space = (
    dp.vector_domain(dp.atom_domain(T=float), size=data_size),
    dp.symmetric_distance(),
)
dp_mean = (
    input_space
    >> dp.t.then_clamp((-1.0, 1.0))
    >> dp.t.then_mean()
    >> dp.m.then_laplace(scale=0.5)
)
per_query_epsilon = dp_mean.map(2)

for i in range(num_steps):
    y_pred = w_slope * x + w_scale
    error = y_pred - y
    w_slope -= learning_rate * dp_mean(2 * x * error)
    w_scale -= learning_rate * dp_mean(2 * error)

print("params:", w_slope, w_scale)
print("total epsilon:", per_query_epsilon * 2 * num_steps)

Unfortunately, implementing this in practice with modern deep learning frameworks is more difficult because they typically do not provide a way to access the gradients corresponding to each individual input (before aggregation). Access to this more granular gradient is necessary for the DP mean because the DP mean needs to clip the gradient before aggregating in order to keep the sensitivity finite. This is one of the motivations for bespoke libraries for differentially private training, like Tensorflow-Privacy and Opacus.

Private Aggregations of Teacher Ensembles

Private aggregations of teacher ensembles (PATE) is a framework to transform any supervised machine learning classification model into a differentially private classification. The PATE framework trains a model on sensitive data without revealing the data itself. There are three main components to PATE:

an ensemble of multiple teacher models
an aggregation method
a student model

PATE involves training multiple models (called teachers) on the same task, and then combining their predictions such that the prediction output is differentially private. The PATE framework starts by dividing the dataset that is available for model training into several data partitions. Each individual data partition is used as a training dataset in a supervised machine learning task. These trained models are called teacher models. When making prediction about an instance x, PATE does the following:

For each trained model, get the predictions for the instance x.
Count the number of teacher models that voted for each class, creating a histogram of votes for each class.
Apply the Laplace or Gaussian Mechanism to the histogram created in the previous step.
The output is selected by choosing the class with the highest number of votes from the DP-histogram generated in the previous step.

The following diagram illustrates these steps: the data is partitioned into disjoint subsets, and each is sent to train a different teacher model.

Differentially private classification with PATE framework

First, multiple models are trained on different disjoint subsets of the data - these are called teacher models. These teacher models are combined via an aggregation, and then a student model is trained via knowledge transfer. The teacher models are never released, the only result of the process is the student model, which never encounters the data directly. Further, the aggregation of the teacher models incorporates calibrated Laplacian noise to the votes of the teacher models in order to guarantee that the results are DP.

The following diagram shows the the prediction process for PATE: a data point is sent to multiple teacher models, and each predicts a label. Those labels form a histogram of votes, which is then privatized (DP-votes).

PATE, as described so far, can be used as a differentially private API. Differential privacy guarantees come from the fact that each label is predicted by the noisy aggregation mechanism. Notice that the addition or removal of an individual from the dataset would affect the output of only one of the teacher models. This means that, by applying a differential privacy mechanism to the counts of votes, we guarantee that the output of the prediction is differentially private.

Using PATE to train a differentially private model

The PATE framework is easy to implement and provides differential privacy during inference for any machine learning technique you apply it to. However, there is one significant limitation: each prediction made spends part of the total privacy budget. This means that there is a limit to how many times this model can be used to make predictions.

This problem can be solved by creating a student model that learns from the teacher models predictions. Since the student model is post-processing the differentially private data, the final student model is differentially private, and can be used as many times as desired without affecting the privacy budget.

Let’s assume the availability of a private labelled dataset and a public unlabelled dataset. The student model is trained as follows:

Label the public unlabelled dataset utilizing PATE framework, trained on the private labelled dataset.
Use the dataset labelled in the previous step to train a student model.

The following diagram shows this process: the public unlabeled data is sent to the teacher models, which generate a DP histogram of labels. From this histogram, a label is added to the data and sent to train the student model.

Differentially private learning with PATE framework

The student model can be deployed to respond to any prediction queries from end users. Most importantly, the private data and teacher models can safely be discarded. Notice that the overall privacy budget is now fixed to a constant value once the student has completed its training. An additional important characteristic of the student model is that an adversary with access to the student’s model weights and parameters could in the worst-case only recover the labels on which the student model was trained. The labels in this case are all differentially private, which is guaranteed by the noisy aggregation mechanism.

In this chapter, you have learned about approaches for building differentially private machine learning models. This often involves modifying existing models, in particular, transforming linear regression and stochastic gradient descent algorithms to be differentially private. Another approach, highlighted in this chapter, is PATE, a general framework for transforming any learning task into a differentially private learning task. On the implementation side, you’ve learned about Opacus as a tool for implementing DP-SGD and PATE.

Now that you are familiar with the basics of DP machine learning, you are ready to learn about the nuances of implementing DP machine learning models in the real world. You will need to take a variety of factors into account during the model training process. In the next chapter, you will learn about budget composition, batching and hyperparameter tuning. The next chapter also provides several code examples for the techniques introduced in this chapter. By the end of next chapter, you will be able to train your own differentially private learning model.

Exercises

a. Given the function $f left-parenthesis x right-parenthesis equals x squared$ , and the starting point x = 10, use gradient descent to find the minimum.
1. How does the speed of the program change with changes in gamma?
a. Starting with the mean squared error as for linear regression, calculate the gradient.
1. Generate linear data with noise added to it (via numpy or a similar package). Write code that uses the gradient you just calculated to find the minimum of the error function.
Implement a DP version of simple linear regression using the Theil-Sen mechanism. How do your model parameters compare to a non-DP implementation?

¹ R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership Inference Attacks against Machine Learning Models.” arXiv, Mar. 31, 2017. Accessed: Jun. 09, 2023. Available: http://arxiv.org/abs/1610.05820

² For more information on attacks, see Ch. 10 (Chapter 10)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
7. Machine Learning and Differential Privacy

Chapter 7. Machine Learning and Differential Privacy

Supervised vs. Unsupervised Learning

Simple Linear Regression

DP Approach 1: Sufficient Statistics

Multivariate Regression

DP Approach 2: Theil-Sen Estimator

Gradient Descent

Using Gradient Descent to Learn Parameters

Stochastic Gradient Descent

Privatizing the Gradient Descent Process

Minimum Viable DP Gradient Descent

Private Aggregations of Teacher Ensembles

Figure 7-1. Differentially private classification with PATE framework.

Figure 7-2. Differentially private classification with PATE framework.

Using PATE to train a differentially private model

Figure 7-3. Differentially private learning with PATE framework for classification.

Exercises

Table of Contents for 7. Machine Learning and Differential Privacy

Create new playlist

Sign In

Sign Up

Chapter 7. Machine Learning and Differential Privacy

Supervised vs. Unsupervised Learning

Simple Linear Regression

DP Approach 1: Sufficient Statistics

Multivariate Regression

DP Approach 2: Theil-Sen Estimator

Gradient Descent

Using Gradient Descent to Learn Parameters

Stochastic Gradient Descent

Privatizing the Gradient Descent Process

Minimum Viable DP Gradient Descent

Private Aggregations of Teacher Ensembles

Figure 7-1. Differentially private classification with PATE framework.

Figure 7-2. Differentially private classification with PATE framework.

Using PATE to train a differentially private model

Figure 7-3. Differentially private learning with PATE framework for classification.

Exercises

Table of Contents for
7. Machine Learning and Differential Privacy