In the previous chapter, you read that machine learning is a process for inferring patterns about data. The process of fitting a model to data is called training. When you train a model, you find parameters that should optimally predict outcomes based on the given data. Sometimes, the training process can leak information about the training data, even if you never see the data directly.
In this chapter, you will learn about ways to make DP machine learning more useful while still preserving privacy. You will also learn about SGD in greater detail, including more ways that it can potentially leak sensitive information. One important outcome of this discussion is the introduction of alternative formulations of differential privacy.
The chapter ends with a discussion and examples of frameworks and tools that will help you create DP machine learning models.
The previous chapter gave a minimally-functional DP gradient descent algorithm using primitives you are already familiar with. Indeed, it is possible to privately train neural networks with simple, scalar DP sums that utilize Laplace noise and basic composition. Unfortunately, this is incredibly inefficient! This section shows how to make key adjustments to the algorithm that make it practically useful.
Instead of releasing each scalar partial individually, you can instead release all gradients together as part of one vector-valued query.
In a machine learning context, recall that the computation of the gradient is a row-by-row transformation.
The resulting dataset consists of instance-level gradients,
where each row corresponds to one training example,
and each column is the partials with respect to one of the parameters in your model.
That is, if you had a batch size of
If you applied the tools covered thus far in this book, you might first clamp each column individually,
then compute the sum of each column, and add noise to each sum.
In each training step, the privacy budget is divided among
Shouldn’t there be a more efficient way to do this? Luckily, there is. It is more efficient to compute a differentially private estimate of all of these sums at once. The strategy is to first clamp the norm of each row, and then sum over the row axis. In the case where one individual can contribute at most one row, then the distance between any two sum vectors on neighboring datasets can be no greater than the clamping norm.
import
numpy
as
np
def
clamp
(
X
,
norm_bound
):
row_norms
=
np
.
linalg
.
norm
(
X
,
axis
=
1
)
return
X
/
np
.
maximum
(
1.0
,
row_norms
/
norm_bound
)[:,
None
]
# mock dataset
gradients
=
np
.
random
.
normal
(
size
=
(
num_rows
,
num_params
))
# clamp the row norms
gradients
=
clamp
(
gradients
,
norm_bound
=
1.0
)
# compute the vector-valued sum
gradients
=
np
.
sum
(
gradients
,
axis
=
1
)
Now that you have a sum vector with sensitivity bounded by the clamping norm, privatize it by adding noise. The Laplace distribution naturally supports vector queries for which the sensitivity is expressed via an L1 norm, and the Gaussian distribution naturally supports vector queries for which the sensitivity is expressed via an L2 norm. These distributions inform our choice of clamping norm.
Lets take a closer look at the vector-valued Laplace mechanism, which adds i.i.d. Laplace noise (
The probability density function of the vector-Laplace distribution is the product of the densities of each
In this equation,
This formula boils down to nearly the same form as in the scalar Laplace mechanism, but this time the sensitivity is measured in terms of the L1-norm:
from
opendp.measurements
import
make_base_laplace
vec_lap_mech
=
make_base_laplace
(
b
,
D
=
"VectorDomain<AllDomain<float>>"
)
private_gradients
=
vec_lap_mech
(
gradients
)
In practice, there are a number of small variations made on this. While it was useful for the analysis, real-world implementations don’t construct one giant instance-level gradient matrix: norms may be computed piecewise, and noise is added directly to each gradient matrix in your model without flattening or concatenation.
It is also typical to use Gaussian noise, as the L2 norm is more permissive. The privacy analysis follows in a similar manner as in the Laplacian derivation.
Differentially private gradient descent involves repeated releases of noisy gradients. Assuming each release consumes the same privacy loss parameters, the tools for composition covered thus far account for a privacy loss that scales linearly in the number of releases.
This can add up quickly - consider an SGD scenario with 1000 steps, each involving a DP release.
You would only be able to use
Practitioners instead use an alternative measure of privacy
that more tightly characterizes the privacy loss when making DP releases with Gaussian noise variates.
A differentially private release can be made in terms of privacy loss parameters in any parameter space,
so long as the measure of privacy provides immunity from post-processing.
The task at hand is to find a parameter space for which the basic linear composition is more favorable,
when translated to
A common choice of privacy measure is based on the Renyi divergence.
Given two probability distributions
Using the previous definition, we can construct an alternative form of differential privacy called Renyi Differential Privacy. This formulation uses the Renyi divergence to measure the distance between outputs of a randomized mechanism over adjacent data sets.
A randomized mechanism
The measure of privacy is now in terms of the Renyi divergence defined as such:
This is a generalization of the differential privacy that you’ve learned so far in this book.
Consider what happens as
which is equivalent to the definition of differential privacy we have been using.
Any mechanism that is
In most practical applications, each step of gradient descent is performed on a small subset of the data. This is called mini-batching.
At the same time, the privacy analysis may be improved by making releases on random subsets of the data. This is called privacy amplification by subsampling.
The intuition is that the set of individuals in the mini-batch is unknown, meaning that each individual may only influence the release with a reduced probability.
Those two concepts complement each other to improve the utility of DP gradient descent.
The most common approach for privacy amplification is to use Poisson sampling.
To take a Poisson sample, iterate through each record in your dataset, and with probability
Each row in the dataset only has a
In the previous chapter, you learned that training a machine learning model means minimizing the error of an estimate
relative to your data.
This minimization happens incrementally: you estimate the parameters of the model, and then take a step in the direction
of the steepest change.
The size of this step is called the learning rate
A value that affects the model training process and does not appear as a parameter in the trained model
As you might suspect, the choice of hyperparameters is an important step of the model training process. You will often want to perform hyperparameter optimization to find the optimal set of hyperparameters to train the desired model.
In a differentially private setting, the hyperparameters can leak information about the data. This means that you may not always be able to do hyperparameter optimization like you would on a non-private data set. One approach for this situation is to find a non-private data set that is thought to likely be structurally similar to the private data in question. You can then optimize the hyperparameters on the public data set as an estimate for the ideal hyperparameters for the private data set.
For example, DPTheilSen requires two hyperparameters: the upper and lower bounds of the data. If you were to do a grid search over possible values, you could likely find the smallest and largest values in the data, and only release the best performing model. The hyperparameters used to train the model would not be accounted for in the privacy calculus.
If you have a public dataset with similar distributional properties as your private dataset, the easiest way to combat the challenges of selecting hyperparameters is to use the public dataset to inform your choice of hyperparameters for the DP algorithm.
Unfortunately, this doesn’t always apply! In the case of DP SGD, even when the public data has the same distribution, a useful learning rate for SGD is not a useful learning rate for DPSGD.
Ideally, you would want to try several different hyperparameters, and only release the model with the best score. Unfortunately, the naive privacy analysis is incredibly unforgiving. You could view the selection of a single model as postprocessing, so the overall privacy budget is the composition of the privacy budgets used to train all models.
A tighter privacy analysis can be conducted in this case using private selection from a set of private candidates.
The setup for this involves creating a function (
The tighter privacy guarantee comes from repeatedly calling this function until either a utility threshold is met, or by random stopping.
In this first example, the algorithm draws private samples from
def
private_selection_random_stop
(
stop_probability
):
assert
0
<
stop_probability
<
1
# queryable is Q(D), a python generator of (score, value) pairs
queryable
=
queryable_builder
(
dataset
)
best_score
=
-
float
(
"inf"
)
best_y
=
None
while
True
:
score
,
y
=
next
(
queryable
)
if
score
>
best_score
:
best_score
=
score
best_y
=
y
if
bernoulli
(
stop_probability
):
return
best_score
,
best_y
When it costs
A modification of this algorithm can give a tighter privacy bound, if you set a threshold for the utility score beforehand:
def
private_selection_threshold
(
stop_probability
,
threshold
,
epsilon_selection
,
steps
=
None
):
assert
0
<
stop_probability
<
1
min_steps
=
int
(
np
.
ceil
(
max
(
np
.
log
(
2
/
epsilon_selection
)
/
stop_probability
,
1
+
1
/
np
.
exp
(
stop_probability
)))
)
steps
=
steps
or
min_steps
assert
steps
>=
min_steps
queryable
=
queryable_builder
(
dataset
)
for
_
in
range
(
steps
):
score
,
*
y
=
next
(
queryable
)
if
score
>=
threshold
:
return
score
,
*
y
if
bernoulli
(
stop_probability
):
return
This algorithm breaks the privacy consumption into two parameters,
where drawing a sample from
When applied in the machine learning context, these algorithms become very useful for privately selecting a private machine learning model. It also provides robustness against poor initial choices of hyperparameters that may cause the model to fail to converge.
We utilize the UCI Adult dataset, used in previous chapters, to train differentially private tabular data classification models utilizing two frameworks for model privatization: DP-SGD and PATE. The UCI Adult data set is a widely-used dataset that contains demographic information about individuals, such as age, education, and occupation, as well as their income level (whether they make more or less than $50,000 per year). The dataset was compiled from the 1994 US Census Bureau data and contains over 32,000 instances.
One of the main advantages of using the UCI Adult dataset as a benchmark dataset is its widespread use in the machine learning community. The UCI Adult dataset is relevant to real-world applications, especially in scenarios where privacy is a concern. The dataset contains information about income, which is an important factor in many decision-making processes, such as credit approvals or hiring decisions.
We will start by training a model to predict whether an individual has a high or a low income. Once we go through the non-private training process, we show how to modify the training process to make if differentially private. We will walk through the model transformation from non-private to differentially private utilizing two distinct frameworks:
The first framework uses DP-SGD as the optimization function in a neural network architecture. Opacus is a library that provides DP-SGD implementations and can be seamlessly used with pytorch implementations. In our example we will show how to transform a neural network model implemented with pytorch in a differentially private neural network using Opacus.
The second framework is PATE.
In this example we will utilize pandas, sklearn and numpy in addition to pytorch and Opacus for tabular data classification.
To install the necessary libraries run the following command:
pip
install
numpy
pandas
opacus
sklearn
torch
To import all necessary libraries, include the following command to your python file or Jupyter notebook:
import
pandas
as
pd
import
numpy
as
np
from
sklearn.preprocessing
import
LabelEncoder
,
StandardScaler
from
sklearn.model_selection
import
train_test_split
import
torch
import
torch.nn
as
nn
import
torch.optim
as
optim
To verify the available devices, run the following command.
import
torch
>>>
device
=
torch
.
device
(
'cuda'
if
torch
.
cuda
.
is_available
()
else
'cpu'
)
>>>
device
device
(
type
=
'cpu'
)
We consider the description of the UCI Adult dataset to be public knowledge. If a dataset description is considered to be public knowledge, this means that not only the metadata of this dataset is publicly available, but every possible neighboring dataset has the same metadata description.
In the case of public data description, there is no preprocessing steps necessary, such as clamping. The only necessary preprocessing steps are related to loading the data to the pytorch model. For this, encoding the categorical data features and scaling the numerical values are the only necessary preprocessing steps. In addition to data encoding and numerical scaling, the code following code also partitions the dataset into training data and testing data:
# Load the Adult dataset
header
=
[
'age'
,
'workclass'
,
'fnlwgt'
,
'education'
,
'education_num'
,
'marital_status'
,
'occupation'
,
'relationship'
,
'race'
,
'sex'
,
'capital_gain'
,
'capital_loss'
,
'hours_per_week'
,
'native_country'
,
'income'
]
train_data
=
pd
.
read_csv
(
'adult.data'
,
header
=
None
,
names
=
header
,
sep
=
',s'
,
na_values
=
[
'?'
],
engine
=
'python'
)
test_data
=
pd
.
read_csv
(
'adult.test'
,
header
=
None
,
names
=
header
,
sep
=
',s'
,
na_values
=
[
'?'
],
skiprows
=
1
,
engine
=
'python'
)
# Preprocess the data
data
=
pd
.
concat
([
train_data
,
test_data
],
ignore_index
=
True
)
data
=
data
.
dropna
()
data
=
data
.
reset_index
(
drop
=
True
)
categorical_columns
=
[
'workclass'
,
'education'
,
'marital_status'
,
'occupation'
,
'relationship'
,
'race'
,
'sex'
,
'native_country'
]
numerical_columns
=
[
'age'
,
'fnlwgt'
,
'education_num'
,
'capital_gain'
,
'capital_loss'
,
'hours_per_week'
]
# Encode categorical features
for
column
in
categorical_columns
:
encoder
=
LabelEncoder
()
data
[
column
]
=
encoder
.
fit_transform
(
data
[
column
])
## Normalize numerical features
for
column
in
numerical_columns
:
scaler
=
StandardScaler
()
data
[
column
]
=
scaler
.
fit_transform
(
data
[
column
]
.
values
.
reshape
(
-
1
,
1
))
# Split data into input (X) and output (y)
X
=
data
.
drop
(
columns
=
[
'income'
])
y
=
data
[
'income'
]
.
apply
(
lambda
x
:
1
if
x
==
'>50K'
else
0
)
# Split data into train and test sets
X_train
,
X_test
,
y_train
,
y_test
=
train_test_split
(
X
,
y
,
test_size
=
0.2
,
random_state
=
42
)
# Convert data to PyTorch tensors
X_train_tensor
=
torch
.
tensor
(
X_train
.
values
,
dtype
=
torch
.
float32
)
X_test_tensor
=
torch
.
tensor
(
X_test
.
values
,
dtype
=
torch
.
float32
)
y_train_tensor
=
torch
.
tensor
(
y_train
.
values
,
dtype
=
torch
.
float32
)
.
view
(
-
1
,
1
)
y_test_tensor
=
torch
.
tensor
(
y_test
.
values
,
dtype
=
torch
.
float32
)
.
view
(
-
1
,
1
)
This next part of the code defines the model architecture for data classification.
This model architecture is a simple neural network classifier for
predicting whether an adult’s income is low or high.
It takes in an input tensor with size input_size
and passes it through a fully
connected layer (fc1) with a ReLU activation function.
The output of fc1 is then passed through another fully connected layer
(fc2) with a sigmoid activation function.
The final output of the model is a probability estimate for the input being in the positive class
(high income):
# Define the model
class
AdultClassifier
(
nn
.
Module
):
def
__init__
(
self
,
input_size
,
hidden_size
,
output_size
):
super
(
AdultClassifier
,
self
)
.
__init__
()
self
.
fc1
=
nn
.
Linear
(
input_size
,
hidden_size
)
self
.
fc2
=
nn
.
Linear
(
hidden_size
,
output_size
)
def
forward
(
self
,
x
):
x
=
torch
.
relu
(
self
.
fc1
(
x
))
x
=
torch
.
sigmoid
(
self
.
fc2
(
x
))
return
x
The next part of the code is training and evaluating the binary classification model
for predicting whether an adult’s income is low or high.
The first few lines of code set the model parameters,
including the input size (which is inferred from the shape of the input data X),
the size fully connected layer (hidden_size
), and the output size (which is always 1 for binary classification).
Next, the code defines the loss function (nn.BCELoss()
) and the optimizer (optim.Adam()
)
for training the model.
The model is then trained for a specified number of epochs
(num_epochs
) using mini-batch gradient descent.
The training loop iterates over the training data in batches of size batch_size
,
computes the forward and backward pass through the model,
and updates the model parameters using the optimizer.
The loss is printed every 10 epochs for monitoring the training progress.
After training, the model is evaluated on the test data:
# Set the model parameters
input_size
=
X
.
shape
[
1
]
hidden_size
=
64
output_size
=
1
model
=
AdultClassifier
(
input_size
,
hidden_size
,
output_size
)
# Define the loss function and optimizer
criterion
=
nn
.
BCELoss
()
optimizer
=
torch
.
optim
.
SGD
(
model
.
parameters
(),
lr
=
0.05
)
# Train the model
num_epochs
=
100
batch_size
=
128
for
epoch
in
range
(
num_epochs
):
for
i
in
range
(
0
,
len
(
X_train
),
batch_size
):
X_batch
=
X_train_tensor
[
i
:
i
+
batch_size
]
y_batch
=
y_train_tensor
[
i
:
i
+
batch_size
]
# Forward pass
output
=
model
(
X_batch
)
loss
=
criterion
(
output
,
y_batch
)
# Backward pass and optimization
optimizer
.
zero_grad
()
loss
.
backward
()
optimizer
.
step
()
if
(
epoch
+
1
)
%
10
==
0
:
(
f
'Epoch [
{
epoch
+
1
}
/
{
num_epochs
}
], Loss:
{
loss
.
item
()
:
.4f
}
'
)
# Evaluate the model
model
.
eval
()
with
torch
.
no_grad
():
test_output
=
model
(
X_test_tensor
)
test_output
=
(
test_output
>
0.5
)
.
float
()
accuracy
=
torch
.
sum
(
test_output
==
y_test_tensor
)
.
item
()
/
len
(
y_test_tensor
)
(
f
'Accuracy:
{
accuracy
*
100
:
.2f
}
%'
)
Running the preceding code produces the following output:
Epoch [10/100], Loss: 0.3022 Epoch [20/100], Loss: 0.3027 Epoch [30/100], Loss: 0.2929 Epoch [40/100], Loss: 0.2831 Epoch [50/100], Loss: 0.2779 Epoch [60/100], Loss: 0.2765 Epoch [70/100], Loss: 0.2749 Epoch [80/100], Loss: 0.2750 Epoch [90/100], Loss: 0.2748 Epoch [100/100], Loss: 0.2744 Accuracy: 84.57%
It is well known that adding differential privacy to the training process usually implies reducing the model utility. In the non-private model training, we obtained an accuracy of 84.71%. In the next example, we will make small modification to the code just discussed to transform it into a differentially private model.
The main difference from the non-private model above and the differentially model presented next is the fact that we use a DP-SGD as the optimizaiton function. The privacy engine Opacus library provides the necessary mechanisms and transformations for making the SGD differentially private.
The data preprocessing steps and model definition remain identical to the non-DP version of model training:
from
opacus
import
PrivacyEngine
# Set the model parameters
input_size
=
X
.
shape
[
1
]
hidden_size
=
64
output_size
=
1
model
=
AdultClassifier
(
input_size
,
hidden_size
,
output_size
)
# Define the loss function and optimizer
criterion
=
nn
.
BCELoss
()
optimizer
=
torch
.
optim
.
SGD
(
model
.
parameters
(),
lr
=
0.05
)
So up to this point the process of defining model architecture, loss function, and optimizer is identical to the non-private version of the model.
The following code describes the process of calling the privacy engine and defining the model that will be used and the privacy loss parameters used in the model:
# Step 4: Attaching a Differential Privacy Engine to the Optimizer
privacy_engine
=
PrivacyEngine
(
model
,
batch_size
=
32
,
alphas
=
range
(
2
,
32
),
noise_multiplier
=
0.8
,
max_grad_norm
=
1.0
)
privacy_engine
.
attach
(
optimizer
)
In this specific example, we define model
, batch_size
, alphas
, noise_multiplier
and max_grad_norm
parameters for the PrivacyEngine
class.
The PrivacyEngine
also takes as input other parameters. From the Opacus documentation 1 we have the following parameters available
for defining the differentially private learning process.
model
: PyTorch model to be used for training
optimizer
: Optimizer to be used for training
noise_multiplier
(float): The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (how much noise to add)
max_grad_norm
(Union[float, List[float]]): The maximum norm of the per-sample gradients. Any gradient with norm higher than this will be clipped to this value.
batch_first
(bool): Flag to indicate if the input tensor to the corresponding module has the first dimension representing the batch. If set to True, dimensions on input tensor are expected be [batch_size, ...]
, otherwise [K, batch_size, ...]
loss_reduction
(str): Indicates if the loss reduction (for aggregating the gradients) is a sum or a mean operation. Can take values "sum"
or "mean"
poisson_sampling
(bool): True if you want to use standard sampling required for DP guarantees. Setting False will leave provided data_loader
unchanged. Technically this doesn’t fit the assumptions made by privacy accounting mechanism, but it can be a good approximation when using Poisson sampling is unfeasible.
clipping
(str): Per sample gradient clipping mechanism ("flat"
or "per_layer"
or "adaptive"
). Flat clipping calculates the norm of the entire gradient over all parameters, per layer clipping sets individual norms for every parameter tensor, and adaptive clipping updates clipping bound per iteration. Flat clipping is usually preferred, but using per layer clipping in combination with distributed training can provide notable performance gains.
alphas
: The alphas parameter instructs the privacy engine what Renyi differential privacy orders to use for tracking privacy expenditure.
Once the privacy engine is defined, the training and evaluation process proceeds as follows:
for
epoch
in
range
(
num_epochs
):
delta
=
1e-5
for
i
in
range
(
0
,
len
(
X_train
),
batch_size
):
X_batch
=
X_train_tensor
[
i
:
i
+
batch_size
]
y_batch
=
y_train_tensor
[
i
:
i
+
batch_size
]
# Forward pass
output
=
model
(
X_batch
)
loss
=
criterion
(
output
,
y_batch
)
# Backward pass and optimization
optimizer
.
zero_grad
()
loss
.
backward
()
optimizer
.
step
()
epsilon
,
best_alpha
=
optimizer
.
privacy_engine
.
get_privacy_spent
(
delta
)
if
(
epoch
+
1
)
%
10
==
0
:
(
f
'Epoch [
{
epoch
+
1
}
/
{
num_epochs
}
], Loss:
{
loss
.
item
()
:
.4f
}
'
)
# Evaluate the model
model
.
eval
()
with
torch
.
no_grad
():
test_output
=
model
(
X_test_tensor
)
test_output
=
(
test_output
>
0.5
)
.
float
()
accuracy
=
torch
.
sum
(
test_output
==
y_test_tensor
)
.
item
()
/
len
(
y_test_tensor
)
(
f
'Accuracy:
{
accuracy
*
100
:
.2f
}
%'
)
Epoch [10/100], Loss: 0.9130 Epoch [20/100], Loss: 0.8918 Epoch [30/100], Loss: 0.8797 Epoch [40/100], Loss: 0.8781 Epoch [50/100], Loss: 0.8845 Epoch [60/100], Loss: 0.8789 Epoch [70/100], Loss: 0.8830 Epoch [80/100], Loss: 0.8801 Epoch [90/100], Loss: 0.8799 Epoch [100/100], Loss: 0.8669 Accuracy: 83.08%
The accuracy of the differentially private model is 83.08%, which is not a significant reduction when we compare with the non-private model accuracy of 84.57%.
When defining parameters, one important parameter to be aware of is the batch size. The peak memory requirement is proportional to the batch size. The above code sample utilizes a batch size of 128, but depending on the data batch sizes can be set to smaller values.
Lets train a differentially private model for digit classification using the PATE framework. We will utilize the same dataset and the same model architecture from the previous example. This will allow a comparative analysis of the two frameworks.
First, load the dataset. As seen in Chapter 7, the PATE framework utilizes two distinct datasets in when training a DP model: a private datasets that trains the teacher models, and a public dataset that trains the student model.
The public dataset is labelled by the DP aggregation of the classification from the teacher models, and then used to train the student model.
The following example divides the data into train and test data. The train data will be used as the private dataset of the framework, and the test data will be utilized as the public dataset:
import
torch
from
torchvision
import
datasets
,
transforms
from
torch.utils.data
import
Subset
transform
=
transforms
.
Compose
([
transforms
.
ToTensor
(),
transforms
.
Normalize
((
0.5
,),
(
0.5
,))])
train_data
=
datasets
.
MNIST
(
'../mnist'
,
train
=
True
,
transform
=
transform
,
target_transform
=
None
,
download
=
True
)
test_data
=
datasets
.
MNIST
(
'../mnist'
,
train
=
False
,
transform
=
transform
,
target_transform
=
None
,
download
=
True
)
num_teachers
=
20
def
get_data_loaders
(
train_data
,
num_teachers
):
""" Function to create data loaders for the Teacher classifier """
teacher_loaders
=
[]
data_size
=
len
(
train_data
)
//
num_teachers
for
i
in
range
(
data_size
):
indices
=
list
(
range
(
i
*
data_size
,
(
i
+
1
)
*
data_size
))
subset_data
=
Subset
(
train_data
,
indices
)
loader
=
torch
.
utils
.
data
.
DataLoader
(
subset_data
,
batch_size
=
batch_size
)
teacher_loaders
.
append
(
loader
)
return
teacher_loaders
teacher_loaders
=
get_data_loaders
(
train_data
,
num_teachers
)
As seen in that code, the first step is to partition the train_data
.
The snippet above prepares the data partitions that will be used to train the teacher models:
student_train_data
=
Subset
(
test_data
,
list
(
range
(
9000
)))
student_test_data
=
Subset
(
test_data
,
list
(
range
(
9000
,
10000
)))
student_train_loader
=
torch
.
utils
.
data
.
DataLoader
(
student_train_data
,
batch_size
=
128
)
student_test_loader
=
torch
.
utils
.
data
.
DataLoader
(
student_test_data
,
batch_size
=
128
)
The MNIST test data will be used to train and test the student models. For this experiment, the data split is 90% train, 10% test.
Defining the model architecture is the next step. The model is used in two parts of the framework: training the teacher models and training the student model. Let’s utilize the same model architecture as the previous example. The following code is a recap of the model architecture used in the previous example:
model
=
torch
.
nn
.
Sequential
(
torch
.
nn
.
Conv2d
(
1
,
16
,
8
,
2
,
padding
=
3
),
torch
.
nn
.
ReLU
(),
torch
.
nn
.
MaxPool2d
(
2
,
1
),
torch
.
nn
.
Conv2d
(
16
,
32
,
4
,
2
),
torch
.
nn
.
ReLU
(),
torch
.
nn
.
MaxPool2d
(
2
,
1
),
torch
.
nn
.
Flatten
(),
torch
.
nn
.
Linear
(
32
*
4
*
4
,
32
),
torch
.
nn
.
ReLU
(),
torch
.
nn
.
Linear
(
32
,
10
))
optimizer
=
torch
.
optim
.
SGD
(
model
.
parameters
(),
lr
=
0.05
)
Next, the process involves training the teacher models and scoring the student training data with the teacher models.
def
train
(
model
,
trainloader
,
criterion
,
optimizer
,
epochs
=
10
):
running_loss
=
0
for
e
in
range
(
epochs
):
model
.
train
()
for
images
,
labels
in
trainloader
:
optimizer
.
zero_grad
()
output
=
model
.
forward
(
images
)
loss
=
criterion
(
output
,
labels
)
loss
.
backward
()
optimizer
.
step
()
running_loss
+=
loss
.
item
()
def
predict
(
model
,
dataloader
):
outputs
=
torch
.
zeros
(
0
,
dtype
=
torch
.
long
)
model
.
eval
()
for
images
,
labels
in
dataloader
:
output
=
model
.
forward
(
images
)
ps
=
torch
.
argmax
(
torch
.
exp
(
output
),
dim
=
1
)
outputs
=
torch
.
cat
((
outputs
,
ps
))
return
outputs
def
train_models
(
num_teachers
,
model
):
models
=
[]
for
i
in
range
(
num_teachers
):
criterion
=
torch
.
nn
.
CrossEntropyLoss
()
optimizer
=
torch
.
optim
.
SGD
(
model
.
parameters
(),
lr
=
0.05
)
train
(
model
,
teacher_loaders
[
i
],
criterion
,
optimizer
)
models
.
append
(
model
)
return
models
models
=
train_models
(
num_teachers
,
model
)
The first step to transform a machine learning model into a differentially private machine learning model with PATE is to evaluate the student training data with the teacher models and aggregate the results using a differential privacy mechanism.
In the next bit of code, the scorings from the teacher models are aggregated using a Laplace mechanism:
import
numpy
as
np
epsilon
=
0.2
def
aggregated_teacher
(
models
,
dataloader
,
epsilon
):
preds
=
torch
.
torch
.
zeros
((
len
(
models
),
9000
),
dtype
=
torch
.
long
)
for
i
,
model
in
enumerate
(
models
):
results
=
predict
(
model
,
dataloader
)
preds
[
i
]
=
results
labels
=
np
.
array
([])
.
astype
(
int
)
for
image_preds
in
np
.
transpose
(
preds
):
label_counts
=
np
.
bincount
(
image_preds
,
minlength
=
10
)
beta
=
1
/
epsilon
for
i
in
range
(
len
(
label_counts
)):
label_counts
[
i
]
+=
np
.
random
.
laplace
(
0
,
beta
,
1
)
new_label
=
np
.
argmax
(
label_counts
)
labels
=
np
.
append
(
labels
,
new_label
)
return
preds
.
numpy
(),
labels
teacher_models
=
models
preds
,
student_labels
=
aggregated_teacher
(
teacher_models
,
student_train_loader
,
epsilon
)
In the final part of the code, a student model is training using the student training data. Notice that because the student data now has differentially private lables the resulting student model is differentially private. Chapter 3 discusses an important characteristic of differential privacy: Post-processing immunity.
Training a model using the differentially private data can be seen as a post-processing step of differentially private data. This is what guarantees the differential privacy of the student model.
The code below illustrates the process of training the student model.
def
student_loader
(
student_train_loader
,
labels
,
model
):
for
i
,
(
data
,
_
)
in
enumerate
(
iter
(
student_train_loader
)):
yield
data
,
torch
.
from_numpy
(
labels
[
i
*
len
(
data
):
(
i
+
1
)
*
len
(
data
)])
student_model
=
model
criterion
=
torch
.
nn
.
CrossEntropyLoss
()
optimizer
=
torch
.
optim
.
SGD
(
model
.
parameters
(),
lr
=
0.05
)
epochs
=
10
steps
=
0
running_loss
=
0
for
e
in
range
(
epochs
):
student_model
.
train
()
train_loader
=
student_loader
(
student_train_loader
,
student_labels
,
model
)
for
images
,
labels
in
train_loader
:
steps
+=
1
optimizer
.
zero_grad
()
output
=
student_model
.
forward
(
images
)
loss
=
criterion
(
output
,
labels
)
loss
.
backward
()
optimizer
.
step
()
running_loss
+=
loss
.
item
()
if
steps
%
50
==
0
:
test_loss
=
0
accuracy
=
0
student_model
.
eval
()
with
torch
.
no_grad
():
for
images
,
labels
in
student_test_loader
:
log_ps
=
student_model
(
images
)
test_loss
+=
criterion
(
log_ps
,
labels
)
.
item
()
ps
=
torch
.
exp
(
log_ps
)
top_p
,
top_class
=
ps
.
topk
(
1
,
dim
=
1
)
equals
=
top_class
==
labels
.
view
(
*
top_class
.
shape
)
accuracy
+=
torch
.
mean
(
equals
.
type
(
torch
.
FloatTensor
))
student_model
.
train
()
(
f
"Epoch:
{
e
+
1
}
/
{
epochs
}
.. "
,
f
"Train Loss:
{
running_loss
/
len
(
student_train_loader
)
:
.3f
}
.. "
,
f
"Test Loss:
{
test_loss
/
len
(
student_test_loader
)
:
.3f
}
.. "
,
f
"Accuracy:
{
accuracy
/
len
(
student_test_loader
)
:
.3f
}
"
)
running_loss
=
0
In this chapter you learned how to implement different frameworks for differentially private learning. Budget composition, batching, and hyperparameter tuning were introduced in this chapter. This chapter also provides examples on how to implement the DP-SGD and PATE framework using the Opacus library. In the next chapter, you will learn about parameter decision making. Choosing privacy loss parameters depends on several factors, including data privacy regulations, frequency of data publication, data utility, and others. The process of identifying all important privacy factors and how to address the privacy versus utility compromise is discussed in Chapter 9.
Using the code presented in Example 1, modify the following PrivacyEngine parameters and analyze the privacy-utility trade-off.
noise_multiplier
max_grad_norm
Implement a differentially private version of a decision tree. Explain the following steps of your design:
How was the model sensitivity calculated?
Which mechanisms were utilized to transform your decision tree into a differentially-private decision tree?
3.238.82.77