Chapter 8. Machine Learning Part II

In the previous chapter, you read that machine learning is a process for inferring patterns about data. The process of fitting a model to data is called training. When you train a model, you find parameters that should optimally predict outcomes based on the given data. Sometimes, the training process can leak information about the training data, even if you never see the data directly.

In this chapter, you will learn about ways to make DP machine learning more useful while still preserving privacy. You will also learn about SGD in greater detail, including more ways that it can potentially leak sensitive information. One important outcome of this discussion is the introduction of alternative formulations of differential privacy.

The chapter ends with a discussion and examples of frameworks and tools that will help you create DP machine learning models.

Making DP Gradient Descent Practical

The previous chapter gave a minimally-functional DP gradient descent algorithm using primitives you are already familiar with. Indeed, it is possible to privately train neural networks with simple, scalar DP sums that utilize Laplace noise and basic composition. Unfortunately, this is incredibly inefficient! This section shows how to make key adjustments to the algorithm that make it practically useful.

Vector-Valued Queries

Instead of releasing each scalar partial individually, you can instead release all gradients together as part of one vector-valued query.

In a machine learning context, recall that the computation of the gradient is a row-by-row transformation. The resulting dataset consists of instance-level gradients, where each row corresponds to one training example, and each column is the partials with respect to one of the parameters in your model. That is, if you had a batch size of N, and M parameters in your model, then the privacy analysis will reason about the N×M matrix of instance-level gradients.

If you applied the tools covered thus far in this book, you might first clamp each column individually, then compute the sum of each column, and add noise to each sum. In each training step, the privacy budget is divided among M DP gradient estimates.

Shouldn’t there be a more efficient way to do this? Luckily, there is. It is more efficient to compute a differentially private estimate of all of these sums at once. The strategy is to first clamp the norm of each row, and then sum over the row axis. In the case where one individual can contribute at most one row, then the distance between any two sum vectors on neighboring datasets can be no greater than the clamping norm.

import numpy as np
def clamp(X, norm_bound):
    row_norms = np.linalg.norm(X, axis=1)
    return X / np.maximum(1.0, row_norms / norm_bound)[:, None]

# mock dataset
gradients = np.random.normal(size=(num_rows, num_params))

# clamp the row norms
gradients = clamp(gradients, norm_bound=1.0)

# compute the vector-valued sum
gradients = np.sum(gradients, axis=1)

Now that you have a sum vector with sensitivity bounded by the clamping norm, privatize it by adding noise. The Laplace distribution naturally supports vector queries for which the sensitivity is expressed via an L1 norm, and the Gaussian distribution naturally supports vector queries for which the sensitivity is expressed via an L2 norm. These distributions inform our choice of clamping norm.

Lets take a closer look at the vector-valued Laplace mechanism, which adds i.i.d. Laplace noise (Yi) to each element in x, our vector of gradient sums:

MLap(x)=x+(Y1,...,Ym)

The probability density function of the vector-Laplace distribution is the product of the densities of each Yi, shifted by xi:

Pr[MLap(x)=z]=i=1m12bexp-|xi-zi|b

In this equation, b denotes the noise scale parameter:

ϵln(Pr[M(x)T]Pr[M(x)T])definition of privacy=ln(i=1m12bexp(|xizi|b)i=1m12bexp(|xizi|b))substitute MLap=i=1mln(exp(|xizi|b)exp(|xizi|b))=i=1m|xizi||xizi|bi=1m|xixi|b=||xx||1b=Δ1bwhere Δ1=||xx||1

This formula boils down to nearly the same form as in the scalar Laplace mechanism, but this time the sensitivity is measured in terms of the L1-norm:

from opendp.measurements import make_base_laplace

vec_lap_mech = make_base_laplace(b, D="VectorDomain<AllDomain<float>>")
private_gradients = vec_lap_mech(gradients)

In practice, there are a number of small variations made on this. While it was useful for the analysis, real-world implementations don’t construct one giant instance-level gradient matrix: norms may be computed piecewise, and noise is added directly to each gradient matrix in your model without flattening or concatenation.

It is also typical to use Gaussian noise, as the L2 norm is more permissive. The privacy analysis follows in a similar manner as in the Laplacian derivation.

Composition

Differentially private gradient descent involves repeated releases of noisy gradients. Assuming each release consumes the same privacy loss parameters, the tools for composition covered thus far account for a privacy loss that scales linearly in the number of releases.

This can add up quickly - consider an SGD scenario with 1000 steps, each involving a DP release. You would only be able to use ϵ/1000 for each release, giving poor utility. Significant improvements in the composition can be achieved by recognizing that the tails of the Gaussian distribution are thinner than those of the Laplace distribution. This means that the chance of adding large amounts of noise is smaller when using Gaussian noise. Unfortunately, this useful information is not captured by the privacy loss parameters (ϵ,δ), so the basic composition of approximate-DP is “loose”. A key observation being that this inefficiency does not come from the basic composition, which itself is optimal, but from the characterization of privacy loss parameters in terms of (ϵ,δ).

Practitioners instead use an alternative measure of privacy that more tightly characterizes the privacy loss when making DP releases with Gaussian noise variates. A differentially private release can be made in terms of privacy loss parameters in any parameter space, so long as the measure of privacy provides immunity from post-processing. The task at hand is to find a parameter space for which the basic linear composition is more favorable, when translated to (ϵ,δ). In fact, many privacy analyses choose to forego the conversion back to (ϵ,δ), in favor of representing the final privacy loss in terms of the new parameters.

A common choice of privacy measure is based on the Renyi divergence.

Renyi divergence

Given two probability distributions P and Q, the Renyi divergence of order α is

Dα(P||Q)=1α-1logExQ(P(x)Q(x))α

Using the previous definition, we can construct an alternative form of differential privacy called Renyi Differential Privacy. This formulation uses the Renyi divergence to measure the distance between outputs of a randomized mechanism over adjacent data sets.

Renyi differential privacy

A randomized mechanism M(·) is (α,ϵ)-RDP (or “has ϵ-Renyi differential privacy”) provided that, for all adjacent datasets u and v:

Dα(M(u)||M(v))ϵ

The measure of privacy is now in terms of the Renyi divergence defined as such:

Dα(M(u)||M(v))=1α-1log?u,vQM(u)M(v)α

This is a generalization of the differential privacy that you’ve learned so far in this book. Consider what happens as α goes to infinity: this statement becomes

D(M(u)||M(v))=supxXM(u)M(v)ϵ

which is equivalent to the definition of differential privacy we have been using.

Any mechanism that is (α,ϵ)-RDP also satisfies traditional DP. More specifically, a mechanism that is (α,ϵ)-RDP is also necessarily (ϵ+log(1δ)α-1,δ)-DP. That is to say, RDP isn’t a weaker privacy guarantee, it is rather a different way of characterizing the privacy guarantee. This characterization can always be transformed into a (ϵ',δ)-DP expression, where ϵâ=ϵ+log(1δ)α-1.

Batching

In most practical applications, each step of gradient descent is performed on a small subset of the data. This is called mini-batching.

At the same time, the privacy analysis may be improved by making releases on random subsets of the data. This is called privacy amplification by subsampling.

The intuition is that the set of individuals in the mini-batch is unknown, meaning that each individual may only influence the release with a reduced probability.

Those two concepts complement each other to improve the utility of DP gradient descent.

The most common approach for privacy amplification is to use Poisson sampling. To take a Poisson sample, iterate through each record in your dataset, and with probability λ include it in the mini-batch.

Each row in the dataset only has a λ chance of being used in the release, so the privacy loss is amplified by a factor roughly proportional to λ.

Hyperparameter Tuning

In the previous chapter, you learned that training a machine learning model means minimizing the error of an estimate relative to your data. This minimization happens incrementally: you estimate the parameters of the model, and then take a step in the direction of the steepest change. The size of this step is called the learning rate γ. Note that this is not a parameter of the model - once the model is trained, this value is not used in any way to make predictions. Instead, it controls how the training process proceeds. For a small value of γ, the training process will take a much longer time to converge to an optimal set of parameters. On the other hand, a large value of γ will mean that you could miss the optimal value entirely. Values like γ are called hyperparameters.

Hyperparameter

A value that affects the model training process and does not appear as a parameter in the trained model

As you might suspect, the choice of hyperparameters is an important step of the model training process. You will often want to perform hyperparameter optimization to find the optimal set of hyperparameters to train the desired model.

In a differentially private setting, the hyperparameters can leak information about the data. This means that you may not always be able to do hyperparameter optimization like you would on a non-private data set. One approach for this situation is to find a non-private data set that is thought to likely be structurally similar to the private data in question. You can then optimize the hyperparameters on the public data set as an estimate for the ideal hyperparameters for the private data set.

For example, DPTheilSen requires two hyperparameters: the upper and lower bounds of the data. If you were to do a grid search over possible values, you could likely find the smallest and largest values in the data, and only release the best performing model. The hyperparameters used to train the model would not be accounted for in the privacy calculus.

Public holdout

If you have a public dataset with similar distributional properties as your private dataset, the easiest way to combat the challenges of selecting hyperparameters is to use the public dataset to inform your choice of hyperparameters for the DP algorithm.

Unfortunately, this doesn’t always apply! In the case of DP SGD, even when the public data has the same distribution, a useful learning rate for SGD is not a useful learning rate for DPSGD.

Private selection from private candidates

Ideally, you would want to try several different hyperparameters, and only release the model with the best score. Unfortunately, the naive privacy analysis is incredibly unforgiving. You could view the selection of a single model as postprocessing, so the overall privacy budget is the composition of the privacy budgets used to train all models.

A tighter privacy analysis can be conducted in this case using private selection from a set of private candidates. The setup for this involves creating a function (Q(D)) from which you can sample a private score and private candidate. In this context, each time you call the function, it returns a DP estimate of the utility, and a DP model: (q,x)Q(D).

The tighter privacy guarantee comes from repeatedly calling this function until either a utility threshold is met, or by random stopping. In this first example, the algorithm draws private samples from Q(D) until a biased coin lands heads, and then returns the best candidate:

def private_selection_random_stop(stop_probability):
    assert 0 < stop_probability < 1
    # queryable is Q(D), a python generator of (score, value) pairs
    queryable = queryable_builder(dataset)
    best_score = -float("inf")
    best_y = None
    while True:
        score, y = next(queryable)
        if score > best_score:
            best_score = score
            best_y = y

        if bernoulli(stop_probability):
            return best_score, best_y

When it costs ϵ to draw a sample from Q(D), this algorithm costs 3·ϵ to run. Although this approach does incur a factor of 3 penalty on the privacy budget, often it is worth paying this penalty for the benefit of taking the best candidate from a pool of samples.

A modification of this algorithm can give a tighter privacy bound, if you set a threshold for the utility score beforehand:

def private_selection_threshold(stop_probability, threshold,
                                epsilon_selection, steps=None):
    assert 0 < stop_probability < 1
    min_steps = int(
        np.ceil(max(np.log(2 / epsilon_selection) / stop_probability,
            1 + 1 / np.exp(stop_probability)))
        )

    steps = steps or min_steps
    assert steps >= min_steps

    queryable = queryable_builder(dataset)
    for _ in range(steps):
        score, *y = next(queryable)

        if score >= threshold:
            return score, *y

        if bernoulli(stop_probability):
            return

This algorithm breaks the privacy consumption into two parameters, where drawing a sample from Q(D) incurs a privacy spend of ϵ0, and the private selection incurs a spend of ϵγ. The resulting privacy spend is 2·ϵ0+ϵγ.

When applied in the machine learning context, these algorithms become very useful for privately selecting a private machine learning model. It also provides robustness against poor initial choices of hyperparameters that may cause the model to fail to converge.

Training Differentially Private Models with Pytorch

We utilize the UCI Adult dataset, used in previous chapters, to train differentially private tabular data classification models utilizing two frameworks for model privatization: DP-SGD and PATE. The UCI Adult data set is a widely-used dataset that contains demographic information about individuals, such as age, education, and occupation, as well as their income level (whether they make more or less than $50,000 per year). The dataset was compiled from the 1994 US Census Bureau data and contains over 32,000 instances.

One of the main advantages of using the UCI Adult dataset as a benchmark dataset is its widespread use in the machine learning community. The UCI Adult dataset is relevant to real-world applications, especially in scenarios where privacy is a concern. The dataset contains information about income, which is an important factor in many decision-making processes, such as credit approvals or hiring decisions.

We will start by training a model to predict whether an individual has a high or a low income. Once we go through the non-private training process, we show how to modify the training process to make if differentially private. We will walk through the model transformation from non-private to differentially private utilizing two distinct frameworks:

  • The first framework uses DP-SGD as the optimization function in a neural network architecture. Opacus is a library that provides DP-SGD implementations and can be seamlessly used with pytorch implementations. In our example we will show how to transform a neural network model implemented with pytorch in a differentially private neural network using Opacus.

  • The second framework is PATE.

Example: Predicting Income

In this example we will utilize pandas, sklearn and numpy in addition to pytorch and Opacus for tabular data classification.

To install the necessary libraries run the following command:

pip install numpy pandas opacus sklearn torch

To import all necessary libraries, include the following command to your python file or Jupyter notebook:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim

To verify the available devices, run the following command.

import torch
>>> device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
>>> device

device(type='cpu')

We consider the description of the UCI Adult dataset to be public knowledge. If a dataset description is considered to be public knowledge, this means that not only the metadata of this dataset is publicly available, but every possible neighboring dataset has the same metadata description.

In the case of public data description, there is no preprocessing steps necessary, such as clamping. The only necessary preprocessing steps are related to loading the data to the pytorch model. For this, encoding the categorical data features and scaling the numerical values are the only necessary preprocessing steps. In addition to data encoding and numerical scaling, the code following code also partitions the dataset into training data and testing data:

# Load the Adult dataset
header = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
            'marital_status', 'occupation', 'relationship',
            'race', 'sex', 'capital_gain', 'capital_loss',
            'hours_per_week', 'native_country', 'income']
train_data = pd.read_csv('adult.data', header=None, names=header,
            sep=',s', na_values=['?'], engine='python')
test_data = pd.read_csv('adult.test', header=None, names=header,
            sep=',s', na_values=['?'], skiprows=1, engine='python')

# Preprocess the data
data = pd.concat([train_data, test_data], ignore_index=True)
data = data.dropna()
data = data.reset_index(drop=True)

categorical_columns = ['workclass', 'education', 'marital_status',
                        'occupation', 'relationship', 'race', 'sex',
                       'native_country']
numerical_columns = ['age', 'fnlwgt', 'education_num', 'capital_gain',
                        'capital_loss', 'hours_per_week']

# Encode categorical features
for column in categorical_columns:
    encoder = LabelEncoder()
    data[column] = encoder.fit_transform(data[column])

## Normalize numerical features
for column in numerical_columns:
    scaler = StandardScaler()
    data[column] = scaler.fit_transform(data[column].values.reshape(-1, 1))

# Split data into input (X) and output (y)
X = data.drop(columns=['income'])
y = data['income'].apply(lambda x: 1 if x == '>50K' else 0)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                    test_size=0.2, random_state=42)

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).view(-1, 1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).view(-1, 1)

This next part of the code defines the model architecture for data classification. This model architecture is a simple neural network classifier for predicting whether an adult’s income is low or high. It takes in an input tensor with size input_size and passes it through a fully connected layer (fc1) with a ReLU activation function. The output of fc1 is then passed through another fully connected layer (fc2) with a sigmoid activation function. The final output of the model is a probability estimate for the input being in the positive class (high income):

# Define the model
class AdultClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(AdultClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

The next part of the code is training and evaluating the binary classification model for predicting whether an adult’s income is low or high. The first few lines of code set the model parameters, including the input size (which is inferred from the shape of the input data X), the size fully connected layer (hidden_size), and the output size (which is always 1 for binary classification).

Next, the code defines the loss function (nn.BCELoss()) and the optimizer (optim.Adam()) for training the model. The model is then trained for a specified number of epochs (num_epochs) using mini-batch gradient descent. The training loop iterates over the training data in batches of size batch_size, computes the forward and backward pass through the model, and updates the model parameters using the optimizer. The loss is printed every 10 epochs for monitoring the training progress.

After training, the model is evaluated on the test data:

# Set the model parameters
input_size = X.shape[1]
hidden_size = 64
output_size = 1

model = AdultClassifier(input_size, hidden_size, output_size)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

# Train the model
num_epochs = 100
batch_size = 128

for epoch in range(num_epochs):
    for i in range(0, len(X_train), batch_size):
        X_batch = X_train_tensor[i:i + batch_size]
        y_batch = y_train_tensor[i:i + batch_size]

        # Forward pass
        output = model(X_batch)
        loss = criterion(output, y_batch)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model
model.eval()
with torch.no_grad():
    test_output = model(X_test_tensor)
    test_output = (test_output > 0.5).float()
    accuracy = torch.sum(test_output == y_test_tensor).item()/len(y_test_tensor)
    print(f'Accuracy: {accuracy * 100:.2f}%')

Running the preceding code produces the following output:

Epoch [10/100], Loss: 0.3022
Epoch [20/100], Loss: 0.3027
Epoch [30/100], Loss: 0.2929
Epoch [40/100], Loss: 0.2831
Epoch [50/100], Loss: 0.2779
Epoch [60/100], Loss: 0.2765
Epoch [70/100], Loss: 0.2749
Epoch [80/100], Loss: 0.2750
Epoch [90/100], Loss: 0.2748
Epoch [100/100], Loss: 0.2744
Accuracy: 84.57%

It is well known that adding differential privacy to the training process usually implies reducing the model utility. In the non-private model training, we obtained an accuracy of 84.71%. In the next example, we will make small modification to the code just discussed to transform it into a differentially private model.

Example: Predicting Income Privately

The main difference from the non-private model above and the differentially model presented next is the fact that we use a DP-SGD as the optimizaiton function. The privacy engine Opacus library provides the necessary mechanisms and transformations for making the SGD differentially private.

The data preprocessing steps and model definition remain identical to the non-DP version of model training:

from opacus import PrivacyEngine

# Set the model parameters
input_size = X.shape[1]
hidden_size = 64
output_size = 1

model = AdultClassifier(input_size, hidden_size, output_size)

# Define the loss function and optimizer
criterion = nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

So up to this point the process of defining model architecture, loss function, and optimizer is identical to the non-private version of the model.

The following code describes the process of calling the privacy engine and defining the model that will be used and the privacy loss parameters used in the model:

# Step 4: Attaching a Differential Privacy Engine to the Optimizer

privacy_engine = PrivacyEngine(model, batch_size=32,
                               alphas=range(2,32),
                               noise_multiplier=0.8,
                               max_grad_norm=1.0)

privacy_engine.attach(optimizer)

In this specific example, we define model, batch_size, alphas, noise_multiplier and max_grad_norm parameters for the PrivacyEngine class. The PrivacyEngine also takes as input other parameters. From the Opacus documentation 1 we have the following parameters available for defining the differentially private learning process.

  • model: PyTorch model to be used for training

  • optimizer: Optimizer to be used for training

  • noise_multiplier (float): The ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added (how much noise to add)

  • max_grad_norm (Union[float, List[float]]): The maximum norm of the per-sample gradients. Any gradient with norm higher than this will be clipped to this value.

  • batch_first (bool): Flag to indicate if the input tensor to the corresponding module has the first dimension representing the batch. If set to True, dimensions on input tensor are expected be [batch_size, ...], otherwise [K, batch_size, ...]

  • loss_reduction (str): Indicates if the loss reduction (for aggregating the gradients) is a sum or a mean operation. Can take values "sum" or "mean"

  • poisson_sampling (bool): True if you want to use standard sampling required for DP guarantees. Setting False will leave provided data_loader unchanged. Technically this doesn’t fit the assumptions made by privacy accounting mechanism, but it can be a good approximation when using Poisson sampling is unfeasible.

  • clipping (str): Per sample gradient clipping mechanism ("flat" or "per_layer" or "adaptive"). Flat clipping calculates the norm of the entire gradient over all parameters, per layer clipping sets individual norms for every parameter tensor, and adaptive clipping updates clipping bound per iteration. Flat clipping is usually preferred, but using per layer clipping in combination with distributed training can provide notable performance gains.

  • alphas: The alphas parameter instructs the privacy engine what Renyi differential privacy orders to use for tracking privacy expenditure.

Once the privacy engine is defined, the training and evaluation process proceeds as follows:

for epoch in range(num_epochs):
    delta=1e-5
    for i in range(0, len(X_train), batch_size):
        X_batch = X_train_tensor[i:i + batch_size]
        y_batch = y_train_tensor[i:i + batch_size]

        # Forward pass
        output = model(X_batch)
        loss = criterion(output, y_batch)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    epsilon, best_alpha = optimizer.privacy_engine.get_privacy_spent(delta)

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model
model.eval()
with torch.no_grad():
    test_output = model(X_test_tensor)
    test_output = (test_output > 0.5).float()
    accuracy = torch.sum(test_output == y_test_tensor).item() / len(y_test_tensor)
    print(f'Accuracy: {accuracy * 100:.2f}%')
Epoch [10/100], Loss: 0.9130
Epoch [20/100], Loss: 0.8918
Epoch [30/100], Loss: 0.8797
Epoch [40/100], Loss: 0.8781
Epoch [50/100], Loss: 0.8845
Epoch [60/100], Loss: 0.8789
Epoch [70/100], Loss: 0.8830
Epoch [80/100], Loss: 0.8801
Epoch [90/100], Loss: 0.8799
Epoch [100/100], Loss: 0.8669
Accuracy: 83.08%

The accuracy of the differentially private model is 83.08%, which is not a significant reduction when we compare with the non-private model accuracy of 84.57%.

When defining parameters, one important parameter to be aware of is the batch size. The peak memory requirement is proportional to the batch size. The above code sample utilizes a batch size of 128, but depending on the data batch sizes can be set to smaller values.

Lets train a differentially private model for digit classification using the PATE framework. We will utilize the same dataset and the same model architecture from the previous example. This will allow a comparative analysis of the two frameworks.

First, load the dataset. As seen in Chapter 7, the PATE framework utilizes two distinct datasets in when training a DP model: a private datasets that trains the teacher models, and a public dataset that trains the student model.

The public dataset is labelled by the DP aggregation of the classification from the teacher models, and then used to train the student model.

The following example divides the data into train and test data. The train data will be used as the private dataset of the framework, and the test data will be utilized as the public dataset:

import torch

from torchvision import datasets, transforms
from torch.utils.data import Subset

transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])
train_data = datasets.MNIST('../mnist', train=True, transform=transform,
                                 target_transform=None, download=True)
test_data = datasets.MNIST('../mnist', train=False, transform=transform,
                               target_transform=None, download=True)

num_teachers = 20

def get_data_loaders(train_data, num_teachers):
    """ Function to create data loaders for the Teacher classifier """
    teacher_loaders = []
    data_size = len(train_data) // num_teachers

    for i in range(data_size):
        indices = list(range(i*data_size, (i+1)*data_size))
        subset_data = Subset(train_data, indices)
        loader = torch.utils.data.DataLoader(subset_data, batch_size=batch_size)
        teacher_loaders.append(loader)

    return teacher_loaders

teacher_loaders = get_data_loaders(train_data, num_teachers)

As seen in that code, the first step is to partition the train_data. The snippet above prepares the data partitions that will be used to train the teacher models:

student_train_data = Subset(test_data, list(range(9000)))
student_test_data = Subset(test_data, list(range(9000, 10000)))

student_train_loader = torch.utils.data.DataLoader(student_train_data, batch_size=128)
student_test_loader = torch.utils.data.DataLoader(student_test_data, batch_size=128)

The MNIST test data will be used to train and test the student models. For this experiment, the data split is 90% train, 10% test.

Defining the model architecture is the next step. The model is used in two parts of the framework: training the teacher models and training the student model. Let’s utilize the same model architecture as the previous example. The following code is a recap of the model architecture used in the previous example:

model = torch.nn.Sequential(torch.nn.Conv2d(1, 16, 8, 2, padding=3),
                            torch.nn.ReLU(),
                            torch.nn.MaxPool2d(2, 1),
                            torch.nn.Conv2d(16, 32, 4, 2),
                            torch.nn.ReLU(),
                            torch.nn.MaxPool2d(2, 1),
                            torch.nn.Flatten(),
                            torch.nn.Linear(32 * 4 * 4, 32),
                            torch.nn.ReLU(),
                            torch.nn.Linear(32, 10))

optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

Next, the process involves training the teacher models and scoring the student training data with the teacher models.

def train(model, trainloader, criterion, optimizer, epochs=10):
    running_loss = 0
    for e in range(epochs):
        model.train()

        for images, labels in trainloader:
            optimizer.zero_grad()
            output = model.forward(images)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()




def predict(model, dataloader):
    outputs = torch.zeros(0, dtype=torch.long)
    model.eval()

    for images, labels in dataloader:
        output = model.forward(images)
        ps = torch.argmax(torch.exp(output), dim=1)
        outputs = torch.cat((outputs, ps))

    return outputs




def train_models(num_teachers, model):
    models = []
    for i in range(num_teachers):
        criterion = torch.nn.CrossEntropyLoss()
        optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
        train(model, teacher_loaders[i], criterion, optimizer)
        models.append(model)
    return models

models = train_models(num_teachers, model)

The first step to transform a machine learning model into a differentially private machine learning model with PATE is to evaluate the student training data with the teacher models and aggregate the results using a differential privacy mechanism.

In the next bit of code, the scorings from the teacher models are aggregated using a Laplace mechanism:

import numpy as np

epsilon = 0.2
def aggregated_teacher(models, dataloader, epsilon):

    preds = torch.torch.zeros((len(models), 9000), dtype=torch.long)
    for i, model in enumerate(models):
        results = predict(model, dataloader)
        preds[i] = results

    labels = np.array([]).astype(int)
    for image_preds in np.transpose(preds):
        label_counts = np.bincount(image_preds, minlength=10)
        beta = 1 / epsilon

        for i in range(len(label_counts)):
            label_counts[i] += np.random.laplace(0, beta, 1)

        new_label = np.argmax(label_counts)
        labels = np.append(labels, new_label)

    return preds.numpy(), labels


teacher_models = models
preds, student_labels = aggregated_teacher(
    teacher_models, student_train_loader, epsilon)

In the final part of the code, a student model is training using the student training data. Notice that because the student data now has differentially private lables the resulting student model is differentially private. Chapter 3 discusses an important characteristic of differential privacy: Post-processing immunity.

Training a model using the differentially private data can be seen as a post-processing step of differentially private data. This is what guarantees the differential privacy of the student model.

The code below illustrates the process of training the student model.

def student_loader(student_train_loader, labels, model):
    for i, (data, _) in enumerate(iter(student_train_loader)):
        yield data, torch.from_numpy(labels[i*len(data): (i+1)*len(data)])
student_model = model
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
epochs = 10
steps = 0
running_loss = 0
for e in range(epochs):
    student_model.train()
    train_loader = student_loader(student_train_loader, student_labels, model)
    for images, labels in train_loader:
        steps += 1

        optimizer.zero_grad()
        output = student_model.forward(images)
        loss = criterion(output, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

        if steps % 50 == 0:
            test_loss = 0
            accuracy = 0
            student_model.eval()
            with torch.no_grad():
                for images, labels in student_test_loader:
                    log_ps = student_model(images)
                    test_loss += criterion(log_ps, labels).item()

                    ps = torch.exp(log_ps)
                    top_p, top_class = ps.topk(1, dim=1)
                    equals = top_class == labels.view(*top_class.shape)
                    accuracy += torch.mean(equals.type(torch.FloatTensor))
            student_model.train()
            print(f"Epoch: {e+1}/{epochs}.. ",
                  f"Train Loss: {running_loss/len(student_train_loader):.3f}.. ",
                  f"Test Loss: {test_loss/len(student_test_loader):.3f}.. ",
                  f"Accuracy: {accuracy/len(student_test_loader):.3f}")
            running_loss = 0

Summary

In this chapter you learned how to implement different frameworks for differentially private learning. Budget composition, batching, and hyperparameter tuning were introduced in this chapter. This chapter also provides examples on how to implement the DP-SGD and PATE framework using the Opacus library. In the next chapter, you will learn about parameter decision making. Choosing privacy loss parameters depends on several factors, including data privacy regulations, frequency of data publication, data utility, and others. The process of identifying all important privacy factors and how to address the privacy versus utility compromise is discussed in Chapter 9.

Exercises

  1. Using the code presented in Example 1, modify the following PrivacyEngine parameters and analyze the privacy-utility trade-off.

    1. noise_multiplier

    2. max_grad_norm

  2. Implement a differentially private version of a decision tree. Explain the following steps of your design:

    1. How was the model sensitivity calculated?

    2. Which mechanisms were utilized to transform your decision tree into a differentially-private decision tree?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.219.78