Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Defining Privacy Loss Parameters of a Data Release: How to Choose Epsilon, Delta and Other Greek Letters

Featuring Jayshree Sarathy

Differential privacy delineates a trade-off between privacy and utility, with both attributes influenced by various parameters. Taking advantage of this tradeoff requires understanding how to set these parameters to achieve your data curation and analysis goals. These choices must be informed by contextual needs around privacy and accuracy — as a choice of parameters may be appropriate in one situation and ineffective in another — and potentially with input from various parties who have a stake in the data.

Some of the decisions that are described in this chapter will be made by the data curator, an individual or organization that is responsible for making decisions around access and disclosure limitation for the data set in question. Other decisions must be made by a data analyst, who must work within the constraints set by the data curator to make the best possible uses of the data for their analysis goals.

Understanding the various parameters that affect privacy and utility is central to making good decisions as a DP data curator, and for creating useful analyses within these constraints as a data analyst. Furthermore, it is important to be able to communicate these decisions to other stakeholders to allow them to work appropriately with the data. Fortunately, there are numerous technical methods that facilitate choosing these parameters. Like all matters in differential privacy, there are benefits and challenges to every parameter selection process, but understanding your unique situation will allow you to make the best decision possible.

One of the main decisions the data owner must make is bounding the privacy loss. Privacy loss can be thought of as a random variable that is defined as the following log ratio:

log StartFraction upper P r left-bracket upper M left-parenthesis x right-parenthesis element-of upper S right-bracket Over upper P r left-bracket upper M left-parenthesis y right-parenthesis element-of upper S right-bracket EndFraction

For the standard “approximate-DP” definition, these are called epsilon and delta. Experts recommend that epsilon be set to be a small value between 0.01 and 1, although many deployments tend to choose a much higher value. As you’ve seen in previous chapters, epsilon defines a bound on the difference between distributions over outputs for any computation run on neighboring datasets.

Delta, on the other hand, should be set to be much smaller than $1 slash n$ , where n is the size of the dataset. For example, a standard setting of delta is $10 Superscript negative 6$ . You can think of delta as bounding the “edge case” differences between these shifted distributions; for example, if the support of the distribution is slightly different for two neighboring datasets.

In this chapter, you will learn how to:

choose privacy-loss parameters (e.g. epsilon and delta) to suit the deployment’s context and goals
make decisions about sampling method and metadata parameters with privacy in mind
create organizational practices around data that enable clearer decision-making
solicit feedback and communicate about the choice of parameters

Sampling

The privacy loss parameter is one of the most important policy decisions that the data curator¹ must make about a deployment. But there are other parameters that affect privacy, some in the data collection stage, such as whether the data comes from a “secret sample,” where secrecy means that the set of data subjects chosen from the broader population is confidential knowledge. The sample size, population size, and sampling method can all affect the privacy-utility tradeoff. A secret, simple random sample, for example, can “amplify” the privacy guarantee, allowing one to gain more utility for the same privacy loss.

Other types of sampling methods may not provide additional privacy guarantees. Consider cluster sampling, where a population is divided into disjoint subgroups (called clusters), and then one of these clusters is randomly selected as the sample. Such an approach can be advantageous for performing market research and constructing geographical clusters to minimize travel time when conducting interviews.

For small cluster sizes, you can achieve privacy amplification via cluster sampling. However, as the cluster sizes grow, the amplification degrades, to the point that eventually there will be no additional privacy advantage. Further, if the clusters are sufficiently different from one another, then a private algorithm can infer which data points are in the sample, meaning that there is negligible advantage from privacy of the sample. ² This is still an active area of research at the time of publication. Privacy amplification by subsampling also appears in Chapter 7, where a mathematical justification for why this technique improves privacy in the case of a simple sample is discussed.

Metadata Parameters

There are decisions that affect utility as well as privacy, and these can be made by both the data curator and data analyst. One example of this is metadata parameters, such as inputting ranges for numerical variables and categories for categorical variables.

Why are data curators and analysts asked to input these metadata parameters? Datasets in the wild may have an infinite range or set of categories; this means the sensitivity of statistics computed on this data could be unbounded, which can pose a problem for maintaining the privacy guarantee. Therefore, many DP algorithms require the inputs to be bounded, or for the curator/analyst to set a bound to which the inputs can be clipped, in order to bound the privacy loss of the release.

However, these bounds and categories must be data-independent; they cannot be taken straight from the dataset, but rather should come from knowledge about the data domain and its collection. For example, the range for an “age” variable should not be the min and max of ages in the dataset, but rather $left-bracket 0 comma 110$ ] (for a dataset potentially containing any human) or $left-bracket 0 comma 18$ ] (for a dataset on children). Similarly, the categories for an “education” variable should not simply be those present in the dataset, but should come from a codebook of educational levels for this data domain.

Selecting these parameters independently of the dataset is critical for maintaining the desired privacy guarantee. However, these parameters also impact data utility. Ranges or categories that are too limited may introduce more bias into the statistic, but parameters that are too broad may introduce more variance into the statistic. Therefore, while the data curator is responsible for choosing these parameters, due to their high knowledge of the data domain, data analysts are typically also able to tune these parameters to achieve their bias-variance goals.

Allocating Privacy Loss Budget

Once the data curator has decided on privacy-loss parameters for the release, i.e. the global privacy-loss budget, the curator can then allocate this budget amongst different releases and analysts. For example, the curator may want to use 50% of the budget to do an initial DP release of basic summary statistics about the data set, such as histograms of key variables. The other 50% can then be allocated amongst analysts at different institutions, who can request additional statistics beyond the initial release.

Analysts, too, are responsible for allocating their portion of the privacy-loss budget amongst different statistics. There are different ways to do this: one can divide the privacy-loss portion equally across the statistics, spend more of the budget on statistics that provide richer information (such as sufficient statistics) or context (such as CDFs of variables), or allocate the budget in order to satisfy accuracy goals. For the latter option, some tools allow the analyst to “fix” the desired accuracy of some statistic, such as the size of the 95% confidence interval around a DP mean, and will automatically compute the correct allocation of privacy loss budget in order to achieve this goal.

There are three key challenges with allocating a privacy-loss budget. First, curators and analysts must contend with the inevitable tradeoff between privacy and accuracy; the more the budget is split across analysts and statistics, the lower the accuracy will be for each individual analysis.

Second, the allocation of the budget must be done prior to the release / computation of the DP statistics. This may make it hard to do exploratory data analysis, as it may be unknown a priori which statistics are more important and informative for the goals of the analysis. We will discuss this later in “Making these Decisions in the Context of Exploratory Data Analysis”.

Third, the privacy-loss budget is a finite resource - it must be shared across data users. This means that the allocation of the budget is a visible policy choice on the part of the data curator. This transparency makes it easier to solicit feedback around which data users may benefit from a larger share of the budget, but it may also make these decisions more contentious as stakeholders begin to view allocation as a zero-sum game. We will discuss this more in the conclusion of the chapter.

Practices that Aid Decision-Making

As we’ve discussed, there are many decisions that go into the DP data curation and analysis process. Making these decisions can feel overwhelming, especially because you will get only one (or limited) shots to analyze the data. Remember that any do-overs will result in additional privacy loss. In this section, we therefore discuss data practices that will help you make the best decisions possible. These practices start even before the process of data collection and extend throughout data analysis and distribution.

Codebook and Data Annotation

First, for the data curator, we recommend creating a codebook even before collecting data. The codebook should also make use of and delineate other public information that is useful for this release, such as (1) information that is invariant across all potential input datasets, (2) information that is publicly available from other sources, and (3) information from other DP releases. See Section 3 of the OpenDP Documentation for more on this and an example.

In particular, the codebook should detail the sampling information - such as the sample frame, sampling method, and what steps may or may not be taken to protect secrecy of the sample. Next, it should contain all the variables (ie. columns names) that will be collected and their metadata parameters such as type of variable, variable bounds and variable categories. Third, the codebook should specify how the data curator will deal with missing or erroneous values. It is important that, to the extent possible, this codebook is created prior to data collection, as that will ensure that the information included is not derived from the dataset itself and therefore does not leak individual-level information. The curator should still take care to not include any group-level sensitive information if the codebook is to be shared publicly.

After the data is collected and processed, we recommend that the curator go back and further annotate the dataset. For this, we recommend using the template of “datasheets for datasets” ³ or something similar to collate relevant information. While it may not be obvious how some of these questions relate to DP, all of this information is critical for ethical, transparent data curation and providing context for the data analyst.

Of the many categories of questions in this template - motivation, composition, collection, preprocessing/cleaning/labeling, uses, distribution, and maintenance - questions around cleaning and uses are particularly important for facilitating data analysis. When using DP to analyze a dataset, data analysts struggle with not being able to check how the data is cleaned ⁴, so including the steps taken to do so will allow the analyst to feel more confident about working with the data and allow them to save their privacy-loss budget for more interesting questions.

Second, having a clear idea of the intended use-cases of the dataset will allow the curator to tailor their initial release towards these use-cases. We recommend that the curator spend some of the privacy-loss budget on a “default DP release” that includes key contextual information. This will depend on the curator’s knowledge of the dataset, but may include: number of observations, CDFs of key variables, and statistics that capture anything odd or unexpected about the release. In addition to this information, the curator should also include any statistics relevant to the key use-cases they have outlined for this dataset.

Translating Contextual Norms into Parameters

Part of your job as a data curator will be choosing these privacy loss parameters based on a complex variety of social and technical factors. This choice may be challenged by the discrepancy between the value users assign to privacy and their actual behavior. While many users will claim that they value privacy strongly, research has demonstrated that their actions often contradict their expressed opinion. This finding is called the privacy paradox. ⁵ Recent research has also demonstrated that privacy preferences cannot be understood in a vacuum without considering complex social contexts and scenarios. ⁶

Once you are clear about the collection, composition, and use-cases of the dataset, it is important to think about the contextual norms around the dataset and choosing parameters that respect these norms. One systematic approach to this is contextual integrity. In this approach, data is subject to information norms, which are social, legal, or moral standards of how information should flow or be distributed in a given context. These norms can be analyzed by looking at five different parameters:

data subject
data sender
data recipient
information type
transmission principle

Given the informational norms and context, contextual integrity conceptualizes privacy as information gathering and dissemination that obeys the appropriate flow of information according to contextual norms.

For example, one context may be the monitoring of an athlete’s health by their trusted sports coach. In this context, there are norms about how information should flow or be distributed. For example, the athlete may expect and trust that the coach collects, monitors, and shares information only with the athlete’s best (personal and competitive) interests in mind. The coach should keep sensitive health information private from other players and fans; however, the coach may disclose any information with the athlete’s consent to other coaches and trainers who would help the athlete improve. Privacy, therefore, is defined not just by what information is shared, but by looking at the subject, sender, recipient, type, and transmission of information with respect to these informational norms and context.

The contextual integrity approach helps diagnose potential disruptions to these norms, and therefore, potential violations of privacy. For example, consider an app that helps athletes train, and replaces a traditional human trainer. The app is paired with a wearable fitness tracker that collects thousands of data points about athletes, including resting heart rate and temperature changes. These datapoints are used to compute metrics such as sleep quality, readiness, recovery rate, and overall fitness. This information is stored in a centralized database, and aggregated statistics are shared with third parties such as advertisers for profit motives. An analysis using contextual integrity may reasonably find that the information norms with respect to the traditional athlete-coach relationship are significantly disrupted by the ways in which this app collects, stores, and shares information.

[ Diagram of information flows and where the norms are disrupted ]

Some disruptions that arise may include: the app is fundamentally a data collecting operation, and does so at a much larger scale and granularity (1 athlete for a few hours a day versus thousands of athletes 24 hours a day) than a human coach might; the app has different motives than a trusted coach, because they are aiming to collect information for profit much beyond the benefits to each athlete, while a coach is motivated by having their clients be more successful; the app shares data to advertisers, while a coach would mostly share data to directly improve the athlete’s performance.

Differential privacy does not solve all of these disruptions, but it may be helpful in mitigating the impact of some disruptive information flows. The app may still be violating informational norms by sharing statistics with advertisers, even if these statistics are protected using DP, because this flow is different from a traditional athlete-coach relationship. At the same time, using DP here does offer athletes protection from the specific privacy harms of re-identification, reconstruction, and membership attacks. It is up to you as a data curator to decide when DP is necessary to prevent attacks in an otherwise appropriate information flow, or whether it will simply mitigate the most harmful impacts of data sharing when the information flow itself is a privacy violation.

Once you have considered the different CI parameters of your deployment to understand how DP should be used, it is time to translate these contextual analyses into privacy-loss parameters for DP. This is not an easy task; research over the past few years has started to consider ways of doing so. ⁷ ⁸.

Some strategies for choosing parameters based on your deployment include 1. Calibrating to known attacks while envisioning future attacks. 2. Epsilon registry / comparing to other deployments 3. Conducting a user study to understand privacy preferences

The first strategy is to choose parameters based on how the sensitive data in your deployment has been, or may be, attacked. More broadly, you can consider the disclosure of existing mechanisms, and the threats posed by existing attacks, in order to choose an appropriate and acceptable epsilon. This is helpful for testing parameters, building intuition around how well these parameters protect against existing attacks, and communicating to stakeholders. However, calibrating parameters solely to known attacks is not recommended because this will not allow your deployment to remain robust to future attacks.⁹ For example, consider randomized response, where the probability of returning the true answer/value is modulated by the privacy loss parameter, epsilon. This probability can be considered as it relates to the contextual analysis - how would a 75% versus a 95% probability of returning the correct answer violate the information norm? You could use this heuristic to choose an appropriate epsilon for your scenario. Similarly, you could use an existing attack to reason about privacy-loss parameters, such as setting an acceptable success rate for a membership attack and running (see Chapter 10 for more on this). It is important to remember, however, that choosing privacy loss parameters based on this level of disclosure does not necessarily protect your deployment from much greater levels of disclosure that could occur from algorithms and attacks that you may not be aware of at present. If you do use this approach, do so with caution and additional conservatism.

The second strategy is to look to similar deployments and their choices of privacy-loss parameters in order to make decisions for your deployment. This was proposed by Dwork, Kohli and Mulligan as an “Epsilon Registry,” which they describe as “a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.” The proposed registry contains information such as information flows and use-cases, granularity of protection (individual-level or event-level), epsilon per datum, rate of privacy loss, privacy loss budget, variant of DP that is used, and justification for these and other implementation choices. Using the parameters from your contextual integrity analysis will allow you to determine how to compare your deployment to others in the registry, so that you can make an informed choice.

Finally, you may consider doing user studies on data subjects, users, and other stakeholders of your deployment to reason about the appropriate privacy-utility tradeoff at a more granular level. When doing this, you should again be mindful that any surveys you use to elicit preferences should include as much detailed context as possible. Otherwise, as has been demonstrated by countless research studies, you may obtain answers from participants that do not actually reflect their desires and needs around this privacy-utility tradeoff¹⁰ In addition, you may want to go even beyond contextual information norms and privacy preferences to consider vulnerabilities - the ways in which people, especially those at the margins of your data population, are more vulnerable to privacy attacks and harms than many - including themselves - may realize. For example, while many consider the US decennial census to collect relatively non-sensitive information about US residents, the citizenship and housing information it collects that can leave those whom are vulnerable with respect to these attributes open to severe harm. Even when faced with pressure from data users and other stakeholders to more heavily weigh utility over privacy, it is the responsibility of the data curator to choose parameters that adequately protect those who are most vulnerable.

Legal and moral requirements should also factor into your choice of parameters. For example, in a company that works with user data, data subjects may not have visibility into how their data will be used. Therefore, you shouldn’t choose your parameter space based solely on user preferences, but you should also take into account legal requirements and moral standards - what you (the company) think is responsible. The legal standard is often lax enough that protection from re-identification, reconstruction, and membership attacks won’t be sufficiently guaranteed. You should also keep in mind that differential privacy is only one aspect in a spectrum of privatization techniques; DP can guarantee sufficient protection against these vulnerabilities, however, other points of attack need to be defended using other technical strategies such as multiparty computation and encryption. There is also an administrative side to protection, since defending against social engineering attacks is critical to guaranteeing user privacy. All of these aspects factor into your decision around parameters, as well as your choice of processes beyond DP to ensure robust, multifaceted protections.

Ultimately, using a combination of these three approaches - reasoning using known and unknown attacks, comparing across deployments, and understanding user needs and vulnerabilities - will give you a multifaceted way to make robust decisions around parameters.

Making these Decisions in the Context of Exploratory Data Analysis

Although the section above has provided approaches for when you, as a data curator or analyst, have detailed information about the context, you may end up asking yourself, “how do I start with the analysis?” It is a challenge to perform exploratory data analysis (EDA) safely and release summary information early on, while still rationing your budget so that you have enough left for the remainder of the analysis. In many ways, these topics are the hardest part of a DP project.

There are two main challenges of selecting parameters in the context of exploratory data analysis. First, you may not know what statistics to release, and the choice of statistics itself may be disclosive. Second, you may have little information about metadata parameters, such as ranges or categories, for the statistics you would like to release.

Let us consider the first problem of choosing statistics in a privacy-preserving way. In many cases, the goal of working with the data will be to explore what is even interesting or relevant to release. Making these data-dependent decisions about which statistics to release in the first place can require a dedicated privacy-loss budget. The problem here is to permit “access to the raw data, while ensuring that the choice of statistics is not disclosive.” ¹¹ If this choice of statistics is differentially private, then you “can release these privately chosen statistics using privacy-preserving algorithms.”¹² One solution, proposed by Dwork and Ullman, is for an analyst to (forgetfully) performs EDA on independent slices of the data to come up with the statistics they would like to release. Here, “forgetfully” means they run the same EDA process on each slice, trying their best to forget what they learned in between. Then, they apply a DP mechanism such as Exponential mechanism over all the sets of statistics to choose which ones to release.

Second, let us consider how to choose metadata parameters when you have little context about what they might be. There are two approaches for doing so.

First, you can train on public data. For example, say you have a dataset that contains incomes of employees in an industry that you are unfamiliar with. You want to release the mean value of these incomes, but when using DP, this would require providing bounds on the range of the incomes. If you had no idea what these bounds might be, you could turn to publicly available data on historical incomes from this industry and use the inflation-adjusted estimates to supply as your bounds.

Second, there are algorithms that use privacy budget to estimate parameters. You should only use these algorithms if you feel that your own estimates may be way off base. These algorithms often use iterative approaches. For example, in the income example above, you could use a small portion of your privacy budget, say 1/10, to run a DP binary search algorithm to estimate the 5th and 95th quantiles of the distribution of incomes. Then, you can use these estimated quantiles as the input for your DP mean. A second approach is to create a DP histogram over an arbitrary large set of values. The recommendation is to choose the bins and set of values based on the data type — eg. floating point numbers. In this case, the bins are chosen with spacings that increase exponentially as they move away from 0. Noise is added to each count, and then the approximate maximum of the input values is chosen by selecting the most signficant bin whose count exceeds some threshold t, where t depends on your desired probability of not selecting a false positive. ¹³ However, we still recommend that you use your own estimates for the bounds instead of splitting the budget whenever possible.

Using these algorithms still requires some decisions on your end. For example, using the binary search algorithm requires choosing which quantiles to even estimate. For this, we recommend following rules-of-thumb, such as estimating the 5th and 95th quantile, or using logarithmically spaced bins for histograms.

Adaptively Choosing Privacy Parameters

Oftentimes the analyst does not know what queries they would like to submit until after doing some exploratory DP analysis. In this setting, the analyst does not know how much budget to allocate to future queries until they’ve already made some DP releases. Unfortunately, the composition of DP releases requires the privacy parameters to be known ahead-of-time. When the privacy parameters for each query can be chosen adaptively, as the analysis progresses, the worst-case privacy expenditure increases. This limitation of DP compositors motivates the use of privacy odometers and privacy filters in exploratory data analysis.

A privacy odometer tracks the total privacy expenditure of a series of adaptively-chosen queries. Odometers assume that the analyst is able to tailor their next query based on the previous release. A privacy filter is very similar to a privacy odometer, but once the privacy expenditure exceeds a pre-set budget, the privacy filter refuses to answer any more queries about the dataset. Odometers and filters allow privacy parameters to be set adaptively, but incur a loss of utility and an increase in the complexity of the privacy analysis.

Potential (Unexpected) Consequences of Transparent Parameter Selection

One of the benefits of DP is that it enables transparency around the method used for noise addition, including the many parameters we’ve discussed above that are used within the process. This is great compared to previous approaches of “security-through-obscurity,” where the method itself had to remain secret in order to guarantee protection. However, transparency often brings with it other challenges. Many deployments of DP have disregarded the additional work needed to support transparency, and have suffered the consequences of controversy and critique, so you as a data curator should be aware of the issues that may arise so you can prepare in advance of your deployment.

First, DP creates a zero-sum game between privacy and utility. When you explain DP to data users, you will have to make it clear that there is a strict privacy-utility tradeoff, as well as a finite privacy-loss budget. You should be aware that many data users may not have thought of privacy and utility this way before, so this may be a shocking revelation. Data users that may have worked together before will now be wary that allocating a larger share of the budget to one group diminishes the share of the budget available to other groups. Therefore, asking data users to come to consensus on how the privacy-loss budget should be distributed may be a contentious process. This is exactly what happened during the use of DP in the 2020 Decennial Census, where coalitions of data users fell apart because of the zero-sum game they were suddenly asked to participate in through the use of DP. ¹⁴ ¹⁵

This should not discourage you from being transparent about DP. Transparency and accountability are important benefits of using DP compared to prior approaches to disclosure avoidance. Nonetheless, you should be prepared to justify how all the decisions are made regarding the setting of the privacy-loss budget and the allocation of the budget amongst data users. Before advertising that data will be made available to data users using DP, you may consider setting a policy around how much budget will be used for internal research and/or trusted partners. Then, to allocate the rest of the budget, you might consider using a formal process where data users submit anonymous applications asking for a portion of the budget, and you use a third-party service to make these decisions as objectively as possible. On the other hand, if you would like collective engagement from data users on these decisions, you should set up modes of communication that mitigate the antagonistic aspects of this decision, such as building these agreements based on personal relationships and emphasizing shared goals.

Throughout this process, you will have to communicate to stakeholders about what the privacy-loss parameters actually mean for their contexts. This is no easy task. The challenges of explaining these parameters is well-documented ¹⁶; there are many ways in which stakeholders may misunderstand them. Data users who have more sophisticated statistical background may be better positioned than those without, which can make it hard for them to come to a consensus together. Across the board, stakeholders may not be used to reasoning about probabilistic risks.

Although research into communicating privacy-loss parameters is still nascent, emerging research ¹⁷ suggests that “odds-based explanation” methods can be effective in communicating about these parameters. This may look like the following text:

“If a data subject does respond to a true/false survey question, [x] out of 100 potential DP outputs will lead an analyst to believe that the data subject responded with [true].” “If a data subject does not respond to a true/false survey question, [y] out of 100 potential DP outputs will lead an analyst to believe that the data subject responded with [true].”

Other approaches include frequency visualization and sets of sample DP outputs. Visualizations, in particular, are underexplored in the DP literature and may be incredibly beneficial in communicating with data subjects and data users. Hands-on explainers (such as this notebook by OpenDP: https://docs.opendp.org/en/stable/user/programming-framework/a-framework-to-understand-dp.html#Distance-Between-Distributions---Divergence) can also be valuable for data scientists. Regardless of what approaches you use, you should remain sensitive to how challenging it is for stakeholders to evaluate these parameters and set expectations accordingly.

As a data curator, you will need to understand the relationship between your data’s privacy needs and the privacy-loss parameter values you choose. The choice of privacy-loss parameter values is key to protecting the privacy of subjects in the data set. As a data curator, this is also your responsibility, and may exist in tension with the desire to extract utility from the data for research or commercial purposes. Further, you will need to understand how to choose metadata parameters without regard for the values in the underlying data set. This decision also impacts the utility of the DP data analysis - for example, clamping values to an inappropriate range can lead to flawed statistics. By now, you should have a solid enough theoretical understanding of DP as a practice to be able to curate sensitive data in a way that results in useful DP statistics.

However, DP data curation doesn’t exist in a vacuum, instead, it is part of a network of researchers and institutions hoping to glean useful information from sensitive data. Since differential privacy leads to a trade-off between utility in privacy, this practice can lead to friction between competing interests. Beyond understanding the theory of DP, it is important that you, as a data curator, can also navigate the personal and professional implications of allocating a DP budget. In this chapter, you have learned several techniques for mitigating such friction - including effective communication of privacy budget allocation and the potential to use third party organizations to anonymously process DP budget requests. Remember that a significant part of the data curation process may rest on personal relationships, and allocating budget with a commitment to shared goals, as well as publishing contextual DP statistics as metadata can help to minimize contentiousness and make the DP data analysis process more productive for everyone.

In the next chapter, you will take your new knowledge of privacy-loss parameters, metadata parameter selection, and the organizational challenges of DP data curation, and learn how to secure your sensitive data against common privacy attacks. In particular, when choosing privacy loss parameters and allocating budget, you should have vulnerabilities in mind. Understanding these attacks will help you make responsible decisions as a data curator and serve a useful purpose as cautionary tales when communicating with data analysts.

Exercises

Consider a personal trainer app that tracks exercise and health habits. How could this app violate contextual integrity, when compared to a human trainer?
Take the following function and generate a dataset with unknown bounds

import random

def random_distribution():
    samples = []
    for i in range(200):
        sample = random.uniform(-1e9, 1e9)
        samples.append(sample)
    return samples

Suppose you have a maximum privacy loss budget of $epsilon equals 4$ .
1. How would you define and explain your choice of $δ$ ?
2. How do you divide your budget between data exploration and data analysis?
3. If this data set is referring to markers in blood tests, how do you define your data bounds?
  1. Explain how your choice of sampling method can effect the privacy-utility tradeoff.

¹ See Chapter 2 for a definition and introduction to data curators and other DP roles.

² M. Bun, J. Drechsler, M. Gaboardi, A. McMillan, and J. Sarathy, “Controlling Privacy Loss in Sampling Schemes: An Analysis of Stratified and Cluster Sampling,” 2022.

³ T. Gebru et al., “Datasheets for Datasets.” arXiv, Dec. 01, 2021. doi: 10.48550/arXiv.1803.09010.

⁴ M. Bun, J. Drechsler, M. Gaboardi, A. McMillan, and J. Sarathy, “Controlling Privacy Loss in Sampling Schemes: An Analysis of Stratified and Cluster Sampling,” in 3rd Symposium on Foundations of Responsible Computing (FORC 2022), L. E. Celis, Ed., in Leibniz International Proceedings in Informatics (LIPIcs), vol. 218. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022, p. 1:1-1:24. doi: 10.4230/LIPIcs.FORC.2022.1.

⁵ S. Barth and M. D. T. de Jong, “The privacy paradox – Investigating discrepancies between expressed privacy concerns and actual online behavior – A systematic literature review,” Telematics and Informatics, vol. 34, no. 7, pp. 1038–1058, Nov. 2017, doi: 10.1016/j.tele.2017.04.013.

⁶ K. Martin, and H. Nissenbaum, “Measuring privacy: An empirical test using context to expose confounding variables.” Colum. Sci. & Tech. L. Rev. 18, pp. 176, 2016.

⁷ S. Benthall, S. Gürses, and H. Nissenbaum, Contextual integrity through the lens of computer science. Now Publishers, 2017.

⁸ P. Nanayakkara, M.A. Smart, R. Cummings, G. Kaptchuk, and E. Redmiles, “What Are the Chances? Explaining the Epsilon Parameter in Differential Privacy.” 2023

⁹ See Chapter 10 for more about protecting against such attacks.

¹⁰ K. Martin and H. Nissenbaum, “Measuring Privacy: An Empirical Test Using Context to Expose Confounding Variables,” Colum. Sci. & Tech. L. Rev., vol. 18, p. 176, 2016.

¹¹ C. Dwork and J. Ullman, “The Fienberg Problem: How to Allow Human Interactive Data Analysis in the Age of Differential Privacy,” Journal of Privacy and Confidentiality, vol. 8, no. 1, Art. no. 1, Dec. 2018, doi: 10.29012/jpc.687.

¹² Ibid.

¹³ R. J. Wilson, C. Y. Zhang, W. Lam, D. Desfontaines, D. Simmons-Marengo, and B. Gipson, “Differentially Private SQL with Bounded User Contribution.” arXiv, Nov. 25, 2019. doi: 10.48550/arXiv.1909.01917.

¹⁴ M. Hawes. “Implementing differential privacy: Seven lessons from the 2020 United States Census.” Harvard Data Science Review, 2020.

¹⁵ d. boyd, J. Sarathy. “Differential Perspectives: Epistemic Disconnects Surrounding the US Census Bureau’s Use of Differential Privacy.” Harvard Data Science Review, 2022.

¹⁶ R. Cummings, G. Kaptchuk, and E. Redmiles. “” I need a better description”: An Investigation Into User Expectations For Differential Privacy.” Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, 2021.

¹⁷ P. Nanayakkara, M. A. Smart, R. Cummings, G. Kaptchuk, E. Redmiles. “What Are the Chances? Explaining the Epsilon Parameter in Differential Privacy.” Proceedings of the 32nd USENIX Security Symposium, 2023.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Defining Privacy Loss Parameters of a Data Release: How to Choose Epsilon, Delta and Other Greek Letters

Create new playlist

Sign In

Sign Up