Introduction
In a previous post on Learnability and Uniform Convergence, we saw that the finite VC-dimension of a hypothesis class was both a necessary and sufficient condition for learnability of for classification and regression problems, where . This would seem to be a complete characterization of learnability for such problems, but there are important cases which it does not cover. For example, the characterization implicitly assumes that we are learning a function from a fixed (prespecified) hypothesis class, . What if we want to pose a different constraint? For example, consider the following constraint: “We will consider up to hypotheses, adaptively chosen from the space of all functions from to . Is learning possible in this context? Without pre-specifying anything beyond the number of hypotheses which we will consider, are we not bound to overfit?
The answer to this question is indeed, yes. That is, if we stick with empirical risk minimization (ERM) algorithms. When ERM is used under such a constraint, overfitting arises as the hypotheses chosen by the analyst lose independence from the data used to validate the hypotheses. Without this independence, we can not easily upper-bound the probability of large deviations between the true and estimated accuracy of any given hypothesis.
In fact, there are many settings, such as scientific studies in the social sciences or machine learning competitions, in which it is unnatural or constraining to prespecify a range of hypotheses. Data analysts naturally prefer to interact with the data before deciding anything about it. The downfalls of poorly accounting for dependencies in such analyses have become something a cultural phenomenon in the form of the “reproducibility crisis,” with notable examples including studies following standard methodological rigor, yet proving the existence of Extra-Sensory Perception
A recent line of work has attempted to address this wide-sweeping problem of adaptivity in data analysis by linking it to a separate but related problem known as differential privacy. Differential privacy answers the question of how to allow nearly-unrestricted access to a finite database from a restricted space of queries (such as statistical summaries) without revealing individual entries (i.e. private information).
In this post, we consider a key paper within this line of work, “Algorithmic Stability for Adaptive Data Analysis” by Bassily et al. [1], which gives a general transfer theorem showing that a differentially private mechanism can be used to avoid overfitting. At the time of the paper, there was one other major work, “Preserving Statistical Validity in Adaptive Data Analysis” by Dwork et al. [2], addressing the connection between differential privacy and generalization. Bassily et al. improve the guarantees which were first presented by Dwork et al., and extend them to a broader class of queries.
Aside from being applicable in a broader context than uniform-convergence-like bounds, the results of this paper are also interesting because they establish a different perspective concerning active mechanisms that introduce a form of distortion in order to bring about better generalization properties. Usually such mechanisms, when used e.g. for training neural networks, are viewed as a form of (Tikhonov) regularization. It is an interesting question whether differentially private mechanisms might provide a richer set of tools (specifically alternatives to ERM) for achieving a more optimal bias-variance trade-off in machine learning contexts.
In the following sections, we will outline the major contributions of the paper, and provide both a high-level and detailed overview of the most important proofs.
Main Contributions
We will begin by defining the important components. We assume that there exists a distribution over a domain . We denote a sample of size from as
Definition (Statistical Query): A statistical query is a function . For brevity, we use the following notations to refer to the expectation and empirical average of , respectively:
The main result of the paper applies to a strict generalization of a statistical query called a -sensitive query:
Definition (-sensitive Query): For , , a -sensitive query is a function satisfying for every pair differing by only one entry. In this case, we have .
Notice that a statistical query is a -sensitive query.
The setting in question is one in which an algorithm or analyst issues a sequence of queries , which are answered one at a time by a mechanism . bases its answers on a sample , which remains fixed throughout the sequence of interactions. Both and are stateful, meaning that and may depend on . The error of an answer, , generated by a mechanism when provided with a query is measured with respect to the sample as and with respect to the distribution as
Obviously, there is little we can do if the mechanism is not accurate with respect to its sample, . This requirement is formalized via the following definition:
Generally speaking, there will be very little we can do if the mechanism is not at least accurate with respect to its sample, . This requirement is formalized via the following definition, which says that the maximum error in the answers returned by the mechanism and the actual value of the query on the sample over queries must be small with high probability.
Definition (Sample Accuracy): A mechanism is -accurate with respect to samples of size from for adaptively chosen queries from if for every adversary and
This may seem like an overly general allowance. After all, if the mechanism simply returns the empirical quantity, , then of course will always be zero. But we remind the reader that doing so would almost certainly lead to overfitting. To avoid overfitting, we also need to mechanism to satisfy a form of differential privacy, which naturally restricts the mechanism from having perfect sample accuracy given any query.
Definition (Differential Privacy): Let be a randomized algorithm. We say that is -differentially private if for every pair of samples , that differ by exactly one element, and every ,
This definition of differential privacy functions somewhat like a constraint on the distance between the output distribution of an algorithm when the input vector is different by just a single element. It is important to know that differential privacy enjoys a closure under post-processing, so that if is -differentially private, then the composition is -differentially private for any (possibly random) . Note that the definition above cannot be applied directly to , since takes as input both a sample from and queries from . This will be addressed in the proof detail section.
Our goal is to show that a mechanism satisfying both sample accuracy and differential privacy also satisfies an analogous accuracy requirement with respect to the population, such that overfitting is precluded:
Definition (Population Accuracy): A mechanism is -accurate with respect to the population for adaptively chosen queries from if for every adversary ,
where the sample held by the mechanism is drawn from .
The main result of the paper is a transfer theorem which gives the desired guarantee:
Theorem (Main Transfer Theorem): Let be a family of -sensitive queries on . Assume that for some , is
- -differentially private for k adaptively chosen queries from * and
- -accurate with respect to its sample for n samples from for adaptively chosen queries from
Then is -accurate with respect to the population for adaptively chosen queries from given n samples from .
* This is a variant of the definition given above, which I will highlight in the proof details section.
When combined with existing differentially private mechanisms, this theorem guarantees the existence of an efficient mechanism that can answer quadratically-many queries in the size of without overfitting:
Corollary: There is a mechanism that is -accurate with respect to the population for k adaptively chosen queries from where given samples from for
The mechanism runs in time per query.
Overview of the Proof
A key maneuver shared between the paper and preceding work is to combine the mechanism and the algorithm into a non-interacting randomized wrapper algorithm, , which observes differential privacy. The objective is then to show that any query output by this mechanism must exhibit strong concentration toward its expected value. The previous paper showed this by directly showing that a differentially private mechanism must preserve statistical moments. The paper which we review provides an innovative proof which achieves tighter guarantees while being expositionally simpler.
The first step is a technical lemma, which applies to a general mechanism that is presented with a set of samples and outputs a pair . The lemma gaurantees that the empirical value of the query on the selected sample, , must be close to the population value if is differentially private.
The next step is to specify a form of that is designed to amplify differences betweenq () and by picking the query sample pair where is the largest. It is then shown that that if does not satisfy population accuracy, then the previous technical lemma fails for which is differentiallyprivate by construction. This contradiction gives the desired result.
Details of the Proof
Technical Lemma
The most technical portion of the proof is the following lemma:
Lemma (Main Technical Lemma): Let be -differentially private where is the class of -sensitive queries . Let be a distribution on and let . Then
Before attempting the proof, it is useful to state and prove a much simpler lemma which, while not in the paper, illustrates the general approach for relating the definition of differential privacy to an expression similar to . Having this destination will guide us in proving the main lemma.
Lemma (Differential Privacy to Stability): Let be a family of functions from to and let be -differentially private. Let and be i.i.d. but constrained to be neighbors. may depend on both and . Then,
For use later on, we will define as where . Then we need to show that
Since is non-negative, we can write the expectation as
But from the differential privacy of , we have
Then, using the bound on the range of ,
The other direction follows similarly.
We now proceed to a condensed proof of the main lemma. Our treatment restructures the original proof and provides additional commentary in order to highlight the purpose of each major step.
Let us begin by noting that on the left-hand side of , the input to the differentially private mechanism differs, whereas in , it is the argument of that differs. So an important question is how it will be possible to make use of the lemma?
The secret to try to rewrite the left hand side of in terms of , similarly to the proof of the previous lemma, but where and are identically distributed. This will allow us to swap the role of with and respectively, thus enabling the use of the previous lemma. Most of the work in the proof involves a sequence of steps toward this end, most of which can be seen as types of symmetrizations.
The first symmetrization comes at no cost. We simply introduce an , independent of , and observe that because of this independence
Then, from linearity of expecation, the left hand size of becomes
This doesn’t yet admit expression in the form of since we haven’t constrained and to be neighbors. To move forward we need some notation. For and , let the entries of be given by
That is, is equal to up to ‘th row of the ‘th column, and equal to beyond that. Notice that, while may in general depend on both and , it is identically distributed to both and , . Also, by construction and differ by just a single element.
Letting denote the ‘th column of , we notice that (a) if then and , and (b) if then , we have that for any
We are now quite close to our destination since, for example, we could define and we would have
The problem is that, unlike in the previous lemma, we have not required that be bounded, and thus is not necessarily non-negative or upper-bounded. To handle this, we need to be more careful. Instead, we define as where is with it’s ‘th element replaced by a fixed element of . It is actually possible to replace with by adding to the left hand side of . (Notice that , since the only differing element in the original samples is replaced by a fixed element.)
Thus, combining our progress so far, we have that
We’ve now achieved the destination that we defined in . By recalling that , we can work through the steps of the previous lemma to achieve the bound (for exact details, see the paper).
Main Transfer Theorem Proof via Monitoring Algorithm
We are now have the proper equipment to prove the main theorem. To this end, we introduce two algorithms. The first is an algorithm which simulates the interaction between the analyst and the mechanism for iterations.
Algorithm (Non-interactive wrapper):
Input: A sample
For
Feed to and get a query .
Feed to and get an answer .
Output:
The above definition for differential privacy does not apply directly to , but we can use this algorithm to give a definition. We say that an interactive mechanism is -differentially private for queries from if for every adversary , the algorithm is -differentially private. This fills in the missing definition in the statement of the main theorem.
The next algorithm is a so-called monitoring algorithm which runs the previous algorithm repeatedly on samples, and then returns the query-sample pair displaying the greatest population error.
Algorithm (Monitoring Algorithm):
Input:
For
Let
Let
Let , where the sign is chosen to make
Output:
The monitoring algorithm should be seen a a way of checking if is accurate with respect to the population. It is also quite straightforward to show that, since merely performs post-processing on the output of , which is assumed to be -differentially private, it must also be -differentially private by the post-processing property of differential privacy.
Applying the monitoring algorithm with , the proof of the main theorem follows from two main observations:
-
If is not -accurate then the deviation between and is large. In particular,
-
But, since is differentially private, application of the technical lemma directly gives
From the contradiction, we see that must be -accurate, thus proving the main transfer theorem.
The key idea in showing 1. is that, since is not population-accurate, is far from with high probability. But since by assumption is sample-accurate, is close to from with high probability. That is, allowing all expectations be implicitly conditioned on , where the upper bounded for (a) comes from the fact that is not population-accurate, and the lower bound for (b) comes from the fact that is sample-accurate.
Corollaries
The paper provides several corollaries to the main transfer theorem which arise by applying the theorem to guarantees provided by existing differentially private mechanisms. The corollary in the main results area is an example of one of these, which makes use of an algorithm proposed in “Between Pure and Approximate Differential Privacy” by Steinke and Ullman [3]. Unfortunately, the results are simply stated without derivations, which is problematic since some of the papers give guarantees for slightly different assumptions.
For example, theorem 1.3 of “Between Pure and Approximate Differential Privacy” states that there is an -differentially private mechanism for adaptively chosen queries which requires samples, with
The mechanism is said to be with respect to the sample, which is problematic since we would like to be able to specify the accuracy independently of . Thus, it appears that without greater familiarity with differential privacy, it would likely be necessary to open up the proofs provided in each paper in order to obtain an appropriate sample complexity bound for use with the main theorem.
Further Reading
[1] Raef Bassily et al. Algorithmic Stability for Adaptive Data Analysis. 2015. http://arxiv.org/abs/1511.02513.
[2] Cynthia Dwork et al. Preserving Statistical Validity in Adaptive Data Analysis. 2016. http://arxiv.org/abs/1411.2664
[3] Thomas Steinke and Jonathan Ullman. Between Pure and Approximate Differential Privacy. 2015. http://arxiv.org/abs/1501.06095