Introduction
In a previous post on Learnability and Uniform Convergence, we saw that the finite VC-dimension of a hypothesis class $\H$ was both a necessary and sufficient condition for learnability of $\H$ for classification and regression problems, where $\H \subseteq \Y^\X$. This would seem to be a complete characterization of learnability for such problems, but there are important cases which it does not cover. For example, the characterization implicitly assumes that we are learning a function from a fixed (prespecified) hypothesis class, $\H$. What if we want to pose a different constraint? For example, consider the following constraint: “We will consider up to $N$ hypotheses, adaptively chosen from the space of all functions from $\X$ to $\Y$. Is learning possible in this context? Without pre-specifying anything beyond the number of hypotheses which we will consider, are we not bound to overfit?
The answer to this question is indeed, yes. That is, if we stick with empirical risk minimization (ERM) algorithms. When ERM is used under such a constraint, overfitting arises as the hypotheses chosen by the analyst lose independence from the data used to validate the hypotheses. Without this independence, we can not easily upper-bound the probability of large deviations between the true and estimated accuracy of any given hypothesis.
In fact, there are many settings, such as scientific studies in the social sciences or machine learning competitions, in which it is unnatural or constraining to prespecify a range of hypotheses. Data analysts naturally prefer to interact with the data before deciding anything about it. The downfalls of poorly accounting for dependencies in such analyses have become something a cultural phenomenon in the form of the “reproducibility crisis,” with notable examples including studies following standard methodological rigor, yet proving the existence of Extra-Sensory Perception
A recent line of work has attempted to address this wide-sweeping problem of adaptivity in data analysis by linking it to a separate but related problem known as differential privacy. Differential privacy answers the question of how to allow nearly-unrestricted access to a finite database from a restricted space of queries (such as statistical summaries) without revealing individual entries (i.e. private information).
In this post, we consider a key paper within this line of work, “Algorithmic Stability for Adaptive Data Analysis” by Bassily et al. [1], which gives a general transfer theorem showing that a differentially private mechanism can be used to avoid overfitting. At the time of the paper, there was one other major work, “Preserving Statistical Validity in Adaptive Data Analysis” by Dwork et al. [2], addressing the connection between differential privacy and generalization. Bassily et al. improve the guarantees which were first presented by Dwork et al., and extend them to a broader class of queries.
Aside from being applicable in a broader context than uniform-convergence-like bounds, the results of this paper are also interesting because they establish a different perspective concerning active mechanisms that introduce a form of distortion in order to bring about better generalization properties. Usually such mechanisms, when used e.g. for training neural networks, are viewed as a form of (Tikhonov) regularization. It is an interesting question whether differentially private mechanisms might provide a richer set of tools (specifically alternatives to ERM) for achieving a more optimal bias-variance trade-off in machine learning contexts.
In the following sections, we will outline the major contributions of the paper, and provide both a high-level and detailed overview of the most important proofs.
Main Contributions
We will begin by defining the important components. We assume that there exists a distribution $\D$ over a domain $\X$. We denote a sample of size $n$ from $\X$ as $\x = (x_1,…,x_n) \in \X^n$
Definition (Statistical Query): A statistical query is a function $q : \X \to [0,1]$. For brevity, we use the following notations to refer to the expectation and empirical average of $q$, respectively: $$q(\D) = \E{x \sim \D}[q(x)] \quad \text{and} \quad q(\x) = \frac{1}{n}\sum_{i \in [n]}q(x_i)$$
The main result of the paper applies to a strict generalization of a statistical query called a $\Delta$-sensitive query:
Definition ($\Delta$-sensitive Query): For $\Delta$ $\in [0,1]$, $n \in \N$, a $\Delta$-sensitive query is a function $q : \X^n \to \R$ satisfying $|q(\x)-q(\x’)| \leq \Delta$ for every pair $\x,\x’ \in \X^n$ differing by only one entry. In this case, we have $q(\D) = \E{\x \sim \D^n}[q(\x)]$.
Notice that a statistical query is a $1/n$-sensitive query.
The setting in question is one in which an algorithm or analyst $\A$ issues a sequence of queries $(q_i)_{i \in [k]}$, which are answered one at a time $((a_i)_{i \in [k]})$ by a mechanism $\M$. $\M$ bases its answers on a sample $\x \sim D^n$, which remains fixed throughout the sequence of interactions. Both $\A$ and $\M$ are stateful, meaning that $q_i$ and $a_i$ may depend on $q_1,a_1,…,q_{j-1},a_{j-1}$. The error of an answer, $a$, generated by a mechanism when provided with a query $q$ is measured with respect to the sample $\x$ as $\errx(q,a) = a - q(\x)$ and with respect to the distribution $\D$ as $\errp(q,a) = a - q(\D)$
Obviously, there is little we can do if the mechanism is not accurate with respect to its sample, $x$. This requirement is formalized via the following definition:
Generally speaking, there will be very little we can do if the mechanism is not at least accurate with respect to its sample, $\x$. This requirement is formalized via the following definition, which says that the maximum error in the answers returned by the mechanism and the actual value of the query on the sample over $k$ queries must be small with high probability.
Definition (Sample Accuracy): A mechanism $\M$ is $(\alpha,\beta)$-accurate with respect to samples of size $n$ from $\X$ for $k$ adaptively chosen queries from $Q$ if for every adversary $\A$ and $\x \in \D^n$
$$ \P \left[ \max_{j \in [k]} \big|\errx(q_j,a_j)\big| \le \alpha \right] \ge 1 - \beta$$
This may seem like an overly general allowance. After all, if the mechanism simply returns the empirical quantity, $q(x)$, then of course $\errx(q,a)$ will always be zero. But we remind the reader that doing so would almost certainly lead to overfitting. To avoid overfitting, we also need to mechanism to satisfy a form of differential privacy, which naturally restricts the mechanism from having perfect sample accuracy given any query.
Definition (Differential Privacy): Let $\W : \X^n \to \mathcal{R}$ be a randomized algorithm. We say that $\W$ is $(\epsilon,\delta)$-differentially private if for every pair of samples $\x$, $\x’$ that differ by exactly one element, and every $R \subseteq \mathcal{R}$,
$$\P[\W(\x) \in R] \le e^\epsilon \cdot \P[\W(\x’) \in R] + \delta$$
This definition of differential privacy functions somewhat like a constraint on the distance between the output distribution of an algorithm when the input vector is different by just a single element. It is important to know that differential privacy enjoys a closure under post-processing, so that if $\W$ is $(\epsilon,\delta)$-differentially private, then the composition $f \circ \W$ is $(\epsilon,\delta)$-differentially private for any (possibly random) $f$. Note that the definition above cannot be applied directly to $\M$, since $\M$ takes as input both a sample from $\x \in \X^n$ and queries from $\A$. This will be addressed in the proof detail section.
Our goal is to show that a mechanism satisfying both sample accuracy and differential privacy also satisfies an analogous accuracy requirement with respect to the population, such that overfitting is precluded:
Definition (Population Accuracy): A mechanism $\M$ is $(\alpha,\beta)$-accurate with respect to the population for $k$ adaptively chosen queries from $Q$ if for every adversary $\A$, $$ \P \left[ \max_{j \in [k]} \big|\errp(q_j,a_j)\big| \le \alpha \right] \ge 1 - \beta$$
where the sample $\x$ held by the mechanism is drawn from $\D^n$.
The main result of the paper is a transfer theorem which gives the desired guarantee:
Theorem (Main Transfer Theorem): Let $Q$ be a family of $\Delta$-sensitive queries on $\X$. Assume that for some $\alpha,\beta \in (0,.1)$, $\M$ is
- $(\epsilon = \alpha/60\Delta n,\delta = \alpha\beta/32\Delta n)$-differentially private for k adaptively chosen queries from $Q$* and
- $(\alpha’=\alpha/8,\beta’=\alpha\beta/16\Delta n)$-accurate with respect to its sample for n samples from $\X$ for $k$ adaptively chosen queries from $Q$
Then $\M$ is $(\alpha,\beta)$-accurate with respect to the population for $k$ adaptively chosen queries from $Q$ given n samples from $\X$.
* This is a variant of the definition given above, which I will highlight in the proof details section.
When combined with existing differentially private mechanisms, this theorem guarantees the existence of an efficient mechanism that can answer quadratically-many queries in the size of $x$ without overfitting:
Corollary: There is a mechanism $\M$ that is $(\alpha,\beta)$-accurate with respect to the population for k adaptively chosen queries from $Q_\Delta$ where $\Delta = O(1/n)$ given $n$ samples from $\X$ for
$$n \ge O\left(\frac{\sqrt{k\log\log k} \log^{3/2}(1/\alpha \beta)}{\alpha^2}\right) $$
The mechanism runs in time $\text{poly}(n,\log|\X|,\log(1/\beta))$ per query.
Overview of the Proof
A key maneuver shared between the paper and preceding work is to combine the mechanism $\M$ and the algorithm $\A$ into a non-interacting randomized wrapper algorithm, $\W[\M,\A]$, which observes differential privacy. The objective is then to show that any query output by this mechanism must exhibit strong concentration toward its expected value. The previous paper showed this by directly showing that a differentially private mechanism must preserve statistical moments. The paper which we review provides an innovative proof which achieves tighter guarantees while being expositionally simpler.
The first step is a technical lemma, which applies to a general mechanism $\W$ that is presented with a set of samples $X = (\x_1,…\x_T)$ and outputs a pair $(q,\x_t)$. The lemma gaurantees that the empirical value of the query on the selected sample, $q(\x_t)$, must be close to the population value $q(\D)$ if $\W$ is differentially private.
The next step is to specify a form of $\W[\M,\A]$ that is designed to amplify differences betweenq ($\x_t$) and $q(\D)$ by picking the query sample pair where $\errp(q,a)$ is the largest. It is then shown that that if $\M$ does not satisfy population accuracy, then the previous technical lemma fails for $\W[M,A]$ which is differentiallyprivate by construction. This contradiction gives the desired result.
Details of the Proof
Technical Lemma
The most technical portion of the proof is the following lemma:
Lemma (Main Technical Lemma): Let $\W : (\X^n)^T \to Q_\Delta \times [T]$ be $(\epsilon,\delta)$-differentially private where $Q_\Delta$ is the class of $\Delta$-sensitive queries $q : \X^n \to \R$. Let $\D$ be a distribution on $\X$ and let $X = (\x_1,…,\x_T) \sim \D^{nT}$. Then
$$ \tag{\htmlId{eq-main-lem}{}} \left|\E{X,\W} [q(\D)|(q,t) = \W(X)] - \E{X,\W}[q(\x_t)|(q,t) = \W(X)] \right| \le 2(e^\epsilon - 1 + T\delta)\Delta n $$
Before attempting the proof, it is useful to state and prove a much simpler lemma which, while not in the paper, illustrates the general approach for relating the definition of differential privacy to an expression similar to $\htmlClass{eq-ref eq-main-lem}{}$. Having this destination will guide us in proving the main lemma.
Lemma (Differential Privacy to Stability): Let $Q$ be a family of functions from $\X$ to $[0,d]$ and let $\W: \X \to Q$ be $(\epsilon,\delta)$-differentially private. Let $X$ and $X’$ be i.i.d. but constrained to be neighbors. $Z$ may depend on both $X$ and $X’$. Then,
$$ \tag{\htmlId{eq-dp-stability}{}} \left| \E{X,Z}[q(Z) | q = \W(X)] - \E{X’,Z}[q(Z) | q = \W(X’)] \right| \le (e^\epsilon - 1 + \delta)d $$
For use later on, we will define $F : \X \times \X \to [0,d]$ as $F(Z,X) = q(Z)$ where $q = \W(X)$. Then we need to show that $$\left|\E{X,Z}[F(Z,X)] - \E{X’,Z}[F(Z,X’)] \right| \le (e^\epsilon - 1 + \delta)d$$
Since $F$ is non-negative, we can write the expectation as
$$ \begin{aligned} \E{X,Z}[F(Z,X)] &= \int_{0}^{d} \underset{X,Z}{\P}[F(Z,X) \ge s]ds \cr &= \int_{0}^{d} \underset{X,Z}{\P}[q(Z) \ge s| q = \W(X)]ds. \end{aligned} $$
But from the differential privacy of $\W$, we have
$$ \begin{aligned} &\le \int_{0}^{d} (e^\epsilon\underset{X’,Z}{\P}[q(Z) \ge s| q = \W(X’)] + \delta)ds \cr &= \int_{0}^{d} e^\epsilon\underset{X’,Z}{\P}[F(Z,X’) \ge s]ds + \delta d \cr &= e^\epsilon\E{X’,Z}F(Z,X’) + \delta d \end{aligned} $$
Then, using the bound on the range of $F$,
$$ \E{X,Z}[F(Z,X)] - \E{X’,Z}[F(Z,X’)] \le (e^\epsilon - 1)\E{X’,Z}[F(Z,X’)] + \delta d \le (e^\epsilon - 1 + \delta)d $$
The other direction follows similarly.
We now proceed to a condensed proof of the main lemma. Our treatment restructures the original proof and provides additional commentary in order to highlight the purpose of each major step.
Let us begin by noting that on the left-hand side of $\htmlClass{eq-ref eq-dp-stability}{}$, the input to the differentially private mechanism differs, whereas in $\htmlClass{eq-ref eq-main-lem}{}$, it is the argument of $q$ that differs. So an important question is how it will be possible to make use of the lemma?
The secret to try to rewrite the left hand side of $\htmlClass{eq-ref eq-main-lem}{}$ in terms of $F : \X \times \X \to [0,d]$, similarly to the proof of the previous lemma, $$ \tag{\htmlId{eq-dest}{}} \left|\E{X,Z}[F(Z,X)] - \E{X,Z’}[F(Z’,X)]\right| $$ but where $Z$ and $X$ are identically distributed. This will allow us to swap the role of $Z$ with $X$ and $X’$ respectively, thus enabling the use of the previous lemma. Most of the work in the proof involves a sequence of steps toward this end, most of which can be seen as types of symmetrizations.
The first symmetrization comes at no cost. We simply introduce an $X’ \sim \D^{nT}$, independent of $X$, and observe that because of this independence
$$\left|\E{X,\W} [q(\D)|(q,t) = \W(X)] \right| = \left|\E{X,X’,\W} [q(\x_t)|(q,t) = \W(X)] \right|.$$
Then, from linearity of expecation, the left hand size of $\htmlClass{eq-ref eq-main-lem}{}$ becomes
$$\left|\E{X,X’,\W} [q(\x’_t) - q(\x_t)|(q,t) = \W(X)] \right|$$
This doesn’t yet admit expression in the form of $\htmlClass{eq-ref eq-dest}{}$ since we haven’t constrained $X$ and $X’$ to be neighbors. To move forward we need some notation. For $\ell \in \{0,1,…,n\}$ and $m \in \{0,1,…,T\}$, let the entries of $X^{\ell,m}$ be given by
$$X_{i,t}^{\ell,m} = \begin{cases} X’_{i,t} & t > m \text{ or } (t = m \text{ and } i \le \ell) \\ X_{i,t} & \text{otherwise} \end{cases}.$$
That is, $X^{\ell,m}$ is equal to $X’$ up to $\ell$‘th row of the $m$‘th column, and equal to $X$ beyond that. Notice that, while $X^{\ell,m}$ may in general depend on both $X$ and $X’$, it is identically distributed to both $X’$ and $X$, $(\text{i.e. }X’ \sim X \sim X^{\ell,m})$. Also, by construction $X^{\ell,m}$ and $X^{\ell-1,m}$ differ by just a single element.
Letting $\x^{\ell,m}_t$ denote the $t$‘th column of $X^{\ell,m}$, we notice that (a) if $t = m$ then $\x_t^{0,m} = \x_t$ and $\x_t^{n,m} = \x’_t$, and (b) if $t \neq m$ then $\x_t^{\ell,m} = \x_t^{\ell-1,m}$, we have that for any $t \in [T]$
$$ \begin{aligned} &q(\x’_t) - q(\x_t) \cr \text{from (a)} \quad &= q(\x_t^{n,t}) - q(\x_t^{0,t}) \cr \text{telescoping sum} \quad &= \sum_{\ell \in [n]}q(\x_t^{\ell,t}) - q(\x_t^{\ell-1,t}) \cr \tag{}\text{from (b)} \quad &= \sum_{m \in[T]}\sum_{\ell \in [n]}q(\x_t^{\ell,m}) - q(\x_t^{\ell-1,m}) \cr %\text{adding 0} \quad &= \sum_{m \in[T]}\sum_{\ell \in [n]}\big(q(\x_t^{\ell,m}) - q(\x_{t,-\ell}^{\ell,m}) + \Delta \big) - \big( q(\x_t^{\ell-1,m}) - q(\x_{t,-\ell}^{\ell-1,m}) + \Delta \big) \end{aligned} $$
We are now quite close to our destination since, for example, we could define $$ \hat{F}(Z,X) = q(\mathbf{z}_t) \quad \text{where} \quad (q,t) = \W(X) $$ and we would have $$ \tag{\htmlId{eq-Fhat}{}}\E{X,X’,\W} [q(\x_t^{\ell,m}) - q(\x_t^{\ell-1,m})|(q,t) = \W(X)] = \E{X,X’,\W}[\hat{F}(X^{\ell,m},X) - \hat{F}(X^{\ell-1,m},X)]. $$
The problem is that, unlike in the previous lemma, we have not required that $q$ be bounded, and thus $\hat{F}$ is not necessarily non-negative or upper-bounded. To handle this, we need to be more careful. Instead, we define $F^{\ell,m}: \X \times \X \to [0,2\Delta]$ as $$ F^{\ell,m}(Z,X) = \begin{cases} q(\mathbf{z}_t) - q(\mathbf{z}_{t,-\ell}) + \Delta & t = m \cr 0 & t \neq m \end{cases} $$ where $\mathbf{z}_{t,-\ell}$ is $\mathbf{z}_t$ with it’s $\ell$‘th element replaced by a fixed element of $\X$. It is actually possible to replace $\hat{F}$ with $F^{\ell,m}$ by adding $0 = \big(\Delta - q(\x_{t,-\ell}^{\ell,m}) \big) - \big(\Delta - q(\x_{t,-\ell}^{\ell-1,m})\big)$ to the left hand side of $\htmlClass{eq-ref eq-Fhat}{}$. (Notice that $\x_{t,-\ell}^{\ell,m} = \x_{t,-\ell}^{\ell-1,m}$, since the only differing element in the original samples is replaced by a fixed element.)
Thus, combining our progress so far, we have that $$ \E{X,X’,\W} [q(\x’_t) - q(\x_t)|(q,t) = \W(X)] = \sum_{m \in[T]}\sum_{\ell \in [n]}\E{X,X’,\W}[F^{\ell,m}(X^{\ell,m},X) - F^{\ell,m}(X^{\ell-1,m},X)] $$
We’ve now achieved the destination that we defined in $\htmlClass{eq-ref eq-dest}{}$. By recalling that $X \sim X^{\ell,m} \sim X^{\ell-1,m}$, we can work through the steps of the previous lemma to achieve the bound (for exact details, see the paper).
Main Transfer Theorem Proof via Monitoring Algorithm
We are now have the proper equipment to prove the main theorem. To this end, we introduce two algorithms. The first is an algorithm which simulates the interaction between the analyst $\A$ and the mechanism $\M$ for $k$ iterations.
Algorithm (Non-interactive wrapper): $\W[\M,\A] : \X^n \to (Q \times \mathcal{R})^k$
Input: A sample $\x \in \X^n$
For $j= 1,…,k$
$\quad$ Feed $a_{j-1}$ to $\A$ and get a query $q_j \in Q$.
$\quad$ Feed $q_j$ to $\M$ and get an answer $q_j \in \mathcal{R}$.
Output: $((q_1,a_1),…,(q_k,a_k))$
The above definition for differential privacy does not apply directly to $\M$, but we can use this algorithm to give a definition. We say that an interactive mechanism is $(\epsilon,\delta)$-differentially private for $k$ queries from $Q$ if for every adversary $\A$, the algorithm $\W[\A,\M]$ is $(\epsilon,\delta)$-differentially private. This fills in the missing definition in the statement of the main theorem.
The next algorithm is a so-called monitoring algorithm which runs the previous algorithm repeatedly on $T$ samples, and then returns the query-sample pair displaying the greatest population error.
Algorithm (Monitoring Algorithm): $\W_M[\M,\A] : \X^{nT} \to Q \times \N$
Input: $X = (\x_1,…,\x_T) \in \X^{nT}$
For $t= 1,…,T$
$\quad$ Let $(q_{t,1},a_{t,1}),…,q_{t,k},a_{t,k}) = \W[\M,\A](\x_t)$
Let
$$
(j^\star,t^\star) = \underset{j \in [k],t \in [T]}{\text{arg,max}}\left|\errp(q_{t,j},a_{t,j})\right|
$$
Let $q^\star = \pm q_{t^\star,j^\star}$, where the sign is chosen to make $a_{t^\star,j^\star} - q^\star(\D) \ge 0$
Output: $(q^\star,t^\star)$
The monitoring algorithm should be seen a a way of checking if $\M$ is accurate with respect to the population. It is also quite straightforward to show that, since $\W_M[\M,\A]$ merely performs post-processing on the output of $\W[\M,A]$, which is assumed to be $(\epsilon,\delta)$-differentially private, it must also be $(\epsilon,\delta)$-differentially private by the post-processing property of differential privacy.
Applying the monitoring algorithm with $T = \lfloor 1/\beta \rfloor$, the proof of the main theorem follows from two main observations:
-
If $\M$ is not $(\alpha,\beta)$-accurate then the deviation between $q^\star(\D)$ and $q^\star(\x_{t^\star})$ is large. In particular, $$ \left| \E{X,W}[q^\star(\D) - q^\star(\x_{t^\star})| (q^\star,t^\star) = \W_M(X)] \right| \ge \alpha/4 $$
-
But, since $\W_M$ is differentially private, application of the technical lemma directly gives $$ \left| \E{X,W}[q^\star(\D) - q^\star(\x_{t^\star})| (q^\star,t^\star) = \W_M(X)] \right| \le \alpha/8 $$
From the contradiction, we see that $\M$ must be $(\alpha,\beta)$-accurate, thus proving the main transfer theorem.
The key idea in showing 1. is that, since $\M$ is not population-accurate, $q^\star(\D)$ is far from $a_{q^\star}$ with high probability. But since $\M$ by assumption is sample-accurate, $a_{q^\star}$ is close to from $q^\star(\x^\star)$ with high probability. That is, allowing all expectations be implicitly conditioned on $(q^\star,t^\star) = \W_M(X)$, $$ \begin{aligned} &\left| \E{X,W}[q^\star(\D) - q^\star(\x_{t^\star})] \right| \cr = &\left| \E{X,W}[q^\star(\D) - a_{q^\star} + a_{q^\star} - q^\star(\x_{t^\star})] \right| \cr \ge &\underbrace{\left|\E{X,W}[q^\star(\D) - a_{q^\star}]\right|}_{(a) \text{ lower-bounded}} - \underbrace{\left|\E{X,W}[a_{q^\star} - q^\star(\x_{t^\star})]\right|}_{(b) \text{ upper-bounded}} \end{aligned} $$ where the upper bounded for (a) comes from the fact that $\M$ is not population-accurate, and the lower bound for (b) comes from the fact that $\M$ is sample-accurate.
Corollaries
The paper provides several corollaries to the main transfer theorem which arise by applying the theorem to guarantees provided by existing differentially private mechanisms. The corollary in the main results area is an example of one of these, which makes use of an algorithm proposed in “Between Pure and Approximate Differential Privacy” by Steinke and Ullman [3]. Unfortunately, the results are simply stated without derivations, which is problematic since some of the papers give guarantees for slightly different assumptions.
For example, theorem 1.3 of “Between Pure and Approximate Differential Privacy” states that there is an $(\epsilon,\delta)$-differentially private mechanism for $k$ adaptively chosen queries which requires $n$ samples, with $$ n \ge \left(\frac{\sqrt{k\log\log k}\log^{1/2}(1/\delta}{\epsilon \alpha}\right)$$
The mechanism is said to be $(\alpha,1/k^{O(1)})$ with respect to the sample, which is problematic since we would like to be able to specify the accuracy independently of $k$. Thus, it appears that without greater familiarity with differential privacy, it would likely be necessary to open up the proofs provided in each paper in order to obtain an appropriate sample complexity bound for use with the main theorem.
Further Reading
[1] Raef Bassily et al. Algorithmic Stability for Adaptive Data Analysis. 2015. http://arxiv.org/abs/1511.02513.
[2] Cynthia Dwork et al. Preserving Statistical Validity in Adaptive Data Analysis. 2016. http://arxiv.org/abs/1411.2664
[3] Thomas Steinke and Jonathan Ullman. Between Pure and Approximate Differential Privacy. 2015. http://arxiv.org/abs/1501.06095