Differential Privacy and Generalization
How differential privacy can be used to help solve problems of overfitting in adaptive data analysis.
\gdef\R{\mathbb{R}} \gdef\E#1{\underset{#1}{\mathbb{E}}} \gdef\P{\mathbb{P}} \gdef\D{\mathcal{D}} \gdef\X{\mathcal{X}} \gdef\Y{\mathcal{Y}} \gdef\Z{\mathcal{Z}} \gdef\H{\mathcal{H}} \gdef\A{\mathcal{A}} \gdef\M{\mathcal{M}} \gdef\W{\mathcal{W}} \gdef\I{\mathbb{I}} \gdef\errx{\text{err}_\x} \gdef\errp{\text{err}^\D} \gdef\x{\mathbf{x}}

Introduction

In a previous post on Learnability and Uniform Convergence, we saw that the finite VC-dimension of a hypothesis class H\H was both a necessary and sufficient condition for learnability of H\H for classification and regression problems, where HYX\H \subseteq \Y^\X. This would seem to be a complete characterization of learnability for such problems, but there are important cases which it does not cover. For example, the characterization implicitly assumes that we are learning a function from a fixed (prespecified) hypothesis class, H\H. What if we want to pose a different constraint? For example, consider the following constraint: “We will consider up to NN hypotheses, adaptively chosen from the space of all functions from X\X to Y\Y. Is learning possible in this context? Without pre-specifying anything beyond the number of hypotheses which we will consider, are we not bound to overfit?

The answer to this question is indeed, yes. That is, if we stick with empirical risk minimization (ERM) algorithms. When ERM is used under such a constraint, overfitting arises as the hypotheses chosen by the analyst lose independence from the data used to validate the hypotheses. Without this independence, we can not easily upper-bound the probability of large deviations between the true and estimated accuracy of any given hypothesis.

In fact, there are many settings, such as scientific studies in the social sciences or machine learning competitions, in which it is unnatural or constraining to prespecify a range of hypotheses. Data analysts naturally prefer to interact with the data before deciding anything about it. The downfalls of poorly accounting for dependencies in such analyses have become something a cultural phenomenon in the form of the “reproducibility crisis,” with notable examples including studies following standard methodological rigor, yet proving the existence of Extra-Sensory Perception

A recent line of work has attempted to address this wide-sweeping problem of adaptivity in data analysis by linking it to a separate but related problem known as differential privacy. Differential privacy answers the question of how to allow nearly-unrestricted access to a finite database from a restricted space of queries (such as statistical summaries) without revealing individual entries (i.e. private information).

In this post, we consider a key paper within this line of work, “Algorithmic Stability for Adaptive Data Analysis” by Bassily et al. [1], which gives a general transfer theorem showing that a differentially private mechanism can be used to avoid overfitting. At the time of the paper, there was one other major work, “Preserving Statistical Validity in Adaptive Data Analysis” by Dwork et al. [2], addressing the connection between differential privacy and generalization. Bassily et al. improve the guarantees which were first presented by Dwork et al., and extend them to a broader class of queries.

Aside from being applicable in a broader context than uniform-convergence-like bounds, the results of this paper are also interesting because they establish a different perspective concerning active mechanisms that introduce a form of distortion in order to bring about better generalization properties. Usually such mechanisms, when used e.g. for training neural networks, are viewed as a form of (Tikhonov) regularization. It is an interesting question whether differentially private mechanisms might provide a richer set of tools (specifically alternatives to ERM) for achieving a more optimal bias-variance trade-off in machine learning contexts.

In the following sections, we will outline the major contributions of the paper, and provide both a high-level and detailed overview of the most important proofs.

Main Contributions

We will begin by defining the important components. We assume that there exists a distribution D\D over a domain X\X. We denote a sample of size nn from X\X as x=(x1,,xn)Xn\x = (x_1,…,x_n) \in \X^n

Definition (Statistical Query): A statistical query is a function q:X[0,1]q : \X \to [0,1]. For brevity, we use the following notations to refer to the expectation and empirical average of qq, respectively: q(D)=ExD[q(x)]andq(x)=1ni[n]q(xi)q(\D) = \E{x \sim \D}[q(x)] \quad \text{and} \quad q(\x) = \frac{1}{n}\sum_{i \in [n]}q(x_i)

The main result of the paper applies to a strict generalization of a statistical query called a Δ\Delta-sensitive query:

Definition (Δ\Delta-sensitive Query): For Δ\Delta [0,1]\in [0,1], nNn \in \N, a Δ\Delta-sensitive query is a function q:XnRq : \X^n \to \R satisfying q(x)q(x)Δ|q(\x)-q(\x’)| \leq \Delta for every pair x,xXn\x,\x’ \in \X^n differing by only one entry. In this case, we have q(D)=ExDn[q(x)]q(\D) = \E{\x \sim \D^n}[q(\x)].

Notice that a statistical query is a 1/n1/n-sensitive query.

The setting in question is one in which an algorithm or analyst A\A issues a sequence of queries (qi)i[k](q_i)_{i \in [k]}, which are answered one at a time ((ai)i[k])((a_i)_{i \in [k]}) by a mechanism M\M. M\M bases its answers on a sample xDn\x \sim D^n, which remains fixed throughout the sequence of interactions. Both A\A and M\M are stateful, meaning that qiq_i and aia_i may depend on q1,a1,,qj1,aj1q_1,a_1,…,q_{j-1},a_{j-1}. The error of an answer, aa, generated by a mechanism when provided with a query qq is measured with respect to the sample x\x as errx(q,a)=aq(x)\errx(q,a) = a - q(\x) and with respect to the distribution D\D as errD(q,a)=aq(D)\errp(q,a) = a - q(\D)

Obviously, there is little we can do if the mechanism is not accurate with respect to its sample, xx. This requirement is formalized via the following definition:

Generally speaking, there will be very little we can do if the mechanism is not at least accurate with respect to its sample, x\x. This requirement is formalized via the following definition, which says that the maximum error in the answers returned by the mechanism and the actual value of the query on the sample over kk queries must be small with high probability.

Definition (Sample Accuracy): A mechanism M\M is (α,β)(\alpha,\beta)-accurate with respect to samples of size nn from X\X for kk adaptively chosen queries from QQ if for every adversary A\A and xDn\x \in \D^n

P[maxj[k]errx(qj,aj)α]1β \P \left[ \max_{j \in [k]} \big|\errx(q_j,a_j)\big| \le \alpha \right] \ge 1 - \beta

This may seem like an overly general allowance. After all, if the mechanism simply returns the empirical quantity, q(x)q(x), then of course errx(q,a)\errx(q,a) will always be zero. But we remind the reader that doing so would almost certainly lead to overfitting. To avoid overfitting, we also need to mechanism to satisfy a form of differential privacy, which naturally restricts the mechanism from having perfect sample accuracy given any query.

Definition (Differential Privacy): Let W:XnR\W : \X^n \to \mathcal{R} be a randomized algorithm. We say that W\W is (ϵ,δ)(\epsilon,\delta)-differentially private if for every pair of samples x\x, x\x’ that differ by exactly one element, and every RRR \subseteq \mathcal{R},

P[W(x)R]eϵP[W(x)R]+δ\P[\W(\x) \in R] \le e^\epsilon \cdot \P[\W(\x’) \in R] + \delta

This definition of differential privacy functions somewhat like a constraint on the distance between the output distribution of an algorithm when the input vector is different by just a single element. It is important to know that differential privacy enjoys a closure under post-processing, so that if W\W is (ϵ,δ)(\epsilon,\delta)-differentially private, then the composition fWf \circ \W is (ϵ,δ)(\epsilon,\delta)-differentially private for any (possibly random) ff. Note that the definition above cannot be applied directly to M\M, since M\M takes as input both a sample from xXn\x \in \X^n and queries from A\A. This will be addressed in the proof detail section.

Our goal is to show that a mechanism satisfying both sample accuracy and differential privacy also satisfies an analogous accuracy requirement with respect to the population, such that overfitting is precluded:

Definition (Population Accuracy): A mechanism M\M is (α,β)(\alpha,\beta)-accurate with respect to the population for kk adaptively chosen queries from QQ if for every adversary A\A, P[maxj[k]errD(qj,aj)α]1β \P \left[ \max_{j \in [k]} \big|\errp(q_j,a_j)\big| \le \alpha \right] \ge 1 - \beta

where the sample x\x held by the mechanism is drawn from Dn\D^n.

The main result of the paper is a transfer theorem which gives the desired guarantee:

Theorem (Main Transfer Theorem): Let QQ be a family of Δ\Delta-sensitive queries on X\X. Assume that for some α,β(0,.1)\alpha,\beta \in (0,.1), M\M is

  1. (ϵ=α/60Δn,δ=αβ/32Δn)(\epsilon = \alpha/60\Delta n,\delta = \alpha\beta/32\Delta n)-differentially private for k adaptively chosen queries from QQ* and
  2. (α=α/8,β=αβ/16Δn)(\alpha’=\alpha/8,\beta’=\alpha\beta/16\Delta n)-accurate with respect to its sample for n samples from X\X for kk adaptively chosen queries from QQ

Then M\M is (α,β)(\alpha,\beta)-accurate with respect to the population for kk adaptively chosen queries from QQ given n samples from X\X.

* This is a variant of the definition given above, which I will highlight in the proof details section.

When combined with existing differentially private mechanisms, this theorem guarantees the existence of an efficient mechanism that can answer quadratically-many queries in the size of xx without overfitting:

Corollary: There is a mechanism M\M that is (α,β)(\alpha,\beta)-accurate with respect to the population for k adaptively chosen queries from QΔQ_\Delta where Δ=O(1/n)\Delta = O(1/n) given nn samples from X\X for

nO(kloglogklog3/2(1/αβ)α2)n \ge O\left(\frac{\sqrt{k\log\log k} \log^{3/2}(1/\alpha \beta)}{\alpha^2}\right)

The mechanism runs in time poly(n,logX,log(1/β))\text{poly}(n,\log|\X|,\log(1/\beta)) per query.

Overview of the Proof

A key maneuver shared between the paper and preceding work is to combine the mechanism M\M and the algorithm A\A into a non-interacting randomized wrapper algorithm, W[M,A]\W[\M,\A], which observes differential privacy. The objective is then to show that any query output by this mechanism must exhibit strong concentration toward its expected value. The previous paper showed this by directly showing that a differentially private mechanism must preserve statistical moments. The paper which we review provides an innovative proof which achieves tighter guarantees while being expositionally simpler.

The first step is a technical lemma, which applies to a general mechanism W\W that is presented with a set of samples X=(x1,xT)X = (\x_1,…\x_T) and outputs a pair (q,xt)(q,\x_t). The lemma gaurantees that the empirical value of the query on the selected sample, q(xt)q(\x_t), must be close to the population value q(D)q(\D) if W\W is differentially private.

The next step is to specify a form of W[M,A]\W[\M,\A] that is designed to amplify differences betweenq (xt\x_t) and q(D)q(\D) by picking the query sample pair where errD(q,a)\errp(q,a) is the largest. It is then shown that that if M\M does not satisfy population accuracy, then the previous technical lemma fails for W[M,A]\W[M,A] which is differentiallyprivate by construction. This contradiction gives the desired result.

Details of the Proof

Technical Lemma

The most technical portion of the proof is the following lemma:

Lemma (Main Technical Lemma): Let W:(Xn)TQΔ×[T]\W : (\X^n)^T \to Q_\Delta \times [T] be (ϵ,δ)(\epsilon,\delta)-differentially private where QΔQ_\Delta is the class of Δ\Delta-sensitive queries q:XnRq : \X^n \to \R. Let D\D be a distribution on X\X and let X=(x1,,xT)DnTX = (\x_1,…,\x_T) \sim \D^{nT}. Then

EX,W[q(D)(q,t)=W(X)]EX,W[q(xt)(q,t)=W(X)]2(eϵ1+Tδ)Δn() \tag{\htmlId{eq-main-lem}{}} \left|\E{X,\W} [q(\D)|(q,t) = \W(X)] - \E{X,\W}[q(\x_t)|(q,t) = \W(X)] \right| \le 2(e^\epsilon - 1 + T\delta)\Delta n

Before attempting the proof, it is useful to state and prove a much simpler lemma which, while not in the paper, illustrates the general approach for relating the definition of differential privacy to an expression similar to \htmlClass{eq-ref eq-main-lem}{}. Having this destination will guide us in proving the main lemma.

Lemma (Differential Privacy to Stability): Let QQ be a family of functions from X\X to [0,d][0,d] and let W:XQ\W: \X \to Q be (ϵ,δ)(\epsilon,\delta)-differentially private. Let XX and XX’ be i.i.d. but constrained to be neighbors. ZZ may depend on both XX and XX’. Then,

EX,Z[q(Z)q=W(X)]EX,Z[q(Z)q=W(X)](eϵ1+δ)d() \tag{\htmlId{eq-dp-stability}{}} \left| \E{X,Z}[q(Z) | q = \W(X)] - \E{X’,Z}[q(Z) | q = \W(X’)] \right| \le (e^\epsilon - 1 + \delta)d

▸ Proof

For use later on, we will define F:X×X[0,d]F : \X \times \X \to [0,d] as F(Z,X)=q(Z)F(Z,X) = q(Z) where q=W(X)q = \W(X). Then we need to show that EX,Z[F(Z,X)]EX,Z[F(Z,X)](eϵ1+δ)d\left|\E{X,Z}[F(Z,X)] - \E{X’,Z}[F(Z,X’)] \right| \le (e^\epsilon - 1 + \delta)d

Since FF is non-negative, we can write the expectation as

EX,Z[F(Z,X)]=0dPX,Z[F(Z,X)s]ds=0dPX,Z[q(Z)sq=W(X)]ds. \begin{aligned} \E{X,Z}[F(Z,X)] &= \int_{0}^{d} \underset{X,Z}{\P}[F(Z,X) \ge s]ds \cr &= \int_{0}^{d} \underset{X,Z}{\P}[q(Z) \ge s| q = \W(X)]ds. \end{aligned}

But from the differential privacy of W\W, we have

0d(eϵPX,Z[q(Z)sq=W(X)]+δ)ds=0deϵPX,Z[F(Z,X)s]ds+δd=eϵEX,ZF(Z,X)+δd \begin{aligned} &\le \int_{0}^{d} (e^\epsilon\underset{X’,Z}{\P}[q(Z) \ge s| q = \W(X’)] + \delta)ds \cr &= \int_{0}^{d} e^\epsilon\underset{X’,Z}{\P}[F(Z,X’) \ge s]ds + \delta d \cr &= e^\epsilon\E{X’,Z}F(Z,X’) + \delta d \end{aligned}

Then, using the bound on the range of FF,

EX,Z[F(Z,X)]EX,Z[F(Z,X)](eϵ1)EX,Z[F(Z,X)]+δd(eϵ1+δ)d \E{X,Z}[F(Z,X)] - \E{X’,Z}[F(Z,X’)] \le (e^\epsilon - 1)\E{X’,Z}[F(Z,X’)] + \delta d \le (e^\epsilon - 1 + \delta)d

The other direction follows similarly.

We now proceed to a condensed proof of the main lemma. Our treatment restructures the original proof and provides additional commentary in order to highlight the purpose of each major step.

▸ Proof

Let us begin by noting that on the left-hand side of \htmlClass{eq-ref eq-dp-stability}{}, the input to the differentially private mechanism differs, whereas in \htmlClass{eq-ref eq-main-lem}{}, it is the argument of qq that differs. So an important question is how it will be possible to make use of the lemma?

The secret to try to rewrite the left hand side of \htmlClass{eq-ref eq-main-lem}{} in terms of F:X×X[0,d]F : \X \times \X \to [0,d], similarly to the proof of the previous lemma, EX,Z[F(Z,X)]EX,Z[F(Z,X)]() \tag{\htmlId{eq-dest}{}} \left|\E{X,Z}[F(Z,X)] - \E{X,Z’}[F(Z’,X)]\right| but where ZZ and XX are identically distributed. This will allow us to swap the role of ZZ with XX and XX’ respectively, thus enabling the use of the previous lemma. Most of the work in the proof involves a sequence of steps toward this end, most of which can be seen as types of symmetrizations.

The first symmetrization comes at no cost. We simply introduce an XDnTX’ \sim \D^{nT}, independent of XX, and observe that because of this independence

EX,W[q(D)(q,t)=W(X)]=EX,X,W[q(xt)(q,t)=W(X)].\left|\E{X,\W} [q(\D)|(q,t) = \W(X)] \right| = \left|\E{X,X’,\W} [q(\x_t)|(q,t) = \W(X)] \right|.

Then, from linearity of expecation, the left hand size of \htmlClass{eq-ref eq-main-lem}{} becomes

EX,X,W[q(xt)q(xt)(q,t)=W(X)]\left|\E{X,X’,\W} [q(\x’_t) - q(\x_t)|(q,t) = \W(X)] \right|

This doesn’t yet admit expression in the form of \htmlClass{eq-ref eq-dest}{} since we haven’t constrained XX and XX’ to be neighbors. To move forward we need some notation. For {0,1,,n}\ell \in \{0,1,…,n\} and m{0,1,,T}m \in \{0,1,…,T\}, let the entries of X,mX^{\ell,m} be given by

Xi,t,m={Xi,tt>m or (t=m and i)Xi,totherwise.X_{i,t}^{\ell,m} = \begin{cases} X’_{i,t} & t > m \text{ or } (t = m \text{ and } i \le \ell) \\ X_{i,t} & \text{otherwise} \end{cases}.

That is, X,mX^{\ell,m} is equal to XX’ up to \ell‘th row of the mm‘th column, and equal to XX beyond that. Notice that, while X,mX^{\ell,m} may in general depend on both XX and XX’, it is identically distributed to both XX’ and XX, (i.e. XXX,m)(\text{i.e. }X’ \sim X \sim X^{\ell,m}). Also, by construction X,mX^{\ell,m} and X1,mX^{\ell-1,m} differ by just a single element.

Letting xt,m\x^{\ell,m}_t denote the tt‘th column of X,mX^{\ell,m}, we notice that (a) if t=mt = m then xt0,m=xt\x_t^{0,m} = \x_t and xtn,m=xt\x_t^{n,m} = \x’_t, and (b) if tmt \neq m then xt,m=xt1,m\x_t^{\ell,m} = \x_t^{\ell-1,m}, we have that for any t[T]t \in [T]

q(xt)q(xt)from (a)=q(xtn,t)q(xt0,t)telescoping sum=[n]q(xt,t)q(xt1,t)from (b)=m[T][n]q(xt,m)q(xt1,m)() \begin{aligned} &q(\x’_t) - q(\x_t) \cr \text{from (a)} \quad &= q(\x_t^{n,t}) - q(\x_t^{0,t}) \cr \text{telescoping sum} \quad &= \sum_{\ell \in [n]}q(\x_t^{\ell,t}) - q(\x_t^{\ell-1,t}) \cr \tag{}\text{from (b)} \quad &= \sum_{m \in[T]}\sum_{\ell \in [n]}q(\x_t^{\ell,m}) - q(\x_t^{\ell-1,m}) \cr %\text{adding 0} \quad &= \sum_{m \in[T]}\sum_{\ell \in [n]}\big(q(\x_t^{\ell,m}) - q(\x_{t,-\ell}^{\ell,m}) + \Delta \big) - \big( q(\x_t^{\ell-1,m}) - q(\x_{t,-\ell}^{\ell-1,m}) + \Delta \big) \end{aligned}

We are now quite close to our destination since, for example, we could define F^(Z,X)=q(zt)where(q,t)=W(X) \hat{F}(Z,X) = q(\mathbf{z}_t) \quad \text{where} \quad (q,t) = \W(X) and we would have EX,X,W[q(xt,m)q(xt1,m)(q,t)=W(X)]=EX,X,W[F^(X,m,X)F^(X1,m,X)].() \tag{\htmlId{eq-Fhat}{}}\E{X,X’,\W} [q(\x_t^{\ell,m}) - q(\x_t^{\ell-1,m})|(q,t) = \W(X)] = \E{X,X’,\W}[\hat{F}(X^{\ell,m},X) - \hat{F}(X^{\ell-1,m},X)].

The problem is that, unlike in the previous lemma, we have not required that qq be bounded, and thus F^\hat{F} is not necessarily non-negative or upper-bounded. To handle this, we need to be more careful. Instead, we define F,m:X×X[0,2Δ]F^{\ell,m}: \X \times \X \to [0,2\Delta] as F,m(Z,X)={q(zt)q(zt,)+Δt=m0tm F^{\ell,m}(Z,X) = \begin{cases} q(\mathbf{z}_t) - q(\mathbf{z}_{t,-\ell}) + \Delta & t = m \cr 0 & t \neq m \end{cases} where zt,\mathbf{z}_{t,-\ell} is zt\mathbf{z}_t with it’s \ell‘th element replaced by a fixed element of X\X. It is actually possible to replace F^\hat{F} with F,mF^{\ell,m} by adding 0=(Δq(xt,,m))(Δq(xt,1,m))0 = \big(\Delta - q(\x_{t,-\ell}^{\ell,m}) \big) - \big(\Delta - q(\x_{t,-\ell}^{\ell-1,m})\big) to the left hand side of \htmlClass{eq-ref eq-Fhat}{}. (Notice that xt,,m=xt,1,m\x_{t,-\ell}^{\ell,m} = \x_{t,-\ell}^{\ell-1,m}, since the only differing element in the original samples is replaced by a fixed element.)

Thus, combining our progress so far, we have that EX,X,W[q(xt)q(xt)(q,t)=W(X)]=m[T][n]EX,X,W[F,m(X,m,X)F,m(X1,m,X)] \E{X,X’,\W} [q(\x’_t) - q(\x_t)|(q,t) = \W(X)] = \sum_{m \in[T]}\sum_{\ell \in [n]}\E{X,X’,\W}[F^{\ell,m}(X^{\ell,m},X) - F^{\ell,m}(X^{\ell-1,m},X)]

We’ve now achieved the destination that we defined in \htmlClass{eq-ref eq-dest}{}. By recalling that XX,mX1,mX \sim X^{\ell,m} \sim X^{\ell-1,m}, we can work through the steps of the previous lemma to achieve the bound (for exact details, see the paper).

Main Transfer Theorem Proof via Monitoring Algorithm

We are now have the proper equipment to prove the main theorem. To this end, we introduce two algorithms. The first is an algorithm which simulates the interaction between the analyst A\A and the mechanism M\M for kk iterations.

Algorithm (Non-interactive wrapper): W[M,A]:Xn(Q×R)k\W[\M,\A] : \X^n \to (Q \times \mathcal{R})^k

Input: A sample xXn\x \in \X^n
For j=1,,kj= 1,…,k
\quad Feed aj1a_{j-1} to A\A and get a query qjQq_j \in Q.
\quad Feed qjq_j to M\M and get an answer qjRq_j \in \mathcal{R}.

Output: ((q1,a1),,(qk,ak))((q_1,a_1),…,(q_k,a_k))

The above definition for differential privacy does not apply directly to M\M, but we can use this algorithm to give a definition. We say that an interactive mechanism is (ϵ,δ)(\epsilon,\delta)-differentially private for kk queries from QQ if for every adversary A\A, the algorithm W[A,M]\W[\A,\M] is (ϵ,δ)(\epsilon,\delta)-differentially private. This fills in the missing definition in the statement of the main theorem.

The next algorithm is a so-called monitoring algorithm which runs the previous algorithm repeatedly on TT samples, and then returns the query-sample pair displaying the greatest population error.

Algorithm (Monitoring Algorithm): WM[M,A]:XnTQ×N\W_M[\M,\A] : \X^{nT} \to Q \times \N

Input: X=(x1,,xT)XnTX = (\x_1,…,\x_T) \in \X^{nT}
For t=1,,Tt= 1,…,T
\quad Let (qt,1,at,1),,qt,k,at,k)=W[M,A](xt)(q_{t,1},a_{t,1}),…,q_{t,k},a_{t,k}) = \W[\M,\A](\x_t)
Let
(j,t)=arg,maxj[k],t[T]errD(qt,j,at,j) (j^\star,t^\star) = \underset{j \in [k],t \in [T]}{\text{arg,max}}\left|\errp(q_{t,j},a_{t,j})\right| Let q=±qt,jq^\star = \pm q_{t^\star,j^\star}, where the sign is chosen to make at,jq(D)0a_{t^\star,j^\star} - q^\star(\D) \ge 0

Output: (q,t)(q^\star,t^\star)

The monitoring algorithm should be seen a a way of checking if M\M is accurate with respect to the population. It is also quite straightforward to show that, since WM[M,A]\W_M[\M,\A] merely performs post-processing on the output of W[M,A]\W[\M,A], which is assumed to be (ϵ,δ)(\epsilon,\delta)-differentially private, it must also be (ϵ,δ)(\epsilon,\delta)-differentially private by the post-processing property of differential privacy.

Applying the monitoring algorithm with T=1/βT = \lfloor 1/\beta \rfloor, the proof of the main theorem follows from two main observations:

  1. If M\M is not (α,β)(\alpha,\beta)-accurate then the deviation between q(D)q^\star(\D) and q(xt)q^\star(\x_{t^\star}) is large. In particular, EX,W[q(D)q(xt)(q,t)=WM(X)]α/4 \left| \E{X,W}[q^\star(\D) - q^\star(\x_{t^\star})| (q^\star,t^\star) = \W_M(X)] \right| \ge \alpha/4

  2. But, since WM\W_M is differentially private, application of the technical lemma directly gives EX,W[q(D)q(xt)(q,t)=WM(X)]α/8 \left| \E{X,W}[q^\star(\D) - q^\star(\x_{t^\star})| (q^\star,t^\star) = \W_M(X)] \right| \le \alpha/8

From the contradiction, we see that M\M must be (α,β)(\alpha,\beta)-accurate, thus proving the main transfer theorem.

The key idea in showing 1. is that, since M\M is not population-accurate, q(D)q^\star(\D) is far from aqa_{q^\star} with high probability. But since M\M by assumption is sample-accurate, aqa_{q^\star} is close to from q(x)q^\star(\x^\star) with high probability. That is, allowing all expectations be implicitly conditioned on (q,t)=WM(X)(q^\star,t^\star) = \W_M(X), EX,W[q(D)q(xt)]=EX,W[q(D)aq+aqq(xt)]EX,W[q(D)aq](a) lower-boundedEX,W[aqq(xt)](b) upper-bounded \begin{aligned} &\left| \E{X,W}[q^\star(\D) - q^\star(\x_{t^\star})] \right| \cr = &\left| \E{X,W}[q^\star(\D) - a_{q^\star} + a_{q^\star} - q^\star(\x_{t^\star})] \right| \cr \ge &\underbrace{\left|\E{X,W}[q^\star(\D) - a_{q^\star}]\right|}_{(a) \text{ lower-bounded}} - \underbrace{\left|\E{X,W}[a_{q^\star} - q^\star(\x_{t^\star})]\right|}_{(b) \text{ upper-bounded}} \end{aligned} where the upper bounded for (a) comes from the fact that M\M is not population-accurate, and the lower bound for (b) comes from the fact that M\M is sample-accurate.

Corollaries

The paper provides several corollaries to the main transfer theorem which arise by applying the theorem to guarantees provided by existing differentially private mechanisms. The corollary in the main results area is an example of one of these, which makes use of an algorithm proposed in “Between Pure and Approximate Differential Privacy” by Steinke and Ullman [3]. Unfortunately, the results are simply stated without derivations, which is problematic since some of the papers give guarantees for slightly different assumptions.

For example, theorem 1.3 of “Between Pure and Approximate Differential Privacy” states that there is an (ϵ,δ)(\epsilon,\delta)-differentially private mechanism for kk adaptively chosen queries which requires nn samples, with n(kloglogklog1/2(1/δϵα) n \ge \left(\frac{\sqrt{k\log\log k}\log^{1/2}(1/\delta}{\epsilon \alpha}\right)

The mechanism is said to be (α,1/kO(1))(\alpha,1/k^{O(1)}) with respect to the sample, which is problematic since we would like to be able to specify the accuracy independently of kk. Thus, it appears that without greater familiarity with differential privacy, it would likely be necessary to open up the proofs provided in each paper in order to obtain an appropriate sample complexity bound for use with the main theorem.

Further Reading

[1] Raef Bassily et al. Algorithmic Stability for Adaptive Data Analysis. 2015. http://arxiv.org/abs/1511.02513.

[2] Cynthia Dwork et al. Preserving Statistical Validity in Adaptive Data Analysis. 2016. http://arxiv.org/abs/1411.2664

[3] Thomas Steinke and Jonathan Ullman. Between Pure and Approximate Differential Privacy. 2015. http://arxiv.org/abs/1501.06095