Learning and Uniform Convergence
First post in a series on generalization in machine learning.
$ \gdef\R{\mathbb{R}} \gdef\E{\mathbb{E}} \gdef\P{\mathbb{P}} \gdef\D{\mathcal{D}} \gdef\X{\mathcal{X}} \gdef\Y{\mathcal{Y}} \gdef\Z{\mathcal{Z}} \gdef\H{\mathcal{H}} \gdef\A{\mathcal{A}} \gdef\I{\mathbb{I}} $

Introduction

A general machine learning problem starts with an known domain of interest, $\Z$, and an unknown distribution over that domain, $\D$. Our access to $\D$ consists only in the ability to generate a finite number of i.i.d. samples from it. In addition to the distribution, we have a class of hypotheses, $\H$, from which we hope to choose an element which characterizes the distribution. Finally, we have a risk functional, $f: \H \times \Z \to \R$, to quantify the goodness of fit between our hypothesis and an individual sample from $\D$. We wish to find the hypothesis which minimizes the population risk, defined as

$$\tag{}R(h) = \mathbb{E}_{z\sim\mathcal{D}}[f(h;z)]$$

We denote the risk of this optimal solution as $R(h^\star) = \inf_{h \in \mathcal{H}}R(h)$. For the special cases of classification and regression, we let $\Z = \X \times \Y$, where $\X \subseteq \R^d$ and $\Y \subseteq \R$ are respectively the input and output domains. Then, naturally $\H \subseteq \Y^\X$ and $f(h,z) \equiv \ell(h(x),y)$ for a suitable loss function $\ell : \Y \times \Y \to \R$.

A machine learning task may be identified with the pair $(\H,f)$. The distribution $\D$ is excluded from the task definition since, formally, we know nothing about the distribution before we begin to sample it. Thus, the questions that we seek to ask and answer with respect to the learnability of a task must apply over all possible distributions. (When a practitioner does have access to some knowledge about the structure of $\D$, she can make use of this by judiciously choosing $\H$.)

Given a machine learning task, a learning algorithm is a function $\mathcal{A}: \bigcup_{n=1}^\infty \Z^n \to \mathcal{H}$, from a set of points $S = \{z_1,…,z_n\}$ to a hypothesis. Ideally, the learning algorithm outputs a hypothesis whose risk is close to optimal. An algorithm producing hypothesis whose risks converge toward the minimum risk is called a consistent algorithm, and a problem or which a consistent algorithm exists is called a learnable problem.

Definition (Learnability and Consistency): An algorithm is called consistent if for any $\epsilon > 0$, $$\tag{}\lim_{m \to 0}\P_{S \sim \D^m} [R(\A[S]) - R(h^\star) > \epsilon] = 0$$ and a task is called learnable if a consistent algorithm exists for that task.

In this series, we will use the weak notion of convergence in probability, as defined above. However, in the literature, stronger notions such as convergence in mean can also be found.

In this post, we will explore a series of sufficient conditions for learnability, which while ultimately lead us to a condition, the finitude of the VC dimension, which can be straightforwardly calculated given $\H$. Finally, we will see that this condition is in fact required for learnability.

The roadmap of topics is as follows:

  • Empirical Risk Minimization (ERM)
  • Uniform Convergence
  • VC Dimension and Related concepts

We will show that

$$ \begin{matrix}\text{Finite VC} \\ \text{Dimension}\end{matrix} \implies \begin{matrix}\text{Uniform} \\ \text{Convergence}\end{matrix} \implies \begin{matrix}\text{Learnable} \\ \text{with ERM}\end{matrix} \implies \text{Learnable}$$

Uniform Convergence and Empirical Risk Minimization

First, let us introduce the notion of empirical risk minimization (ERM). ERM is a strategy for learning, which selects the member of $\H$ which displays the smallest empirical risk, $R_S(h)$ on the training set, S:

$$\tag{} R_S(h) = \frac{1}{|S|}\sum_{z \in S} f(h;z)$$

Since ERM is simply a specific type of learning algorithm, clearly if a task is learnable with ERM, then it is learnable. Note that the ERM algorithm directly gives us a form of convergence via the law of large numbers (LLN). Specifically, we have that for any $h \in \H$, as $m \to \infty$, $R_S(h) \to R(h)$ in probability. Sadly, this isn’t enough: since each $f(h,z)$ in the sum depends on $h$, and $h$ in turn depends on $S$, we lose statistical independence in the sum, and LLN no longer applies. Thus, we need to consider the stronger uniform convergence guarantee that as $m \to \infty$

$$\tag{\htmlId{eq-uniform}{}}P_{S \sim \D^m}( \exists h \in \H,\text{ s.t. }|R_S(h) - R(h)| > \epsilon) = P_{S \sim \D^m}\left( \sup_{h \in \H}|R_S(h) - R(h)| > \epsilon\right) \to 0$$

We now have our first basic result:

Proposition (Uniform Convergence implies ERM Learnability): If uniform convergence holds for a given hypothesis class, $\H$, then that class is ERM learnable.

Proof: Let $h_S$ denote $\A[S]$. Since $h_S$ is the minimizer of $R_S$, it follows that $R_S(h^\star) - R_S(h_S) > 0$. Thus

$$ \tag{}R(h_S) - R(h^\star) \leq R(h_s) - R_s(h_s) + R_S(h^\star) - R(h^\star) \leq 2 \sup_{h\in \H} |R(h) - R_S(h)| $$

Notice that if $\H$ is of finite cardinality, then uniform convergence follows the LLN directly via the union bound. This motivates the strategy of considering the effective size of the hypothesis class.

Bounded Hypothesis Class Complexity and Uniform Convergence

We now specialize to the case of binary classification; other cases can be handled with variations of the techniques handled here. See [3] for a thorough treatment.

Binary Classification

In the case of binary classification, we progress toward uniform convergence by noting that there are always a finite number of ways to classify the elements of a sample $S$ (even for an uncountable $\H$). By upper bounding this number, and employing a statistical trick known as symmetrization, we are again able to use the union bound to show uniform convergence as in the case of finite $\H$.

First, let us formalize the above number:

Definition (Growth Function): For a given sample, $S = \{z_1,…,z_m\}$, let $\H(S)$ be the set of ‘projections’ of $h \in \H$ onto $S$:

$$\tag{} \H(S) = \{(h(z_1),…h(x_m)),h \in \H\}$$

The growth function is defined to be the maximum size of $\H(S)$ over all samples $S$:

$$\tag{} S_{\H}(m) = \sup_{S \in \Z^m} |\H(S)|$$

The following theorem by Vapnik and Chervonenkis shows that the growth function can serve as an the effective size of the hypothesis class.

Theorem (Vapnik and Chervonenkis): For any $\delta > 0$, with probability at least $1 - \delta$ over $S \sim \D^m$:

$$\tag{\htmlId{eq-gf-bound}{}} \sup_{h \in \H}|R_S(h) - R(h)| \le 2\sqrt{2\frac{\log(S_\H(2m)) + \log(2/\delta)}{m}}$$

▸ Proof

The quantity $\sup_{h \in \H}|R_S(h) - R(h)|$ seems to depend on the full object $\H$ which could be uncountable; this isn’t appropriate for the use of the union bound. We need to upper bound this quantity (probabilistically) in a way that depends only on a finite “sampling” or “projection” of $\H$—i.e. on $\H_S$. We do this by using the symmetrization lemma:

Lemma (Symmetrization): For any $t > 0$, such that $nt^2 \ge 2$, $$\tag{\htmlId{eq-symmetric}{}} \P_{S \sim \D^m} \left[\sup_{h \in \H}R_S(h) - R(h) > t\right] \le 2\P_{S,S’ \sim \D^m} \left[\sup_{h \in \H}R_S(h) - R_{S’}(h) > t/2\right]$$

Proof of lemma: Fix $S$ and $S’$, and let $h’$ be the hypothesis achieving the supremum on the left hand side. Then

$$ \I_{R(h’) - R_S(h’) > t}\I_{R(h’) - R_{S’}(h’) < t/2} = \I_{R(h’) - R_S(h’) > t \; \land \; R_{S’}(h’) - R(h’) \ge -t/2} \le \I_{R_{S’}(h’) - R_S(h’) > t/2}$$

Taking expecation with respect to the second sample gives

$$ \I_{R(h’) - R_S(h’) > t}\P_{S’}(R(h’) - R_{S’}(h’) < t/2) \le \E_{S’}[\I_{R_{S’}(h’) - R_S(h’) > t/2}]$$

But by Chebyshev’s inequality and that for binary classification, $\text{range}(h’) \subseteq [0,1] \implies \textrm{var}(h’) \le 1/4$

$$\P’(R(h’) - R_{S’}(h’) \ge t/2) \le \frac{4 \textrm{var}(h’)}{nt^2} \le \frac{1}{nt^2} \le 1/2$$

Thus taking expectation with respect to $S$, we have

$$\tag{} \P_S(R(h’) - R_{S’}(h’) > t) \le 2 \P_{S,S’}(R_{S’}(h’) - R_S(h’) > t/2)$$

But by the choice of $h’$, the left hand side $\htmlClass{eq-ref eq-symmetric}{}$ is equal to the left hand side of this equation and the right hand side of $\htmlClass{eq-ref eq-symmetric}{}$ is less than or equal to the right hand side of this equation. $\blacksquare$

With the symmetrization lemma in hand, our desired result follows quite quickly. We need merely make note that because $\H_{S \cup S’}$ contains all possible projections of $h \in \H$ onto the sets $S$ and $S’$,

$$ \sup_{h \in \H}R_S(h) - R_{S’}(h) = \sup_{h \in \H_{S \cup S’}}R_S(h) - R_{S’}(h)$$

But $\H_{S \cup S’}$ has bounded size! We will also make use of a symmetric form of Hoeffding’s inequality for a random variable bounded by $[0,1]$ (This can be derived via minor adjustments in the derivation of the usual form):

$$\P_{S,S’ \sim \D^m} [R_S - R_{S’} > \epsilon] \le e^{-m\epsilon^2}$$

With these two facts, along with the symmetrization lemma, we have,

$$ \begin{aligned} \P_S \left[\sup_{h \in \H}R_S(h) - R(h) > \epsilon \right] &\le 2 \P_{S,S’} \left[\sup_{h \in \H}R_S(h) - R_{S’}(h) > \epsilon/2 \right] \\ &= 2 \P_{S,S’} \left[\sup_{h \in \H_{S\cup S’}}R_S(h) - R_{S’}(h) > \epsilon/2 \right] \\ &\le 2 S_\H(2m) \P_{S,S’} \left[R_S(h) - R_{S’}(h) > \epsilon/2 \right] \\ &\le 2 S_\H(2m)e^{-m\epsilon^2/4} \end{aligned} $$

How does this result help us? Notice that clearly, $S_\H(m) \le 2^m$, since there only $2^m$ distinct classifications of $m$ objects. But plugging this bound into (7) does not achieve uniform convergence: we need $S_\H(m)$ to have sub-exponential growth for the bound to be non-vacuous. This is where the VC dimension comes into play: The growth function of a hypothesis class with bounded VC dimension will exhibit polynomial growth for a high enough $m$.

When $S_\H(m) = 2^m$, it implies that there exists a set $S$ of size $m$ such that all possible classifications of $S$ are allowed by $\H$. In this situation, $\H$ is said to shatter S. The VC dimension of is the size of the largest set which is shattered by $\H$. Formally:

Definition (VC Dimension): The VC dimension of a class $\H$ is the largest $m$ such that

$$S_\H(m) = 2^m$$

The importance of the VC dimension comes from a lemma, often known as Sauer’s lemma:

Lemma (Sauer, Vapnik and Chervonenkis): Let $\H$ be a class of functions with finite VC dimension $h$. Then for all $m \in \N$,

$$S_\H(m) \le \sum_{i=0}^{h}{m \choose i}$$

and for $m \ge h$,

$$S_\H(m) \le \left(\frac{em}{h}\right)^h$$

We recommend [1] for the proof.

Learnability and Bounded Hypothesis Complexity

We have now seen that a finite VC dimension implies learnability, via uniform convergence and the consistency of ERM. The final step is to show that

$$\text{Learnability} \implies \text{Finite VC Dimension}$$

The proof is a corollary of a particular version of the “No Free Lunch Theorem” given in [1].

Theorem (No Free Lunch): Let $\A$ be any learning algorithm for the task of binary classification with respect to the 0-1 loss over domain $\X$. Let $m$ be any number smaller than $|\X|/2$, representing a training set size. Then, there exists a distribution $\D$ over $\X \times \{0,1\}$ such that $\P_{S\sim \D^m}(R(\A[S]) \ge 1/8) \ge 1/7$.

▸ Proof

The proof depends on constructing a family of finite support distributions having the property $(P)$ that, on average, an “unseen” point is equally likely to belong either class. We then note that the maximum loss over this class of distributions is greater than the average loss, and thus there must exist a distribution in this class which achieves a large loss.

Recall that since our context is binary classification, $\Y = \{0,1\}$, and $\ell(h(x),y) = \I_{h(x) \neq y}$.

Let $C \subseteq \X$ with $|C| = 2m$. We let $\Y^C = \{h_1,…,h_T\}$ denote the $T = 2^{2m}$ possible functions from $C$ to $\Y$. With each $i \in [T]$ we associate a distribution $\D_i$ such that

$$ \D_i(\{(x,y)\}) \equiv \P_{(s,t) \sim \D_i}(s=x,t=y) = \begin{cases} 1/|C| &\text{if } x \in C \land y=h_i(x) \\ 0 &\text{o.w.} \end{cases}$$

Notice that under this family of distributions, the learning algorithm will never see points outside of $C$.

We now consider the quantity

$$\tag{} \max_{i \in [T]} \E_{S \sim D_i^m} [ R_{D_i}(A[S])] \ge \frac{1}{T} \sum_{i \in [T]} \E_{S \sim D_i^m} [ R_{D_i}(A[S])], $$

which we want to show to be large. Our goal is to make use the of property, $P$, to be formalized later. But to do this, we will need bring the summation over $i$ inside of the expectations that depend on it. We progress in this direction by expanding these expectations.

Notice that there are $k = (2m)^m$ possible sequences of length $m$ contained in $C$. We denote these sequences $X_j = (x^j_1,…,x^j_m)$, for $j \in [k]$. For a given distribution $\D_i$, we let $Z_j^i = ((x^j_1,f_i(x^j_1)),…,(x^j_m,f_i(x^j_m)))$ denote the labeled sequence corresponding to $X_j$. Each such sequence has equal probability $1/k$ under $\D_i$. Thus, we have

$$\tag{} \E_{S \sim D_i^m} [ R_{D_i}(A[S])] = \frac{1}{k}\sum_{j\in[k]}R_{D_i}(\A[Z_j^i])$$

We also have that for each $i$,

$$\tag{} R_{D_i}(h) = \frac{1}{2m}\sum_{x\in C}\I_{h(x)\neq h_i(x)}$$

Combining (12) and (13) with (11), and interchanging the orders of the summations, we have,

$$\tag{} \begin{aligned} \max_{i \in [T]} \E_{S \sim D_i^m} [ R_{D_i}(A[S])] &\ge \frac{1}{k}\sum_{j\in[k]}\frac{1}{2m}\sum_{x\in C}\frac{1}{T} \sum_{i \in [T]}\I_{\A[Z_j^i](x)\neq h_i(x)} \\ &\ge \frac{1}{k}\sum_{j\in[k]}\frac{1}{2m}\sum_{x\in C\setminus X_j}\frac{1}{T} \sum_{i \in [T]}\I_{\A[Z_j^i](x)\neq h_i(x)} \end{aligned}$$

In the second equality, we make it so that $\A[Z_j^i]$ is only evaluated on unseen points. Now we are ready to formally state property $P$: for any $\A$ and $x \notin X_j$,

$$\tag{} \frac{1}{T}\sum_{i\in[T]} \I_{\A[Z_j^i](x) \neq h_i(x)} = \frac{1}{2}$$

To see this, note that there exists a bijection $\pi : [T] \to [T]$, such that $h_i$ and $h_{\pi(i)}$ differ only at $x$. Since $x \notin X_j$, we must have $\A[Z_j^i](x) = \A[Z_j^{\pi(i)}](x)$. Thus, we must have

$$\I_{\A[Z_j^i](x) \neq h_i(x)} + \I_{\A[Z_j^{\pi(i)}](x) \neq h_{\pi(i)}(x)} = 1$$

or

$$2\sum_{i\in[T]} \I_{\A[Z_j^i](x) \neq h_i(x)} = \sum_{i\in[T]}\I_{\A[Z_j^i](x) \neq h_i(x)} + \sum_{i\in[T]}\I_{\A[Z_j^{\pi(i)}](x) \neq h_{\pi(i)}(x)} = T$$

which gives (14). Combining (14) and (13), we have

$$\tag{} \max_{i \in [T]} \E_{S \sim D_i^m} [ R_{D_i}(A[S])] \ge \frac{1}{k}\sum_{j\in[k]}\frac{1}{2m}\sum_{x\in C\setminus X_j}\frac{1}{2} \ge \frac{1}{4} $$

since $|C\setminus X_j| \ge m$.

We have shown that there exists a distribution $\D$ such that $\E_{S \sim D^m} [ R(A[S])] \ge \frac{1}{4}$. To get the statement of the theorem, we need only apply a proper form of Markov’s inequality. Since $R(A[S]) \in [0,1]$, $Y = 1 - R(A[S])$ is a positive random variable with $\E[Y] \le 3/4$. We thus have,

$$P(R(A[S]) \ge 1/8) = P(Y < 7/8) = 1 - P(Y \ge 7/8) \ge 1 - \frac{\E[Y]}{7/8} \ge 1 - \frac{3/4}{7/8} = \frac{1}{7}$$

We can use the no free lunch theorem to show that a hypothesis class having unbounded VC dimension is not learnable.

Corollary: Let $\H$ be a class of functions with unbounded VC dimension $h$. Then $\H$ is not learnable.

Proof: Suppose to the contrary that $\H$ is learnable and has unbounded VC dimension. Since $\H$ is learnable, there exists an algorithm $\A$ and $m \in \N$ such that for $\epsilon = 1/8$ and $\delta = 1/7$,

$$\P_{S \sim \D^m}[R(\A[S]) - R(h^\star) > \epsilon] < \delta$$

Since $\H$ has unbounded VC dimension, there exists a set denoted $C \in Z^{2m}$ which $\H$ shatters. But by the no free lunch theorem, there exists a distribution $\D$ having support $C$ such that

$$\P_{S \sim \D^m}[R(\A[S]) > \epsilon] \ge \delta$$

furthermore since $\H$ shatters $C$, $R(h^\star) = 0$. Thus, we have

$$\P_{S \sim \D^m}[R(\A[S]) - R(h^\star) > \epsilon] \ge \delta$$

which gives the contradiction.

An Interpretation

The final implication of the previous section is incredibly important, since it tells us that finite VC dimension, uniform convergence, learnability by ERM, and learnability in general, are all equivalent notions for the learning problems that we have considered (though we only showed the results for binary classfication, similar results hold for general classification and regression). We can use this characterization to build an intuitive interpretation of what learning means and how it works.

Let us consider the ERM algorithm: The ERM algorithm works by testing each hypothesis on the training data, and selecting the hypothesis which fits the data the best. We now know that this approach fails when we do not have enough samples to ensure uniform convergence. What happens then? From $\htmlClass{eq-ref eq-uniform}{}$, the answer is that there may exist a sample drawn from $\D^m$ such that for some $h \in \H$, $R_S(h)$ is far from $R(h)$. Indeed, we might find that $R_S(\A[S]) \gg R(h^\star)$. In such an event, we say that $\A$ has overfit to the training data.

The bound given in $\htmlClass{eq-ref eq-gf-bound}{}$ further tells us that there essentially two requirements to ensure uniform convergence, and minimize the probability of overfitting:

  • We need enough samples to ensure that each time we independently test a hypothesis, we get a good estimate of loss with high probability. This requirement corresponds to the second term in the radical on the right hand side of $\htmlClass{eq-ref eq-gf-bound}{}$.
  • We need enough samples to ensure that, among the “independent” hypothesis that we wish to test, there is not a coincidental match between the hypothesis and our sample. We can roughly equate this requirement to the first term in the radical.

It is interesting to observe that either of these terms may potentially dominate the sample complexity bounds implied by $\htmlClass{eq-ref eq-gf-bound}{}$. For example, the VC dimension of the set of half-spaces in $\R^d$ is $d+1$. Thus, taking $\delta = 0.001$ and $d = 5$, we have $d \approx \log(2/\delta)$. Thus we might suggest that for $d < 5$, most of sample complexity requirement comes from concentration requirements on a single hypothesis; whereas for $d \gg 5$, the need to avoid overfitting is the biggest driver of sample complexity.

Further Reading

[1] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge University Press, 2014.

[2] O. Bousquet, S. Boucheron, and G. Lugosi, Introduction to Statistical Learning Theory, vol. 3176. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004.

[3] M. Anthony and P. L. Bartlett, Neural Network Learning: Theoretical Foundations, 1st ed. Cambridge University Press, 1999.