# The fundamental theorem of statistical learning

We state and prove the fundamental theorem of statistical learning, which says that a class of functions is learnable if and only if it has finite VC dimension.

This post uses all the notations of the previous post.

Theorem:

- If $H$ has infinite VC dimension, then it is not PAC-learnable.
- If $H$ has finite VC dimension, then it is PAC-learnable.

## First implication: no free lunch

We prove that if $H$ has infinite VC dimension, then it is not PAC-learnable, which is sometimes called the “no free lunch theorem”.

Fix $m$. Assume that $H$ has infinite VC dimension, in particular it has dimension at least $2m$, so there exists $Y$ of size $2m$ which is shattered by $H$. Recall that to be PAC-learnable, there must exist an algorithm working against any distributions of the inputs. Here we will restrict ourselves with the uniform distribution over $Y$, so in particular we can only consider hypotheses .

We will show the following: if we are given $m$ samples, then there is no way that with high probability an algorithm can correctly pick an almost correct function. More precisely, we show that any algorithm fails on a proportion at least $\frac{1}{8}$ of the inputs with probability at least $\frac{1}{7}$.

These strange constants come from the following lemma.

Lemma:Let $X$ be a random variable taking values in $[0,1]$ such that . Then .

This is a simple application of Markov’s inequality: , with double complementation.

We write $U$ for the uniform distribution over $Y$.

An algorithm is a function $A$ taking as input the labelled samples and outputting . We show that any algorithm fails on at least one function. To this end we draw a function uniformly at random, and show that on expectations the algorithm cannot do too well.

Lemma:In the following quantity, we draw a function uniformly at random. .

**Proof:**
Recall that is defined by
.

We lower bound it by

.

We plug in the second term and remark that

,

since for $x \notin S$ the value of $f(x)$ is either $1$ or $0$ each with probability $\frac{1}{2}$.

This concludes the proof of this lemma.

We now wrap up the proof of the no free lunch theorem. Fix an algorithm $A$. It follows from the second lemma that there exists a function $f$ such that . Thanks to the first lemma, we obtain .

## Converse implication: finite VC dimension implies PAC-learnability

Let $d$ be the VC dimension of $H$. We combine a few ingredients.

The first ingredient is Sauer’s lemma.

Sauer’s Lemma:.

There are different proofs of Sauer’s lemma, some of them are very beautiful. My favourite is the second one in this lecture notes. Note that it is a purely combinatorial statement.

The main technical piece of work to be done is to obtain generalization bounds using Rademacher complexity. What is a generalization bound? It’s an inequality relating the empirical loss, i.e. $L_{S,f}(h)$, and the actual loss, i.e. $L_{D,f}(h)$. We say that an hypothesis $h$ generalises if is small.

Lemma:.

**Proof:**
By definition and
.

To compare these two quantities we first make them look similar: observe that . We write .

We obtain .

Recall that the goal is to upper bound by the Rademacher complexity. We claim that the quantity above is equal to

Indeed, fix $\sigma$ and $\sigma’$. For each $i$ there are four cases:

- either $\sigma_i = \sigma’_i = +1$, then this is the term as in the quantity above,
- or $\sigma_i = \sigma’_i = -1$, this is the term as in the quantity above once $x_i$ and $x’_i$ are swapped in $S$ and $S’$,
- the other two terms cancel out ($\sigma_i = +1$ and $\sigma’_i = -1$ with $\sigma_i = -1$ and $\sigma’_i = +1$).

This quantity is smaller than , which concludes the proof of this lemma.

#### Proof wrap up

We now combine this with Massart’s Theorem (see the previous post), yielding .

It follows using Markov’s inequality that for $\delta > 0$, .

Thanks to Sauer’s lemma, $\pi_H(2m)$ grows polynomially in $m$ (since $d$ is constant), so given $\varepsilon > 0$ and $\delta > 0$, one can choose $m$ such that .

An Empirical Risk Minimisation (ERM) algorithm is one that outputs an hypothesis exactly matching the training set, i.e. such that $L_{S,f}(A(S,f)) = 0$.

Hence what we proved is that any ERM algorithm ensures for $m$ chosen as above. Hence $H$ is PAC-learnable. Note that we did not say anything about how to construct an ERM algorithm.