# VC dimension, Rademacher complexity, and growth function

We define the VC dimension, the Rademacher complexity, and the growth function of a class of functions.

Let $X$ be the set of inputs. The general learning question we consider is the following: we want to learn a target function \(f : X \to \left\{0,1\right\}\) through a number of samples $(x_i,f(x_i))_{i \in [1,m]}$. The function $f$ is unknown, but we assume some background knowledge in the form of a set of hypotheses $H$, which are functions \(h : X \to \left\{0,1\right\}\).

We make a strong assumption called **realizability**:
\(f \in H.\)
This assumption can be lifted to obtain more general results, which we will not do here.

We would like to quantify **how hard** it is to learn a function from $H$.
More precisely, we are interested in the PAC-learnability of $H$; see below for a definition.
There are two approaches:

- a combinatorial approach, called the VC dimension,
- a probabilistic approach, called the Rademacher complexity.

The beautiful and fundamental result of statistical learning theory is that these two approaches are in some sense equivalent; we will prove this equivalence result in the next post.

## PAC learning

We let $D$ denote distributions over the inputs, and write $x \sim D$ to mean that $x$ is sampled according to $D$. If the sample set has size $m$ we assume that the inputs are identically independently sampled according to $D$, written $S \sim D^m$, where $S = (x_i)_{i \in [1,m]}$. The distribution $D$ is unknown, in particular we do not want the learning algorithm to depend on it.

The first notion we need is the loss function: how good is a hypothesis $h$ if the target function is $f$?

- The zero-one loss is \(L_f(h,x) = 1 \text{ if } h(x) \neq f(x), \text{ and } 0 \text{ otherwise}\).
- The quality of $h$ is measured by $L_{D,f}(h) = E_{x \sim D}[L_f(h,x)]$.
- The empirical quality of $h$ against the samples $S$ is \(L_{S,f}(h) = \frac{1}{m} \sum_{i = 1}^m L_f(h,x_i)\).

For \(f : X \to \left\{0,1\right\}\) we write $S_f = (x_i,f(x_i))_{i \in [1,m]}$ for the labelled samples

Definition:We say that $H$ is PAC-learnable if

for all $\varepsilon > 0$ (precision),

for all $\delta > 0$ (confidence),

there exists an algorithm $A$,

there exists $m$ a number of samples,

for all objective function $f \in H$,

for all distributions $D$ over $X$:

\(P_{S \sim D^m} ( L_{D,f}(A(S_f) \le \varepsilon ) \ge 1 - \delta,\) where we let $A(S_f) \in H$ denote the hypothesis picked by the algorithm $A$ when receiving the samples $S_f$.

## VC dimension

We say that a subset $Y$ of $X$ is **shattered** by $H$ if by considering the restrictions of the functions in $H$ to $Y$ we obtain all functions \(Y \to \left\{0,1 \right\}\).
Formally, let $H_{\mid Y}$ denote \(\left\{ h_{\mid Y} : Y \to \left\{0,1\right\} \mid h \in H \right\}\).
The subset $Y$ is shattered by $H$ if \(\text{Card} (H_{\mid Y}) = 2^Y\).

Intuitively, $Y$ being shattered by $H$ means that to learn how the function $f$ behaves on $Y$ one needs the values of $f$ on each element of $Y$. Hence if $H$ shatters large sets, it is difficult to learn a function from $H$.

Definition:The VC dimension of $H$ is the size of the largest set $Y$ which is shattered by $H$.

We unravel the definition: the VC dimension of $H$ is the size of the *largest* set $Y$ for which *all* possible functions are realised.

The beauty here is that this is a purely combinatorial definition, which abstracts away probabilities and in particular the probabilistic distribution on the inputs. We will see that it anyway says something about the PAC-learnability of $H$.

## Rademacher complexity

The Rademacher complexity quantifies how a set of functions fits random noise. Let $P$ a set of functions \(p : X \to \mathbb{R}\)

A random noise is $\sigma = (\sigma_i)_{i \in [1,m]}$ where each \(\sigma_i \in \left\{-1,+1\right\}\) is drawn uniformly and independently at random. To measure how well a function $p$ correlates with random noise we use the quantity

\[\frac{1}{m} \sum_{i = 1}^m \sigma_i p(x_i)\]The function $p$ defined by $p(x_i) = \sigma_i$ exactly fits the random noise $\sigma$. More generally, the function $p \in P$ which best fits $\sigma$ maximises the above quantity. The empirical Rademacher complexity of $P$ against the samples $S$ is

\[E_{\sigma} \left[ \sup_{p \in P} \frac{1}{m} \sum_{i = 1}^m \sigma_i p(x_i) \right]\]

Definition:The Rademacher complexity $R_P(m)$ of $P$ is \(E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i P(x_i) \right] \right]\).

We now specialise the definition of the Rademacher complexity to a class of hypothesis $H$ and a hypothesis $f$:
it is the Rademacher complexity of the class of loss functions $L_f(h,\cdot) : X \to [0,1]$.
So the Rademacher complexity $R_{H,f}(m)$ of $H$ **and $f$** is

\(E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i) \right] \right]\).

To define the Rademacher complexity of $H$ we consider the supremum over all hypotheses $f$, leading to the following definition.

Definition:The Rademacher complexity $R_H(m)$ of $H$ is \(\sup_{f \in H} E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i) \right] \right]\).

## Growth function

The notion of growth function will be a bridge between VC dimension and Rademacher complexity.

Definition:The grow function $\tau_H$ of $H$ associates with $m$ the number \(\max_{Y \subseteq X, |Y| = m} |H_{\mid Y}|\).

Remark that by definition, if $H$ has VC dimension $d$, then for $m \le d$, we have $\tau_H(m) = 2^m$.