# VC dimension, Rademacher complexity, and growth function

We define the VC dimension, the Rademacher complexity, and the growth function of a class of functions.

Let $X$ be the set of inputs. The general learning question we consider is the following: we want to learn a target function through a number of samples $(x_i,f(x_i))_{i \in [1,m]}$. The function $f$ is unknown, but we assume some background knowledge in the form of a set of hypotheses $H$, which are functions .

We make a strong assumption called **realizability**:
This assumption can be lifted to obtain more general results, which we will not do here.

We would like to quantify **how hard** it is to learn a function from $H$.
More precisely, we are interested in the PAC-learnability of $H$; see below for a definition.
There are two approaches:

- a combinatorial approach, called the VC dimension,
- a probabilistic approach, called the Rademacher complexity.

## PAC learning

We let $D$ denote distributions over the inputs, and write $x \sim D$ to mean that $x$ is sampled according to $D$. If the sample set has size $m$ we assume that the inputs are identically independently sampled according to $D$, written $S \sim D^m$, where $S = (x_i)_{i \in [1,m]}$. The distribution $D$ is unknown, in particular we do not want the learning algorithm to depend on it.

The first notion we need is the loss function: how good is a hypothesis $h$ if the target function is $f$?

- The zero-one loss is .
- The quality of $h$ is measured by $L_{D,f}(h) = E_{x \sim D}[L_f(h,x)]$.
- The empirical quality of $h$ against the samples $S$ is .

Definition:We say that $H$ is PAC-learnable if

for all $\varepsilon > 0$ (precision),

for all $\delta > 0$ (confidence),

there exists an algorithm $A$,

there exists $m$ a number of samples,

for all objective function $f \in H$,

for all distributions $D$ over $X$:

where we let $h$ denote the hypothesis picked by the algorithm $A$ when receiving the samples $S = (x_i,f(x_i))_{i \in [1,n]}$.

## VC dimension

We say that a subset $Y$ of $X$ is **shattered** by $H$ if by considering the restrictions of the functions in $H$ to $Y$ we obtain all functions .
Formally, let $H_{\mid Y}$ denote .
The subset $Y$ is shattered by $H$ if .

Intuitively, $Y$ being shattered by $H$ means that to learn how the function $f$ behaves on $Y$ one needs the values of $f$ on each element of $Y$. Hence the if $H$ shatters large sets, it is difficult to learn a function from $H$.

Definition:The VC dimension of $H$ is the size of the largest set $Y$ which is shattered by $H$.

We unravel the definition: the VC dimension of $H$ is the size of the *largest* set $Y$ for which *all* possible functions are realised.

The beauty here is that this is a purely combinatorial definition, which abstracts away probabilities and in particular the probabilistic distribution on the inputs. We will see that it anyway says something about the PAC-learnability of $H$.

## Rademacher complexity

The Rademacher complexity quantifies how a set of functions fits random noise. To define it for $H$, we consider the loss functions induced by $H$ and $f$.

Hence the question is how much do the loss functions induced by $H$ and $f$ fit random noise. A random noise is $\sigma = (\sigma_i)_{i \in [1,m]}$ where each is drawn uniformly and independently at random. To measure how well $h$ correlates with random noise we use the quantity

.

Hence the function $h$ which best fits random noise maximises the above quantity.

The empirical Rademacher complexity of $H$ and $f$ against the samples $S$ is

Definition:The Rademacher complexity $R_H(m)$ of $H$ is .

## Growth function

The notion of growth function will be a bridge between VC dimension and Rademacher complexity.

Definition:The grow function $\tau_H$ of $H$ associates with $m$ the number .

Remark that by definition, if $H$ has VC dimension $d$, then for $m \le d$, we have $\tau_H(m) = 2^m$.

The following theorem relates the Rademacher complexity to the growth function:

Massart’s Theorem:.

The proof of analytical and relies on Hoeffding’s inequality. See p39/40 of the Foundations of Machine Learning book for a proof.