# VC dimension, Rademacher complexity, and growth function

March 15, 2018 | by

We define the VC dimension, the Rademacher complexity, and the growth function of a class of functions.

Let $X$ be the set of inputs. The general learning question we consider is the following: we want to learn a target function $f : X \to \left\{0,1\right\}$ through a number of samples $(x_i,f(x_i))_{i \in [1,m]}$. The function $f$ is unknown, but we assume some background knowledge in the form of a set of hypotheses $H$, which are functions $h : X \to \left\{0,1\right\}$.

We make a strong assumption called realizability: $f \in H.$ This assumption can be lifted to obtain more general results, which we will not do here.

We would like to quantify how hard it is to learn a function from $H$. More precisely, we are interested in the PAC-learnability of $H$; see below for a definition. There are two approaches:

• a combinatorial approach, called the VC dimension,
• a probabilistic approach, called the Rademacher complexity.

## PAC learning

We let $D$ denote distributions over the inputs, and write $x \sim D$ to mean that $x$ is sampled according to $D$. If the sample set has size $m$ we assume that the inputs are identically independently sampled according to $D$, written $S \sim D^m$, where $S = (x_i)_{i \in [1,m]}$. The distribution $D$ is unknown, in particular we do not want the learning algorithm to depend on it.

The first notion we need is the loss function: how good is a hypothesis $h$ if the target function is $f$?

• The zero-one loss is $L_f(h,x) = 1 \text{ if } h(x) \neq f(x), \text{ and } 0 \text{ otherwise}$.
• The quality of $h$ is measured by $L_{D,f}(h) = E_{x \sim D}[L_f(h,x)]$.
• The empirical quality of $h$ against the samples $S$ is $L_{S,f}(h) = \frac{1}{m} \sum_{i = 1}^m L_f(h,x_i)$.

Definition: We say that $H$ is PAC-learnable if
for all $\varepsilon > 0$ (precision),
for all $\delta > 0$ (confidence),
there exists an algorithm $A$,
there exists $m$ a number of samples,
for all objective function $f \in H$,
for all distributions $D$ over $X$:
$P_{S \sim D^m} ( L_D(h,f) \le \varepsilon ) \ge 1 - \delta,$ where we let $h$ denote the hypothesis picked by the algorithm $A$ when receiving the samples $S = (x_i,f(x_i))_{i \in [1,n]}$.

## VC dimension

We say that a subset $Y$ of $X$ is shattered by $H$ if by considering the restrictions of the functions in $H$ to $Y$ we obtain all functions $Y \to \left\{0,1 \right\}$. Formally, let $H_{\mid Y}$ denote $\left\{ h_{\mid Y} : Y \to \left\{0,1\right\} \mid h \in H \right\}$. The subset $Y$ is shattered by $H$ if $\text{Card} (H_{\mid Y}) = 2^Y$.

Intuitively, $Y$ being shattered by $H$ means that to learn how the function $f$ behaves on $Y$ one needs the values of $f$ on each element of $Y$. Hence the if $H$ shatters large sets, it is difficult to learn a function from $H$.

Definition: The VC dimension of $H$ is the size of the largest set $Y$ which is shattered by $H$.

We unravel the definition: the VC dimension of $H$ is the size of the largest set $Y$ for which all possible functions are realised.

The beauty here is that this is a purely combinatorial definition, which abstracts away probabilities and in particular the probabilistic distribution on the inputs. We will see that it anyway says something about the PAC-learnability of $H$.

The Rademacher complexity quantifies how a set of functions fits random noise. To define it for $H$, we consider the loss functions induced by $H$ and $f$.

Hence the question is how much do the loss functions induced by $H$ and $f$ fit random noise. A random noise is $\sigma = (\sigma_i)_{i \in [1,m]}$ where each $\sigma_i \in \left\{-1,+1\right\}$ is drawn uniformly and independently at random. To measure how well $h$ correlates with random noise we use the quantity

$\frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i)$.

Hence the function $h$ which best fits random noise maximises the above quantity.

The empirical Rademacher complexity of $H$ and $f$ against the samples $S$ is $E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i) \right]$

Definition: The Rademacher complexity $R_H(m)$ of $H$ is $\sup_{f \in H} E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i) \right] \right]$.

## Growth function

The notion of growth function will be a bridge between VC dimension and Rademacher complexity.

Definition: The grow function $\tau_H$ of $H$ associates with $m$ the number $\max_{Y \subseteq X, |Y| = m} |H_{\mid Y}|$.

Remark that by definition, if $H$ has VC dimension $d$, then for $m \le d$, we have $\tau_H(m) = 2^m$.

The following theorem relates the Rademacher complexity to the growth function:

Massart’s Theorem: $R_H(m) \le \sqrt{\frac{2 \log(\pi_H(m))}{m}}$.

The proof of analytical and relies on Hoeffding’s inequality. See p39/40 of the Foundations of Machine Learning book for a proof.