VC dimension, Rademacher complexity, and growth function
We define the VC dimension, the Rademacher complexity, and the growth function of a class of functions.
Let $X$ be the set of inputs. The general learning question we consider is the following: we want to learn a target function \(f : X \to \left\{0,1\right\}\) through a number of samples $(x_i,f(x_i))_{i \in [1,m]}$. The function $f$ is unknown, but we assume some background knowledge in the form of a set of hypotheses $H$, which are functions \(h : X \to \left\{0,1\right\}\).
We make a strong assumption called realizability: \(f \in H.\) This assumption can be lifted to obtain more general results, which we will not do here.
We would like to quantify how hard it is to learn a function from $H$. More precisely, we are interested in the PAC-learnability of $H$; see below for a definition. There are two approaches:
- a combinatorial approach, called the VC dimension,
- a probabilistic approach, called the Rademacher complexity.
The beautiful and fundamental result of statistical learning theory is that these two approaches are in some sense equivalent; we will prove this equivalence result in the next post.
PAC learning
We let $D$ denote distributions over the inputs, and write $x \sim D$ to mean that $x$ is sampled according to $D$. If the sample set has size $m$ we assume that the inputs are identically independently sampled according to $D$, written $S \sim D^m$, where $S = (x_i)_{i \in [1,m]}$. The distribution $D$ is unknown, in particular we do not want the learning algorithm to depend on it.
The first notion we need is the loss function: how good is a hypothesis $h$ if the target function is $f$?
- The zero-one loss is \(L_f(h,x) = 1 \text{ if } h(x) \neq f(x), \text{ and } 0 \text{ otherwise}\).
- The quality of $h$ is measured by $L_{D,f}(h) = E_{x \sim D}[L_f(h,x)]$.
- The empirical quality of $h$ against the samples $S$ is \(L_{S,f}(h) = \frac{1}{m} \sum_{i = 1}^m L_f(h,x_i)\).
For \(f : X \to \left\{0,1\right\}\) we write $S_f = (x_i,f(x_i))_{i \in [1,m]}$ for the labelled samples
Definition: We say that $H$ is PAC-learnable if
for all $\varepsilon > 0$ (precision),
for all $\delta > 0$ (confidence),
there exists an algorithm $A$,
there exists $m$ a number of samples,
for all objective function $f \in H$,
for all distributions $D$ over $X$:
\(P_{S \sim D^m} ( L_{D,f}(A(S_f) \le \varepsilon ) \ge 1 - \delta,\) where we let $A(S_f) \in H$ denote the hypothesis picked by the algorithm $A$ when receiving the samples $S_f$.
VC dimension
We say that a subset $Y$ of $X$ is shattered by $H$ if by considering the restrictions of the functions in $H$ to $Y$ we obtain all functions \(Y \to \left\{0,1 \right\}\). Formally, let $H_{\mid Y}$ denote \(\left\{ h_{\mid Y} : Y \to \left\{0,1\right\} \mid h \in H \right\}\). The subset $Y$ is shattered by $H$ if \(\text{Card} (H_{\mid Y}) = 2^Y\).
Intuitively, $Y$ being shattered by $H$ means that to learn how the function $f$ behaves on $Y$ one needs the values of $f$ on each element of $Y$. Hence if $H$ shatters large sets, it is difficult to learn a function from $H$.
Definition: The VC dimension of $H$ is the size of the largest set $Y$ which is shattered by $H$.
We unravel the definition: the VC dimension of $H$ is the size of the largest set $Y$ for which all possible functions are realised.
The beauty here is that this is a purely combinatorial definition, which abstracts away probabilities and in particular the probabilistic distribution on the inputs. We will see that it anyway says something about the PAC-learnability of $H$.
Rademacher complexity
The Rademacher complexity quantifies how a set of functions fits random noise. Let $P$ a set of functions \(p : X \to \mathbb{R}\)
A random noise is $\sigma = (\sigma_i)_{i \in [1,m]}$ where each \(\sigma_i \in \left\{-1,+1\right\}\) is drawn uniformly and independently at random. To measure how well a function $p$ correlates with random noise we use the quantity
\[\frac{1}{m} \sum_{i = 1}^m \sigma_i p(x_i)\]The function $p$ defined by $p(x_i) = \sigma_i$ exactly fits the random noise $\sigma$. More generally, the function $p \in P$ which best fits $\sigma$ maximises the above quantity. The empirical Rademacher complexity of $P$ against the samples $S$ is
\[E_{\sigma} \left[ \sup_{p \in P} \frac{1}{m} \sum_{i = 1}^m \sigma_i p(x_i) \right]\]Definition: The Rademacher complexity $R_P(m)$ of $P$ is \(E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i P(x_i) \right] \right]\).
We now specialise the definition of the Rademacher complexity to a class of hypothesis $H$ and a hypothesis $f$: it is the Rademacher complexity of the class of loss functions $L_f(h,\cdot) : X \to [0,1]$. So the Rademacher complexity $R_{H,f}(m)$ of $H$ and $f$ is
\(E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i) \right] \right]\).
To define the Rademacher complexity of $H$ we consider the supremum over all hypotheses $f$, leading to the following definition.
Definition: The Rademacher complexity $R_H(m)$ of $H$ is \(\sup_{f \in H} E_{S \sim D^m} \left[ E_{\sigma} \left[ \sup_{h \in H} \frac{1}{m} \sum_{i = 1}^m \sigma_i L_f(h,x_i) \right] \right]\).
Growth function
The notion of growth function will be a bridge between VC dimension and Rademacher complexity.
Definition: The grow function $\tau_H$ of $H$ associates with $m$ the number \(\max_{Y \subseteq X, |Y| = m} |H_{\mid Y}|\).
Remark that by definition, if $H$ has VC dimension $d$, then for $m \le d$, we have $\tau_H(m) = 2^m$.