Games automata play

VC dimension, Rademacher complexity, and growth function

March 15, 2018 | by Nathanaël Fijalkow

We define the VC dimension, the Rademacher complexity, and the growth function of a class of functions.

Let $X$ be the set of inputs. The general learning question we consider is the following: we want to learn a target function through a number of samples $(x_i,f(x_i))_{i \in [1,m]}$. The function $f$ is unknown, but we assume some background knowledge in the form of a set of hypotheses $H$, which are functions .

We make a strong assumption called realizability: This assumption can be lifted to obtain more general results, which we will not do here.

We would like to quantify how hard it is to learn a function from $H$. More precisely, we are interested in the PAC-learnability of $H$; see below for a definition. There are two approaches:

  • a combinatorial approach, called the VC dimension,
  • a probabilistic approach, called the Rademacher complexity.

PAC learning

We let $D$ denote distributions over the inputs, and write $x \sim D$ to mean that $x$ is sampled according to $D$. If the sample set has size $m$ we assume that the inputs are identically independently sampled according to $D$, written $S \sim D^m$, where $S = (x_i)_{i \in [1,m]}$. The distribution $D$ is unknown, in particular we do not want the learning algorithm to depend on it.

The first notion we need is the loss function: how good is a hypothesis $h$ if the target function is $f$?

  • The zero-one loss is .
  • The quality of $h$ is measured by $L_{D,f}(h) = E_{x \sim D}[L_f(h,x)]$.
  • The empirical quality of $h$ against the samples $S$ is .

Definition: We say that $H$ is PAC-learnable if
for all $\varepsilon > 0$ (precision),
for all $\delta > 0$ (confidence),
there exists an algorithm $A$,
there exists $m$ a number of samples,
for all objective function $f \in H$,
for all distributions $D$ over $X$:
where we let $h$ denote the hypothesis picked by the algorithm $A$ when receiving the samples $S = (x_i,f(x_i))_{i \in [1,n]}$.

VC dimension

We say that a subset $Y$ of $X$ is shattered by $H$ if by considering the restrictions of the functions in $H$ to $Y$ we obtain all functions . Formally, let $H_{\mid Y}$ denote . The subset $Y$ is shattered by $H$ if .

Intuitively, $Y$ being shattered by $H$ means that to learn how the function $f$ behaves on $Y$ one needs the values of $f$ on each element of $Y$. Hence the if $H$ shatters large sets, it is difficult to learn a function from $H$.

Definition: The VC dimension of $H$ is the size of the largest set $Y$ which is shattered by $H$.

We unravel the definition: the VC dimension of $H$ is the size of the largest set $Y$ for which all possible functions are realised.

The beauty here is that this is a purely combinatorial definition, which abstracts away probabilities and in particular the probabilistic distribution on the inputs. We will see that it anyway says something about the PAC-learnability of $H$.

Rademacher complexity

The Rademacher complexity quantifies how a set of functions fits random noise. To define it for $H$, we consider the loss functions induced by $H$ and $f$.

Hence the question is how much do the loss functions induced by $H$ and $f$ fit random noise. A random noise is $\sigma = (\sigma_i)_{i \in [1,m]}$ where each is drawn uniformly and independently at random. To measure how well $h$ correlates with random noise we use the quantity


Hence the function $h$ which best fits random noise maximises the above quantity.

The empirical Rademacher complexity of $H$ and $f$ against the samples $S$ is

Definition: The Rademacher complexity $R_H(m)$ of $H$ is .

Growth function

The notion of growth function will be a bridge between VC dimension and Rademacher complexity.

Definition: The grow function $\tau_H$ of $H$ associates with $m$ the number .

Remark that by definition, if $H$ has VC dimension $d$, then for $m \le d$, we have $\tau_H(m) = 2^m$.

The following theorem relates the Rademacher complexity to the growth function:

Massart’s Theorem: .

The proof of analytical and relies on Hoeffding’s inequality. See p39/40 of the Foundations of Machine Learning book for a proof.

In the next post we show how these notions relate to PAC-learnability of $H$.