Recap

由於 lecture 6 被跳過了，所以這裡先講講 lecture 6 的結論。

當 $m_{\mathcal{H}}(N)$ 有 break point $k$ 時 ( $N \ge 2, k \ge 3$ )，我們會得到 upper bound:

${\color{blue} m_{\mathcal{H}}(N)} \le B(N,k) = \sum^{k-1}_{i=0}\dbinom{N}{i} \le {\color{blue} N^{k-1}}$

也就是說我們有夠小的 $\mathcal{H}$
透過一串騷操作，我們可以將原本「得到 bad data 的機會」的 upper bound 換成 (which is called VC bound):

$\begin{aligned}&\mathbb{P}_{\mathcal{D}}[|E_{\text{in}}(g) - E_{\text{out}}(g)| \gt \epsilon]\\\le\quad&\mathbb{P}_{\mathcal{D}}[\exists\ h \in \mathcal{H}\ \text{s.t.}\ |E_{\text{in}}(h) - E_{\text{out}}(h)| \gt \epsilon]\\\le\quad&4{\color{blue}m_{\mathcal{H}}(2N)}\exp\left(-{1\over8}\epsilon^2N\right)\\\le\quad&4{\color{blue}(2N)^{k+1}}\exp\left(-{1\over8}\epsilon^2N\right)\end{aligned}$

綜合以上兩點，當 == $m_{\mathcal{H}}(N)$ 有 break point $k$ 且 $N$ 足夠大的時候，我們就能說 $E_{\text{out}} \approx E_{\text{in}}$ is possible。==

然後當 $\mathcal{A}$ 可以從 $\mathcal{H}$ 裡挑出一個好的 $g$ (有小的 $E_{\text{in}}$ )，我們就能說 $E_{\text{out}}$ 應該也是小的，因此 $g$ 可能相當接近理想中的 $f$ 。

VC Dimension

VC Dimension $d_{\text{vc}}(\mathcal{H})$ 的定義為: largest $N$ for which $m_{\mathcal{H}}(N) = 2^N$ (也就是 maximum non-break point， $\mathcal{H}$ 可以 shatter 的最大 input 數)

VC Dimension and Learning

從定義我們可以得到: if $N \ge 2, d_{\text{vc}} \ge 2$ , $m_{\mathcal{H}}(N) \le 2^{d_{\text{vc}}}$

因此當我們能找到 finite 的 $d_{\text{vc}}$ 時，就代表任何 $g$ 都會符合 $E_{\text{in}}(g) \approx E_{\text{out}}(g)$

此時我們需要考慮的就只剩下 $\mathcal{H}$ 了:

|500 source: https://www.csie.ntu.edu.tw/~htlin/course/ml24fall/doc/07u_handout.pdf p.6

VC Dimension of Perceptrons

到目前為止，我們已經證明了 2D PLA 是可行的:

|500 source: https://www.csie.ntu.edu.tw/~htlin/course/ml24fall/doc/07u_handout.pdf p.8

接下來，我們要來找出當 $\bf x$ 的有超過兩個 features 時， $d_{\text{vc}}$ 應該長怎樣。

根據經驗，我們可以假設 $d$ -D perceptrons 的 $d_{\text{vc}} = d+1$ ，我們分成兩部分來證明這個式子。

證明 $d_{\text{vc}} \ge d+1$

也就是說，我們可以 shatter 某些 $d+1$ inputs 組合。

我們先假設一個特殊的有 $d+1$ 個點的 input $X$ :

$X = \begin{bmatrix}&&&-{\bf x}^T_1-&&&\\&&&-{\bf x}^T_2-&&&\\&&&-{\bf x}^T_3-&&&\\&&&\vdots&&&\\&&&-{\bf x}^T_{d+1}-&&&\end{bmatrix}=\begin{bmatrix}&1 &0 &0 &\cdots &0\\&1 &1 &0 &\cdots &0\\&1 &0 &1 & &0\\&\vdots\ &\vdots\ & &\ddots &0\\&1 &0 &\cdots &0 &1\end{bmatrix}$

如果很難想像的話，在 2D ( $d = 2$ ) 的時候，這個 $X$ 在空間中會長這樣:

|400 source: my ipad :)

由於這個 $X$ 是可逆的，所以不管 $y$ 長怎樣，我們都能夠讓 ${\bf w} = X^{-1}y$ ，這樣一來 $\text{sign}(X{\bf w}) = y$ ，也就是說這個 $X$ 是可以被 shatter 的，因此 $d_{\text{vc}} \ge d+1$ 。

證明 $d_{\text{vc}} \le d+1$

考慮有 $d+2$ 個點的 general $X$ :

$X = \begin{bmatrix}&&&-{\bf x}^T_1-&&&\\&&&-{\bf x}^T_2-&&&\\&&&-{\bf x}^T_3-&&&\\&&&\vdots&&&\\&&&-{\bf x}^T_{d+1}-&&&\\&&&-{\bf x}^T_{d+2}-&&&\end{bmatrix}$

由於 rows ( $d+2$ ) 比 columns ( $d+1$ ) 還多，所以我們要考慮 linear dependence 的問題:

${\bf x}_{d+2} = a_1{\bf x}_{1} + a_2{\bf x}_{2} + \cdots + a_{d+1}{\bf x}_{d+1}$

其中 $a_i$ 代表該 ${\bf x}_i$ 的 sign。

現在假設只有 $a_1$ 是 $+1$ ，其他 $a_2$ 到 $a_{d+2}$ 都是 $-1$ 的情況，我們會發現如果有個 $\bf w$ 可以準確分類 ${\bf x}_{1}$ 到 ${\bf x}_{d+1}$ 的話，則該 $\bf w$ 就不可能讓 ${\bf x}_{d+2}$ 被分類正確:

${\bf w}^T{\bf x}_{d+2} = a_1{\bf w}^T{\bf x}_{1} + a_2{\bf w}^T{\bf x}_{2} + \cdots + a_{d+1}{\bf w}^T{\bf x}_{d+1} \gt 0$

因此，在有 $d+2$ 個點的情況下，不管 $X$ 長怎樣，一定都會有一個 $y$ 找不到對應的 $\bf w$ ，因此不能被 shatter，也就是說 $d_{\text{vc}} \le d+1$ 。

解讀 VC Dimension

Degrees of Freedom

degrees of freedom (DOF) 代表在 hypothesis function $h$ 內的 independent parameters 的數量，通常這會和 $h$ 的 parameters ${\bf w}$ 的維度相近，但不一定相等。

舉例來說，假設模型有三個參數 $\beta_1, \beta_2, \beta_3$ ，但我們加了一個約束條件:

$\beta_1 + \beta_2 + \beta_3 = 1$

因此 $\beta_3$ 必須根據 $\beta_1, \beta_2$ 的選擇來計算，也就是說我們不能「自由」的調整它，因此雖然 $\bf w$ 的維度是 3，但是 DOF 卻是 2。

DOF 越高，代表模型越靈活，越能夠擬合資料，但有可能 over-fitting。

Powerfulness of H

有了 DOF 的概念，(在 binary classification 上) $d_{\text{vc}}(\mathcal{H})$ 的意義相當於是一個 $\mathcal{H}$ 的effective DOF，於是，我們可以用 $d_{\text{vc}}(\mathcal{H})$ 來衡量一個 $\mathcal{H}$ (在 binary classification 上) 的能力有多強。

所謂的 effective DOF 是指最後真正對模型產生影響的 independent parameters 的數量，通常會和 DOF 相近，但不一定相等。

當然，在 PLA 的例子裡， $\bf w$ 的維度 = DOF = Effective DOF = $d_{\text{vc}}$ = $d+1$

Trade-off on VC Dimension

Lecture 5 的一開始我們有提到兩個問題:

如果確保 $E_{\text{in}}$ 和 $E_{\text{out}}$ 足夠接近？
如何讓 $E_{\text{in}}$ 足夠小？

這兩個問題除了和 $M$ (size of $\mathcal{H}$ )，也和 $d_{\text{vc}}$ 有關:

small $d_{\text{vc}}$ $d_{vc}$ :
1. Yes: 誤差 $\le 4(2N)^{d_{\text{vc}}}\exp(\cdots)$
2. No: 能力不夠強
large $d_{\text{vc}}$ $d_{vc}$ :
- No
- Yes: 能力很強

因此選擇合適的 $d_{\text{vc}}$ (or $\mathcal{H}$ ) 相當重要。

Penalty for Model Complexity

回到這個式子:

$\mathbb{P}_{\mathcal{D}}[|E_{\text{in}}(g) - E_{\text{out}}(g)| \gt \epsilon]\le4(2N)^{d_{\text{vc}}}\exp\left(-{1\over8}\epsilon^2N\right)$

我們現在將右項替換成 $\color{red}\delta$ (confidence level)，然後做一些轉換可以得到 $E_{\text{in}}$ 和 $E_{\text{out}}$ 的誤差實際上會等於:

$\epsilon = \sqrt{{8 \over N}\ln\left({4(2N)^{d_{\text{vc}}}\over{\color{red}\delta}}\right)}$

這誤差又叫做 generalization error，而我們又再給他一個更簡單的代號:

$\sqrt{{8 \over N}\ln\left({4(2N)^{\color{blue}{d_{\text{vc}}}}\over{\color{red}\delta}}\right)} = \Omega(N, {\color{blue}\mathcal{H}}, {\color{red}\delta})$

而這個 $\Omega$ 被稱為 penalty for model complexity。

The VC Message

|500 source: https://www.csie.ntu.edu.tw/~htlin/course/ml24fall/doc/07u_handout.pdf p.22

Sample Complexity

現在假設條件 $\epsilon = 0.1, \delta = 0.1, d_{\text{vc}} = 3$ ，則我們需要多少 sample 才能讓 $4(2N)^{d_{\text{vc}}}\exp\left(-{1\over8}\epsilon^2N\right) \le \delta$ (誤差的 upper bound 在信任水準內)。

理論上而言，我們需要 $N = 29300$ 才能讓 upper bound 為 $9.99 \times 10^{-2}$ ，也就是需要 $N \approx 10000d_{\text{vc}}$ 。

不過，實務上大概只要抓 $N \approx 10d_{\text{vc}}$ 就夠了，為什麼？

|500 source: https://www.youtube.com/live/3zbhL1Q7nu0 p.24