Linear Regression
MSE
\[J=(y-X^\top \omega)^\top(y-X^\top \omega)\]
\[\nabla_\omega J=2X(y-X^\top \omega)\]
令其为 \(0\)
\[Xy=XX^\top \omega\]
\[\omega = (XX^\top)^{-1} Xy\]
\[MSE=\frac 1n\sum_{i=1}^n(y_i-f(x_i,\omega))^2\]
系数绝对值太大说明过拟合
Statistic Model
\(y=f(x,\omega) + \epsilon\)
\(\epsilon\) 是噪音, 一般是 \(\epsilon \sim \mathcal N(0, \sigma^2)\)
用 \(\epsilon\) 来建模 \(y\)
\(p(y|x, \omega, \sigma)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y-f(x,\omega))^2}{2\sigma^2}}\)
MLE
\[L(\mathcal D, \omega, \sigma) = \prod_{i=1}^n p(y_i|x_i,\omega, \sigma)\]
连乘看着不好, 取对数得到 log-likelihood
\[l(\mathcal D, \omega, \sigma) = \sum_{i=1}^n \log p(y_i|x_i,\omega, \sigma)\]
对于正态分布的 \(p\) 取 \(\log\) 得到 MSE 的形式:
\[l(\mathcal D, \omega, \sigma)=\frac {1}{2\sigma^2}\sum_{i=1}^n(y_i-f(x_i,\omega))^2+C(\sigma)\]
Regularization
Ridge Regression
\[J=(y-X^\top \omega)^\top(y-X^\top \omega)+\lambda\omega^\top \omega\]
\[\nabla_\omega J=-2X(y-X^\top \omega)+2\lambda\omega=0\]
\[Xy=XX^\top \omega+\lambda \omega\]
\[\omega=(XX^\top+\lambda I)^{-1}Xy\]
\(XX^\top+\lambda I\) 一定可逆
Lasso
\[\omega^{*}=\arg\min\frac 1n\sum_{i=1}^n(y_i-f(x_i,\omega))^2+\lambda ||\omega||_p\]
\(p=1\) 是为 LASSO (Least Absolute Selection and Shrinkage Operator)
Expected Prediction Error and Bias-Variance Decomposition
\[\iint(y-\mathbb E[y\mid x])p(x,y)\text dx\text dy=0\]
注意到 \((y-f(x))^2=((y-\mathbb E[y\mid x])+(\mathbb E[y\mid x]-f(x)))^2\)
再把 \(f(x)\) 换成实际的 \(f(x;\mathcal D)\) \(\(\text{EPE}=(\text{Bias})^2+\text{Variance}+\text{Noise}\)\)
Classification
sigmoid
logistic regression
\[p(y_i=\pm1\mid x_i,a)=\sigma(y_ia^\top x_i)=\frac{1}{1+e^{-y_ia^\top x_i}}\]
MLE
\[\text{MLE}=\sum_{i}\log p(y_i=\pm1\mid x_i,a)=\sum_i\log \sigma(y_ia^\top x_i)=-\sum_i\log(1+e^{-y_ia^\top x_i})\]
\(E(a)=\sum_i\log(1+e^{-y_ia^\top x_i})\) 是凸函数
Gradient Descent
归纳偏置
SVM
\[x=x_p+r\frac{\omega}{||\omega||}=x_p+\gamma y\frac{\omega}{||\omega||}\]
\(\gamma\) 非负, \(y=\pm 1\)
\[f(x_p)=0\Rightarrow\omega(x-\gamma y\frac{\omega}{||\omega||})+b=0\]
\[\gamma=\frac{\omega^\top x+b}{y||\omega||}=y\frac{\omega^\top x+b}{||\omega||}\]
最小距离最大化, 为了让误分类的可能性最小