【RLchina2022强化学习暑期课】理论课一-机器学习和深度学习基础

参考资料

http://rlchina.org/topic/492

Classical Machine Learning

Definition by Tom Mitchell(1998):

Machine Learning is the study of algorithms that

improve their performance P
at some tasks T
with experience E

A well-defined learning task is given by <P,T,E>

Programming : 1 + 2 = ?

Machine Learning: 1 + x = 3 ,x =?

Area

Recommender system

Computer vision

E-commerce

Space exploration

Robotics

Information extraction

Social networks

Debugging software

why we need machine learning

识别手写数字
- programming的解决方法
- machine learning 的方法，更优秀

Machine learning workflow

Dataset 分为 Training Set 和 Testing Set
Model的部署

分为两大部分model 和 Dataset

model

Model Training
- Model Selection 确定适合当前任务的模型
  - Selection based on human experience
    - Linear model for continuous value regression
    - Tree-based model for interpretable prediction
  - Selection based on learning strategies
    - Bayesian optimization
    - Bayesian model selection
- Parameter Learning
  - Gradient decent:
    - full
    - batch
    - stochastic
Model Evaluation
模型
- Linear regression
- Decision Tree
- Logistic Regression
- Naive Bayes
- K-Nearest Neighbor
- Artificial Neural Network / Deep Learning
- Support Vector Machine
- Ensemble Methods:
  
  Random Forset，Bagging，AdaBoost

Supervised Learning

Given a training set (x1,y1),(x2,y2),…,(xn,yn)
Learn a Function F(x) based on the testing set (xn+1, yn+1),(xn+2,yn+2),…,(xn+m,yn+m).

Linear regression

In linear regression we assume that y and x are related with the following equation:

y_i = F(x_i) = wx_i

where w is a parameter

Determine the model parameters

Our goal is to estimate w from the training data of (x1,y1),(x2,y2),…,(xn,yn)
Optimization goal: minimize squared error (least squares):

\begin{align}min_{w} \sum^{n}_{i = 1} (y_i - wx_i)^2\end{align}
- minimizes squared distance between the true value and predicted value.
- has a nice probabilistic interpretation
证明，用偏导数 = 0 解出w

Evaluation of the learned model

learning based on the training set (x1,y1),(x2,y2),…,(xn,yn)

Evaluation based on the testing set (xn+1, yn+1),(xn+2,yn+2),…,(xn+m,yn+m).

Root mean squared error(RMSE) 均方根误差
$\begin{align} RMSE = \sqrt{\frac{1}{m} \sum^{n+m}_{i = n+1}(y_i - wx_i)^2} \end{align}$
Mean Absolute Error(MAE) 平均绝对误差
$\begin{align} MAE = \frac{1}{m} \sum^{n+m}_{i = n+1}|y_i - wx_i| \end{align}$

Decision Tree

Dataset D= {(x1,y1),(x2,y2),(x3,y3),…(xn,yn)}

举例有以下数列

在这里插入图片描述

S表示从x1开始划分左右，c1表示左边的平均值，c2表示右边数列的平均值

在这里插入图片描述

则，得到的c1和c2就是上述公式中的c1和c2,我们可以依次得到所有的可能的S对应的c1和c2，如下表：

在这里插入图片描述

有了所有的c1和c2我们就可以带入公式：

得到如下的表格：

在这里插入图片描述

因此我们将m(s)值最小的S = 5，当作分界值

最后我们就生成了一个决策树，大于5往右走，小于5往左走。

Nearest-Neighbor Classifiers

Requires three things
- The set of stored records
- Distance Metric to compute distance between records
- The value of K , the number of nearest neighbors to retrieve
To classify an unknown record:
- Compute distance to other training records
- Identify k nearest neighbors
- Use class labels of nearest neighbors to determine the class label of unknown record(e.g.,by taking majority vote)
Compute distance between two points:
- Euclidean distance
  $d(p,q) = \sqrt{\sum_i(p_i-q_i)^2}$
- Manhatten distance
  $d(p,q) = \sum_i |p_i - q_i|$
- q norm distance
  $d(p,q) = (\sum_i | p_i - q_i |^q)^{\frac{1}{q}}$
Choosing the value of k:
- if k is too small ,sensitive to noise points
- if k is too large ,neighborhood may include points from other classes
Determine the class from nearest neighbor list
- take the majority vote of class labels among the k-nearest neighbors
  $\gamma’ = argmax_v \sum_{(x_i,y_i)\in D_z}I(v = y_i)$
  where $D_z$ is the set of k closest training examples to z.
- Weigh the vote according to distance
  
  $\gamma’ = argmax_v \sum_{(x_i,y_i)\in D_z}w_i \times I(v = y_i)$
  - weight factor,w = $$1/d^2$$

Logistic regression

The optimal decisions are based on the posterior class probabilities $P(y|x)$ . For binary classification problems, we can write these decisions as
$\begin{align}y = 1 \space if \space log\frac{P(y = 1|X)}{P(y=0|X)} > 0 \end{align}$
and y = 0 otherwise.
We generally don’t know $P(y|x)$ but we can parameterize the possible decisions according to
$\begin{align} log\frac{P(y = 1|X)}{P(y=0|X)} =f(x;w) = w_0 + x^T w_1 \end{align}$
Our log-odds model

\begin{align} log\frac{P(y = 1|X)}{P(y=0|X)} = w_0 + x^T w_1 \end{align}

gives rise to a specific form for the conditional probability over the labels( the logistic model):

\begin{align} P(y = 1|x,w) =g(w_0 + x^T w_1) \end{align}

where

\begin{align} g(z) =(1 + exp(-z))^{-1} \end{align}$$ 激活函数 is a logistic "squashing function " that turns linear predictions into probabilities.
- As with the linear regression models we can fit the logistic models using the maximum (conditional) log-likelihood criterion.
  $\begin{align}l(D;w) = \sum_{i=1}^n log P(y_i|x_i,w)\end{align}$
  where
  $\begin{align} P(y = 1|x,w) = g(w_0 + x^Tw_1) \end{align}$
- The log-likelihood function $ l(D;w) $ is a jointly concave function of thr parameters w; a number of optimization techniques are available for finding the maximizing parameters.
- if we set the derivatives of the log-likelihood with respect ot the parameters to zero.
  
  the optimality conditions again require that the prediction errors
  
  corresponding to the optimal setting of the parameters are uncorrelated with any linear function of the inputs.
testing set 和 Training Set
- 尽量使分布一致
Dataset 和 input data
How many training samples we need to achieve successful learning
- The learned model can approximate the real model with high probaility