CS485/685 Foundations of Machine Learning

参考资料

B站视频
官方课本
katex 参考
这样做笔记还是比较吃力，建议还是卖一本原版书对着看。
英文四级刚过，水平有限，有错误欢迎指正😂

What is Machine Learning?

从experience到expertise
expertise usually will be another program
nature
- experiments 举了一个老鼠被毒死的例子，老鼠被毒过一次，就会知道某种看起来像食物的东西是不能吃的。老鼠学习的关键是记忆和如何将经验转化为一种普遍和推广的认识。
- pigeon superstation experiments
  - B.F.Skinner 心理学家
  - 行为心理学认为所有动物都相当于一个黑箱，我们无法理解内部
  - 收集了很多鸽子放到一个巨大的屋子里，屋子的顶部有机械装置可以喷洒谷粒，and put some toys around the fool.
  - Skinner不断的调整间隔intervals to have the food rain, pigeon will peck the toy usually, for more times ,pigeon will find out that some toys are relation with the food rain.
silly generalization
what distinguishes between these two types of learning (two phenomena )?
- Garcia 博士 1996 尝试理解第一个实验，做了相同的事，给了老鼠类似食物的毒药诱饵，但是不同的是，在老鼠尝试去触碰有毒的食物时，他将会响铃。
  
  讨论时有人说，食物并不是足以致死的食物，不然实验无法进行多轮
- 老鼠是否会学会响铃和毒药事物的关系，大概说了一下巴甫洛夫的狗和这个实验的区别和联系。
- 老鼠是否可以通过一次的experiment 得到 association between the ringing of a bell and the sickness five hours later.
结果是老鼠最终失败了
prior knowledge
why not pay attention to every thing ?
- 当老鼠在想毒药和食物的因果关系的时候，来自食物的信息是特殊的，如果接受了所有的信息，那么所有信息对老鼠来说都是特殊的。
prior knowledge is the key of what going to do in machine learning .
poor prior knowledge means more training data
a lots of prior knowledge,means I already knows how to do ,even don’t need collect data.
机器学习介于上述两者之间，需要首先判断我们有多少先验知识，然后再收集数据进行训练。

What do we need Machine learning ?

如果我们不使用machine learning的话自己编写一个程序，区分spam和not spam 是比较困难的
Tasks that require experience with amounts of date that are beyond capabilities

需要处理超出能力范围的大量数据的经验的任务
- 举了一个Google的例子，google 收集大数据来分析用户的在页面点击的数据

Many types of Machine learning

supervised VS unsupervised
supervised监督学习
- 通过人工标记垃圾邮件和正常邮件来训练模型
unsupervised无监督学习
- 没有需要人工标记的地方，需要通过整个模型来找到outliers(离群值)，或者叫找到特征
- scenarios 情节
reinforce learning 强化学习
- 介于监督学习和无监督学习之间
Batch learning 批处理学习
online learning 在线学习，不断地观察周围进行学习
Cooperative -> indifferent -> adversarial teacher
被动学习主动学习
relationship courses
- AI programming
- algorithms 涉及计算复杂度
- linear algebra
- statistics
- combinatorics
- optimization

Outline of the course

principles
- supervised batch
Algorithmic paradigms

Papaya Tasting

在一个贫瘠的小岛上有很多木瓜树，在岛上的人不得不学会如何判断树上的木瓜熟没熟
挑选特征
color
softness hard or mushy
Papaya -> (soft,color)
training data 就是二维的所有的点
domain space 是(0,1)
label {Tasty,N}
output, $f:[0,1]^2 - > {T,N} $
assumption about data generation
- the data is randomly generated
- Realizability by rectangles 可实现的矩形

How do I evaluate the success of such a function

measure of success

papayas
- Domain set X，$ [0,1]^2$
- Label set Y，$ (T,N)$
- Training input ，set of already
- sample，S=((x1,y1)…(xm,ym))S= ((x_1,y_1)…(x_m,y_m))S=((x1,y1)…(xm,ym))，tasted papaya
- learners output,h:X->Y
- get the prediction rule for tastiness
the quality of such h is determined with respect to some data generating distribution and labeling rule.

这些h的质量是根据一些数据生成分布和标注规则来决定的。

LD,f(h)=D[dx:h(x)≠f(x)]L_{D,f}(h) = D[dx: h(x) \ne f(x)] LD,f(h)=D[dx:h(x)=f(x)]

L : lost
D: distribution that generates papaya in the world.(分布)
F: function，the true function X->Y. （其中x表示一个点，（softness，color）,a strong assumption）
this is kind of the probability of error (h失败的可能性)
the goal of learner is given s to come up with h of small loss.学习者的目标是给定s来想出h的小损失
Basic learner strategy
Empirical risk Minimization (ERM) ,经验风险最小化
- 我们并不知道样本的分布，我们只知道这些sample，样品
define Ls(h)L_s(h)Ls(h),s 表示 sample，empirical loss of h,下面表示损失函数是如何定义的

\begin{align} L_s(h) = \frac{|di: h(x_i)\ne y_i|}{|S|} \end{align}
A very simple rule for finding h with small ER

\begin{align} h_S(x) \triangleq \begin{cases}y_i \quad if, x=x_i \quad for \quad some(x_i,y_i)\in S, \\N \quad otherwise\end{cases}\end{align}
We really to amend this empirical risk minimization technique.

我们真的要修正这个经验风险最小化技术 (ERM)
overfitting 过拟合

How to overcome the overfitting?

to guard against overfitting we introduce some prior knowledge(inductive bias)

为了防止过拟合，我们引入了一些先验知识(归纳偏差)
the exist a good prediction rule that is some axis some aligned rectangle.

存在一个很好的预测规则，即某一轴某一对齐矩形。
Let H denote a fixed collection of potential (candidate) prediction rules.

设H表示一个固定的潜在(候选)预测规则集合。

H⊆{f:f:x→y}=XyH \subseteq \lbrace f :f:x \to y \rbrace = X^y H⊆{f:f:x→y}=Xy
call it hypothesis class h

Revised learning rule

ERMhERM_hERMh —— pick h $ h \in H $ that minimizes of Ls(h)L_s(h)Ls(h)

ERMH(s)∈Argmin{Ls(h),h∈H}ERM_H(s) \in Argmin \lbrace L_s(h),h \in H\rbraceERMH(s)∈Argmin{Ls(h),h∈H}

ERM表示一个算法，上面的意思是，我的算法将会选择一个最小的损失(a minimum loss)
可以保证在众多的假设中，这是一个成功的学习规则(learning rule)

arg min 就是使后面这个式子达到最小值时的x，t的取值。

arg的意思：aargument of a complex number 复数的辐角

来源：https://www.cnblogs.com/Ph-one/p/12811442.html

Theorem

Theorem : Let x be any set , y = {0,1}, and let H be a finite set of functions from X to Y.

设x为任意集合，y = {0,1}， H为从x到y的有限函数集。
Assume : the training sample S is generated by some probability distribution d over x and labeled by some $ f \in H$
and the elements of S are picked i.i.d ( indentically independent distributed 独立分布相同,意思是每一个xi独立的取出，不考虑先前的取出， indentically所有的取出基于相同的分布)
Then $ERM_h $ is guaranteed to come up with an h that has small true loss , given sufficients large sample S. 在足够大样本S的情况下，$ERM_h $保证得到一个真实损失小的h。

Proof

coufusing samples are those on which ERMhERM_hERMh may come up with a bad h.
Fix some success parameter ϵ>0\epsilon > 0ϵ>0
the set of congusing S’s is $ \lbrace S ; L_{D,f}(h_s) > \epsilon\rbrace$
We wish to upper bound the probability of fetting such a bad sample.

我们希望把得到这样一个坏样品的可能性设个上限。

Dm[{S∣x:LD,f(hs)>ϵ}]D^m [ \lbrace S|_x : L_{D,f}(h_s)> \epsilon \rbrace] Dm[{S∣x:LD,f(hs)>ϵ}]
- D[A] 表示命中A的概率，上面的A为Bad samples