Ablative analysis

假设,通过添加了一些特征到逻辑回归中:

  • Spelling correction
  • Sender host features
  • Email header features
  • Email text parser features
  • Javascript parser
  • Features from embedded images

我们将一个原本准确率只有94.0%的垃圾邮件分类器提升至准确率为99.9%。此时,我们想知道每一个组件对整体准确率的提升有多大?
消蚀分析通过每次从系统中移除一个组件,看准确率下降多少以解答这个问题:
图片 1

由上,我们可以看到Email text parser
features对准确率提升的帮助最大,如果考虑通过去除一些组件以提升效率的话,这一步是最不应该被去除的。

正则化的偏差与方差

在训练模型的过程中,为了避免过拟合问题我们通常使用正则化方法。但对于正则化参数λ的选择,我们是需要谨慎考虑的。

之前,我们在考虑正则化参数λ的选择时,只是考虑单变量的情况。现在,我们要考虑在多项式的情况下,正则化参数λ的取值问题。

图片 2

例如:对于某一多项式模型,我们使用正则化方法。其中,正则化参数λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10。现求出最佳的正则化参数λ的值。

首先,我们将数据集分为训练集、交叉验证集和测试集三部分。

然后,当正则化参数λ=0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10时,我们分别求出Jtran(θ)和JCV(θ)。

最后,我们利用测试集对JCV(θ)最小时的某个正则化参数λ值进行计算,求出其Jtest(θ)。

图片 3

图中,假设正则化参数λ=0.08时,JCV(θ)最小。

为了便于理解,以及便于找到最佳的正则化参数λ的值,我们可以画出下图:

图片 4

2.6.3    function and geometric margin    40

1. 想当然地使用缺省损失函数

许多实践者使用缺省损失函数(如,均方误差)训练和挑选最好的模型。实际上,现有损失函数很少符合业务目标。以欺诈检测为例,当试图检测欺诈性交易时,业务目标是最小化欺诈损失。现有二元分类器损失函数为误报率和漏报率分配相等权重,为了符合业务目标,损失函数惩罚漏报不仅要多于惩罚误报,而且要与金额数量成比例地惩罚每个漏报数据。此外,欺诈检测数据集通常含有高度不平衡的标签。在这些情况下,偏置损失函数能够支持罕见情况(如,通过上、下采样)。

这一节是Andrew对应用机器学习给出的建议,虽然没有数学公式,但却是十分重要的一课。

Regularization and Bias/Variance

图片 5

In the figure above, we see that as λ increases, our fit becomes more
rigid. On the other hand, as λ approaches 0, we tend to over overfit the
data. So how do we choose our parameter λ to get it ‘just right’ ? In
order to choose the model and the regularization term λ, we need to:

  1. Create a list of lambdas (i.e.
    λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
  2. Create a set of models with different degrees or any other variants.
  3. Iterate through the λs and for each λ go through all the models to
    learn some Θ.
  4. Compute the cross validation error using the learned Θ (computed
    with λ) on the JCV(Θ) without regularization or λ = 0.
  5. Select the best combo that produces the lowest error on the cross
    validation set.
  6. Using the best combo Θ and λ, apply it on Jtest(Θ) to see
    if it has a good generalization of the problem.

2.1    Perceptron Learning Algorithm (PLA)    3

4. Use high variance model when n<<P

SVM is one of the most popular off-the-shelf modeling algorithms and one
of its most powerful features is the ability to fit the model with
different kernels. SVM kernels can be thought of as a way to
automatically combine existing features to form a richer feature space.
Since this power feature comes almost for free, most practitioners by
default use kernel when training a SVM model. However, when the data has
n<<p

1.high variance vs high bias

Typical learning curve for high variance:

图片 6

  • 随着样本量的增大,测试集的错误率将会持续减小(增大样本量将有助于性能的改善)
  • 训练集的错误率与测试集的错误率有很大距离

Typical learning curve for high bias:

图片 7

  • 训练集的错误率也无法接受地高
  • 训练集与测试集的错误率相差不大

因此,上面列出的解决方案中:

  • 收集更多的训练样本  ——可解决high variance问题
  • 进一步减少特征数  ——可解决high variance问题
  • 增加特征数  ——可解决high bias问题
  • 改变特征(考虑邮件标题/正文)  ——可解决high bias问题
Learning Curves

Training an algorithm on a very few number of data points (such as 1, 2
or 3) will easily have 0 errors because we can always find a quadratic
curve that touches exactly those number of points. Hence:

  • As the training set gets larger, the error for a quadratic function
    increases.
  • The error value will plateau out after a certain m, or training set
    size.

Experiencing high bias:

Low training set size: causes Jtrain(Θ) to be low and
JCV(Θ) to be high.

Large training set size: causes both Jtrain(Θ) and
JCV(Θ) to be high with
Jtrain(Θ)≈JCV(Θ).

If a learning algorithm is suffering from high bias, getting more
training data will not (by itself) help much.

图片 8

Experiencing high variance:

Low training set size: Jtrain(Θ) will be low and
JCV(Θ) will be high.

Large training set size: Jtrain(Θ) increases with training
set size and JCV(Θ) continues to decrease without leveling
off. Also, Jtrain(Θ) < JCV(Θ) but the
difference between them remains significant.

If a learning algorithm is suffering from high variance, getting more
training data is likely to help.

图片 9

2.7    神经网络    51

4.样本数少于特征数(n<<P)时使用高方差模型

SVM是现有建模算法中最受欢迎算法之一,它最强大的特性之一是,用不同核函数去拟合模型的能力。SVM核函数可被看作是一种自动结合现有特征,从而形成一个高维特征空间的方式。由于获得这一强大特性不需任何代价,所以大多数实践者会在训练SVM模型时默认使用核函数。然而,当数据样本数远远少于特征数(n<<P)—业界常见情况如医学数据—时,高维特征空间意味着更高的数据过拟合风险。事实上,当样本数远小于特征数时,应该彻底避免使用高方差模型。

Getting started on a learning problem

Approach #1: Careful design

  • Spend a long term designing exactly the right features, collecting
    the right dataset, and designing the right algorithmic architecture.
  • Implement it and hope it works.

Benefit: Nicer, perhaps more scalable algorithms. May come up with
new, elegant, learning algorithms; contribute to basic research in
machine learning.

Approach #2: Build-and-fix

  • Implement something quick-and-dirty.
  • Run error analyses and diagnostics to see what’s wrong with it, and
    fix its errors.

Benefit: Will often get your application problem working more
quickly. Faster time to market.

第一种方法适用于做理论研究。
而在日常的工作中我们应使用第二种方案,避免过早优化(去学一大堆可能用不着的知识,花费大量的精力在只能提升一点点收益的地方),提升工作效率!
最后,Andrew表示他经常要花3分之1甚至更多的时间在诊断方法的设计上,以找出哪里能正常工作,哪里出了问题,而花这部分时间是非常值得的(well-spent)。

Diagnosing Bias vs. Variance

In this section we examine the relationship between the degree of the
polynomial d and the underfitting or overfitting of our hypothesis.

  • We need to distinguish whether bias or variance is the problem
    contributing to bad predictions.
  • High bias is underfitting and high variance is overfitting. Ideally,
    we need to find a golden mean between these two.

The training error will tend to decrease as we increase the degree d of
the polynomial.

At the same time, the cross validation error will tend to decrease as we
increase d up to a point, and then it will increase as d is increased,
forming a convex curve.

High bias (underfitting): both Jtrain(Θ) and
JCV(Θ) will be high. Also,
JCV(Θ)≈Jtrain(Θ).

High variance (overfitting): Jtrain(Θ) will be low and
JCV(Θ) will be much greater than Jtrain(Θ).

The is summarized in the figure below:

图片 10

 

6. 不考虑线性相关直接使用线性模型

设想建立一个具有两变量X1和X2的线性模型,假设真实模型是Y=X1+X2。理想地,如果观测数据含有少量噪声,线性回归解决方案将会恢复真实模型。然而,如果X1和X2线性相关(大多数优化算法所关心的),Y=2*X1,
Y=3*X1-X2或Y=100*X1-99*X2都一样好,这一问题可能并无不妥,因为它是无偏估计。然而,它却会使问题变得病态,使系数权重变得无法解释。

Error Analysis

假设有一个人脸识别的机器学习应用,它由多个不同的机器学习组件构成。

图片 11

当前整个系统的准确率是85%,我们可以通过误差分析的方法来确定哪个组件的改善能够最大程度地提升整体准确率:

图片 12

具体的做法是:用人工或者其他手段替换一个机器学习的组件,记录替换后得到的系统准确率。由上表我们可以看到,提升Face
detection将显著提升系统的准确率(应该是我们下一步工作的重点),而Preprocess(remove
backgroung)则只能微量提升整体准确率。

1时,我们的假设函数hθ(x)能完美拟合训练集,其Jtrain(θ)

0,但对于交叉验证集而言,假设函数hθ(x)的泛化能力差,其JCV(θ)的值将较大;当m=2时,我们的假设函数hθ能较好地拟合训练集,其Jtrain(θ)的值将稍微增大,但对于交叉验证集而言,假设函数hθ(x)的泛化能力依旧较差,其JCV(θ)的值将较比之前有略微减小;······;但m足够大时,Jtrain(θ)的值将增大到某一特定值后保持水平,JCV(θ)的值将减小到某一特定值后保持水平,且Jtrain(θ)的值与JCV(θ)的值非常接近。

因此,当学习算法处于高偏差的情况时,我们增加训练集样本数量是毫无用处的。

图片 13

上图中,我们的假设函数hθ(x) = θ0 + θ1x

  • θ2x2 + … +

    θ100x100,此处考虑正则化,其中正则化参数λ的值很小。当m

    5时,假设函数hθ(x)能够较好地拟合训练集,其Jtrain(θ)的值较小,但假设函数hθ(x)的泛化能力较差,其JCV(θ)的值较大;当m

    12时,假设函数hθ(x)依旧能够较好地拟合训练集,但其Jtrain(θ)的值稍微增大一些,JCV(θ)的值略微减小一些;······;当m足够大时,Jtrain(θ)的值逐渐增大,JCV(θ)的值逐渐减小。

因此,此时学习算法处于高偏差的情况时,我们增加训练集样本数量可能会有些帮助。

注:当m足够大时,Jtrain(θ)的值逐渐增大,JCV(θ)的值逐渐减小,这两者是否会相交,视频中尚未交代清楚。

第一章:Introduction;

5. L1/L2/… regularization without standardization

Applying L1 or L2 to penalize large coefficients is a common way to
regularize linear or logistic regression. However, many practitioners
are not aware of the importance of standardizing features before
applying those regularization.

Returning to fraud detection, imagine a linear regression model with a
transaction amount feature. Without regularization, if the unit of
transaction amount is in dollars, the fitted coefficient is going to be
around 100 times larger than the fitted coefficient if the unit were in
cents. With regularization, as the L1 / L2 penalize larger coefficient
more, the transaction amount will get penalized more if the unit is in
dollars. Hence, the regularization is biased and tend to penalize
features in smaller scales. To mitigate the problem, standardize all the
features and put them on equal footing as a preprocessing step.

总结

从过年期间(第一篇笔记发布于2月24号)到清明假期(今天4月5号),历经40天左右的时间终于把CS229监督学习的部分过了一遍。期间解开了无数困惑已久的机器学习领域的问题,感觉整个人神清气爽。感谢网易公开的翻译,并且细心地提供了讲义下载,当然更感谢Andrew大神的精彩授课。接下来需要先掌握一个好的工具箱(也许是mahout?);然后开始践行先Implement
something quick-and-dirty,再通过error
analyses优化重点环节。像个ml专家一样去战斗!

补充笔记

1    Introduction    1

7. Interpreting absolute value of coefficients from linear or logistic regression as feature importance

Because many off-the-shelf linear regressor returns p-value for each
coefficient, many practitioners believe that for linear models, the
bigger the absolute value of the coefficient, the more important the
corresponding feature is. This is rarely true as (a) changing the scale
of the variable changes the absolute value of the coefficient (b) if
features are multi-collinear, coefficients can shift from one feature to
others. Also, the more features the data set has, the more likely the
features are multi-collinear and the less reliable to interpret the
feature importance by coefficients.

So there you go: 7 common mistakes when doing ML in practice. This list
is not meant to be exhaustive but merely to provoke the reader to
consider modeling assumptions that may not be applicable to the data at
hand. To achieve the best model performance, it is important to pick the
modeling algorithm that makes the most fitting assumptions — not just
the one you’re most familiar with.

原文地址:

==================================================

翻译:

2.optimization algorithm vs optimization objective

还是刚才垃圾邮件分类的例子,假设贝叶斯逻辑回归对垃圾邮件分类的错误率是2%,对非垃圾邮件分类的错误率也是2%(我们不愿意看到有过多的正常邮件也被过滤);而SVM使用linear
kernel在垃圾邮件分类的错误率是10%,而对非垃圾邮件分类的错误率只有0.01%。但考虑到计算的效率,你还是想使用逻辑回归,这时应该如何调优?
这时我们关心的问题有两个:

  1. 逻辑回归的梯度上升算法是否收敛了?
  2. 我们是否在优化正确的函数?

这个问题中,我们关心的函数是加权的准确率(非垃圾邮件的权重应高于垃圾邮件的权重):

图片 14

而贝叶斯逻辑回归与SVM相应的策略函数则需要考虑是否选用了合适的参数:

图片 15

由问题的背景,我们已经有a(θSVM) >
a(θBLR),这时我们需要诊断:J(θSVM) >
J(θBLR)? 如果J(θSVM) >
J(θBLR),这说明θBLR没能最大化J(θ),也就是算法没能收敛,需要改进最优化算法;
如果J(θSVM) ≤
J(θBLR),这说明J(θ)是错误的优化目标,因为即使J(θ)已经最大化而我们关心的目标函数却没能最大化,需要改进目标函数。
因此,上面列出的解决方案中:

  • 将梯度上升多运行几个迭代  ——可解决最优化算法问题
  • 尝试牛顿方法  ——可解决最优化算法问题
  • 使用不同的λ  ——可解决最优化目标问题
  • 改用SVM  ——可解决最优化目标问题

由上所述的两个问题可见,如果没有诊断清楚问题的根源就随机调优,将很有可能导致做了半天却毫无改进。除此之外,经常需要提出自己的诊断方法以确定算法中出现了什么问题。”solving a
really important problem using learning algorithms, one of the most
valuable things is just your own personal intuitive understanding of
problem.” 而诊断确定机器学习应用中的问题是获得“理解问题的直觉”的好方法。

补充笔记

2.3.1    逻辑回归算法原理    10

3. Forget about outliers

Outliers are interesting. Depending on the context, they either deserve
special attention or should be completely ignored. Take the example of
revenue forecasting. If unusual spikes of revenue are observed, it’s
probably a good idea to pay extra attention to them and figure out what
caused the spike. But if the outliers are due to mechanical error,
measurement error or anything else that’s not generalizable, it’s a good
idea to filter out these outliers before feeding the data to the
modeling algorithm.

Some models are more sensitive to outliers than others. For instance,
AdaBoost might treat those outliers as “hard” cases and put tremendous
weights on outliers while decision tree might simply count each outlier
as one false classification. If the data set contains a fair amount of
outliers, it’s important to either use modeling algorithm robust against
outliers or filter the outliers out.

Debugging Learning Algorithms

假设要做一个垃圾邮件分类的模型,已经从海量的词汇表中选出一个较小的词汇子集(100个单词)作为特征。
用梯度上升算法实现了贝叶斯逻辑回归,但测试集的错误率达到了20%,这显然太高了。
图片 16

如何解决这个问题?

  • 收集更多的训练样本
  • 进一步减少特征数
  • 增加特征数
  • 改变特征(考虑邮件标题/正文)
  • 将梯度上升多运行几个迭代
  • 尝试牛顿方法
  • 使用不同的λ
  • 改用SVM

“the people in industry and in search  that I see that are really good,
would not go and try to change a learning algorithm
randomly.”这是Andrew的原话,有这么多的解决方案,如果我们只是随机地选择,那么既浪费时间又感觉是在碰运气。而更好的方法应该是:先通过诊断分析出问题到底出在哪里,然后再选择合适的解决方案。

偏差和方差的判别

高偏差和高方差本质上为学习模型的欠拟合和过拟合问题。

图片 17

对于高偏差和高方差问题,即学习模型的欠拟合和过拟合问题,我们通常绘制如下图表进行判断:

图片 18

高偏差——欠拟合问题

  • Jtrain(Θ)误差大
  • JCV(Θ)误差 ≈ Jtrain(Θ)误差

高方差——过拟合问题

  • Jtrain(Θ)误差小
  • JCV(Θ)误差 >> Jtrain(Θ)误差

5.3    Learning a model for an MDP    93

1. Take default loss function for granted

Many practitioners train and pick the best model using the default loss
function (e.g., squared error). In practice, off-the-shelf loss function
rarely aligns with the business objective. Take fraud detection as an
example. When trying to detect fraudulent transactions, the business
objective is to minimize the fraud loss. The off-the-shelf loss function
of binary classifiers weighs false positives and false negatives
equally. To align with the business objective, the loss function should
not only penalize false negatives more than false positives, but also
penalize each false negative in proportion to the dollar amount. Also,
data sets in fraud detection usually contain highly imbalanced labels.
In these cases, bias the loss function in favor of the rare case (e.g.,
through up/down sampling).

下一步决定做什么

在机器学习应用建议(一)一文的开头,我们就预测结果存在高误差而提出了如下的解决方法:

  • 获取更多的样本
  • 尝试减少特征变量的数量
  • 尝试获取更多的特征变量
  • 尝试增加多项式特征
  • 尝试减小正则化参数λ的值
  • 尝试增大正则化参数λ的值

对于这些方法,我们分别进行了研究得出了如下结论:

  • 获取更多的样本——适合高方差(过拟合)问题
  • 尝试减少特征变量的数量——适合高方差(过拟合)问题
  • 尝试获取更多的特征变量——适合高偏差(欠拟合)问题
  • 尝试增加多项式特征——适合高偏差(欠拟合)问题
  • 尝试减小正则化参数λ的值——适合高偏差(欠拟合)问题
  • 尝试增大正则化参数λ的值 ——适合高方差(过拟合)问题

对于神经网络模型而言,使用“小”的模型,其容易出现高偏差(欠拟合)问题,但其优势在于计算代价较小;使用“大”的模型(即隐藏层激活单元较多或有多个隐藏层。),其容易出现高方差(过拟合)问题,且其计算代价较大。但一般而言,正则化的神经网络模型越“大”其性能越好。

图片 19

通常我们选择只含有一层隐藏层的神经网络模型。但对于其他情况,只含有一层隐藏层的神经网络模型并不是最优的模型。因此,我们可以将数据集分为训练集、交叉验证集和测试集三部分,分别对隐藏层层数不同的神经网络模型进行训练,找到一个JCV(Θ)最小的神经网络模型为止。

非线性变换 non-linear
transformation

7. 将线性或逻辑回归模型的系数绝对值解释为特征重要性

因为很多现有线性回归量为每个系数返回P值,对于线性模型,许多实践者认为,系数绝对值越大,其对应特征越重要。事实很少如此,因为:(a)改变变量尺度就会改变系数绝对值;(b)如果特征是线性相关的,则系数可以从一个特征转移到另一个特征。此外,数据集特征越多,特征间越可能线性相关,用系数解释特征重要性就越不可靠。

这下你就知道了机器学习实践中的七种常见错误。这份清单并不详尽,它只不过是引发读者去考虑,建模假设可能并不适用于手头数据。为了获得最好的模型性能,挑选做出最合适假设的建模算法—而不只是选择你最熟悉那个算法,是很重要的。

补充笔记

二元分类 binary classification

Machine Learning Done Wrong

Statistical modeling is a lot like engineering.

In engineering, there are various ways to build a key-value storage, and
each design makes a different set of assumptions about the usage
pattern. In statistical modeling, there are various algorithms to build
a classifier, and each algorithm makes a different set of assumptions
about the data.

When dealing with small amounts of data, it’s reasonable to try as many
algorithms as possible and to pick the best one since the cost of
experimentation is low. But as we hit “big data”, it pays off to analyze
the data upfront and then design the modeling pipeline (pre-processing,
modeling, optimization algorithm, evaluation, productionization)
accordingly.

As pointed out in my
previouspost,
there are dozens of ways to solve a given modeling problem. Each model
assumes something different, and it’s not obvious how to navigate and
identify which assumptions are reasonable. In industry, most
practitioners pick the modeling algorithm they are most familiar with
rather than pick the one which best suits the data. In this post, I
would like to share some common mistakes (the don’t-s). I’ll save some
of the best practices (the do-s) in a future post.

学习曲线

通过绘制学习曲线可以帮助我们了解学习算法是否运行正常。学习曲线为训练集误差、交叉验证集误差与训练集样本数量m之间的函数关系图。

图片 20

上图中,假设函数为hθ(x) = θ0 + θ1x +
θ2x2,且此处不考虑正则化。当m =

4.1    k-means clustering algorithm    58

2. Use plain linear models for non-linear interaction

When building a binary classifier, many practitioners immediately jump
to logistic regression because it’s simple. But, many also forget that
logistic regression is a linear model and the non-linear interaction
among predictors need to be encoded manually. Returning to fraud
detection, high order interaction features like “billing address =
shipping address and transaction amount < $50” are required for good
model performance. So one should prefer non-linear models like SVM with
kernel or tree based classifiers that bake in higher-order interaction
features.

Deciding What to Do Next Revisited

Our decision process can be broken down as follows:

  • Getting more training examples: Fixes high variance
  • Trying smaller sets of features: Fixes high variance
  • Adding features: Fixes high bias
  • Adding polynomial features: Fixes high bias
  • Decreasing λ: Fixes high bias
  • Increasing λ: Fixes high variance.

Diagnosing Neural Networks

  • A neural network with fewer parameters is prone to underfitting. It
    is also computationally cheaper.
  • A large neural network with more parameters is prone to overfitting.
    It is also computationally expensive. In this case you can use
    regularization (increase λ) to address the overfitting.

Using a single hidden layer is a good starting default. You can train
your neural network on a number of hidden layers using your cross
validation set. You can then select the one that performs best.

Model Complexity Effects:

  • Lower-order polynomials (low model complexity) have high bias and
    low variance. In this case, the model fits poorly consistently.
  • Higher-order polynomials (high model complexity) fit the training
    data extremely well and the test data extremely poorly. These have
    low bias on the training data, but very high variance.
  • In reality, we would want to choose a model somewhere in between,
    that can generalize well but also fits the data reasonably well.

4.4    Principal Components Analysis    77

6. Use linear model without considering multi-collinear predictors

Imagine building a linear model with two variables X1 and X2 and suppose
the ground truth model is Y=X1+X2. Ideally, if the data is observed with
small amount of noise, the linear regression solution would recover the
ground truth. However, if X1 and X2 are collinear, to most of the
optimization algorithms’ concerns, Y=2*X1, Y=3*X1-X2 or
Y=100*X1-99*X2 are all as good. The problem might not be detrimental
as it doesn’t bias the estimation. However, it does make the problem
ill-conditioned and make the coefficient weight uninterpretable.

补充笔记

learning theory (bias/variance tradeoffs; VC theory; large margins);

3.忘记异常值

异常值非常有趣,根据上下文环境,你可以特殊关注或者完全忽略它们。以收入预测为例,如果观察到不同寻常的峰值收入,给予它们额外关注并找出其原因可能是个好主意。但是如果异常是由于机械误差,测量误差或任何其它不可归纳的原因造成的,那么在将数据输入到建模算法之前忽略掉这些异常值是个不错的选择。

相比于其它模型,有些模型对异常值更为敏感。比如,当决策树算法简单地将每个异常值计为一次误分类时,AdaBoost算法会将那些异常值视为“硬”实例,并为异常值分配极大权值。如果一个数据集含有相当数量的异常值,那么,使用一种具有异常值鲁棒性的建模算法或直接过滤掉异常值是非常重要的。

2.4.4    权重衰减    27

机器学习实践中的7种常见错误

统计建模非常像工程学。

在工程学中,有多种构建键-值存储系统的方式,每个设计都会构造一组不同的关于使用模式的假设集合。在统计建模中,有很多分类器构建算法,每个算法构造一组不同的关于数据的假设集合。

当处理少量数据时,尝试尽可能多的算法,然后挑选最好的一个的做法是比较合理的,因为此时实验成本很低。但当遇到“大数据”时,提前分析数据,然后设计相应“管道”模型(预处理,建模,优化算法,评价,产品化)是值得的。

正如我之前文章中所指出的,有很多种方法来解决一个给定建模问题。每个模型做出不同假设,如何导引和确定哪些假设合理的方法并不明确。在业界,大多数实践者是挑选他们更熟悉而不是最合适的建模算法。在本文中,我想分享一些常见错误(不能做的),并留一些最佳实践方法(应该做的)在未来一篇文章中介绍。

深度神经网络 deep neural
networks

下面罗列的几个在机器学习算法实际应用中误区,解决了我很多困惑,推荐大家读一下:

Andrew Ng的课程内容请参看附录1,本文的基本框架如下:

2.非线性情况下使用简单线性模型

当构建一个二元分类器时,很多实践者会立即跳转到逻辑回归,因为它很简单。但是,很多人也忘记了逻辑回归是一种线性模型,预测变量间的非线性交互需要手动编码。回到欺诈检测问题,要获得好的模型性能,像“billing
address = shipping address and transaction amount <
$50”这种高阶交互特征是必须的。因此,每个人都应该选择适合高阶交互特征的带核SVM或基于树的分类器。

neural networks 神经网络

原文:

第二章:Supervised Learning (generative/discriminative learning,
parametric/non-parametric learning, neural networks, support vector
machines);

5.尚未标准化就进行L1/L2/等正则化

使用L1或L2去惩罚大系数是一种正则化线性或逻辑回归模型的常见方式。然而,很多实践者并没有意识到进行正则化之前标准化特征的重要性。

回到欺诈检测问题,设想一个具有交易金额特征的线性回归模型。不进行正则化,如果交易金额的单位为美元,拟合系数将是以美分为单位时的100倍左右。进行正则化,由于L1/L2更大程度上惩罚较大系数,如果单位为美元,那么交易金额将受到更多惩罚。因此,正则化是有偏的,并且趋向于在更小尺度上惩罚特征。为了缓解这个问题,标准化所有特征并将它们置于平等地位,作为一个预处理步骤。

微调 fine-tuned

截距项 intercept term

2.6.2    由逻辑回归引出SVM    38

 

activation 激活值

2.6.4    optimal margin classifier    43

第三章:Learning Theory (regularization and model selection);

高度非凸的优化问题 highly
non-convex optimization problem

[10] 最大似然估计

有监督学习 supervised learning

hyperbolic tangent 双曲正切函数

4.3    The EM Algorithm    72

估值函数/估计值 hypothesis

[3] Jerry Lead 

7.3.1    奇异值和特征值基础知识    103

logistic回归 logistic regression

 

 Andrew Ng
在斯坦福大学的CS229机器学习课程内容

2.6.5    拉格朗日对偶(Lagrange duality)    44

2.4.5    Softmax回归与Logistic 回归的关系    28

2.5.2    朴素贝叶斯 ( Naive Bayes )    34

6.1    应用混合高斯模型和EM实现家庭用户的身份识别    95

2.2.4    Spark MLlib实现线性回归    9

[6] Spark MLlib之朴素贝叶斯分类算法 

2.5.1    Gaussian discriminant analysis ( GDA
)    29

 

高雪松

Author

发表评论

电子邮件地址不会被公开。 必填项已用*标注