(24) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. (F)
一、 回归
1、考虑回归一个正则化回归问题。在下图中给出了惩罚函数为二次正则函数,当正则化参数C取不同值时,在训练集和测试集上的log似然(mean log-probability)。(10分)
(1)说法“随着C的增加,图2中训练集上的log似然永远不会增加”是否正确,并说明理由。 (2)解释当C取较大值时,图2中测试集上的log似然下降的原因。
2、考虑线性回归模型:y~Nw0?w1x, ??2(10分) ?,训练数据如下图所示。
(1)用极大似然估计参数,并在图(a)中画出模型。(3分)
(2)用正则化的极大似然估计参数,即在log似然目标函数中加入正则惩罚函数?并在图(b)中画出当参数C取很大值时的模型。(3分)
(3)在正则化后,高斯分布的方差?是变大了、变小了还是不变?(4分)
2C2w1?, ?2
图(a) 图(b)
3. 考虑二维输入空间点x??x1,x2?上的回归问题,其中xj?T
??1,1?,j?1,2在单位正方形内。
训练样本和测试样本在单位正方形中均匀分布,输出模型为
5y~N?x13x2?10x1x2?7x12?5x2?3, 1?,我们用1-10阶多项式特征,采用线性回归模型来
学习x与y之间的关系(高阶特征模型包含所有低阶特征),损失函数取平方误差损失。 (1) 现在n?20个样本上,训练1阶、2阶、8阶和10阶特征的模型,然后在一个大规模的独立的测试集上测试,则在下3列中选择合适的模型(可能有多个选项),并解释第3列中你选择的模型为什么测试误差小。(10分)
训练误差最小 训练误差最大 测试误差最小 1阶特征的线性模型 X 2阶特征的线性模型 X 8阶特征的线性模型 X 10阶特征的线性模型 X (2) 现在n?10个样本上,训练1阶、2阶、8阶和10阶特征的模型,然后在一个大规模的独
6立的测试集上测试,则在下3列中选择合适的模型(可能有多个选项),并解释第3列中你选择的模型为什么测试误差小。(10分)
训练误差最小 训练误差最大 测试误差最小 1阶特征的线性模型 X 2阶特征的线性模型 8阶特征的线性模型 X X 10阶特征的线性模型 X (3) The approximation error of a polynomial regression model depends on the number of training points. (T)
(4) The structural error of a polynomial regression model depends on the number of training points. (F)
4、We are trying to learn regression parameters for a dataset which we know was
generated from a polynomial of a certain degree, but we do not know what this degree is. Assume the data was actually generated from a polynomial of degree 5 with some added Gaussian noise (that is
y?w0?w1x?w2x2?w3x3?w4x4?w5x5??, ?~N?0,1?.
For training we have 100 {x,y} pairs and for testing we are using an additional set of 100 {x,y} pairs. Since we do not know the degree of the polynomial we learn two models from the data. Model A learns parameters for a polynomial of degree 4 and model B learns parameters for a polynomial of degree 6. Which of these two models is likely to fit the test data better?
Answer: Degree 6 polynomial. Since the model is a degree 5 polynomial and we have enough training data, the model we learn for a six degree polynomial will likely fit a very small coefficient for x6 . Thus, even though it is a six degree polynomial it will actually behave in a very similar way to a fifth degree polynomial which is the correct model leading to better fit to the data.
5、Input-dependent noise in regression
Ordinary least-squares regression is equivalent to assuming that each data point is generated according to a linear function of the input plus zero-mean, constant-variance Gaussian noise. In many systems, however, the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e., x >= 0).
a) Which of the following families of probability models correctly describes this situation in the
univariate case? (Hint: only one of them does.)
(iii) is correct. In a Gaussian distribution over y, the variance is determined by the coefficient of y2; so by replacing
?2by x?2, we get a variance that increases linearly with x. (Note also the change to
the normalization “constant.”) (i) has quadratic dependence on x; (ii) does not change the variance at all, it just renames w1.
b) Circle the plots in Figure 1 that could plausibly have been generated by some instance of the
model family(ies) you chose. (ii) and (iii). (Note that (iii) works for variance appears independent of x.
c) True/False: Regression with input-dependent noise gives the same solution as ordinary regression
for an infinite data set generated according to the corresponding model. True. In both cases the algorithm will recover the true underlying model.
d) For the model you chose in part (a), write down the derivative of the negative log likelihood with
respect to w1.
?2?0.) (i) exhibits a large variance at x = 0, and the
二、 分类
1. 产生式模型 vs. 判别式模型
(a) [ points] Your billionaire friend needs your help. She needs to classify job
applications into good/bad categories, and also to detect job applicants who lie in their applications using density estimation to detect outliers. To meet these needs, do you recommend using a discriminative or generative classifier? Why? [final_sol_s07] 产生式模型 因为要估计密度p?x|y?
(b) [ points] Your billionaire friend also wants to classify software applications to detect bug-prone applications using features of the source code. This pilot project only has a few applications to be used as training data, though. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?
判别式模型
样本数较少,通常用判别式模型直接分类效果会好些
(d) [ points] Finally, your billionaire friend also wants to classify companies to decide which one to acquire. This project has lots of training data based on several decades of research. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why? 产生式模型
样本数很多时,可以学习到正确的产生式模型
2、logstic回归