机器学习题库-new - 图文(3)

2019-08-30 20:23

Figure 2: Log-probability of labels as a function of regularization parameter C

Here we use a logistic regression model to solve a classification problem. In Figure 2, we have plotted the mean log-probability of labels in the training and test sets after having trained the classifier with quadratic regularization penalty and different values of the regularization parameter C.

(1) In training a logistic regression model by maximizing the likelihood of the labels given the inputs

we have multiple locally optimal solutions. (F)

Answer: The log-probability of labels given examples implied by the logistic regression model is a concave (convex down) function with respect to the weights. The (only) locally optimal solution is also globally optimal

(2) A stochastic gradient algorithm for training logistic regression models with a fixed learning rate

will find the optimal setting of the weights exactly. （F）

Answer: A fixed learning rate means that we are always taking a finite step towards improving the log-probability of any single training example in the update equation. Unless the examples are somehow ―aligned‖, we will continue jumping from side to side of the optimal solution, and will not be able to get arbitrarily close to it. The learning rate has to approach to zero in the course of the updates for the weights to converge.

(3) The average log-probability of training labels as in Figure 2 can never increase as we increase C.

（T）

Stronger regularization means more constraints on the solution and thus the (average) log-probability of the training examples can only get worse.

(4) Explain why in Figure 2 the test log-probability of labels decreases for large values of C. As C increases, we give more weight to constraining the predictor, and thus give less flexibility to fitting the training set. The increased regularization guarantees that the

test performance gets closer to the training performance, but as we over-constrain our allowed predictors, we are not able to fit the training set at all, and although the test performance is now very close to the training performance, both are low.

(5) The log-probability of labels in the test set would decrease for large values of C even if we had a

large number of training examples. （T）

The above argument still holds, but the value of C for which we will observe such a decrease will scale up with the number of examples.

(6) Adding a quadratic regularization penalty for the parameters when estimating a logistic regression

model ensures that some of the parameters (weights associated with the components of the input vectors) vanish.

A regularization penalty for feature selection must have non-zero derivative at zero. Otherwise, the regularization has no effect at zero, and weight will tend to be slightly non-zero, even when this does not improve the log-probabilities by much.

3、正则化的Logstic回归

This problem we will refer to the binary classification task depicted in Figure 1(a), which we attempt to solve with the simple linear logistic regression model

(for simplicity we do not use the bias parameter w0). The training data can be separated with zero training error - see line L1 in Figure 1(b) for instance.

(1) Consider a regularization approach where we try to maximize

for large C. Note that only w2 is penalized. We’d like to know which of the four lines in Figure 1(b) could arise as a result of such regularization. For each potential line L2, L3 or L4 determine whether it can result from regularizing w2. If not, explain very briefly why not.

L2: No. When we regularize w2, the resulting boundary can rely less on the value of x2 and therefore becomes more vertical. L2 here seems to be more horizontal than the unregularized solution so it

(a) The 2-dimensional data set used in Problem 2

(b) The points can be separated by L1 (solid line). Possible other decision boundaries are shown by L2;L3;L4.

cannot come as a result of penalizing w2

L3: Yes. Here w2^2 is small relative to w1^2 (as evidenced by high slope), and even though it would assign a rather low log-probability to the observed labels, it could be forced by a large regularization parameter C.

L4: No. For very large C, we get a boundary that is entirely vertical (line x1 = 0 or the x2 axis). L4 here is reflected across the x2 axis and represents a poorer solution than it’s counter part on the other side. For moderate regularization we have to get the best solution that we can construct while keeping w2 small. L4 is not the best and thus cannot come as a result of regularizing w2.

(2) If we change the form of regularization to one-norm (absolute value) and also regularize w1 we get

the following penalized log-likelihood

Consider again the problem in Figure 1(a) and the same linear logistic regression model. As we increase the regularization parameter C which of the following scenarios do you expect to observe (choose only one):

( x ) First w1 will become 0, then w2. ( ) w1 and w2 will become zero simultaneously ( ) First w2 will become 0, then w1.

( ) None of the weights will become exactly zero, only smaller as C increases

The data can be classified with zero training error and therefore also with high log-probability by looking at the value of x2 alone, i.e. making w1 = 0. Initially we might prefer to have a non-zero value for w1 but it will go to zero rather quickly as we increase regularization. Note that we pay a

regularization penalty for a non-zero value of w1 and if it doesn’t help classification why would we pay the penalty? The absolute value regularization ensures that w1 will indeed go to exactly zero. As C increases further, even w2 will eventually become zero. We pay higher and higher cost for setting w2 to a non-zero value. Eventually this cost overwhelms the gain from the log-probability of labels that we can achieve with a non-zero w2. Note that when w1 = w2 = 0, the log-probability of labels is a finite value nlog(0:5).

1、 SVM

Figure 4: Training set, maximum margin linear separator, and the support vectors (in bold).

(1) What is the leave-one-out cross-validation error estimate for maximum margin separation in figure

4? (we are asking for a number) （0）

Based on the figure we can see that removing any single point would not chance the resulting maximum margin separator. Since all the points are initially classified correctly, the leave-one-out error is zero.

(2) We would expect the support vectors to remain the same in general as we move from a linear

kernel to higher order polynomial kernels. (F)

There are no guarantees that the support vectors remain the same. The feature vectors corresponding to polynomial kernels are non-linear functions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite different.

(3) Structural risk minimization is guaranteed to find the model (among those considered) with the

lowest expected loss. （F）

We are guaranteed to find only the model with the lowest upper bound on the expected loss.

(4) What is the VC-dimension of a mixture of two Gaussians model in the plane with equal covariance

matrices? Why?

A mixture of two Gaussians with equal covariance matrices has a linear decision boundary. Linear separators in the plane have VC-dim exactly 3. 4、SVM

对如下数据点进行分类：

(a) Plot these six training points. Are the classes {+, ?} linearly separable? yes

(b) Construct the weight vector of the maximum margin hyperplane by inspection and identify the support vectors.

The maximum margin hyperplane should have a slope of ?1 and should satisfy x1 = 3/2, x2 = 0. Therefore it’s equation is x1 + x2 = 3/2, and the weight vector is (1, 1)T .

In this specific dataset the optimal margin increases when we remove the support vectors (1, 0) or (1, 1)

and stays the same when we remove the other two.

(d) (Extra Credit) Is your answer to (c) also true for any dataset? Provide a counterexample or give a short proof.

When we drop some constraints in a constrained maximization problem, we get an optimal value which is at least as good the previous one. It is because the set of candidates satisfying the original (larger, stronger) set of contraints is a subset of the candidates satisfying the new (smaller, weaker) set of constraints. So, for the weaker constraints, the oldoptimal solution is still available and there may be additions soltons that are even better. In mathematical form:

Finally, note that in SVM problems we are maximizing the margin subject to the constraints given by training points. When we drop any of the constraints the margin can increase or stay the same depending on the dataset. In general problems with realistic datasets it is expected that the margin increases when we drop support vectors. The data in this problem is constructed to demonstrate that when removing some constraints the margin can stay the same or increase depending on the geometry. 2、 SVM

对下述有3个数据点的集合进行分类：

(a) Are the classes {+, ?} linearly separable? No。

(b) Consider mapping each point to 3-D using new feature vectors classes now linearly separable? If so, find a separating hyperplane.

??x??1,2x,x2??. Are the

The points are mapped to ?1,0,0?,1,?2,1,1,2,1 respectively. The points are now separable in 3-dimensional space. A separating hyperplane is given by the weight vector (0,0,1) in the new space as seen in the figure.

????

共7页:

机器学习题库-new - 图文(3).doc 将本文的Word文档下载到电脑下载失败或者文档不完整，请联系客服人员解决！

下载这篇word文档