Stochastic gradient descent (SGD) is a simple but widely applicable optimization technique. For example, we can use it to train a Support Vector Machine. The objective function in this case is given by:


where is the hinge loss function, with for are the training examples, with being the label for the vector .

For simplicity, we ignore the offset parameter in all problems on this page.
The stochastic gradient update rule involves the gradient of with respect to .

Hint:Recall that for a -dimensional vector , the gradient of w.r.t. is .)

Find in terms of .

(Enter y for and x for the vector . Use * for multiplication between scalars and vectors, or for dot products between vectors. Use 0 for the zero vector. )

For :

proofread your posts - this one is missing vital information.

To find the gradient of the objective function with respect to , we need to calculate the derivative of the hinge loss function.

The hinge loss function is defined as:

, where is the label for the vector .

Now, let's differentiate with respect to .

Since , we have:

.

Since the derivative of the dot product between vectors is the dot product of the derivatives, we can expand the equation as follows:

.

Now, let's calculate the derivative of the hinge loss term with respect to .

.

Therefore, the gradient of the objective function with respect to is:

.

Thus, in terms of is .

To find the gradient of the objective function with respect to , we need to compute the derivative of the hinge loss function with respect to for each training example .

The hinge loss function is defined as:



where is the label for the vector .

To find the derivative of the hinge loss function with respect to , we need to consider two cases: when and when .

1. When (i.e., the true label is positive):

In this case, is the correct decision boundary and we want the output to be larger than 1. The derivative of the hinge loss function with respect to is then 0.

2. When (i.e., the true label is negative):

In this case, is not the correct decision boundary and we want the output to be smaller than -1. The derivative of the hinge loss function with respect to is then .

Now, let's compute in terms of :

For :

If , then:



If , then:



Therefore, the expression for in terms of is:



Note: The expression above assumes that is a scalar constant and is the -dimensional vector.

The gradient of with respect to is: