Stochastic gradient descent (SGD) is a simple but widely applicable optimization technique. For example, we can use it to train a Support Vector Machine. The objective function in this case is given by:

J(θ) = [1/n*∑i=1 to n Loss_h(y(i)*θ ⋅ x(i)) ]+λ/2*∥θ∥^2
where Lossh(z)=max{0,1−z} is the hinge loss function, (x(i),y(i)) with for i=1,…n are the training examples, with y(i)∈{1,−1} being the label for the vector x(i).

For simplicity, we ignore the offset parameter θ_0 in all problems on this page.

1.) The stochastic gradient update rule involves the gradient ∇θLossh(y(i)θ⋅x(i)) of Lossh(y(i)θ⋅x(i)) with respect to θ.

Hint:Recall that for a k-dimensional vector θ=[θ1 θ2⋯ θk]^T, the gradient of f(θ) w.r.t. θ is ∇θf(θ)=[∂f/∂θ1 ∂f/∂θ2⋯∂f/∂θk]^T.)

Find ∇_θLossh(yθ⋅x) in terms of x.

(Enter lambda for λ, y for y and x for the vector x. Use * for multiplication between scalars and vectors, or for dot products between vectors. Use 0 for the zero vector. )

For yθ⋅x≤1:
For yθ⋅x>1:

2. Let θ be the current parameters. What is the stochastic gradient update rule, where η>0 is the learning rate? (Choose all that apply.)

θ←

θ+η∇θ[Lossh(y(i)θ⋅x(i))]+ηλθ for random x(i) with label y(i)

θ−η∇θ[Lossh(y(i)θ⋅x(i))]−ηλθ for random x(i) with label y(i)

θ+η∇θ[Lossh(y(i)θ⋅x(i))]+η∇θ[λ2∥θ∥2] for random x(i) with label y(i)

θ−η∇θ[Lossh(y(i)θ⋅x(i))]−η∇θ[λ2∥θ∥2] for random x(i) with label y(i)

θ+η1n∑ni=1∇θ[Lossh(y(i)θ⋅x(i))]+η∇θ[λ2∥θ∥2]

θ−η1n∑i=1n∇θ[Lossh(y(i)θ⋅x(i))]−η∇θ[λ2∥θ∥2]

1.) The gradient of Lossh(yθ⋅x) with respect to θ, denoted as ∇_θLossh(yθ⋅x), can be calculated as follows:

For yθ⋅x ≤ 1:
∇_θLossh(yθ⋅x) = -y * x

For yθ⋅x > 1:
∇_θLossh(yθ⋅x) = 0

2.) The correct stochastic gradient update rule is:

θ ← θ - η * ∇_θLossh(y(i)θ⋅x(i)) - η * λ * θ
for random x(i) with label y(i)

So the correct answer is:
θ - η * ∇θ[Lossh(y(i)θ⋅x(i))] - η * λ * θ
for random x(i) with label y(i)