Let \ell _ n({\boldsymbol \beta }) be the likelihood function as a function of {\boldsymbol \beta } for a given \mathbb {X},\mathbf{Y}. Recall from Lecture 9 the gradient of a real-valued function f(\mathbf x), \mathbf x\in \mathbb {R}^ d.

Question

Let \ell _ n({\boldsymbol \beta }) be the likelihood function as a function of {\boldsymbol \beta } for a given \mathbb {X},\mathbf{Y}. Recall from Lecture 9 the gradient of a real-valued function f(\mathbf x), \mathbf x\in \mathbb {R}^ d.

We can use gradient descent to find a local minimum of the negative of the log-likelihood function. The gradient descent optimization algorithm, in general, is used to find the local minimum of a given function f(\mathbf x) around a starting initial point \mathbf x_0.

Let \ell _{n,1}({\boldsymbol \beta }) = -\ell _ n({\boldsymbol \beta }).

Given a starting point {\boldsymbol \beta }, repeat

\Delta {\boldsymbol \beta }= -\nabla \ell _{n,1}({\boldsymbol \beta }).

Choose step size t.

Update {\boldsymbol \beta }: = {\boldsymbol \beta }+ t \Delta {\boldsymbol \beta }.

until a stopping criterion is satisfied.

The stopping criterion for gradient descent is usually of the form \left\| \nabla \ell _ n({\boldsymbol \beta }) \right\| \le \epsilon for some very small \epsilon.

The analysis of gradient descent and the choice of step size t in every iteration is beyond the scope of this class, but the implementation of this algorithm requires one to compute gradients of the function \ell _ n({\boldsymbol \beta }) at various points as given in Step 1 of the algorithm. Hence, the computational complexity of gradient descent boils down to the complexity of evaluating the gradient of the function \ell _ n({\boldsymbol \beta }).

Note: The above algorithm is a descent algorithm to minimize and find a local minimum of a given function. This is the reason why we used the conversion \ell _{n,1}({\boldsymbol \beta }) = -\ell _ n({\boldsymbol \beta }). If one were to re-write the algorithm without this conversion, we would have maximized \ell _ n({\boldsymbol \beta }) and Step 1 of the algorithm would be \Delta {\boldsymbol \beta }= \nabla \ell _{n}({\boldsymbol \beta }). Such an algorithm is called a gradient ascent algorithm. It is more common in literature in optimization to use the descent version rather than the ascent version.

One Step of Gradient Ascent for the Poisson GLM
1/2 points (graded)
Let {\boldsymbol \beta }\in \mathbb {R}^1 and let \ell _2({\boldsymbol \beta }) = \sum _{i=1}^2 \frac{Y_ i X_ i^ T {\boldsymbol \beta }- e^{X_ i^ T {\boldsymbol \beta }}}{\phi } + c, for some constant c. For the Poisson GLM, recall that \phi =1 and we have b(\theta ) = e^\theta.

What is \nabla \ell _2(\beta ) for any \beta?

Use X_i for X_ i and Y_i for Y_ i.

\sum _{i=1}^2

Answer 1

To find the gradient of the function \(\ell_2(\boldsymbol{\beta})\), we need to compute the partial derivative of each component of the function with respect to \(\beta\).

Let's start by differentiating the first component of the sum:

\[
\frac{{\partial}}{{\partial \beta}} \left(\frac{{Y_1X_1^T\boldsymbol{\beta} - e^{X_1^T\boldsymbol{\beta}}}}{{\phi}}\right)
\]

Using the chain rule, we have:

\[
\frac{{\partial}}{{\partial \beta}} \left(\frac{{Y_1X_1^T\boldsymbol{\beta} - e^{X_1^T\boldsymbol{\beta}}}}{{\phi}}\right) = \frac{{\partial}}{{\partial \beta}} \left(\frac{{Y_1X_1^T\boldsymbol{\beta}}}}{{\phi}}\right) - \frac{{\partial}}{{\partial \beta}} \left(\frac{{e^{X_1^T\boldsymbol{\beta}}}}{{\phi}}\right)
\]

Answer 2

To find the gradient \nabla \ell_2(\beta), we need to take the derivative of \ell_2(\beta) with respect to each component of \beta.

Let's start by computing the derivative with respect to \beta.

Taking the derivative of \ell_2(\beta) with respect to \beta, we have:

\frac{d \ell_2(\beta)} {d \beta} = \frac{d} {d \beta} \left( \sum_{i=1}^2 \frac{Y_i X_i^T \beta - e^{X_i^T \beta}}{\phi} + c \right)

Since the derivative is a linear operation, we can take the derivative inside the summation:

= \sum_{i=1}^2 \frac{d} {d \beta} \left( \frac{Y_i X_i^T \beta - e^{X_i^T \beta}}{\phi} + c \right)

Now, let's compute the derivative of the first term inside the summation:

\frac{d} {d \beta} \left( \frac{Y_i X_i^T \beta}{\phi} \right) = \frac{Y_i X_i^T}{\phi}

Next, let's compute the derivative of the second term inside the summation:

\frac{d} {d \beta} \left( \frac{-e^{X_i^T \beta}}{\phi} \right) = \frac{-X_i e^{X_i^T \beta}}{\phi}

Now, combining the derivatives of the two terms, we have:

\frac{d \ell_2(\beta)} {d \beta} = \sum_{i=1}^2 \left( \frac{Y_i X_i^T}{\phi} - \frac{X_i e^{X_i^T \beta}}{\phi} \right)

Since \phi = 1 in the Poisson GLM, we can simplify the equation to:

\frac{d \ell_2(\beta)} {d \beta} = \sum_{i=1}^2 \left( Y_i X_i^T - X_i e^{X_i^T \beta} \right)

Therefore, \nabla \ell_2(\beta) = \sum_{i=1}^2 \left( Y_i X_i^T - X_i e^{X_i^T \beta} \right).