Consider a 2-layer feed-forward neural network that takes in x∈R2 and has two ReLU hidden units as defined in the figure below. Note that hidden units have no offset parameters in this problem.

5. (1)

The values of the weights in the hidden layer are set such that they result in the z1 and z2 “classifiers" as shown in the (x1,x2)-space in the figure below:

The z1 “classifier" with the normal w1=[w11 w21]T is the line given by z1=x⋅w1=0.
Similarly, the z2 “classifier" with the normal w2=[w12 w22]T is the line given by z2=x⋅w2=0.
The arrows labeled w1 and w2 point in the positive directions of the respective normal vectors.
The regions labeled I,II,III,IV are the 4 regions defined by these two lines not including the boundaries.

Choose the region(s) in (x1,x2) space which are mapped into each of the following regions in (f1,f2)-space, the 2-dimensional space of hidden unit activations f(z1) and f(z2). (For example, for the second column below, choose the region(s) in (x1,x2) space which are mapped into the f1-axis in (f1,f2)-space.)

(Choose all that apply for each column.)

{(f1,f2):f1>0,f2>0}: f1-axis: f2-axis: the origin (f1,f2)=(0,0):
(Choose all that apply.)
I --> True

II

III

IV

None of the above

I --> True

II --> True

III

IV

None of the above

I --> True

II

III

IV --> True

None of the above

I

II

III

IV

None of the above --> True

5. (2)
2 points possible (graded, results hidden)
If we keep the hidden layer parameters above fixed but add and train additional hidden layers (applied after this layer) to further transform the data, could the resulting neural network solve this classification problem?
Yes
No

Suppose we stick to the 2-layer architecture but add many more ReLU hidden units, all of them without offset parameters. Would it be possible to train such a model to perfectly separate these points?

Note : Assume that no 2 data points lie on the same line through the origin.

yes

no
5. (3)
5 points possible (graded, results hidden)
Which of the following statements is correct?

The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight.

True

False
unanswered
Initialization of the parameters is often important when training large feed-forward neural networks.

If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs.

True

False
unanswered
On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units. Here, “behave like sign units" allows for shifting or rescaling of the sign function.

(Note that a sign unit is a unit with activation function if and if . For the purpose of this question, it does not matter what is.)

True

False
unanswered
If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will

almost never change any of the weights

change the weights by large amounts at random
unanswered
Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step.

True

False

5. (4)
3 points possible (graded, results hidden)
There are many good reasons to use convolutional layers in CNNs as opposed to replacing them with fully connected layers. Please check T or F for each statement.

Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears.

True

False
unanswered
A fully connected layer for a reasonably sized image would simply have too many parameters

True

False
unanswered
Grading Note: The intended answer was true because it's a justification for using CNNs over FC layers, but in fact the FC net used in the mnist project did have quite good accuracy, and was trainable. Since the statement "simply have too many parameters" is debatable, full credit is given to all. (The intended answer will still show as the correct answer, but you will see the credit in your score.)

A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training

True

False

5. (1)

{(f1,f2):f1>0,f2>0}:
I --> True

f1-axis:
I --> True
II --> True

f2-axis:
I --> True
IV --> True

the origin (f1,f2)=(0,0):
None of the above --> True

5. (2)
If we keep the hidden layer parameters above fixed but add and train additional hidden layers (applied after this layer) to further transform the data, could the resulting neural network solve this classification problem?
Yes

Suppose we stick to the 2-layer architecture but add many more ReLU hidden units, all of them without offset parameters. Would it be possible to train such a model to perfectly separate these points?
yes

5. (3)
The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight.
True

Initialization of the parameters is often important when training large feed-forward neural networks.
True

If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs.
True

On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units.
True

If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will
change the weights by large amounts at random

Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step.
False

5. (4)

Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears.
True

A fully connected layer for a reasonably sized image would simply have too many parameters
True

A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training
True

5. (1)

The regions in (x1,x2) space mapped into each of the following regions in (f1,f2)-space are as follows:

{(f1,f2):f1>0,f2>0}:
- Region I is mapped into the f1-axis in (f1,f2)-space.
- No other region is mapped into the f1-axis.
- No region is mapped into the f2-axis.
- No region is mapped into the origin (f1,f2)=(0,0).

Therefore, the answer is:
- Region I is mapped into the f1-axis.
- No region is mapped into the f2-axis.
- No region is mapped into the origin (f1,f2)=(0,0).

5. (2)
If we keep the hidden layer parameters fixed but add and train additional hidden layers to further transform the data, the resulting neural network could potentially solve this classification problem. Therefore, the answer is Yes.

5. (3)
The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight. This statement is True.

Initialization of the parameters is often important when training large feed-forward neural networks. This statement is True.

If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs. This statement is True.

On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units. Here, “behave like sign units" allows for shifting or rescaling of the sign function. This statement is True.

If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will almost never change any of the weights. This statement is False.

Stochastic gradient descent differs from (true) gradient descent by updating only one network weight during each gradient descent step. This statement is False.

5. (4)
There are many good reasons to use convolutional layers in CNNs as opposed to replacing them with fully connected layers:

Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears. This statement is True.

A fully connected layer for a reasonably sized image would simply have too many parameters. This statement is True.

A fully connected layer can learn to recognize features anywhere in the image even if the features appeared preferentially in one location during training. This statement is False.

The answers to the questions are as follows:

5. (1)
- {(f1,f2):f1>0,f2>0}: Region I in (x1,x2)-space is mapped into the f1-axis in (f1,f2)-space.

5. (2)
- If we keep the hidden layer parameters fixed but add and train additional hidden layers, the resulting neural network could solve this classification problem. The answer is Yes.

5. (3)
- The statement "The gradient calculated in the backpropagation algorithm consists of the partial derivatives of the loss function with respect to each network weight." is True.
- The statement "Initialization of the parameters is often important when training large feed-forward neural networks." is True.
- The statement "If weights in a neural network with sigmoid units are initialized to close to zero values, then during early stochastic gradient descent steps, the network represents a nearly linear function of the inputs." is False.
- The statement "On the other hand, if we randomly set all the weights to very large values, or don't scale them properly with the number of units in the layer below, then the sigmoid units would behave like sign units." is True.
- The answer to the question "If we use only sign units in a feedforward neural network, then the stochastic gradient descent update will" is almost never change any of the weights.

5. (4)
- The statement "Since we apply the same convolutional filter throughout the image, we can learn to recognize the same feature wherever it appears." is True.
- The statement "A fully connected layer for a reasonably sized image would simply have too many parameters" is True.