Calculate gradients for a neural network with one hidden layer

Chenxiao Ma | February 9, 2018

I have always been struggling with calculating gradients in Back Propagation, especially when there are matrices involved. Here I will talk about how I solved the question below as an exercise. It is question 2(c) in Assignment 1 of CS224n, Winter 2017 from Stanford. I found it quite difficult, mentally, to do back propagation in terms of vectors, so I have to calculate the partial direvatives w.r.t. an element of some vector, and then vertorize the result, which means I would first calculate , and then derive . I'm not sure this is the right way to do it, but I got the same results as the official solution. I hope it helps you!

Derive the gradients with respect to the inputs to an one-hidden-layer neural network (that is, find where is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is , and cross entropy cost is used. (Feel free to use as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)

Recall that the forward propagation is as follows

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. and are the weights and biases, respectively, of the two layers.

Let , then , and .

We already know that , which means .

Now we go back.

Let ,

I then tried to do the calculation with matrices:

However, I can't multiply them together, because is a vector, is a vector, however, is a vector. Now it would be tempting (and even correct) to multiply them element-wise, but it does not make sense! Can someone please explain this to me?

Updated on Feb 23, 2018

Hats off to Rex Ying! It is so nice of you to explain this to me! It turns out that is a matrix, but it is diagonal. The elements on the diagonal are the same as those in . (The function is an element-wise function, which means every element in the output vector is determined solely by the corresponding element in the input vector.) Now element-wise multiplication makes sense because it is the same as multiplying this diagonal matrix!

Now it is ease to see that:

Finally!