Calculate gradients for a neural network with one hidden layer

Chenxiao Ma | February 9, 2018

I have always been struggling with calculating gradients in Back Propagation, especially when there are matrices involved. Here I will talk about how I solved the question below as an exercise. It is question 2(c) in Assignment 1 of CS224n, Winter 2017 from Stanford. I found it quite difficult, mentally, to do back propagation in terms of vectors, so I have to calculate the partial direvatives w.r.t. an element of some vector, and then vertorize the result, which means I would first calculate $\frac{\partial J}{\partial \mathbf{x}_i} = \text{scalar}$ , and then derive $\frac{\partial J}{\partial \mathbf{x}} = \mathbf{vector}$ . I'm not sure this is the right way to do it, but I got the same results as the official solution. I hope it helps you!

Derive the gradients with respect to the inputs $x$ to an one-hidden-layer neural network (that is, ﬁnd $\frac{\partial J}{\partial x}$ where $J = CE(y, \hat{y})$ is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is $y$ , and cross entropy cost is used. (Feel free to use $\sigma '(x)$ as the shorthand for sigmoid gradient, and feel free to deﬁne any variables whenever you see ﬁt.)

Recall that the forward propagation is as follows

$h = sigmoid(xW_1 + b_1)$

$\hat{y} = softmax(hW_2 + b_2)$

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. $W_i$ and $b_i (i = 1,2)$ are the weights and biases, respectively, of the two layers.

Let $z_2 = hW_2 + b_2$ , then $J = CE(y, \hat{y})$ , and $y = softmax(z_2)$ .

We already know that $\frac{\partial J}{\partial z_2}=\hat{y} - y$ , which means $\frac{\partial J}{\partial z_{2i}}=\hat{y}_i - y_i, \text{for } i=1,...,I$ .

Now we go back.

$\frac{\partial z_{2i}}{\partial h_j} = W_{ji}$

$\frac{\partial z_{i}}{\partial h} = W_{i}$

Let $z_1 = xW_1 + b_1$ ,

$\begin{aligned} \frac{\partial z_{1i}}{\partial x_j} &= W_{2ji} \\ \frac{\partial J}{\partial x_k} &= \sum_i^I \frac{\partial J}{\partial z_{2i}} \frac{\partial z_{2i}}{\partial x_k} \\ &= \sum_i^I (\hat{y}_i - y_i)\left(\sum_j^J \frac{\partial z_{2i}}{\partial h_j} \cdot \frac{\partial h_j}{\partial x_k} \right) \\ &= \sum_i^I (\hat{y}_i - y_i)\left(\sum_j^J W_{2ji} \cdot \sigma'(z_1)_j W_{1kj} \right) \\ &= \sum_i^I \sum_j^J (\hat{y}_i - y_i) W_{2ji} \sigma'(z_1)_j W_{1kj} \\ &= \sum_i^J \sum_i^I (\hat{y}_i - y_i) W_{2ji} \sigma'(z_1)_j W_{1kj} \\ &= \sum_j^J \left( (\hat{y} - y) W_2^T \right)_j \odot \sigma'(z_1)_j W_{1kj} \\ \frac{\partial J}{\partial x} &= (\hat{y} - y)W_2^T \odot \sigma'(z_1) W_1^T \end{aligned}$

I then tried to do the calculation with matrices:

$\begin{aligned} \frac{\partial J}{\partial z_2} &= \hat{y} - y \\ \frac{\partial z_2}{\partial h} &= W_2^T \\ \frac{\partial h}{\partial z_1} &= \sigma'(z_1) \end{aligned}$

However, I can't multiply them together, because $\frac{\partial J}{\partial z_2}$ is a $1 \times I$ vector, $\frac{\partial z_2}{\partial h}$ is a $I \times J$ vector, however, $\frac{\partial h}{\partial z_1}$ is a $1 \times J$ vector. Now it would be tempting (and even correct) to multiply them element-wise, but it does not make sense! Can someone please explain this to me?

Updated on Feb 23, 2018

Hats off to Rex Ying! It is so nice of you to explain this to me! It turns out that $\frac{\partial h}{\partial z_1}$ is a $J \times J$ matrix, but it is diagonal. The elements on the diagonal are the same as those in $\sigma'(z_1)$ . (The function $\sigma$ is an element-wise function, which means every element in the output vector is determined solely by the corresponding element in the input vector.) Now element-wise multiplication makes sense because it is the same as multiplying this diagonal matrix!

Now it is ease to see that:

$\begin{aligned} \frac{\partial J}{\partial z_1} &= \left(\hat{y} - y\right) W_2^T \odot \sigma'(z_1) \\ \frac{\partial z_1}{\partial x} &= W_1^T \\ \frac{\partial J}{\partial x} &= \left(\hat{y} - y\right) W_2^T \odot \sigma'(z_1) W_1^T \\ \end{aligned}$

Finally!