Retropropagação com Softmax / Cross Entropy


Estou tentando entender como a retropropagação funciona para uma camada de saída softmax / entropia cruzada.

A função de erro de entropia cruzada é


com t e o como alvo e saída no neurônio j , respectivamente. A soma é sobre cada neurônio na camada de saída. oj em si é o resultado da função softmax:


Novamente, a soma está sobre cada neurônio na camada de saída e zj é a entrada do neurônio j :


Que é a soma sobre todos os neurónios na camada anterior, com a sua saída correspondente oi e o peso wij no sentido de neurónio j mais um viés b .

Agora, para atualizar um peso que conecta um neurônio j na camada de saída com um neurôniowijj na camada anterior, preciso calcular a derivada parcial da função de erro usando a regra da cadeia:i


com como entrada para o neurônio j .zjj

O último termo é bastante simples. Uma vez que há apenas um peso entre e j , o derivado é:ij


O primeiro termo é a derivação da função de erro em relação à saída :oj


O termo do meio é a derivação da função softmax em relação à sua entrada é mais difícil:zj


Digamos que temos três neurônios de saída correspondentes às classes então o b = s o f t m a x ( b )a,b,cob=softmax(b) é:


e sua derivação usando a regra do quociente:

Back to the middle term for backpropagation this means:

Putting it all together I get


which means, if the target for this class is tj=0, then I will not update the weights for this. That does not sound right.

Investigating on this I found people having two variants for the softmax derivation, one where i=j and the other for ij, like here or here.

But I can't make any sense out of this. Also I'm not even sure if this is the cause of my error, which is why I'm posting all of my calculations. I hope someone can clarify me where I am missing something or going wrong.

The links you have given are calculating the derivative relative to the input, whilst you're calculating the derivative relative to the weights.



Note: I am not an expert on backprop, but now having read a bit, I think the following caveat is appropriate. When reading papers or books on neural nets, it is not uncommon for derivatives to be written using a mix of the standard summation/index notation, matrix notation, and multi-index notation (include a hybrid of the last two for tensor-tensor derivatives). Typically the intent is that this should be "understood from context", so you have to be careful!

I noticed a couple of inconsistencies in your derivation. I do not do neural networks really, so the following may be incorrect. However, here is how I would go about the problem.

First, you need to take account of the summation in E, and you cannot assume each term only depends on one weight. So taking the gradient of E with respect to component k of z, we have



which gives
or, expanding the log
Note that the derivative is with respect to zk, an arbitrary component of z, which gives the δjk term (=1 only when k=j).

So the gradient of E with respect to z is then

where τ=jtj is constant (for a given t vector).

This shows a first difference from your result: the tk no longer multiplies ok. Note that for the typical case where t is "one-hot" we have τ=1 (as noted in your first link).

A second inconsistency, if I understand correctly, is that the "o" that is input to z seems unlikely to be the "o" that is output from the softmax. I would think that it makes more sense that this is actually "further back" in network architecture?

Calling this vector y, we then have


Finally, to get the gradient of E with respect to the weight-matrix w, we use the chain rule

giving the final expression (assuming a one-hot t, i.e. τ=1)
where y is the input on the lowest level (of your example).

So this shows a second difference from your result: the "oi" should presumably be from the level below z, which I call y, rather than the level above z (which is o).

Hopefully this helps. Does this result seem more consistent?

Update: In response to a query from the OP in the comments, here is an expansion of the first step. First, note that the vector chain rule requires summations (see here). Second, to be certain of getting all gradient components, you should always introduce a new subscript letter for the component in the denominator of the partial derivative. So to fully write out the gradient with the full chain rule, we have

In practice the full summations reduce, because you get a lot of δab terms. Although it involves a lot of perhaps "extra" summations and subscripts, using the full chain rule will ensure you always get the correct result.

I am not certain how the "Backprop/AutoDiff" community does these problems, but I find any time I try to take shortcuts, I am liable to make errors. So I end up doing as here, writing everything out in terms of summations with full subscripting, and always introducing new subscripts for every derivative. (Similar to my answer here ... I hope I am at least giving correct results in the end!)

I personally find that you writing everything down makes it much easier to follow. The results look correct to me.

Although I'm still trying to fully understand each of your steps, I got some valuable insights that helped me with the overall picture. I guess I need to read more into the topic of derivations and sums. But taking your advise to take account of the summation in E, I came up with this:

for two outputs oj1=ezj1Ω and oj1=ezj1Ω with
the cross entropy error is
Then the derivative is
which conforms with your result... taking in account that you didn't have the minus sign before the error sum

But a further question I have is: Instead of
which is generally what your introduced to with backpropagation, you calculated:
as like to cancel out the oj . Why is this way leading to the right result?


While @GeoMatt22's answer is correct, I personally found it very useful to reduce the problem to a toy example and draw a picture:

Graphical model.

I then defined the operations each node was computing, treating the h's and w's as inputs to a "network" (t is a one-hot vector representing the class label of the data point):


Say I want to calculate the derivative of the loss with respect to w21. I can just use my picture to trace back the path from the loss to the weight I'm interested in (removed the second column of w's for clarity):

Graphical model with highlighted backwards path.

Then, I can just calculate the desired derivatives. Note that there are two paths through y1 that lead to w21, so I need to sum the derivatives that go through each of them.


Finally, putting the chain rule together:


Note that in the last step, t1+t2=1 because the vector t is a one-hot vector.

This is what finally cleared this up for me! Excellent and Elegant explanation!!!!

I’m glad you both enjoyed and benefited from reading my post! It was also helpful for me to write it out and explain it.
Vivek Subramanian

@VivekSubramanian should it be
instead ?

You’re right - it was a typo! I will make the change.
Vivek Subramanian

The thing i do not understand here is that you also assign logits (unscaled scores) to some neurons. (o is softmaxed logits (predictions) and y is logits in your case). However, this is not the case normally, is not it? Look at this picture ( o_out1 is prediction and o_in1 is logits) so how is it possible in this case how can you find the partial derivative of o2 with respect to y1?


In place of the {oi}, I want a letter whose uppercase is visually distinct from its lowercase. So let me substitute {yi}. Also, let's use the variable {pi} to designate the {oi} from the previous layer.

Let Y be the diagonal matrix whose diagonal equals the vector y, i.e.

Using this new matrix variable and the Frobenius Inner Product we can calculate the gradient of E wrt W.


Here is one of the cleanest and well written notes that I came across the web which explains about "calculation of derivatives in backpropagation algorithm with cross entropy loss function".

In the given pdf how did equation 22 become equation 23? As in how did the Summation(k!=i) get a negative sign. Shouldn't it get a positive sign? Like Summation(Fn)(For All K) = Fn(k=i) + Summation(Fn)(k!=i) should be happening according to my understanding.


Here's a link explaining the softmax and its derivative.

It explains the reason for using i=j and i!=j.

It is recommended to provide a minimal, stand-alone answer, in case that link gets broken in the future. Otherwise, this might no longer help other users in the future.


Other answers have provided the correct way of calculating the derivative, but they do not point out where you have gone wrong. In fact, tj is always 1 in your last equation, cause you have assumed that oj takes that node of target 1 in your output; oj of other nodes have different forms of probability function, thus lead to different forms of derivative, so you should now understand why other people have treated i=j and ij differently.

Ao utilizar nosso site, você reconhece que leu e compreendeu nossa Política de Cookies e nossa Política de Privacidade.
Licensed under cc by-sa 3.0 with attribution required.