Como exatamente calcular a função de perda profunda do Q-Learning?

Tenho uma dúvida sobre como exatamente é treinada a função de perda de uma Deep Q-Learning Network. Estou usando uma rede feedforward de 2 camadas com camada de saída linear e relu camadas ocultas.

Vamos supor que eu tenho 4 ações possíveis. Portanto, a saída da minha rede para o estado atual é . Para torná-lo mais concreto, vamos assumir $s_t$ $Q(s_t) \in \mathbb{R}^4$ $Q(s_t) = [1.3, 0.4, 4.3, 1.5]$
Agora, tomo a ação correspondente ao valor ou seja, a terceira ação, e chego a um novo estado . $a_t = 2$ $4.3$ $s_{t+1}$
Em seguida, calculo a passagem direta com o estado e digamos que obtenho os seguintes valores na camada de saída . Também digamos a recompensa e . $s_{t+1}$ $Q(s_{t+1}) = [9.1, 2.4, 0.1, 0.3]$ $r_t = 2$ $\gamma = 1.0$
A perda é dada por:

$\mathcal{L} = (11.1- 4.3)^2$

OU

$\mathcal{L} = \frac{1}{4}\sum_{i=0}^3 ([11.1, 11.1, 11.1, 11.1] - [1.3, 0.4, 4.3, 1.5])^2$

OR

$\mathcal{L} = \frac{1}{4}\sum_{i=0}^3 ([11.1, 4.4, 2.1, 2.3] - [1.3, 0.4, 4.3, 1.5])^2$

Thank you, sorry I had to write this out in a very basic way... I am confused by all the notation. ( I think the correct answer is the second one...)

— A.D
fonte

This question with the clear example made me understand deep q learning more than any other medium article I've read in the past week.

— dhruvm

After reviewing the equations a few more times. I think the correct loss is the following:

L = (11.1 - 4.3)^{2}

$\mathcal{L} = (11.1 - 4.3)^2$

My reasoning is that the q-learning update rule for the general case is only updating the q-value for a specific $state,action$ pair.

Q (s, a) = r + γ max_{a *} Q (s^{'}, a *)

$Q(s,a) = r + \gamma \max_{a*}Q(s',a*)$

This equation means that the update happens only for one specific $state,action$ pair and for the neural q-network that means the loss is calculated only for one specific output unit which corresponds to a specific $action$ .

In the example provided $Q(s,a) = 4.3$ and the $target$ is $r + \gamma \max_{a*}Q(s',a*) = 11.1$ .

— A.D
fonte