Derivando a equação de Bellman na aprendizagem por reforço

32

Vejo a seguinte equação em " In Reforcement Learning. An Introduction ", mas não siga a etapa que destaquei em azul abaixo. Como exatamente essa etapa é derivada?

expected-value reinforcement-learning

— Amelio Vazquez-Reina
fonte

7

Esta é a resposta para todos que se perguntam sobre a matemática estruturada e limpa por trás dela (ou seja, se você pertence ao grupo de pessoas que sabe o que é uma variável aleatória e que deve mostrar ou presumir que uma variável aleatória tem densidade, então isso é a resposta para você ;-)):

Antes de tudo, precisamos ter em conta que o processo de decisão de Markov possui apenas um número finito de $L^1$ barreiras , ou seja, precisamos que exista um conjunto finito de densidades, cada uma pertencendo a variáveis , ou seja, para todos e um mapa modo que (ou seja, nos autômatos por trás do MDP, pode haver infinitos estados, mas existem apenas finitamente muitas distribuições de recompensas associadas às transições possivelmente infinitas entre os estados) $E$ $L^1$ $\int_{\mathbb{R}}x \cdot e(x) dx < \infty$ $e \in E$ $F : A \times S \to E$

p (r_{t} | a_{t}, s_{t}) = F (a_{t}, s_{t}) (r_{t})

$p(r_t|a_t, s_t) = F(a_t, s_t)(r_t)$

L^{1}

$L^1$

Teorema 1 : Seja (isto é, uma variável aleatória real integrável) e deixe ser outra variável aleatória tal que tenha uma densidade comum então $X \in L^1(\Omega)$ $Y$ $X,Y$

E [X | Y = y] = \int_{R} x p (x | y) d x

$E[X|Y=y] = \int_\mathbb{R} x p(x|y) dx$

Prova : Essencialmente comprovada aqui por Stefan Hansen.

Teorema 2 : Seja e sejam variáveis aleatórias adicionais, tais que tenham uma densidade comum, então onde é a gama de . $X \in L^1(\Omega)$ $Y,Z$ $X,Y,Z$

E [X | Y = y] = \int_{Z} p (z | y) E [X | Y = y, Z = z] d z

$E[X|Y=y] = \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz$

Z

$\mathcal{Z}$

Z

$Z$

Prova :

\begin{aligned} E [X | Y = y] & = \int_{R} x p (x | y) d x \\ (by Thm. 1) \\ = \int_{R} x \frac{p (x, y)}{p (y)} d x \\ = \int_{R} x \frac{\int_{Z} p (x, y, z) d z}{p (y)} d x \\ = \int_{Z} \int_{R} x \frac{p (x, y, z)}{p (y)} d x d z \\ = \int_{Z} \int_{R} x p (x | y, z) p (z | y) d x d z \\ = \int_{Z} p (z | y) \int_{R} x p (x | y, z) d x d z \\ = \int_{Z} p (z | y) E [X | Y = y, Z = z] d z \\ (by Thm. 1) \end{aligned}

$\begin{align*} E[X|Y=y] &= \int_{\mathbb{R}} x p(x|y) dx \\ &~~~~\text{(by Thm. 1)}\\ &= \int_{\mathbb{R}} x \frac{p(x,y)}{p(y)} dx \\ &= \int_{\mathbb{R}} x \frac{\int_{\mathcal{Z}} p(x,y,z) dz}{p(y)} dx \\ &= \int_{\mathcal{Z}} \int_{\mathbb{R}} x \frac{ p(x,y,z) }{p(y)} dx dz \\ &= \int_{\mathcal{Z}} \int_{\mathbb{R}} x p(x|y,z)p(z|y) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) \int_{\mathbb{R}} x p(x|y,z) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz \\ &~~~~\text{(by Thm. 1)} \end{align*}$

Coloque e coloque então pode-se mostrar (usando o fato de que o MDP possui apenas finitas gavetas ) que G_t converge e que, desde a funçãoainda está em (isto é, integrável), também se pode mostrar (usando a combinação usual dos teoremas da convergência monótona e depois dominar a convergência nas equações definidoras [das fatorações] da expectativa condicional) que Agora, mostra-se que $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k}$ $G_t^{(K)} = \sum_{k=0}^K \gamma^k R_{t+k}$ $L^1$ $G_t^{(K)}$ $\sum_{k=0}^\infty \gamma^k |R_{t+k}|$ $L^1(\Omega)$

lim_{K \to \infty} E [G_{t}^{(K)} | S_{t} = s_{t}] = E [G_{t} | S_{t} = s_{t}]

$\lim_{K \to \infty} E[G_t^{(K)} | S_t=s_t] = E[G_t | S_t=s_t]$

E [G_{t}^{(K)} | S_{t} = s_{t}] = E [R_{t} | S_{t} = s_{t}] + γ \int_{S} p (s_{t + 1} | s_{t}) E [G_{t + 1}^{(K - 1)} | S_{t + 1} = s_{t + 1}] d s_{t + 1}

$E[G_t^{(K)} | S_t=s_t] = E[R_{t} | S_t=s_t] + \gamma \int_S p(s_{t+1}|s_t) E[G_{t+1}^{(K-1)} | S_{t+1}=s_{t+1}] ds_{t+1}$ usando , Thm. 2 acima de Thm. 1 em e, em seguida, usando uma guerra de marginalização direta, mostra-se que para todos os . Agora precisamos aplicar o limite a ambos os lados da equação. Para puxar o limite para a integral sobre o espaço de estado , precisamos fazer algumas suposições adicionais:

G_{t}^{(K)} = R_{t} + γ G_{t + 1}^{(K - 1)}

$G_t^{(K)} = R_t + \gamma G_{t+1}^{(K-1)}$

E [G_{t + 1}^{(K - 1)} | S_{t + 1} = s^{'}, S_{t} = s_{t}]

$E[G_{t+1}^{(K-1)}|S_{t+1}=s', S_t=s_t]$

p (r_{q} | s_{t + 1}, s_{t}) = p (r_{q} | s_{t + 1})

$p(r_q|s_{t+1}, s_t) = p(r_q|s_{t+1})$

q \geq t + 1

$q \geq t+1$

K \to \infty

$K \to \infty$

S

$S$

O espaço de estados é finito (então e a soma é finita) ou todas as recompensas são todas positivas (então usamos convergência monótona) ou todas as recompensas são negativas (então colocamos um sinal de menos na frente do equação e usar convergência monótona novamente) ou todas as recompensas são limitadas (então usamos convergência dominada). Então (aplicando nos dois lados da equação de Bellman parcial / finita acima) obtemos $\int_S = \sum_S$ $\lim_{K \to \infty}$

E [G_{t} | S_{t} = s_{t}] = E [G_{t}^{(K)} | S_{t} = s_{t}] = E [R_{t} | S_{t} = s_{t}] + γ \int_{S} p (s_{t + 1} | s_{t}) E [G_{t + 1} | S_{t + 1} = s_{t + 1}] d s_{t + 1}

$E[G_t | S_t=s_t] = E[G_t^{(K)} | S_t=s_t] = E[R_{t} | S_t=s_t] + \gamma \int_S p(s_{t+1}|s_t) E[G_{t+1} | S_{t+1}=s_{t+1}] ds_{t+1}$

e então o resto é a manipulação usual da densidade.

OBSERVAÇÃO: Mesmo em tarefas muito simples, o espaço de estados pode ser infinito! Um exemplo seria a tarefa de "equilibrar um poste". O estado é essencialmente o ângulo do polo (um valor em , um conjunto incontável de infinitos!) $[0, 2\pi)$

OBSERVAÇÃO: As pessoas podem comentar 'massa, essa prova pode ser reduzida muito mais se você apenas usar a densidade de diretamente e mostrar que '... MAS ... minhas perguntas seriam: $G_t$ $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

Como é que você sabe que tem uma densidade? $G_{t+1}$
Como é que você sabe que tem uma densidade comum junto com ? $G_{t+1}$ $S_{t+1}, S_t$
Como você deduz que ? Esta não é apenas a propriedade Markov: a propriedade Markov apenas diz algo sobre as distribuições marginais, mas elas não determinam necessariamente toda a distribuição, veja, por exemplo, gaussianos multivariados! $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

— Fabian Werner
fonte

10

Seja a soma total das recompensas descontadas após o tempo : $t$
$G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...$

O valor da utilidade de iniciar no estado, no momento, é equivalente à soma esperada das recompensas com desconto da política de execução partir do estado diante. Por definição de Pela lei da linearidade Por lei de $s$ $t$
$R$ $\pi$ $s$
$U_\pi(S_t=s) = E_\pi[G_t|S_t = s]$
$\\ = E_\pi[(R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...)|S_t = s]$ $G_t$
$= E_\pi[(R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+...))|S_t = s]$
$= E_\pi[(R_{t+1}+\gamma (G_{t+1}))|S_t = s]$
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[ G_{t+1}|S_t = s]$
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[E_\pi(G_{t+1}|S_{t+1} = s')|S_t = s]$ Expectativa total Por definição de Por lei da linearidade
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[U_\pi(S_{t+1}= s')|S_t = s]$ $U_\pi$
$= E_\pi[R_{t+1} + \gamma U_\pi(S_{t+1}= s')|S_t = s]$

Supondo que o processo satisfaça a Propriedade Markov:
Probabilidade de terminar no estado iniciando no estado e executou a ação , e a recompensa de terminar no estado iniciando no estado e adotando a ação , $Pr$ $s'$ $s$ $a$
$Pr(s'|s,a) = Pr(S_{t+1} = s', S_t=s,A_t = a)$
$R$ $s'$ $s$ $a$
$R(s,a,s') = [R_{t+1}|S_t = s, A_t = a, S_{t+1}= s']$

Portanto, podemos reescrever a equação da utilidade acima como,
$= \sum_a \pi(a|s) \sum_{s'} Pr(s'|s,a)[R(s,a,s')+ \gamma U_\pi(S_{t+1}=s')]$

Onde; : Probabilidade de agir quando em estado para uma política estocástica. Para política determinística, $\pi(a|s)$ $a$ $s$ $\sum_a \pi(a|s)= 1$

— Ntabgoba
fonte

Apenas algumas notas: a soma acima de é igual a 1 mesmo em uma política estocástica, mas em uma política determinística, há apenas uma ação que recebe todo o peso (por exemplo, e o restante . recebem 0 peso, por isso esse termo é removido da equação Também na linha que você usou a lei da expectativa total, a ordem dos condtionals é invertida

π

$\pi$

π (a | s) = 1

$\pi(a|s) = 1$

— Gilad Peleg

1

Tenho certeza de que esta resposta está incorreta: sigamos as equações apenas até a linha que envolve a lei da expectativa total. Então o lado esquerdo não depende de enquanto o lado direito depende ... Ou seja, se as equações estão corretas, então para quais estão corretas? Você deve ter algum tipo de integral sobre já nessa fase. A razão provavelmente é o seu mal-entendido sobre a diferença de (uma variável aleatória) versus sua fatoração (uma função determinística!) ...

s^{'}

$s'$

s^{'}

$s'$

s^{'}

$s'$

E [X | Y]

$E[X|Y]$

E [X | Y = y]

$E[X|Y=y]$

— Fabian Werner

@FabianWerner Concordo que isso não está correto. A resposta de Jie Shi é a resposta certa.

— teucer 9/01

@ resposta Essa resposta pode ser corrigida porque falta apenas "simetrização", isto é, mas ainda assim, a pergunta é a mesma da resposta de Jie Shis: Por que ? Essa não é apenas a propriedade Markov, porque é um RV muito complicado: ele converge mesmo? Se sim, onde? Qual é a densidade comum ? Conhecemos essa expressão apenas para somas finitas (convolução complicada), mas para o caso infinito?

E [A | C = c] = \int_{range (B)} p (b | c) E [A | B = b, C = c] d P_{B} (b)

$E[A|C=c] = \int_{\text{range}(B)} p(b|c) E[A|B=b, C=c] dP_B(b)$

E [G_{t + 1} | S_{t + 1} = s_{t + 1}, S_{t} = s_{t}] = E [G_{t + 1} | S_{t + 1} = s_{t + 1}]

$E[G_{t+1}|S_{t+1}=s_{t+1}, S_t=s_t] = E[G_{t+1}|S_{t+1}=s_{t+1}]$

G_{t + 1}

$G_{t+1}$

p (g_{t + 1}, s_{t + 1}, s_{t})

$p(g_{t+1}, s_{t+1}, s_t)$

— Fabian Werner

@FabianWerner não tem certeza se posso responder a todas as perguntas. Abaixo alguns indicadores. Para a convergência de , dado que é a soma das recompensas descontadas, é razoável supor que a série converge (o fator de desconto é e para onde converge realmente não importa). Não me preocupo com a densidade (sempre é possível definir uma densidade articular contanto que tenhamos variáveis aleatórias), só importa se estiver bem definida e, nesse caso, estiver.

G_{t + 1}

$G_{t+1}$

< 1

$<1$

— teucer 10/01

8

Aqui está a minha prova. É baseado na manipulação de distribuições condicionais, o que facilita o acompanhamento. Espero que este ajude você.

\begin{aligned} v_{π} (s) & = E [G_{t} | S_{t} = s] \\ = E [R_{t + 1} + γ G_{t + 1} | S_{t} = s] \\ = \sum_{s^{'}} \sum_{r} \sum_{g_{t + 1}} \sum_{a} p (s^{'}, r, g_{t + 1}, a | s) (r + γ g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} \sum_{g_{t + 1}} p (s^{'}, r, g_{t + 1} | a, s) (r + γ g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} \sum_{g_{t + 1}} p (s^{'}, r | a, s) p (g_{t + 1} | s^{'}, r, a, s) (r + γ g_{t + 1}) \\ Note that p (g_{t + 1} | s^{'}, r, a, s) = p (g_{t + 1} | s^{'}) by assumption of MDP \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | a, s) \sum_{g_{t + 1}} p (g_{t + 1} | s^{'}) (r + γ g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | a, s) (r + γ \sum_{g_{t + 1}} p (g_{t + 1} | s^{'}) g_{t + 1}) \\ = \sum_{a} p (a | s) \sum_{s^{'}} \sum_{r} p (s^{'}, r | a, s) (r + γ v_{π} (s^{'})) \end{aligned}

$\begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=E{\left[R_{t+1}+\gamma G_{t+1}|S_t=s\right]} \nonumber \\ &= \sum_{s'}\sum_{r}\sum_{g_{t+1}}\sum_{a}p(s',r,g_{t+1}, a|s)(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r,g_{t+1} |a, s)(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r|a, s)p(g_{t+1}|s', r, a, s)(r+\gamma g_{t+1}) \nonumber \\ &\text{Note that $p(g_{t+1}|s', r, a, s)=p(g_{t+1}|s')$ by assumption of MDP} \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)\sum_{g_{t+1}}p(g_{t+1}|s')(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)(r+\gamma\sum_{g_{t+1}}p(g_{t+1}|s')g_{t+1}) \nonumber \\ &=\sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)\left(r+\gamma v_{\pi}(s')\right) \label{eq2} \end{align}$ Esta é a famosa equação de Bellman.

— Jie Shi
fonte

Você se importa em explicar este comentário 'Observe que ...' um pouco mais? Por que essas variáveis aleatórias e as variáveis de estado e ação têm uma densidade comum? Se sim, por que você conhece essa propriedade que está usando? Eu posso ver que isso é verdade para uma soma finita, mas se a variável aleatória é um limite ... ???

G_{t + 1}

$G_{t+1}$

— Fabian Werner

Para Fabian: Primeiro vamos lembrar o que é . . Observe que depende diretamente diretamente de e pois captura todas as informações de transição de um MDP (mais precisamente, é independente de todos os estados, ações e recompensas antes do tempo dado e ). Da mesma forma, depende apenas de e . Como resultado, é independente de ,

G_{t + 1}

$G_{t+1}$

G_{t + 1} = R_{t + 2} + R_{t + 3} + \dots

$G_{t+1}=R_{t+2}+R_{t+3}+\cdots$

R_{t + 2}

$R_{t+2}$

S_{t + 1}

$S_{t+1}$

A_{t + 1}

$A_{t+1}$

p (s^{'}, r | s, a)

$p(s', r|s, a)$

R_{t + 2}

$R_{t+2}$

t + 1

$t+1$

S_{t + 1}

$S_{t+1}$

A_{t + 1}

$A_{t+1}$

R_{t + 3}

$R_{t+3}$

S_{t + 2}

$S_{t+2}$

A_{t + 2}

$A_{t+2}$

G_{t + 1}

$G_{t+1}$

S_{t}

$S_t$

A_{t}

$A_t$ , e forneceu , o que explica essa linha.

R_{t}

$R_t$

S_{t + 1}

$S_{t+1}$

— Jie Shi

Desculpe, isso apenas o 'motiva', na verdade não explica nada. Por exemplo: Qual é a densidade de ? Por que você tem certeza de que ? Por que essas variáveis aleatórias têm uma densidade comum? Você sabe que uma soma se transforma em uma convolução em densidades, então o que ... deve ter uma quantidade infinita de integrais na densidade ??? Não há absolutamente nenhum candidato para a densidade!

G_{t + 1}

$G_{t+1}$

p (g_{t + 1} | s_{t + 1}, s_{t}) = p (g_{t + 1} | s_{t + 1})

$p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

G_{t + 1}

$G_{t+1}$

— Fabian Werner

Para Fabian: Eu não entendi sua pergunta. 1. Você quer a forma exata da distribuição marginal ? Não o conheço e não precisamos disso nesta prova. 2. por que ? Porque, como mencionei anteriormente, e são independentes, dado . 3. O que você quer dizer com "densidade comum"? Você quer dizer distribuição conjunta? Você quer saber por que essas variáveis aleatórias têm uma distribuição conjunta? Todas as variáveis aleatórias neste universo podem ter uma distribuição conjunta. Se essa é sua pergunta, sugiro que você encontre um livro de teoria das probabilidades e o leia.

p (g_{t + 1})

$p(g_{t+1})$

p (g_{t + 1} | s_{t + 1}, s_{t}) = p (g_{t + 1} | s_{t + 1})

$p(g_{t+1}|s_{t+1}, s_t)=p(g_{t+1}|s_{t+1})$

g_{t + 1}

$g_{t+1}$

s_{t}

$s_t$

s_{t + 1}

$s_{t+1}$

— Jie Shi

Vamos passar esta discussão para o chat: chat.stackexchange.com/rooms/88952/bellman-equation

— Fabian Werner

2

O que há com a seguinte abordagem?

\begin{aligned} v_{π} (s) & = E_{π} [G_{t} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] \\ = \sum_{a} π (a ∣ s) \sum_{s^{'}} \sum_{r} p (s^{'}, r ∣ s, a) \cdot \\ E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s, A_{t + 1} = a, S_{t + 1} = s^{'}, R_{t + 1} = r] \\ = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})] . \end{aligned}

$\begin{align} v_\pi(s) & = \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ & = \sum_a \pi(a \mid s) \sum_{s'} \sum_r p(s', r \mid s, a) \cdot \,\\ & \qquad \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s, A_{t+1} = a, S_{t+1} = s', R_{t+1} = r\right] \\ & = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right]. \end{align}$

As somas são introduzidas para recuperar , e de . Afinal, as ações possíveis e os próximos estados possíveis podem ser. Com essas condições extras, a linearidade da expectativa leva ao resultado quase diretamente. $a$ $s'$ $r$ $s$

Não sei ao certo quão rigoroso é meu argumento matematicamente. Estou aberto a melhorias.

— Tsjolder
fonte

A última linha funciona apenas devido à propriedade MDP.

— teucer

2

Este é apenas um comentário / adição à resposta aceita.

Fiquei confuso na linha em que a lei da expectativa total está sendo aplicada. Não acho que a principal forma de lei da expectativa total possa ajudar aqui. Uma variante disso é de fato necessária aqui.

Se $X,Y,Z$ são variáveis aleatórias e assumindo que toda a expectativa existe, a seguinte identidade é válida:

$E[X|Y] = E[E[X|Y,Z]|Y]$

Neste caso, $X= G_{t+1}$ , $Y = S_t$ e $Z = S_{t+1}$ . Então

$E[G_{t+1}|S_t=s] = E[E[G_{t+1}|S_t=s, S_{t+1}=s'|S_t=s]$ , que pela propriedade Markov corresponde a $E[E[G_{t+1}|S_{t+1}=s']|S_t=s]$

A partir daí, pode-se seguir o restante da prova da resposta.

— Mehdi Golari
fonte

1

Bem-vindo ao CV! Por favor, use as respostas apenas para responder à pergunta. Depois de ter reputação suficiente (50), você pode adicionar comentários.

— Frans Rodenburg 28/09

Obrigado. Sim, como não pude comentar por não ter reputação suficiente, achei útil adicionar a explicação às respostas. Mas vou manter isso em mente.

— Mehdi Golari 28/09

Voto a favor, mas ainda assim, esta resposta está faltando detalhes: Mesmo se

satisfaz esse relacionamento louco, então ninguém garante que isso também seja verdade para as fatorações das expectativas condicionais! Ou seja, como no caso da resposta de Ntabgoba: O lado esquerdo não depende de

enquanto o lado direito depende . Esta equação não pode estar correta!

E [X | Y]

$E[X|Y]$

s^{'}

$s'$

— Fabian Werner

1

geralmente denota a expectativa assumindo que o agente segue a política. Nesse caso,parece não determinístico, ou seja, retorna a probabilidade de o agente executaraçãoquando no estado. $\mathbb{E}_\pi(\cdot)$ $\pi$ $\pi(a|s)$ $a$ $s$

Parece que , inferior a caso, é a substituição , uma variável aleatória. A segunda expectativa substitui a soma infinita, para refletir a suposição de que continuamos a seguir para todo futuro . é então a recompensa imediata esperada no próximo passo no tempo; A segunda expectativa - que se torna - é o valor esperado do próximo estado, ponderado pela probabilidade de liquidação no estado $r$ $R_{t+1}$ $\pi$ $t$ $\sum_{s',r} r \cdot p(s′,r|s,a)$ $v_\pi$ Tendo tirado de . $s'$ $a$ $s$

Assim, a expectativa é responsável pela probabilidade política, bem como pelas funções de transição e recompensa, aqui expressas em conjunto como . $p(s', r|s,a)$

— Sean Easter
fonte

Obrigado. Sim, o que você mencionou sobre

está correto (é a probabilidade do agente executar

ação

quando no estado

).

π (a | s)

$\pi(a|s)$

a

$a$

s

$s$

— Amelio Vazquez-Reina

O que não sigo é quais termos são expandidos exatamente para quais termos na segunda etapa (eu estou familiarizado com a fatoração e a marginalização da probabilidade, mas não tanto com a RL). É

o termo que está sendo expandida? Ou seja, o que exatamente no passo anterior é igual ao que exatamente no próximo passo?

R_{t}

$R_t$

— Amelio Vazquez-Reina

1

Parece que

, minúsculas, está substituindo

, uma variável aleatória, e o segundo expectativa substitui a soma infinita (provavelmente para refletir a suposição de que continuamos a seguir

para todas as futuras

).

é então a recompensa imediata esperada no próximo passo no tempo e a segunda expectativa - que se torna

- é o valor esperado do próximo estado, ponderado pela probabilidade de enrolamento -se no estado

ter tomado

r

$r$

R_{t + 1}

$R_{t+1}$

π

$\pi$

t

$t$

Σ p (s^{'}, r | s, a) r

$\Sigma p(s',r|s,a)r$

v_{π}

$v_\pi$

s^{'}

$s'$

de

.

a

$a$

s

$s$

— 31519 Easter Sean

1

mesmo que a resposta correta já tenha sido dada e já tenha passado algum tempo, pensei que o seguinte guia passo a passo poderia ser útil:
Pela linearidade do Valor Esperado, podemos dividir $E[R_{t+1} + \gamma E[G_{t+1}|S_{t}=s]]$ em $E[R_{t+1}|S_t=s]$ e $\gamma E[G_{t+1}|S_{t}=s]$ .
Vou descrever os passos apenas para a primeira parte, pois a segunda parte segue os mesmos passos combinados com a Lei da Expectativa Total.

\begin{aligned} E [R_{t + 1} | S_{t} = s] & = \sum_{r} r P [R_{t + 1} = r | S_{t} = s] \\ = \sum_{a} \sum_{r} r P [R_{t + 1} = r, A_{t} = a | S_{t} = s] (III) \\ = \sum_{a} \sum_{r} r P [R_{t + 1} = r | A_{t} = a, S_{t} = s] P [A_{t} = a | S_{t} = s] \\ = \sum_{s^{^{'}}} \sum_{a} \sum_{r} r P [S_{t + 1} = s^{^{'}}, R_{t + 1} = r | A_{t} = a, S_{t} = s] P [A_{t} = a | S_{t} = s] \\ = \sum_{a} π (a | s) \sum_{s^{^{'}}, r} p (s^{^{'}}, r | s, a) r \end{aligned}

$\begin{align} E[R_{t+1}|S_t=s]&=\sum_r{ r P[R_{t+1}=r|S_t =s]} \\ &= \sum_a{ \sum_r{ r P[R_{t+1}=r, A_t=a|S_t=s]}} \qquad \text{(III)} \\ &=\sum_a{ \sum_r{ r P[R_{t+1}=r| A_t=a, S_t=s] P[A_t=a|S_t=s]}} \\ &= \sum_{s^{'}}{ \sum_a{ \sum_r{ r P[S_{t+1}=s^{'}, R_{t+1}=r| A_t=a, S_t=s] P[A_t=a|S_t=s] }}} \\ &=\sum_a{ \pi(a|s) \sum_{s^{'},r}{p(s^{'},r|s,a)} } r \end{align}$

Whereas (III) follows form:

\begin{aligned} P [A, B | C] & = \frac{P [A, B, C]}{P [C]} \\ = \frac{P [A, B, C]}{P [C]} \frac{P [B, C]}{P [B, C]} \\ = \frac{P [A, B, C]}{P [B, C]} \frac{P [B, C]}{P [C]} \\ = P [A | B, C] P [B | C] \end{aligned}

$\begin{align} P[A,B|C]&=\frac{P[A,B,C]}{P[C]} \\ &= \frac{P[A,B,C]}{P[C]} \frac{P[B,C]}{P[B,C]}\\ &= \frac{P[A,B,C]}{P[B,C]} \frac{P[B,C]}{P[C]}\\ &= P[A|B,C] P[B|C] \end{align}$

— Adsertor Justitia
fonte

1

I know there is already an accepted answer, but I wish to provide a probably more concrete derivation. I would also like to mention that although @Jie Shi trick somewhat makes sense, but it makes me feel very uncomfortable:(. We need to consider the time dimension to make this work. And it is important to note that, the expectation is actually taken over the entire infinite horizon, rather than just over $s$ and $s'$ . Let assume we start from $t=0$ (in fact, the derivation is the same regardless of the starting time; I do not want to contaminate the equations with another subscript $k$ )

\begin{aligned} v_{π} (s_{0}) & = E_{π} [G_{0} | s_{0}] \\ G_{0} & = \sum_{t = 0}^{T - 1} γ^{t} R_{t + 1} \\ E_{π} [G_{0} | s_{0}] & = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \\ \times (\sum_{t = 0}^{T - 1} γ^{t} r_{t + 1})) \\ = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \\ \times (r_{1} + γ \sum_{t = 0}^{T - 2} γ^{t} r_{t + 2})) \end{aligned}

$\begin{align} v_{\pi}(s_0)&=\mathbb{E}_{\pi}[G_{0}|s_0]\\ G_0&=\sum_{t=0}^{T-1}\gamma^tR_{t+1}\\ \mathbb{E}_{\pi}[G_{0}|s_0]&=\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\\ &\times\Big(\sum_{t=0}^{T-1}\gamma^tr_{t+1}\Big)\bigg)\\ &=\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\\ &\times\Big(r_1+\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\Big)\bigg) \end{align}$ NOTED THAT THE ABOVE EQUATION HOLDS EVEN IF $T\rightarrow\infty$ , IN FACT IT WILL BE TRUE UNTIL THE END OF UNIVERSE (maybe be a bit exaggerated :) )
At this stage, I believe most of us should already have in mind how the above leads to the final expression--we just need to apply sum-product rule(

\sum_{a} \sum_{b} \sum_{c} a b c \equiv \sum_{a} a \sum_{b} b \sum_{c} c

$\sum_a\sum_b\sum_cabc\equiv\sum_aa\sum_bb\sum_cc$ ) painstakingly. Let us apply the law of linearity of Expectation to each term inside the

(r_{1} + γ \sum_{t = 0}^{T - 2} γ^{t} r_{t + 2})

$\Big(r_{1}+\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\Big)$

Part 1

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t}) \times r_{1})

$\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\times r_1\bigg)$

Well this is rather trivial, all probabilities disappear (actually sum to 1) except those related to $r_1$ . Therefore, we have

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) \times r_{1}

$\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times r_1$

Part 2
Guess what, this part is even more trivial--it only involves rearranging the sequence of summations.

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{a_{1}, . . . a_{T}} \sum_{s_{1}, . . . s_{T}} \sum_{r_{1}, . . . r_{T}} (\prod_{t = 0}^{T - 1} π (a_{t + 1} | s_{t + 1}) p (s_{t + 1}, r_{t + 1} | s_{t}, a_{t})) = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) (\sum_{a_{1}} π (a_{1} | s_{1}) \sum_{a_{2}, . . . a_{T}} \sum_{s_{2}, . . . s_{T}} \sum_{r_{2}, . . . r_{T}} (\prod_{t = 0}^{T - 2} π (a_{t + 2} | s_{t + 2}) p (s_{t + 2}, r_{t + 2} | s_{t + 1}, a_{t + 1})))

$\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\bigg)\\=\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\bigg(\sum_{a_1}\pi(a_1|s_1)\sum_{a_{2},...a_{T}}\sum_{s_{2},...s_{T}}\sum_{r_{2},...r_{T}}\bigg(\prod_{t=0}^{T-2}\pi(a_{t+2}|s_{t+2})p(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\bigg)\bigg)$

And Eureka!! we recover a recursive pattern in side the big parentheses. Let us combine it with $\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}$ , and we obtain $v_{\pi}(s_1)=\mathbb{E}_{\pi}[G_1|s_1]$

γ E_{π} [G_{1} | s_{1}] = \sum_{a_{1}} π (a_{1} | s_{1}) \sum_{a_{2}, . . . a_{T}} \sum_{s_{2}, . . . s_{T}} \sum_{r_{2}, . . . r_{T}} (\prod_{t = 0}^{T - 2} π (a_{t + 2} | s_{t + 2}) p (s_{t + 2}, r_{t + 2} | s_{t + 1}, a_{t + 1})) (γ \sum_{t = 0}^{T - 2} γ^{t} r_{t + 2})

$\gamma\mathbb{E}_{\pi}[G_1|s_1]=\sum_{a_1}\pi(a_1|s_1)\sum_{a_{2},...a_{T}}\sum_{s_{2},...s_{T}}\sum_{r_{2},...r_{T}}\bigg(\prod_{t=0}^{T-2}\pi(a_{t+2}|s_{t+2})p(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\bigg)\bigg(\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\bigg)$
and part 2 becomes

\sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) \times γ v_{π} (s_{1})

$\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times \gamma v_{\pi}(s_1)$

Part 1 + Part 2

v_{π} (s_{0}) = \sum_{a_{0}} π (a_{0} | s_{0}) \sum_{s_{1}, r_{1}} p (s_{1}, r_{1} | s_{0}, a_{0}) \times (r_{1} + γ v_{π} (s_{1}))

$v_{\pi}(s_0) =\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times \Big(r_1+\gamma v_{\pi}(s_1)\Big)$

And now if we can tuck in the time dimension and recover the general recursive formulae

v_{π} (s) = \sum_{a} π (a | s) \sum_{s^{'}, r} p (s^{'}, r | s, a) \times (r + γ v_{π} (s^{'}))

$v_{\pi}(s) =\sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)\times \Big(r+\gamma v_{\pi}(s')\Big)$

Final confession, I laughed when I saw people above mention the use of law of total expectation. So here I am

— Karlsson Yu
fonte

Erm... what is the symbol '

\sum_{a_{0}, . . ., a_{\infty}}

$\sum_{a_0, ..., a_{\infty}}$ ' supposed to mean? There is no

a_{\infty}

$a_\infty$ ...

— Fabian Werner

Another question: Why is the very first equation true? I know

E [f (X) | Y = y] = \int_{X} f (x) p (x | y) d x

$E[f(X)|Y=y] = \int_{\mathcal{X}} f(x) p(x|y) dx$ but in our case,

X

$X$ would be an infinite sequence of random variables

(R_{0}, R_{1}, R_{2}, . . . . . . . .)

$(R_0, R_1, R_2, ........)$ so we would need to compute the density of this variable (consisting of an infinite amount of variables of which we know the density) together with something else (namely the state)... how exactly do you du that? I.e. what is

p (r_{0}, r_{1}, . . . .)

$p(r_0, r_1, ....)$ ?

— Fabian Werner

@FabianWerner. Take a deep breath to calm your brain first:). Let me answer your first question.

\sum_{a_{0}, . . ., a_{\infty}} \equiv \sum_{a_{0}} \sum_{a_{1}}, . . ., \sum_{a_{\infty}}

$\sum_{a_0,...,a_{\infty}} \equiv \sum_{a_0}\sum_{a_1},...,\sum_{a_{\infty}}$ . If you recall the definition of the value function, it is actually a summation of discounted future rewards. If we consider an infinite horizon for our future rewards, we then need to sum infinite number of times. A reward is result of taking an action from a state, since there is an infinite number of rewards, there should be an infinite number of actions, hence

a_{\infty}

$a_{\infty}$ .

— Karlsson Yu

1

let us assume that I agree that there is some weird

a_{\infty}

$a_\infty$ (which I still doubt, usually, students in the very first semester in math tend to confuse the limit with some construction that actually involves an infinite element)... I still have one simple question: how is “

\sum_{a_{1}} . . . \sum_{a_{\infty}}

$\sum_{a_1} ... \sum_{a_\infty}$ defined? I know what this expression is supposed to mean with a finite amount of sums... but infinitely many of them? What do you understand that this expression does?

— Fabian Werner

1

internet. Could you refer me to a page or any place that defines your expression? If not then you actually defined something new and there is no point in discussing that because it is just a symbol that you made up (but there is no meaning behind it)... you agree that we are only able to discuss about the symbol if we both know what it means, right? So, I do not know what it means, please explain...

— Fabian Werner

1

There are already a great many answers to this question, but most involve few words describing what is going on in the manipulations. I'm going to answer it using way more words, I think. To start,

G_{t} ≐ \sum_{k = t + 1}^{T} γ^{k - t - 1} R_{k}

$G_{t} \doteq \sum_{k=t+1}^{T} \gamma^{k-t-1} R_{k}$

is defined in equation 3.11 of Sutton and Barto, with a constant discount factor $0 \leq \gamma \leq 1$ and we can have $T = \infty$ or $\gamma = 1$ , but not both. Since the rewards, $R_{k}$ , are random variables, so is $G_{t}$ as it is merely a linear combination of random variables.

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] \\ = E_{π} [R_{t + 1} | S_{t} = s] + γ E_{π} [G_{t + 1} | S_{t} = s] \end{aligned}

$\begin{align} v_\pi(s) & \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ & = \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] + \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] \end{align}$

That last line follows from the linearity of expectation values. $R_{t+1}$ is the reward the agent gains after taking action at time step $t$ . For simplicity, I assume that it can take on a finite number of values $r \in \mathcal{R}$ .

Work on the first term. In words, I need to compute the expectation values of $R_{t+1}$ given that we know that the current state is $s$ . The formula for this is

\begin{aligned} E_{π} [R_{t + 1} | S_{t} = s] = \sum_{r \in R} r p (r | s) . \end{aligned}

$\begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] = \sum_{r \in \mathcal{R}} r p(r|s). \end{align}$

In other words the probability of the appearance of reward $r$ is conditioned on the state $s$ ; different states may have different rewards. This $p(r|s)$ distribution is a marginal distribution of a distribution that also contained the variables $a$ and $s'$ , the action taken at time $t$ and the state at time $t+1$ after the action, respectively:

\begin{aligned} p (r | s) = \sum_{s^{'} \in S} \sum_{a \in A} p (s^{'}, a, r | s) = \sum_{s^{'} \in S} \sum_{a \in A} π (a | s) p (s^{'}, r | a, s) . \end{aligned}

$\begin{align} p(r|s) = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(s',a,r|s) = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \pi(a|s) p(s',r | a,s). \end{align}$

Where I have used $\pi(a|s) \doteq p(a|s)$ , following the book's convention. If that last equality is confusing, forget the sums, suppress the $s$ (the probability now looks like a joint probability), use the law of multiplication and finally reintroduce the condition on $s$ in all the new terms. It in now easy to see that the first term is

\begin{aligned} E_{π} [R_{t + 1} | S_{t} = s] = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} r π (a | s) p (s^{'}, r | a, s), \end{aligned}

$\begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} r \pi(a|s) p(s',r | a,s), \end{align}$

as required. On to the second term, where I assume that $G_{t+1}$ is a random variable that takes on a finite number of values $g \in \Gamma$ . Just like the first term:

\begin{aligned} E_{π} [G_{t + 1} | S_{t} = s] = \sum_{g \in Γ} g p (g | s) . (*) \end{aligned}

$\begin{align} \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] = \sum_{g \in \Gamma} g p(g|s). \qquad\qquad\qquad\qquad (*) \end{align}$

Once again, I "un-marginalize" the probability distribution by writing (law of multiplication again)

\begin{aligned} p (g | s) & = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (s^{'}, r, a, g | s) = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}, r, a, s) p (s^{'}, r, a | s) \\ = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}, r, a, s) p (s^{'}, r | a, s) π (a | s) \\ = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}, r, a, s) p (s^{'}, r | a, s) π (a | s) \\ = \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} p (g | s^{'}) p (s^{'}, r | a, s) π (a | s) (* *) \end{aligned}

$\begin{align} p(g|s) & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(s',r,a,g|s) = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r, a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r | a, s) \pi(a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r | a, s) \pi(a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s') p(s', r | a, s) \pi(a | s) \qquad\qquad\qquad\qquad (**) \end{align}$

The last line in there follows from the Markovian property. Remember that $G_{t+1}$ is the sum of all the future (discounted) rewards that the agent receives after state $s'$ . The Markovian property is that the process is memory-less with regards to previous states, actions and rewards. Future actions (and the rewards they reap) depend only on the state in which the action is taken, so $p(g | s', r, a, s) = p(g | s')$ , by assumption. Ok, so the second term in the proof is now

\begin{aligned} γ E_{π} [G_{t + 1} | S_{t} = s] & = γ \sum_{g \in Γ} \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} g p (g | s^{'}) p (s^{'}, r | a, s) π (a | s) \\ = γ \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} E_{π} [G_{t + 1} | S_{t + 1} = s^{'}] p (s^{'}, r | a, s) π (a | s) \\ = γ \sum_{r \in R} \sum_{s^{'} \in S} \sum_{a \in A} v_{π} (s^{'}) p (s^{'}, r | a, s) π (a | s) \end{aligned}

$\begin{align} \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] & = \gamma \sum_{g \in \Gamma} \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} g p(g | s') p(s', r | a, s) \pi(a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \mathbb{E}_{\pi}\left[ G_{t+1} | S_{t+1} = s' \right] p(s', r | a, s) \pi(a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} v_{\pi}(s') p(s', r | a, s) \pi(a | s) \end{align}$

as required, once again. Combining the two terms completes the proof

\begin{aligned} v_{π} (s) & ≐ E_{π} [G_{t} ∣ S_{t} = s] \\ = \sum_{a \in A} π (a | s) \sum_{r \in R} \sum_{s^{'} \in S} p (s^{'}, r | a, s) [r + γ v_{π} (s^{'})] . \end{aligned}

$\begin{align} v_\pi(s) & \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \sum_{a \in \mathcal{A}} \pi(a | s) \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} p(s', r | a, s) \left[ r + \gamma v_{\pi}(s') \right]. \end{align}$

UPDATE

I want to address what might look like a sleight of hand in the derivation of the second term. In the equation marked with $(*)$ , I use a term $p(g|s)$ and then later in the equation marked $(**)$ I claim that $g$ doesn't depend on $s$ , by arguing the Markovian property. So, you might say that if this is the case, then $p(g|s) = p(g)$ . But this is not true. I can take $p(g | s', r, a, s) \rightarrow p(g | s')$ because the probability on the left side of that statement says that this is the probability of $g$ conditioned on $s'$ , $a$ , $r$ , and $s$ . Because we either know or assume the state $s'$ , none of the other conditionals matter, because of the Markovian property. If you do not know or assume the state $s'$ , then the future rewards (the meaning of $g$ ) will depend on which state you begin at, because that will determine (based on the policy) which state $s'$ you start at when computing $g$ .

If that argument doesn't convince you, try to compute what $p(g)$ is:

\begin{aligned} p (g) & = \sum_{s^{'} \in S} p (g, s^{'}) = \sum_{s^{'} \in S} p (g | s^{'}) p (s^{'}) \\ = \sum_{s^{'} \in S} p (g | s^{'}) \sum_{s, a, r} p (s^{'}, a, r, s) \\ = \sum_{s^{'} \in S} p (g | s^{'}) \sum_{s, a, r} p (s^{'}, r | a, s) p (a, s) \\ = \sum_{s \in S} p (s) \sum_{s^{'} \in S} p (g | s^{'}) \sum_{a, r} p (s^{'}, r | a, s) π (a | s) \\ ≐ \sum_{s \in S} p (s) p (g | s) = \sum_{s \in S} p (g, s) = p (g) . \end{aligned}

$\begin{align} p(g) & = \sum_{s' \in \mathcal{S}} p(g, s') = \sum_{s' \in \mathcal{S}} p(g | s') p(s') \\ & = \sum_{s' \in \mathcal{S}} p(g | s') \sum_{s,a,r} p(s', a, r, s) \\ & = \sum_{s' \in \mathcal{S}} p(g | s') \sum_{s,a,r} p(s', r | a, s) p(a, s) \\ & = \sum_{s \in \mathcal{S}} p(s) \sum_{s' \in \mathcal{S}} p(g | s') \sum_{a,r} p(s', r | a, s) \pi(a | s) \\ & \doteq \sum_{s \in \mathcal{S}} p(s) p(g|s) = \sum_{s \in \mathcal{S}} p(g,s) = p(g). \end{align}$

As can be seen in the last line, it is not true that $p(g|s) = p(g)$ . The expected value of $g$ depends on which state you start in (i.e. the identity of $s$ ), if you do not know or assume the state $s'$ .

— Finncent Price
fonte