Forma matricial de retropropagação com normalização em lote

A normalização de lotes foi creditada com melhorias substanciais de desempenho em redes neurais profundas. Muito material na internet mostra como implementá-lo, ativação por ativação. Eu já implementei backprop usando álgebra matricial e, como estou trabalhando em linguagens de alto nível (enquanto confio em Rcpp(e eventualmente GPUs) para multiplicação densa de matrizes), rasgar tudo e recorrer a forloops provavelmente atrasaria meu código substancialmente, além de ser uma dor enorme.

A função de normalização do lote é

b (x_{p}) = γ (x_{p} - μ_{x_{p}}) σ_{x_{p}}^{- 1} + β

$b(x_p) = \gamma \left(x_p - \mu_{x_p}\right) \sigma^{-1}_{x_p} + \beta$ em que

$x_p$ é o $p$ nó th, antes que ele é ativado
$\gamma$ e $\beta$ são parâmetros escalares
$\mu_{x_p}$ e $\sigma_{x_p}$ são a média e o DP de $x_p$ . (Observe que a raiz quadrada da variação mais um fator de correção é normalmente usado - vamos assumir elementos diferentes de zero para compactação)

Na forma de matriz, a normalização de lote para uma camada inteira seria

b (X) = (γ \otimes 1_{p}) ⊙ (X - μ_{X}) ⊙ σ_{X}^{- 1} + (β \otimes 1_{p})

$b(\mathbf{X}) = \left(\gamma\otimes\mathbf{1}_p\right)\odot \left(\mathbf{X} - \mu_{\mathbf{X}}\right) \odot\sigma^{-1}_{\mathbf{X}} + \left(\beta\otimes\mathbf{1}_p\right)$ em que

$\mathbf{X}$ é $N\times p$
$\mathbf{1}_N$ é um vetor de coluna de unidades
$\gamma$ e $\beta$ agora sãovetores delinha $p$ dos parâmetros de normalização por camada
$\mu_{\mathbf{X}}$ e $\sigma_{\mathbf{X}}$ sãomatrizes $N \times p$ , em que cada coluna é umvetor $N$ de médias decolunae desvios padrão
$\otimes$ é o produto Kronecker e $\odot$ é o produto elementwise (Hadamard)

Uma rede neural de uma camada muito simples, sem normalização por lotes e um resultado contínuo é

y = a ({X Γ}_{1}) Γ_{2} + ϵ

$y = a\left(\mathbf{X\Gamma}_1\right)\Gamma_2 + \epsilon$

Onde

$\Gamma_1$ é $p_1 \times p_2$
$\Gamma_2$ é $p_2 \times 1$
$a(.)$ é a função de ativação

Se a perda é $R = N^{-1}\displaystyle\sum\left(y - \hat{y}\right)^2$ , em seguida, os gradientes são

\begin{array}{lr} \frac{\partial R}{\partial Γ_{1}} = - 2 V^{T} \hat{ϵ} \\ \frac{\partial R}{\partial Γ_{2}} = X^{T} (a^{'} (X Γ_{1}) ⊙ - 2 \hat{ϵ} Γ_{2}^{T}) \end{array}

$\begin{array}{lr} \frac{\partial R}{\partial \Gamma_1} = -2\mathbf{V}^T \hat\epsilon\\ \frac{\partial R}{\partial \Gamma_2} = \mathbf{X}^T \left(a'(\mathbf{X}\mathbf{\Gamma}_1) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T\right) \\ \end{array}$

Onde

$\mathbf{V} = a\left(\mathbf{X}\Gamma_1\right)$
$\hat{\epsilon} = y-\hat{y}$

Sob normalização do lote, a rede se torna

y = a (b (X Γ_{1})) Γ_{2}

$y = a\left(b\left(\mathbf{X}\Gamma_1\right)\right)\Gamma_2$ ou

y = a ((γ \otimes 1_{N}) ⊙ (X Γ_{1} - μ_{X Γ_{1}}) ⊙ σ_{X Γ_{1}}^{- 1} + (β \otimes 1_{N})) Γ_{2}

$y = a\Big(\left(\gamma\otimes\mathbf{1}_N\right)\odot \left(\mathbf{X\Gamma_1} - \mu_{\mathbf{X\Gamma_1}}\right) \odot\sigma^{-1}_{\mathbf{X\Gamma_1}} + \left(\beta\otimes\mathbf{1}_N\right)\Big)\mathbf{\Gamma_2}$ Não tenho ideia de como calcular os derivados dos produtos Hadamard e Kronecker. Quanto aos produtos Kronecker, a literatura fica bastante misteriosa.

Existe uma maneira prática de computação , , e no âmbito da matriz? Uma expressão simples, sem recorrer à computação nó por nó? $\partial R/\partial \gamma$ $\partial R/\partial \beta$ $\partial R/\partial \mathbf{\Gamma_1}$

Atualização 1:

Eu descobri - mais ou menos. É: Algumas R código demonstra que este é equivalente ao modo looping para fazê-lo. Primeiro, configure os dados falsos: $\partial R/\partial \beta$

1_{N}^{T} (a^{'} (X Γ_{1}) ⊙ - 2 \hat{ϵ} Γ_{2}^{T})

$\mathbf{1}_{N}^T \left(a'(\mathbf{X}\mathbf{\Gamma}_1) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T\right)$

set.seed(1)
library(dplyr)
library(foreach)

#numbers of obs, variables, and hidden layers
N <- 10
p1 <- 7
p2 <- 4
a <- function (v) {
  v[v < 0] <- 0
  v
}
ap <- function (v) {
  v[v < 0] <- 0
  v[v >= 0] <- 1
  v
}

# parameters
G1 <- matrix(rnorm(p1*p2), nrow = p1)
G2 <- rnorm(p2)
gamma <- 1:p2+1
beta <- (1:p2+1)*-1
# error
u <- rnorm(10)

# matrix batch norm function
b <- function(x, bet = beta, gam = gamma){
  xs <- scale(x)
  gk <- t(matrix(gam)) %x% matrix(rep(1, N))
  bk <- t(matrix(bet)) %x% matrix(rep(1, N))
  gk*xs+bk
}
# activation-wise batch norm function
bi <- function(x, i){
  xs <- scale(x)
  gk <- t(matrix(gamma[i]))
  bk <- t(matrix(beta[i]))
  suppressWarnings(gk*xs[,i]+bk)
}

X <- round(runif(N*p1, -5, 5)) %>% matrix(nrow = N)
# the neural net
y <- a(b(X %*% G1)) %*% G2 + u

Em seguida, calcule derivativos:

# drdbeta -- the matrix way
drdb <- matrix(rep(1, N*1), nrow = 1) %*% (-2*u %*% t(G2) * ap(b(X%*%G1)))
drdb
           [,1]      [,2]    [,3]        [,4]
[1,] -0.4460901 0.3899186 1.26758 -0.09589582
# the looping way
foreach(i = 1:4, .combine = c) %do%{
  sum(-2*u*matrix(ap(bi(X[,i, drop = FALSE]%*%G1[i,], i)))*G2[i])
}
[1] -0.44609015  0.38991862  1.26758024 -0.09589582

Eles combinam. Mas ainda estou confuso, porque realmente não sei por que isso funciona. As notas do MatCalc referenciadas por @ Mark L. Stone dizem que a derivada de deve ser $\beta \otimes \mathbf{1}_N$

, onde os subscritos,, e,são as dimensões dee. é a matriz de comutação, que é apenas 1 aqui porque ambas as entradas são vetores. Eu tento isso e obtenho um resultado que não parece útil:

\frac{\partial A \otimes B}{\partial A} = (I_{n q} \otimes T_{m p}) (I_{n} \otimes v e c (B) \otimes I_{m})

$\frac{\partial A \otimes B}{\partial A} = \left(I_{nq} \otimes T_{mp}\right)\left(I_n\otimes vec(B) \otimes I_m\right)$

m

$m$

n

$n$

p

$p$

q

$q$

A

$A$

B

$B$

T

$T$

# playing with the kroneker derivative rule
A <- t(matrix(beta)) 
B <- matrix(rep(1, N))
diag(rep(1, ncol(A) *ncol(B))) %*% diag(rep(1, ncol(A))) %x% (B) %x% diag(nrow(A))
     [,1] [,2] [,3] [,4]
 [1,]    1    0    0    0
 [2,]    1    0    0    0
 snip
[13,]    0    1    0    0
[14,]    0    1    0    0
snip
[28,]    0    0    1    0
[29,]    0    0    1    0
[snip
[39,]    0    0    0    1
[40,]    0    0    0    1

Isso não é conformável. Claramente, não estou entendendo essas regras derivadas da Kronecker. Ajudar com isso seria ótimo. Ainda estou totalmente preso aos outros derivativos, para e - esses são mais difíceis porque não entram de maneira aditiva como o . $\gamma$ $\mathbf{\Gamma_1}$ $\beta \otimes \mathbf{1}$

Atualização 2

Lendo livros didáticos, tenho certeza de que e exigirão o uso do operador. Mas, aparentemente, sou incapaz de seguir suficientemente as derivações para poder traduzi-las em código. Por exemplo, envolverá a derivada de em relação a , onde $\partial R/\partial \Gamma_1$ $\partial R/\partial \gamma$ vec() $\partial R/\partial \Gamma_1$ $w\odot\mathbf{X\Gamma_1}$ $\mathbf{\Gamma_1}$ (que podemos tratar como uma matriz constante no momento). $w \equiv (\gamma \otimes \mathbf{1}) \odot \sigma_{\mathbf{X\Gamma_1}}^{-1}$

Meu instinto é simplesmente dizer "a resposta é ", mas que, obviamente, não funciona porque não está de acordo com . $w\odot\mathbf{X}$ $w$ $\mathbf{X}$

Eu sei que

\partial (A ⊙ B) = \partial A ⊙ B + A ⊙ \partial B

$\partial(A \odot B) = \partial A \odot B + A \odot \partial B$

e a partir disso , que

\frac{\partial v e c (w ⊙ X Γ_{1})}{\partial v e c (Γ_{1})^{T}} = v e c (X Γ_{1}) I \frac{\partial v e c (w)}{\partial v e c (Γ_{1})^{T}} + v e c (w) I \frac{\partial v e c (X Γ_{1})}{\partial v e c (Γ_{1})^{T}}

$\frac{\partial vec(w \odot \mathbf{X\Gamma_1})}{\partial vec(\mathbf{\Gamma_1})^T} = vec(\mathbf{X\Gamma_1})I\frac{\partial vec(w)}{\partial vec(\mathbf{\Gamma_1})^T} + vec(w)I\frac{\partial vec(\mathbf{X\Gamma_1})}{\partial vec(\mathbf{\Gamma_1})^T}$

Atualização 3

Fazendo progresso aqui. Eu acordei às 2 da manhã na noite passada com essa ideia. A matemática não é boa para dormir.

$\partial R/\partial \mathbf{\Gamma_1}$

$w \equiv (\gamma \otimes \mathbf{1}) \odot \sigma_{\mathbf{X\Gamma_1}}^{-1}$
$\text{"stub"} \equiv a'(b(\mathbf{X\Gamma}_1)) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T$

\frac{\partial R}{\partial Γ_{1}} = \frac{\partial w ⊙ {X Γ}_{1}}{\partial Γ_{1}} ("stub")

$\frac{\partial R}{\partial \Gamma_1} = \frac{\partial w \odot \mathbf{X\Gamma}_1}{\partial \Gamma_1}\left(\text{"stub"}\right)$

i

$i$

j

$j$

I

$\mathbf{I}$

\frac{\partial R}{\partial Γ_{i j}} = {(w_{i} ⊙ X_{i})}^{T} ({"stub"}_{j})

$\frac{\partial R}{\partial \Gamma_{ij}} = \left(w_i \odot \mathbf{X_i}\right)^T\left(\text{"stub"}_j\right)$

\frac{\partial R}{\partial Γ_{i j}} = {(I w_{i} X_{i})}^{T} ({"stub"}_{j})

$\frac{\partial R}{\partial \Gamma_{ij}} = \left(\mathbf{I} w_i \mathbf{X_i}\right)^T\left(\text{"stub"}_j\right)$

\frac{\partial R}{\partial Γ_{i j}} = {X_{i}}^{T} I w_{i} ({"stub"}_{j})

$\frac{\partial R}{\partial \Gamma_{ij}} = \mathbf{X_i}^T\mathbf{I} w_i\left(\text{"stub"}_j\right)$ tl;dr you're basically pre-multiplying the stub by the batchnorm scale factors. This should be equivalent to:

\frac{\partial R}{\partial Γ} = X^{T} ("stub" ⊙ w)

$\frac{\partial R}{\partial \Gamma} = \mathbf{X}^T\left(\text{"stub"}\odot w\right)$

And, in fact it is:

stub <- (-2*u %*% t(G2) * ap(b(X%*%G1)))
w <- t(matrix(gamma)) %x% matrix(rep(1, N)) * (apply(X%*%G1, 2, sd) %>% t %x% matrix(rep(1, N)))
drdG1 <- t(X) %*% (stub*w)

loop_drdG1 <- drdG1*NA
for (i in 1:7){
  for (j in 1:4){
    loop_drdG1[i,j] <- t(X[,i]) %*% diag(w[,j]) %*% (stub[,j])
  }
}

> loop_drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965
> drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965

Update 4

Here, I think, is $\partial R / \partial \gamma$ . First

$\widetilde{\mathbf{X\Gamma}} \equiv \left(\mathbf{X\Gamma} - \mu_{\mathbf{X\Gamma}}\right)\odot \sigma^{-1}_\mathbf{X\Gamma}$
$\tilde\gamma \equiv \gamma \otimes\mathbf{1}_N$

Similar to before, the chain rule gets you as far as

\frac{\partial R}{\partial \tilde{γ}} = \frac{\partial \tilde{γ} ⊙ \tilde{X Γ}}{\partial \tilde{γ}} ("stub")

$\frac{\partial R}{\partial \tilde\gamma} = \frac{\partial \tilde\gamma \odot \widetilde{\mathbf{X\Gamma}}}{\partial \tilde\gamma}\left(\text{"stub"}\right)$ Looping gives you

\frac{\partial R}{\partial {\tilde{γ}}_{i}} = (\tilde{X Γ})_{i}^{T} I {\tilde{γ}}_{i} ({"stub"}_{i})

$\frac{\partial R}{\partial \tilde\gamma_i} = (\widetilde{\mathbf{X\Gamma}})_i^T \mathbf{I}\tilde\gamma_i \left(\text{"stub"}_i\right)$ Which, like before, is basically pre-multiplying the stub. It should therefore be equivalent to:

\frac{\partial R}{\partial \tilde{γ}} = (\tilde{X Γ})^{T} ("stub" ⊙ \tilde{γ})

$\frac{\partial R}{\partial \tilde\gamma} = (\widetilde{\mathbf{X\Gamma}})^T \left(\text{"stub"} \odot \tilde\gamma \right)$

It sort of matches:

drdg <- t(scale(X %*% G1)) %*% (stub * t(matrix(gamma)) %x% matrix(rep(1, N)))

loop_drdg <- foreach(i = 1:4, .combine = c) %do% {
  t(scale(X %*% G1)[,i]) %*% (stub[,i, drop = F] * gamma[i])  
}

> drdg
           [,1]      [,2]       [,3]       [,4]
[1,]  0.8580574 -1.125017  -4.876398  0.4611406
[2,] -4.5463304  5.960787  25.837103 -2.4433071
[3,]  2.0706860 -2.714919 -11.767849  1.1128364
[4,] -8.5641868 11.228681  48.670853 -4.6025996
> loop_drdg
[1]   0.8580574   5.9607870 -11.7678486  -4.6025996

The diagonal on the first is the same as the vector on the second. But really since the derivative is with respect to a matrix -- albeit one with a certain structure, the output should be a similar matrix with the same structure. Should I take the diagonal of the matrix approach and simply take it to be $\gamma$ ? I'm not sure.

It seems that I have answered my own question but I am unsure whether I am correct. At this point I will accept an answer that rigorously proves (or disproves) what I've sort of hacked together.

while(not_answered){
  print("Bueller?")
  Sys.sleep(1)
}

— generic_user
fonte

Chapter 9 section 14 of "Matrix Differential Calculus with Applications in Statistics and Econometrics" by Magnus and Neudecker, 3rd edition janmagnus.nl/misc/mdc2007-3rdedition covers differentials of Kronecker products and concludes with an exercise on differential of Hadamard product. "Notes on Matrix Calculus" by Paul L. Fackler www4.ncsu.edu/~pfackler/MatCalc.pdf has a lot of material on differentiating Kronceker products

— Mark L. Stone

Thanks for the references. I've found those MatCalc notes before, but it doesn't cover Hadamard, and anyway I'm never certain whether a rule from non-matrix calculus applies or doesn't apply to to matrix case. Product rules, chain rules, etc. I'll look into the book. I'd accept an answer that points me to all of the ingredients I need to pencil it out myself...

— generic_user

why are you doing this? why not use framewroks such as Keras/TensorFlow? It's a waste of productive time to implement these low level algorithms, that you could use on solving actual problems

— Aksakal

More precisely, I'm fitting networks that exploit known parametric structure -- both in terms of linear-in-parameters representations of input data, as well as longitudinal/panel structure. Established frameworks are so heavily optimized as to be beyond my ability to hack/modify. Plus math is helpful generally. Plenty of codemonkeys have no idea what they're doing. Likewise learning enough Rcpp to implement it efficiently is useful.

— generic_user

@MarkL.Stone not only is it theoretically sound, it's practically easy! A more or less mechanical process! &%#$!

— generic_user

Not a complete answer, but to demonstrate what I suggested in my comment if

b (X) = (X - e_{N} μ_{X}^{T}) Γ Σ_{X}^{- 1 / 2} + e_{N} β^{T}

$b(X)=(X−e_N\mu_X^T)ΓΣ_X^{-1/2}+e_N\beta^T$ where

Γ = d i a g (γ)

$\Gamma=\mathop{\mathrm{diag}}(\gamma)$ ,

Σ_{X}^{- 1 / 2} = d i a g (σ_{X_{1}}^{- 1}, σ_{X_{2}}^{- 1}, \dots)

$\Sigma_X^{-1/2}=\mathop{\mathrm{diag}}(\sigma_{X_1}^{-1},\sigma_{X_2}^{-1},\dots)$ and

e_{N}

$e_N$ is a vector of ones, then by the chain rule

\nabla_{β} R = [- 2 \hat{ϵ} (Γ_{2}^{T} \otimes I) J_{X} (a) (I \otimes e_{N})]^{T}

$\nabla_\beta R=[-2\hat{\epsilon}(\Gamma_2^T\otimes I)J_X(a)(I\otimes e_N)]^T$ Noting that

- 2 \hat{ϵ} (Γ_{2}^{T} \otimes I) = v e c (- 2 \hat{ϵ} Γ_{2}^{T})^{T}

$-2\hat{\epsilon}(\Gamma_2^T\otimes I)=\mathop{\mathrm{vec}}(-2\hat{\epsilon}\Gamma_2^T)^T$ and

J_{X} (a) = d i a g (v e c (a^{'} (b (X Γ_{1}))))

$J_X(a)=\mathop{\mathrm{diag}}(\mathop{\mathrm{vec}}(a^\prime(b(X\Gamma_1))))$ , we see that

\nabla_{β} R = (I \otimes e_{N}^{T}) v e c (a^{'} (b (X Γ_{1})) ⊙ - 2 \hat{ϵ} Γ_{2}^{T}) = e_{N}^{T} (a^{'} (b (X Γ_{1})) ⊙ - 2 \hat{ϵ} Γ_{2}^{T})

$\nabla_\beta R=(I\otimes e_N^T)\mathop{\mathrm{vec}}(a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T)=e_N^T(a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T)$ via the identity

v e c (A X B) = (B^{T} \otimes A) v e c (X)

$\mathop{\mathrm{vec}}(AXB)=(B^T\otimes A)\mathop{\mathrm{vec}}(X)$ . Similarly,

\begin{aligned} \nabla_{γ} R & = [- 2 \hat{ϵ} (Γ_{2}^{T} \otimes I) J_{X} (a) (Σ_{X Γ_{1}}^{- 1 / 2} \otimes (X Γ_{1} - e_{N} μ_{X Γ_{1}}^{T})) K]^{T} \\ = K^{T} v e c ((X Γ_{1} - e_{N} μ_{X Γ_{1}}^{T})^{T} W Σ_{X Γ_{1}}^{- 1 / 2}) \\ = d i a g ((X Γ_{1} - e_{N} μ_{X Γ_{1}}^{T})^{T} W Σ_{X Γ_{1}}^{- 1 / 2}) \end{aligned}

$\begin{align}\nabla_\gamma R&=[-2\hat{\epsilon}(\Gamma_2^T\otimes I)J_X(a)(\Sigma_{X\Gamma_1}^{-1/2}\otimes (X\Gamma_1-e_N\mu_{X\Gamma_1}^T))K]^T\\&=K^T\mathop{\mathrm{vec}}((X\Gamma_1-e_N\mu_{X\Gamma_1}^T)^TW\Sigma^{-1/2}_{X\Gamma_1})\\&=\mathop{\mathrm{diag}}((X\Gamma_1-e_N\mu_{X\Gamma_1}^T)^TW\Sigma^{-1/2}_{X\Gamma_1})\end{align}$ where

W = a^{'} (b (X Γ_{1})) ⊙ - 2 \hat{ϵ} Γ_{2}^{T}

$W=a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T$ (the "stub") and

K

$K$ is an

N p \times p

$Np\times p$ binary matrix that selects the columns of the Kronecker product corresponding to the diagonal elements of a square matrix. This follows from the fact that

d Γ_{i \neq j} = 0

$d\Gamma_{i\neq j}=0$ . Unlike the first gradient, this expression is not equivalent to the expression you derived. Considering that

b

$b$ is a linear function w.r.t

γ_{i}

$\gamma_i$ , there should not be a factor of

γ_{i}

$\gamma_i$ in the gradient. I leave the gradient of

Γ_{1}

$\Gamma_1$ to the OP, but I will say for derivation with fixed

w

$w$ creates the "explosion" the writers of the article seek to avoid. In practice, you will also need to find the Jacobians of

Σ_{X}

$\Sigma_X$ and

μ_{X}

$\mu_X$ w.r.t

X

$X$ and use product rule.

— deasmhumnha
fonte