Valores iniciais padrão que ajustam a regressão logística ao glm

Eu estou querendo saber como são os valores iniciais padrão especificados em glm.

Esta publicação sugere que os valores padrão sejam definidos como zeros. Esta uma diz que existe um algoritmo por trás dele, no entanto link relevante é quebrado.

Tentei ajustar o modelo de regressão logística simples com o rastreamento de algoritmo:

set.seed(123)

x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)

# to see parameter estimates in each step
trace(glm.fit, quote(print(coefold)), at = list(c(22, 4, 8, 4, 19, 3)))

Primeiro, sem especificação dos valores iniciais:

glm(y ~ x, family = "binomial")

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

Na primeira etapa, os valores iniciais são NULL.

Segundo, defino os valores iniciais como zeros:

glm(y ~ x, family = "binomial", start = c(0, 0))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0 0
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3177530 0.9097521
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3909975 1.1397163
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3994147 1.1666173
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995191 1.1669518

E podemos ver que as iterações entre a primeira e a segunda abordagem diferem.

Para ver os valores iniciais especificados por glm, tentei ajustar o modelo com apenas uma iteração:

glm(y ~ x, family = "binomial", control = list(maxit = 1))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
NULL

Call:  glm(formula = y ~ x, family = "binomial", control = list(maxit = 1))

Coefficients:
(Intercept)            x  
     0.3864       1.1062  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      134.6 
Residual Deviance: 115  AIC: 119

As estimativas dos parâmetros (sem surpresa) correspondem às estimativas da primeira abordagem na segunda iteração, ou seja, [1] 0.386379 1.106234 definir esses valores como valores iniciais leva à mesma sequência de iterações da primeira abordagem:

glm(y ~ x, family = "binomial", start = c(0.386379, 1.106234))

Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.386379 1.106234
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3991135 1.1653971
Tracing glm.fit(x = structure(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  .... step 22,4,8,4,19,3 
[1] 0.3995188 1.1669508

Portanto, a questão é: como esses valores são calculados?

r logistic-regression glm

— Adela
fonte

É complicado. Se você fornecer startvalores, eles serão usados no cálculo do que é passado para a C_Cdqrlsrotina. Caso contrário, os valores passados são calculados (incluindo uma chamada eval(binomial()$initialize)), mas glm.fitnunca calculam explicitamente os valores para start. Tome uma ou duas horas e estude o glm.fitcódigo.

— Roland

Obrigado pelo comentário. Tentei estudar o glm.fitcódigo, mas ainda não tenho idéia de como os valores iniciais são calculados.

— Adela

TL; DR

start=c(b0,b1)inicializa eta para b0+x*b1(mu para 1 / (1 + exp (-eta))))
start=c(0,0) inicializa eta para 0 (mu para 0,5), independentemente do valor de y ou x.
start=NULL inicializa eta = 1,098612 (mu = 0,75) se y = 1, independentemente do valor x.
start=NULL inicializa eta = -1,098612 (mu = 0,25) se y = 0, independentemente do valor x.
Uma vez que eta (e consequentemente mu e var (mu)) foram calculados, we zsão calculados e enviados para um solucionador de QR, no espírito de qr.solve(cbind(1,x) * w, z*w).

Forma longa

Criando o comentário de Roland: Fiz um glm.fit.truncated(), onde atendi glm.fità C_Cdqrlschamada e depois a comentei. glm.fit.truncatedgera os valores ze w(assim como os valores das quantidades usadas para calcular ze w) que seriam passados para a C_Cdqrlschamada:

## call Fortran code via C wrapper
fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
             min(1e-7, control$epsilon/1000), check=FALSE)

Mais pode ser lido C_Cdqrls aqui . Felizmente, a função qr.solvena base R toca diretamente nas versões LINPACK que são chamadas glm.fit().

Por isso, corremos glm.fit.truncatedpara as diferentes especificações de valores iniciais e, em seguida, fazemos uma chamada qr.solvecom os valores de z e vemos como os "valores iniciais" (ou os primeiros valores de iteração exibidos) são calculados. Como Roland indicou, especificar start=NULLou start=c(0,0)em glm () afeta os cálculos para w e z, não para start.

Para o início = NULL: zé um vetor em que os elementos têm o valor 2.431946 ou -2.431946 e wé um vetor em que todos os elementos são 0,4330127:

start.is.null <- glm.fit.truncated(x,y,family=binomial(), start=NULL)
start.is.null
w <- start.is.null$w
z <- start.is.null$z
## if start is NULL, the first displayed values are:
qr.solve(cbind(1,x) * w, z*w)  
# > qr.solve(cbind(1,x) * w, z*w)  
#                 x 
# 0.386379 1.106234

Para o início = c (0,0): zé um vetor em que os elementos têm o valor 2 ou -2 e wé um vetor em que todos os elementos são 0,5:

## if start is c(0,0)    
start.is.00 <- glm.fit.truncated(x,y,family=binomial(), start=0)
start.is.00
w <- start.is.00$w
z <- start.is.00$z
## if start is c(0,0), the first displayed values are:    
qr.solve(cbind(1,x) * w, z*w)  
# > qr.solve(cbind(1,x) * w, z*w)  
#                   x 
# 0.3177530 0.9097521

Então, está tudo bem, mas como calculamos o we z? Perto do fundo do glm.fit.truncated()que vemos

z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])

Observe as seguintes comparações entre os valores de saída das quantidades usadas para calcular ze w:

cbind(y, start.is.null$mu, start.is.00$mu)
cbind(y, start.is.null$eta, start.is.00$eta)
cbind(start.is.null$var_mu, start.is.00$var_mu)
cbind(start.is.null$mu.eta.val, start.is.00$mu.eta.val)

Observe que start.is.00o vetor terá muapenas os valores 0,5, porque eta está definido como 0 e mu (eta) = 1 / (1 + exp (-0)) = 0,5. start.is.nulldefine aqueles com y = 1 como mu = 0,75 (que corresponde a eta = 1,098612) e aqueles com y = 0 como mu = 0,25 (que corresponde a eta = -1,098612) e, portanto, var_mu= 0,75 * 0,25 = 0,1875.

No entanto, é interessante notar que mudei a semente e reran tudo e mu = 0,75 para y = 1 e mu = 0,25 para y = 0 (e, portanto, as outras quantidades permaneceram as mesmas). Ou seja, start = NULL dá origem ao mesmo we zindependentemente do que yex é , porque inicializam eta = 1,098612 (mu = 0,75) se y = 1 e eta = -1,098612 (mu = 0,25) se y = 0.

Portanto, parece que um valor inicial para o coeficiente Intercept e para o coeficiente X não está definido para start = NULL, mas valores iniciais são dados a eta, dependendo do valor de y e independente do valor de x. De lá we zsão calculados, em seguida, enviado juntamente comx a qr.solver.

Código a ser executado antes dos blocos acima:

set.seed(123)

x <- rnorm(100)
p <- 1/(1 + exp(-x))
y <- rbinom(100, size = 1, prob = p)


glm.fit.truncated <- function(x, y, weights = rep.int(1, nobs), 
start = 0,etastart = NULL, mustart = NULL, 
offset = rep.int(0, nobs),
family = binomial(), 
control = list(), 
intercept = TRUE,
singular.ok = TRUE
){
control <- do.call("glm.control", control)
x <- as.matrix(x)
xnames <- dimnames(x)[[2L]]
ynames <- if(is.matrix(y)) rownames(y) else names(y)
conv <- FALSE
nobs <- NROW(y)
nvars <- ncol(x)
EMPTY <- nvars == 0
## define weights and offset if needed
if (is.null(weights))
  weights <- rep.int(1, nobs)
if (is.null(offset))
  offset <- rep.int(0, nobs)

## get family functions:
variance <- family$variance
linkinv  <- family$linkinv
if (!is.function(variance) || !is.function(linkinv) )
  stop("'family' argument seems not to be a valid family object", call. = FALSE)
dev.resids <- family$dev.resids
aic <- family$aic
mu.eta <- family$mu.eta
unless.null <- function(x, if.null) if(is.null(x)) if.null else x
valideta <- unless.null(family$valideta, function(eta) TRUE)
validmu  <- unless.null(family$validmu,  function(mu) TRUE)
if(is.null(mustart)) {
  ## calculates mustart and may change y and weights and set n (!)
  eval(family$initialize)
} else {
  mukeep <- mustart
  eval(family$initialize)
  mustart <- mukeep
}
if(EMPTY) {
  eta <- rep.int(0, nobs) + offset
  if (!valideta(eta))
    stop("invalid linear predictor values in empty model", call. = FALSE)
  mu <- linkinv(eta)
  ## calculate initial deviance and coefficient
  if (!validmu(mu))
    stop("invalid fitted means in empty model", call. = FALSE)
  dev <- sum(dev.resids(y, mu, weights))
  w <- sqrt((weights * mu.eta(eta)^2)/variance(mu))
  residuals <- (y - mu)/mu.eta(eta)
  good <- rep_len(TRUE, length(residuals))
  boundary <- conv <- TRUE
  coef <- numeric()
  iter <- 0L
} else {
  coefold <- NULL
  eta <-
    if(!is.null(etastart)) etastart
  else if(!is.null(start))
    if (length(start) != nvars)
      stop(gettextf("length of 'start' should equal %d and correspond to initial coefs for %s", nvars, paste(deparse(xnames), collapse=", ")),
           domain = NA)
  else {
    coefold <- start
    offset + as.vector(if (NCOL(x) == 1L) x * start else x %*% start)
  }
  else family$linkfun(mustart)
  mu <- linkinv(eta)
  if (!(validmu(mu) && valideta(eta)))
    stop("cannot find valid starting values: please specify some", call. = FALSE)
  ## calculate initial deviance and coefficient
  devold <- sum(dev.resids(y, mu, weights))
  boundary <- conv <- FALSE

  ##------------- THE Iteratively Reweighting L.S. iteration -----------
  for (iter in 1L:control$maxit) {
    good <- weights > 0
    varmu <- variance(mu)[good]
    if (anyNA(varmu))
      stop("NAs in V(mu)")
    if (any(varmu == 0))
      stop("0s in V(mu)")
    mu.eta.val <- mu.eta(eta)
    if (any(is.na(mu.eta.val[good])))
      stop("NAs in d(mu)/d(eta)")
    ## drop observations for which w will be zero
    good <- (weights > 0) & (mu.eta.val != 0)

    if (all(!good)) {
      conv <- FALSE
      warning(gettextf("no observations informative at iteration %d",
                       iter), domain = NA)
      break
    }
    z <- (eta - offset)[good] + (y - mu)[good]/mu.eta.val[good]
    w <- sqrt((weights[good] * mu.eta.val[good]^2)/variance(mu)[good])
    # ## call Fortran code via C wrapper
    # fit <- .Call(C_Cdqrls, x[good, , drop = FALSE] * w, z * w,
    #              min(1e-7, control$epsilon/1000), check=FALSE)
    # 

    #print(iter)
    #print(z)
    #print(w)
  }


  }
  return(list(z=z, w=w, mustart=mustart, etastart=etastart, eta=eta, offset=offset, mu=mu, mu.eta.val=mu.eta.val,
              weight=weights, var_mu=variance(mu)))

}

— Swihart
fonte

Obrigado pela sua excelente resposta, isso está muito além do que eu esperava :)

— Adela