Determinar se um processo distribuído de cauda pesada melhorou significativamente

12

Observo os tempos de processamento de um processo antes e depois de uma alteração para descobrir se o processo melhorou com a alteração. O processo melhorou, se o tempo de processamento for reduzido. A distribuição do tempo de processamento é baseada em gordura, portanto, comparar com base na média não é sensato. Em vez disso, gostaria de saber se a probabilidade de observar um tempo de processamento menor após a alteração é significativamente acima de 50%.

Seja $X$ a variável aleatória do tempo de processamento após a alteração e $Y$ a anterior. Se $P(X < Y)$ estiver significativamente acima de $0.5$ , eu diria que o processo melhorou.

Agora tenho $n$ observações $x_i$ dos $X$ e $m$ observações $y_j$ de $Y$ . O observada probabilidade de $P(X < Y)$ é . $\hat p = \frac{1}{n m} \sum_i \sum_j 1_{x_i < y_j}$

O que posso dizer sobre dadas as observações e ? $P(X < Y)$ $x_i$ $y_j$

sampling nonparametric

— cristão
fonte

12

Sua estimativa é igual ao Mann-Whitney estatística dividido por (obrigado, Glen!), E é, portanto, equivalente ao Wilcoxon-sum classificação estatística (também conhecido como a estatística de Wilcoxon-Mann-Whitney): $\hat{p}$ $U$ $mn$ $W$ $W = U + {n(n+1)\over{2}}$ , onde $n$ é o tamanho da amostra de $y$ (assumindo que não há vínculos). Portanto, você pode usar tabelas / software do teste de Wilcoxon e transformá-los de volta em $U$ para obter um intervalo de confiança ou umvalor- $p$ .

Seja $m$ o tamanho da amostra de $x$ , $N$ = $m+n$ . Então, assintoticamente,

$W^* = \frac{W-\frac{m(N+1)}{2}}{\sqrt{\frac{mn(N+1)}{12}}} \sim \text{N}(0,1)$

Fonte: Hollander e Wolfe , Métodos Estatísticos Não Paramétricos, aproximadamente p. 117, mas provavelmente a maioria dos livros de estatística não paramétrica o levará até lá.

— jbowman
fonte

@ Glen_b - obrigado, eu atualizei a resposta. Um palpite muito generoso que você fez sobre a causa do erro!

— jbowman

13

O @jbowman fornece uma solução padrão (agradável) para o problema de estimar conhecido como modelo de resistência ao estresse . $\theta=P(X<Y)$

Outra alternativa não paramétrica foi proposta em Baklizi e Eidous (2006) para o caso em que e são independentes. Isso é descrito abaixo. $X$ $Y$

Por definição, temos que

θ = P (X < Y) = \int_{- \infty}^{\infty} F_{X} (y) f_{Y} (y) d y,

$\theta=P(X<Y)=\int_{-\infty}^{\infty}F_X(y)f_Y(y)dy,$

$F_X$ $X$ $f_Y$ $Y$ $X$ $Y$ $F_X$ $f_Y$ $\theta$

\hat{θ} = \int_{- \infty}^{\infty} {\hat{F}}_{X} (y) {\hat{f}}_{Y} (y) d y .

$\hat\theta=\int_{-\infty}^{\infty}\hat F_X(y)\hat f_Y(y)dy.$

This is implemented in the following R code using a Gaussian kernel.

# Optimal bandwidth
h = function(x){
n = length(x)
return((4*sqrt(var(x))^5/(3*n))^(1/5))
}

# Kernel estimators of the density and the distribution
kg = function(x,data){
hb = h(data)
k = r = length(x)
for(i in 1:k) r[i] = mean(dnorm((x[i]-data)/hb))/hb
return(r )
} 

KG = function(x,data){
hb = h(data)
k = r = length(x)
for(i in 1:k) r[i] = mean(pnorm((x[i]-data)/hb))
return(r )
} 

# Baklizi and Eidous (2006) estimator
nonpest = function(dat1B,dat2B){
return( as.numeric(integrate(function(x) KG(x,dat1B)*kg(x,dat2B),-Inf,Inf)$value))  
}

# Example when X and Y are Cauchy
datx = rcauchy(100,0,1)
daty =  rcauchy(100,0,1)

nonpest(datx,daty)

In order to obtain a confidence interval for $\theta$ you can get a bootstrap sample of this estimator as follows.

# bootstrap
B=1000
p = rep(0,B)

for(j in 1:B){
dat1 =  sample(datx,length(datx),replace=T)
dat2 =  sample(daty,length(daty),replace=T)
p[j] = nonpest(dat1,dat2)
}

# histogram of the bootstrap sample
hist(p)

# A confidence interval (quantile type)
c(quantile(p,0.025),quantile(p,0.975))

Other sorts of bootstrap intervals might be considered as well.

2

Interesting and a good paper reference (+1). I'll add it to my repertoire!

— jbowman

0

Consider the paired difference $X_i-Y_i$ , $P(X_i-Y_i<0) = p$ then $I\{X_i-Y_i<0\}$ for $i=1,2,..,n$ are iid Bernoulli random variables. So the number $X$ of $X_i < Y_i$ is binomial $n$ $p=P(X_i-Y_i<0)$ . Then $X/n$ is an unbiased estimate of the probability and confidence intervals and hypothesis tests can be done base on the binomial.

— Michael R. Chernick
fonte

2

What is the basis of the pairing, Michael?

— whuber

The OP said "Let X be the random variable for the processing time after the change and Y the one before" So Xi is after the intervention and Yi is before.

— Michael R. Chernick

Did you notice that the counts (potentially) differ? You appear to assume

m = n

$m=n$ . My reading is that a "process" is temporal and that the

X_{i}

$X_i$ sample it before an event and the

Y_{j}

$Y_j$ sample it after an event.

— whuber

1

You're right. I guess some sort of two sample test such as the Wilcoxon as suggested by jbowman above would be appropriate. It is interesting that the Mann-Whitney form og the test counts the number of Xis < the Yjs.

— Michael R. Chernick