O que significa "observações independentes"?

Estou tentando entender o que significa a suposição de observações independentes . Algumas definições são:

"Dois eventos são independentes se e somente se $P(a \cap b) = P(a) * P(b)$ ." ( Dicionário de Termos Estatísticos )
"a ocorrência de um evento não altera a probabilidade de outro" ( Wikipedia ).
"a amostragem de uma observação não afeta a escolha da segunda observação" ( David M. Lane ).

Um exemplo de observações dependentes que geralmente são dadas são os alunos aninhados nos professores, como abaixo. Vamos supor que os professores influenciam os alunos, mas os alunos não influenciam um ao outro.

Então, como essas definições são violadas para esses dados? A amostragem [nota = 7] para [aluno = 1] não afeta a distribuição de probabilidade da nota que será amostrada em seguida. (Ou faz? E se sim, então o que a observação 1 prevê em relação à próxima observação?)

Por que as observações seriam independentes se eu tivesse medido em gender vez de teacher_id? Eles não afetam as observações da mesma maneira?

teacher_id   student_id   grade
         1            1       7
         1            2       7
         1            3       6
         2            4       8
         2            5       8
         2            6       9

— RubenGeert
fonte

Pode-se sugerir que a distribuição de notas para o professor 1 tenha um valor "médio" menor do que para o professor 2 e, portanto, os alunos do professor 1 tenderiam a ter notas mais baixas, em média, do que os alunos do professor 2. Em outras palavras , a distribuição de alunos / notas para os dois professores poderia muito bem ser distribuições diferentes. Isso seria suficiente para tornar as observações dependentes.

— Reponha Monica - G. Simpson

@ GavinSimpson: Eu estive pensando sobre essa linha exata de raciocínio. No entanto, e se eu substituir teacherpor gender? O gênero está presente na maioria dos dados das ciências sociais e se correlaciona com quase tudo, até certo ponto.

— RubenGeert 22/09

Certamente deve depender da resposta. Se estivéssemos analisando notas de estudantes de ciências no Reino Unido, talvez houvesse um efeito com diferentes distribuições de aproveitamento para os dois sexos, em média nas populações que você estuda. De qualquer forma, tudo isso importa apenas (em um modelo estatístico) para os resíduos ou, de maneira diferente, para as respostas condicionadas ao modelo ajustado. Em outras palavras, se as observações não forem independentes, tudo bem, desde que o modelo seja responsável por isso, de modo que os resíduos sejam independentes.

— Reponha Monica - G. Simpson

Você não pode tomar (1) ou (2) como definições de independência (estatística), porque a independência pode ser definida sem referência à causalidade. Todas as três citações são apenas esforços para fornecer exemplos informais e intuitivos . ((3) possivelmente poderia ser tomada como uma definição, desde que você tenha acesso a uma definição quantitativa e rigorosa da quantidade de informações.) Portanto, seria uma boa idéia fazer referência a uma definição real, como as que aparecem sob o título "Definição" no artigo da Wikipedia que você faz referência.

— whuber

Não, você pode tornar os resíduos independentes (ou pelo menos reduzir a dependência a tal ponto que os resíduos pareçam independentes). Isso vem dizer das suposições do modelo linear;

onde

é uma matriz de correlação. A suposição usual é que

é uma matriz de identidade; portanto, as diagonais externas são zero e, portanto, a suposição de independência está nos resíduos. Em outras palavras, essa é uma afirmação sobre

condicional ao modelo ajustado.

ε \sim N (0, σ^{2} Λ)

$\varepsilon \sim N(0, \sigma^2 \Lambda)$

Λ

$\Lambda$

Λ

$\Lambda$

y

$y$

— Reinstate Monica - G. Simpson

Respostas:

Na teoria das probabilidades, independência estatística (que não é a mesma que independência causal) é definida como sua propriedade (3), mas (1) segue como conseqüência . Os eventos e são considerados estatisticamente independentes se e somente se: $\dagger$ $\mathcal{A}$ $\mathcal{B}$

P (A \cap B) = P (A) \cdot P (B) .

$\mathbb{P}(\mathcal{A} \cap \mathcal{B}) = \mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B}) .$

Se , se segue o seguinte: $\mathbb{P}(\mathcal{B}) > 0$

P (A | B) = \frac{P (A \cap B)}{P (B)} = \frac{P (A) \cdot P (B)}{P (B)} = P (A) .

$\mathbb{P}(\mathcal{A} |\mathcal{B}) = \frac{\mathbb{P}(\mathcal{A} \cap \mathcal{B})}{\mathbb{P}(\mathcal{B})} = \frac{\mathbb{P}(\mathcal{A}) \cdot \mathbb{P}(\mathcal{B})}{\mathbb{P}(\mathcal{B})} = \mathbb{P}(\mathcal{A}) .$

Isso significa que a independência estatística implica que a ocorrência de um evento não afeta a probabilidade do outro. Outra maneira de dizer isso é que a ocorrência de um evento não deve mudar suas crenças sobre o outro. O conceito de independência estatística é geralmente estendido de eventos a variáveis aleatórias de uma maneira que permite que declarações análogas sejam feitas para variáveis aleatórias, incluindo variáveis aleatórias contínuas (que têm probabilidade zero de qualquer resultado específico). O tratamento da independência para variáveis aleatórias envolve basicamente as mesmas definições aplicadas às funções de distribuição.

É crucial entender que a independência é uma propriedade muito forte - se os eventos são estatisticamente independentes, então (por definição) não podemos aprender sobre um observando o outro. Por esse motivo, os modelos estatísticos geralmente envolvem suposições de independência condicional , dadas algumas distribuições ou parâmetros subjacentes. A estrutura conceitual exata depende se alguém está usando métodos bayesianos ou métodos clássicos. O primeiro envolve dependência explícita entre valores observáveis, enquanto o último envolve uma forma implícita (complicada e sutil) de dependência. Compreender esse problema corretamente requer um pouco de entendimento das estatísticas clássica versus estatística bayesiana.

Os modelos estatísticos costumam dizer que usam uma suposição de que seqüências de variáveis aleatórias são "independentes e identicamente distribuídas (IID)". Por exemplo, você pode ter uma seqüência observável , o que significa que cada variável aleatória observável é normalmente distribuída com média e desvio padrão $X_1, X_2, X_3, ... \sim \text{IID N} (\mu, \sigma^2)$ $X_i$ $\mu$ $\sigma$ . Cada uma das variáveis aleatórias na sequência é "independente" das outras, no sentido de que seu resultado não altera a distribuição declarada dos outros valores. Nesse tipo de modelo, usamos os valores observados da sequência para estimar os parâmetros no modelo e, por sua vez, podemos prever valores não observados da sequência. Isso envolve necessariamente o uso de alguns valores observados para aprender sobre os outros.

Estatísticas Bayesianas: Tudo é conceitualmente simples. Assume-se que são condicionalmente IID, dados os parâmetros e , e tratam esses parâmetros desconhecidos como variáveis aleatórias. Dada qualquer distribuição anterior não degenerada para esses parâmetros, os valores na sequência observável são (incondicionalmente) dependentes, geralmente com correlação positiva. Por isso, faz todo o sentido usarmos os resultados observados para prever resultados não observados mais tarde - eles são condicionalmente independentes, mas incondicionalmente dependentes. $X_1, X_2, X_3, ...$ $\mu$ $\sigma$

Estatísticas clássicas: isso é bastante complicado e sutil. Assume-se que são IID dados os parâmetros e $X_1, X_2, X_3, ...$ $\mu$ $\sigma$ , mas trate esses parâmetros como "constantes desconhecidas". Como os parâmetros são tratados como constantes, não há clara diferença entre independência condicional e incondicional nesse caso. No entanto, ainda usamos os valores observados para estimar os parâmetros e fazer previsões dos valores não observados. Portanto, usamos os resultados observados para prever resultados posteriores não observados, mesmo que eles sejam nocionalmente "independentes" um do outro. Essa aparente incongruência é discutida em detalhes em O'Neill, B. (2009) Exchangeability, Correlation and Bayes 'Effect. International Statistical Review 77 (2) , pp. 241 - 250 .

Aplicando isto a seus dados notas de alunos, você provavelmente modelo algo como isto, assumindo que gradeé condicionalmente independente GIVEN teacher_id. Você usaria os dados para fazer inferências sobre a distribuição de notas de cada professor (o que não seria considerado o mesmo) e isso permitiria que você fizesse previsões sobre o desconhecido gradede outro aluno. Como a gradevariável é usada na inferência, ela afetará suas previsões de qualquer gradevariável desconhecida para outro aluno. Substituir teacher_idpor gendernão altera isso; em ambos os casos, você tem uma variável que pode usar como um preditor grade.

Se você usar o método bayesiano, terá uma suposição explícita de independência condicional e uma distribuição prévia das distribuições de notas dos professores, e isso levará a uma dependência incondicional (preditiva) de notas, permitindo que você use racionalmente uma nota na sua previsão de outra. Se você estiver usando estatísticas clássicas, terá uma suposição de independência (com base em parâmetros que são "constantes desconhecidas") e usará métodos de previsão estatística clássicos que permitem usar uma nota para prever outra.

Existem algumas apresentações fundamentais da teoria da probabilidade que definem a independência por meio da declaração de probabilidade condicional e, em seguida, fornecem a declaração de probabilidade conjunta como conseqüência. Isso é menos comum. $\dagger$

— Restabelecer Monica
fonte

Independência estatística é muito o que você descreve na primeira parte de sua resposta. Mas sua sentença "... se os eventos são estatisticamente independentes, então (por definição) não podemos aprender sobre um observando o outro". está descaradamente errado. O mundo está cheio de eventos estatisticamente independentes, mas similares e variáveis aleatórias.

— Alecos Papadopoulos

"Aprendizagem" não significaria mudar nossas crenças sobre uma coisa com base na observação de outra? Nesse caso, a independência (definitivamente) não se opõe a isso?

— Reinstate Monica

Eu faria um comentário semelhante ao do @Alecos. A impressão geral que se obtém é que você está afirmando que observar uma realização de uma variável aleatória não nos diz nada sobre sua distribuição

, de modo que você não pode prever nada sobre uma segunda realização independente. Se fosse esse o caso, seria impossível desenvolver a maior parte da teoria da amostragem e estimativa. Mas você está correto no sentido de que, se conhecemos

e observamos uma realização, isso não nos dá informações adicionais sobre qualquer outra realização independente .

F

$F$

F

$F$

— whuber

Eu acho que o problema aqui é que o modelo IID padrão com distribuição

é implicitamente usando um pressuposto de condicional independência dado conhecimento da . Dependendo do conhecimento de

, as observações são independentes, mas incondicionalmente você tem uma situação em que cada observação fornece informações sobre

, o que afeta suas crenças sobre as outras observações.

F

$F$ $F$

F

$F$

F

$F$

— Reinstate Monica

A dificuldade nessa questão é que a estatística clássica trata a distribuição e os parâmetros subjacentes como "constantes desconhecidas" e, portanto, não faz nenhuma distinção explícita entre independência condicional ou incondicional, neste caso. Nas estatísticas bayesianas, tudo é muito simples.

— Reinstate Monica

Vamos por um dimensional vector aleatório, isto é, um conjunto de posição fixa de variáveis aleatórias (funções reais mensuráveis). $\mathbb x=(X_1,...,X_j,...,X_k)$ $k-$

Considere muitos desses vectores, dizer , e o índice de estes vectores de , então, dizer $n$ $i=1,...,n$

e considerá-los como um conjunto chamado "amostra",. Então chamamos cada

x_{i} = (X_{1 i}, . . ., X_{j i}, . . ., X_{k i})

$\mathbb x_i=(X_{1i},...,X_{ji},...,X_{ki})$

S = (x_{1}, . . ., x_{i}, . . ., x_{n})

$S=(\mathbb x_1,...,\mathbb x_i,...,\mathbb x_n)$

k -

$k-$ vetor dimensional uma "observação" (embora realmente se torne uma apenas uma vez que medimos e registramos as realizações das variáveis aleatórias envolvidas).

Vamos primeiro tratar o caso em que existe uma função de massa de probabilidade (PMF) ou uma função de densidade de probabilidade (PDF) e também articular essas funções. Denotado por o PMF conjunta ou PDF conjunta de cada vector aleatório, e a PMF conjunta ou PDF conjunta de todos estes vectores em conjunto. $f_i(\mathbb x_i),\;i=1,...,n$ $f(\mathbb x_1,...,\mathbb x_i,...,\mathbb x_n)$

$S$

f (x_{1}, . . ., x_{i}, . . ., x_{n}) = \prod_{i = 1}^{n} f_{i} (x_{i}), \forall (x_{1}, . . ., x_{i}, . . ., x_{n}) \in D_{S}

$f(\mathbb x_1,...,\mathbb x_i,...,\mathbb x_n) = \prod_{i=1}^{n}f_i(\mathbb x_i),\;\;\; \forall (\mathbb x_1,...,\mathbb x_i,...,\mathbb x_n) \in D_S$

$D_S$ is the joint domain created by the $n$ random vectors/observations.

This means that the "observations" are "jointly independent", (in the statistical sense, or "independent in probability" as was the old saying that is still seen today sometimes). The habit is to simply call them "independent observations".

Note that the statistical independence property here is over the index $i$ , i.e. between observations. It is unrelated to what are the probabilistic/statistical relations between the random variables in each observation (in the general case we treat here where each observation is multidimensional).

Note also that in cases where we have continuous random variables with no densities, the above can be expressed in terms of the distribution functions.

This is what "independent observations" means. It is a precisely defined property expressed in mathematical terms. Let's see some of what it implies.

SOME CONSEQUENCES OF HAVING INDEPENDENT OBSERVATIONS

A. If two observations are part of a group of jointly independent observations, then they are also "pair-wise independent" (statistically),

f (x_{i}, x_{m}) = f_{i} (x_{i}) f_{m} (x_{m}) \forall i \neq m, i, m = 1, . . ., n

$f(\mathbb x_i,\mathbb x_m) = f_i(\mathbb x_i)f_m(\mathbb x_m)\;\;\; \forall i\neq m, \;\;\; i,m =1,...,n$

This in turn implies that conditional PMF's/PDFs equal the "marginal" ones

f (x_{i} ∣ x_{m}) = f_{i} (x_{i}) \forall i \neq m, i, m = 1, . . ., n

$f(\mathbb x_i \mid \mathbb x_m) = f_i(\mathbb x_i)\;\;\; \forall i\neq m, \;\;\; i,m =1,...,n$

This generalizes to many arguments, conditioned or conditioning, say

f (x_{i}, x_{ℓ} ∣ x_{m}) = f (x_{i}, x_{ℓ}), f (x_{i} ∣ x_{m}, x_{ℓ}) = f_{i} (x_{i})

$f(\mathbb x_i , \mathbb x_{\ell}\mid \mathbb x_m) = f(\mathbb x_i , \mathbb x_{\ell}),\;\;\;\; f(\mathbb x_i \mid \mathbb x_m, \mathbb x_{\ell}) = f_i(\mathbb x_i)$

etc, as long as the indexes to the left are different to the indexes on the right of the vertical line.

This implies that if we actually observe one observation, the probabilities characterizing any other observation of the sample do not change. So as regards prediction, an independent sample is not our best friend. We would prefer to have dependence so that each observation could help us say something more about any other observation.

B. On the other hand, an independent sample has maximum informational content. Every observation, being independent, carries information that cannot be inferred, wholly or partly, by any other observation in the sample. So the sum total is maximum, compared to any comparable sample where there exists some statistical dependence between some of the observations. But of what use is this information, if it cannot help us improve our predictions?

Well, this is indirect information about the probabilities that characterize the random variables in the sample. The more these observations have common characteristics (common probability distribution in our case), the more we are in a better position to uncover them, if our sample is independent.

In other words if the sample is independent and "identically distributed", meaning

f_{i} (x_{i}) = f_{m} (x_{m}) = f (x), i \neq m

$f_i(\mathbb x_i) = f_m(\mathbb x_m) = f(\mathbb x),\;\;\; i\neq m$

it is the best possible sample in order to obtain information about not only the common joint probability distribution $f(\mathbb x)$ , but also for the marginal distributions of the random variables that comprise each observation, say $f_j(x_{ji})$ .

So even though $f(\mathbb x_i \mid \mathbb x_m) = f_i(\mathbb x_i)$ , so zero additional predictive power as regards the actual realization of $\mathbb x_i$ , with an independent and identically distributed sample, we are in the best position to uncover the functions $f_i$ (or some of its properties), i.e. the marginal distributions.

Therefore, as regards estimation (which is sometimes used as a catch-all term, but here it should be kept distinct from the concept of prediction), an independent sample is our "best friend", if it is combined with the "identically distributed" property.

C. It also follows that an independent sample of observations where each is characterized by a totally different probability distribution, with no common characteristics whatsoever, is as worthless a collection of information as one can get (of course every piece of information on its own is worthy, the issue here is that taken together these cannot be combined to offer anything useful). Imagine a sample containing three observations: one containing (quantitative characteristics of) fruits from South America, another containing mountains of Europe, and a third containing clothes from Asia. Pretty interesting information pieces all three of them -but together as a sample cannot do anything statistically useful for us.

Put in another way, a necessary and sufficient condition for an independent sample to be useful, is that the observations have some statistical characteristics in common. This is why, in Statistics, the word "sample" is not synonymous to "collection of information" in general, but to "collection of information on entities that have some common characteristics".

APPLICATION TO THE OP'S DATA EXAMPLE

Responding to a request from user @gung, let's examine the OP's example in light of the above. We reasonably assume that we are in a school with more than two teachers and more than six pupils. So a) we are sampling both pupilss and teachers, and b) we include in our data set the grade that corresponds to each teacher-pupil combination.

Namely, the grades are not "sampled", they are a consequence of the sampling we did on teachers and pupils. Therefore it is reasonable to treat the random variable $G$ (=grade) as the "dependent variable", while pupils ( $P$ ) and teachers $T$ are "explanatory variables" (not all possible explanatory variables, just some). Our sample consists of six observations which we write explicitly, $S = (\mathbb s_1, ..., \mathbb s_6)$ as

\begin{aligned} s_{1} = (T_{1}, P_{1}, G_{1}) \\ s_{2} = (T_{1}, P_{2}, G_{2}) \\ s_{3} = (T_{1}, P_{3}, G_{3}) \\ s_{3} = (T_{2}, P_{4}, G_{4}) \\ s_{4} = (T_{2}, P_{5}, G_{5}) \\ s_{5} = (T_{2}, P_{6}, G_{6}) \end{aligned}

$\begin{align} \mathbb s_1 =(T_1, P_1, G_1) \\ \mathbb s_2 =(T_1, P_2, G_2) \\ \mathbb s_3 =(T_1, P_3, G_3) \\ \mathbb s_3 =(T_2, P_4, G_4) \\ \mathbb s_4 =(T_2, P_5, G_5) \\ \mathbb s_5 =(T_2, P_6, G_6) \\ \end{align}$

Under the stated assumption "pupils do not influence each other", we can consider the $P_i$ variables as independently distributed. Under a non-stated assumption that "all other factors" that may influence the Grade are independent of each other, we can also consider the $G_i$ variables to be independent of each other.
Finally under a non-stated assumption that teachers do not influence each other, we can consider the variables $T_1, T_2$ as statistically independent between them.

But irrespective of what causal/structural assumption we will make regarding the relation between teachers and pupils, the fact remains that observations $\mathbb s_1, \mathbb s_2, \mathbb s_3$ contain the same random variable ( $T_1$ ), while observations $\mathbb s_4, \mathbb s_5, \mathbb s_6$ also contains the same random variable ( $T_2$ ).

Note carefully the distinction between "the same random variable" and "two distinct random variables that have identical distributions".

So even if we assume that "teachers do NOT influence pupils", then still, our sample as defined above is not an independent sample, because $\mathbb s_1, \mathbb s_2, \mathbb s_3$ are statistically dependent through $T_1$ , while $\mathbb s_4, \mathbb s_5, \mathbb s_6$ are statistically dependent through $T_2$ .

Assume now that we exclude the random variable "teacher" from our sample. Is the (Pupil, Grade) sample of six observations, an independent sample? Here, the assumptions we will make regarding what is the structural relationship between teachers, pupils, and grades does matter.

First, do teachers directly affect the random variable "Grade", through perhaps, different "grading attitudes/styles"? For example $T_1$ may be a "tough grader" while $T_2$ may be not. In such a case "not seeing" the variable "Teacher" does not make the sample independent, because it is now the $G_1, G_2, G_3$ that are dependent, due to a common source of influence, $T_1$ (and analogously for the other three).

But say that teachers are identical in that respect. Then under the stated assumption "teachers influence students" we have again that the first three observations are dependent with each other, because teachers influence pupils who influence grades, and we arrive at the same result, albeit indirectly in this case (and likewise for the other three). So again, the sample is not independent.

THE CASE OF GENDER

Now, let's make the (Pupil, Grade) six-observation sample "conditionally independent with respect to teacher" (see other answers) by assuming that all six pupils have in reality the same teacher. But in addition let's include in the sample the random variable " $Ge$ =Gender" that traditionally takes two values ( $M,F$ ), while recently has started to take more. Our once again three-dimensional six-observation sample is now

\begin{aligned} s_{1} = (G e_{1}, P_{1}, G_{1}) \\ s_{2} = (G e_{2}, P_{2}, G_{2}) \\ s_{3} = (G e_{3}, P_{3}, G_{3}) \\ s_{3} = (G e_{4}, P_{4}, G_{4}) \\ s_{4} = (G e_{5}, P_{5}, G_{5}) \\ s_{5} = (G e_{6}, P_{6}, G_{6}) \end{aligned}

$\begin{align} \mathbb s_1 =(Ge_1, P_1, G_1) \\ \mathbb s_2 =(Ge_2, P_2, G_2) \\ \mathbb s_3 =(Ge_3, P_3, G_3) \\ \mathbb s_3 =(Ge_4, P_4, G_4) \\ \mathbb s_4 =(Ge_5, P_5, G_5) \\ \mathbb s_5 =(Ge_6, P_6, G_6) \\ \end{align}$

Note carefully that what we included in the description of the sample as regards Gender, is not the actual value that it takes for each pupil, but the random variable "Gender". Look back at the beginning of this very long answer: the Sample is not defined as a collection of numbers (or fixed numerical or not values in general), but as a collection of random variables (i.e. of functions).

Now, does the gender of one pupil influences (structurally or statistically) the gender of the another pupil? We could reasonably argue that it doesn't. So from that respect, the $Ge_i$ variables are independent. Does the gender of pupil $1$ , $Ge_1$ , affects in some other way directly some other pupil ( $P_2, P_3,...$ )? Hmm, there are battling educational theories if I recall on the matter. So if we assume that it does not, then off it goes another possible source of dependence between observations. Finally, does the gender of a pupil influence directly the grades of another pupil? if we argue that it doesn't, we obtain an independent sample (conditional on all pupils having the same teacher).

— Alecos Papadopoulos
fonte

Não concordo no seu ponto B. Para alguns propósitos, como estimar uma correlação média, negativa é melhor que a independência.

— Kjetil b halvorsen

@kjetil Melhor em que sentido?

— Alecos Papadopoulos

It would help if you could connect this concretely to the OP's questions in the text. Given this, how do we understand that the listed observations are not independent? & how does leaving out teacher differ from leaving out sex?

— gung - Reinstate Monica

@gung I included some elaboration along the lines you suggested.

— Alecos Papadopoulos

Better in the sense of reducing the variance

— kjetil b halvorsen

The definitions of statistical independence that you give in your post are all essentially correct, but they don't get to the heart of the independence assumption in a statistical model. To understand what we mean by the assumption of independent observations in a statistical model, it will be helpful to revisit what a statistical model is on a conceptual level.

Statistical models as approximations to "nature's dice"

Let's use a familiar example: we collect a random sample of adult humans (from a well-defined population--say, all adult humans on earth) and we measure their heights. We wish to estimate the population mean height of adult humans. To do this, we construct a simple statistical model by assuming that people's heights arise from a normal distribution.

Our model will be a good one if a normal distribution provides a good approximation to how nature "picks" heights for people. That is, if we simulate data under our normal model, does the resulting dataset closely resemble (in a statistical sense) what we observe in nature? In the context of our model, does our random-number generator provide a good simulation of the complicated stochastic process that nature uses to determine the heights of randomly selected human adults ("nature's dice")?

The independence assumption in a simple modeling context

When we assumed that we could approximate "nature's dice" by drawing random numbers from a normal distribution, we didn't mean that we would draw a single number from the normal distribution, and then assign that height to everybody. We meant that we would independently draw numbers for everybody from the same normal distribution. This is our independence assumption.

Imagine now that our sample of adults wasn't a random sample, but instead came from a handful of families. Tallness runs in some families, and shortness runs in others. We've already said that we're willing to assume that the heights of all adults come from one normal distribution. But sampling from the normal distribution wouldn't provide a dataset that looks much like our sample (our sample would show "clumps" of points, some short, others tall--each clump is a family). The heights of people in our sample are not independent draws from the overall normal distribution.

The independence assumption in a more complicated modeling context

But not all is lost! We might be able to write down a better model for our sample--one that preserves the independence of the heights. For example, we could write down a linear model where heights arise from a normal distribution with a mean that depends on what family the subject belongs to. In this context, the normal distribution describes the residual variation, AFTER we account for the influence of family. And independent samples from a normal distribution might be a good model for this residual variation.

Overall here, what we have done is to write down a more sophisticated model of how we expect nature's dice to behave in the context of our study. By writing down a good model, we might still be justified in assuming that that the random part of the model (i.e. the random variation around the family means) is independently sampled for each member of the population.

The (conditional) independence assumption in a general modeling context

In general, statistical models work by assuming that data arises from some probability distribution. The parameters of that distribution (like the mean of the normal distribution in the example above) might depend on covariates (like family in the example above). But of course endless variations are possible. The distribution might not be normal, the parameter that depends on covariates might not be the mean, the form of the dependence might not be linear, etc. ALL of these models rely on the assumption that they provide a reasonably good approximation to how nature's dice behave (again, that data simulated under the model will look statistically similar to actual data obtained by nature).

Quando simulamos dados sob o modelo, a etapa final será sempre desenhar um número aleatório de acordo com alguma distribuição de probabilidade modelada. Estes são os empates que assumimos serem independentes um do outro. Os dados reais que obtemos podem não parecer independentes, porque covariáveis ou outros recursos do modelo podem nos dizer para usar distribuições de probabilidade diferentes para diferentes sorteios (ou conjuntos de sorteios). Mas todas essas informações devem ser construídas no próprio modelo. Não podemos permitir que o sorteio final aleatório dependa de quais valores desenhamos para outros pontos de dados. Assim, os eventos que precisam ser independentes são os lançamentos de "dados da natureza" no contexto do nosso modelo.

É útil referir-se a essa situação como independência condicional , o que significa que os pontos de dados são independentes um do outro, dados (isto é, condicionados) às covariáveis. No nosso exemplo de altura, assumimos que minha altura e a altura de meu irmão condicionadas em minha família são independentes uma da outra e também são independentes de sua altura e de sua irmã condicionadas em sua família. Uma vez que conhecemos a família de alguém, sabemos de que distribuição normal extrair para simular sua altura, e os empates para diferentes indivíduos são independentes, independentemente da família (embora nossa escolha de qual distribuição normal extrair dependa da família). Também é possível que, mesmo depois de lidar com a estrutura familiar de nossos dados, ainda não tenhamos uma boa independência condicional (talvez também seja importante modelar gênero, por exemplo).

Em última análise, se faz sentido assumir independência condicional das observações é uma decisão que deve ser tomada no contexto de um modelo específico. É por isso que, por exemplo, na regressão linear, não verificamos se os dados vêm de uma distribuição normal, mas verificamos que os RESIDUAIS vêm de uma distribuição normal (e da mesma distribuição normal em toda a faixa do dados). A regressão linear assume que, após contabilizar a influência das covariáveis (a linha de regressão), os dados são amostrados independentemente a partir de uma distribuição normal, de acordo com a definição estrita de independência no post original.

No contexto do seu exemplo

"Professor" nos seus dados pode ser como "família" no exemplo de altura.

Uma rodada final

Muitos modelos familiares assumem que os resíduos surgem de uma distribuição normal. Imagine que eu lhe dei alguns dados que claramente não eram normais. Talvez eles estejam fortemente distorcidos, ou talvez sejam bimodais. E eu disse a você "esses dados vêm de uma distribuição normal".

"De jeito nenhum", você diz, "é óbvio que eles não são normais!"

"Quem disse algo sobre os dados serem normais?" Eu digo. "Eu apenas disse que eles vêm de uma distribuição normal".

"Um no mesmo!" você diz. "Sabemos que um histograma de amostra razoavelmente grande de uma distribuição normal tenderá a parecer aproximadamente normal!"

"Mas", digo, "nunca disse que os dados foram amostrados independentemente da distribuição normal. O DO vem de uma distribuição normal, mas não são empates independentes".

A suposição de independência (condicional) na modelagem estatística existe para impedir que pessoas inteligentes como eu ignorem a distribuição dos resíduos e apliquem mal o modelo.

Duas notas finais

1) O termo "dados da natureza" não é meu originalmente, mas, apesar de consultar algumas referências, não consigo descobrir onde consegui nesse contexto.

2) Alguns modelos estatísticos (por exemplo, modelos auto-regressivos) não requerem independência das observações dessa maneira. Em particular, eles permitem que a distribuição amostral de uma determinada observação dependa não apenas de covariáveis fixas, mas também dos dados que vieram antes dela.

— Jacob Socolar
fonte

Obrigado por isso. Eu gosto que seja colocado de uma maneira muito acessível. Você aborda a questão de como isso se desenrola para o professor. Você pode estender a discussão para também abordar a ideia de sexo como covariável?

— gung - Restabelece Monica