Um lema técnico
Não tenho certeza do quanto isso é intuitivo, mas o principal resultado técnico subjacente à sua declaração do Teorema de Halmos-Savage é o seguinte:
Lema.
Seja µμ uma medida σ-σ definida em ( S , A )(S,A) . Suponha-se que ℵℵ é um conjunto de medidas em ( S , A )(S,A) , tais que, para cada ν ∈ ℵν∈ℵ , ν « μν≪μ . Existe uma sequência de números não negativos { c i } ∞ i = 1{ci}∞i=1 e uma sequência de elementos de ℵℵ , { ν i } ∞ i = 1{νi}∞i=1de tal modo que Σ ∞ i = 1 c i = 1∑∞i=1ci=1 e ν « Σ ∞ i = 1 c i ν iν≪∑∞i=1ciνi para cada ν ∈ ℵν∈ℵ .
Isso é extraído literalmente do Teorema A.78 na Teoria das Estatísticas de Schervish (1995) . Nele, ele o atribui às Hipóteses Estatísticas de Teste de Lehmann (1986) ( link para a terceira edição ), onde o resultado é atribuído aos próprios Halmos e Savage (ver Lema 7). Outra boa referência é a estatística matemática de Shao (segunda edição, 2003) , onde os resultados relevantes são o lema 2.1 e o teorema 2.2.
O lema acima afirma que, se você começar com uma família de medidas dominadas por uma medida σ-σ definida, na verdade poderá substituir a medida dominante por uma combinação convexa contável de medidas de dentro da família. Schervish escreve antes de afirmar o Teorema A.78,
"Em aplicações estatísticas, muitas vezes teremos uma classe de medidas, cada uma das quais é absolutamente contínua em relação a uma única medida σ-σ infinita. Seria bom se a única medida dominante estivesse na classe original ou pudesse ser construída a partir da O teorema a seguir aborda esse problema. "
Um exemplo concreto
Suponha que tomemos uma medida de uma quantidade XX que acreditamos estar distribuída uniformemente no intervalo [ 0 , θ ][0,θ] para algum desconhecido θ > 0θ>0 . Nesse problema estatístico, estamos considerando implicitamente o conjunto de medidas de probabilidade PP de Borel em R queR consiste em distribuições uniformes em todos os intervalos da forma [ 0 , θ ][0,θ] . Ou seja, se λλ indica a medida de Lebesgue e, para θ > 0θ>0 , P θPθ indica o uniforme ( [0,θ])Uniform([0,θ]) distribution (i.e.,
Pθ(A)=1θλ(A∩[0,θ])=∫A1θ1[0,θ](x)dxPθ(A)=1θλ(A∩[0,θ])=∫A1θ1[0,θ](x)dx
for every Borel A⊆RA⊆R), then we simply have
P={Pθ:θ>0}.P={Pθ:θ>0}.
This is the set of candidate distributions for our measurement XX.
A família PP é claramente dominada pela medida de Lebesgue λλ (que é σ-σ infinita); portanto, o lema acima (com ℵ = Pℵ=P ) garante a existência de uma sequência { c i } ∞ i = 1{ci}∞i=1 de números não-negativos, somando 11 e a sequência { Q i } ∞ i = 1{Qi}∞i=1 de distribuições uniformes em PP tal que
P θ « ∞ Σ i = 1 c i Q iPθ≪∑i=1∞ciQi
for each θ>0θ>0.
In this example, we can construct such sequences explicitly!
First, let (θi)∞i=1(θi)∞i=1 be an enumeration of the positive rational numbers (this can be done explicitly), and let Qi=PθiQi=Pθi for each ii.
Next, let ci=2−ici=2−i, so that ∑∞i=1ci=1∑∞i=1ci=1.
I claim that this combination of {ci}∞i=1{ci}∞i=1 and {Qi}∞i=1{Qi}∞i=1 works.
To see this, fix θ>0θ>0 and let AA be a Borel subset of RR such that ∑∞i=1ciQi(A)=0∑∞i=1ciQi(A)=0.
We need to show that Pθ(A)=0Pθ(A)=0.
Since ∑∞i=1ciQi(A)=0∑∞i=1ciQi(A)=0 and each summand is non-negative, it follows that ciQi(A)=0ciQi(A)=0 for each ii.
Moreover, since each cici is positive, it follows that Qi(A)=0Qi(A)=0 for each ii.
That is, for all ii we have
Qi(A)=Pθi(A)=1θiλ(A∩[0,θi])=0.Qi(A)=Pθi(A)=1θiλ(A∩[0,θi])=0.
Since each θiθi is positive, it follows that λ(A∩[0,θi])=0λ(A∩[0,θi])=0 for each ii.
Now choose a subsequence {θik}∞k=1{θik}∞k=1 of {θi}∞i=1{θi}∞i=1 which converges to θθ from above (this can be done since QQ is dense in RR).
Then A∩[0,θθik]↓A∩[0,θ]A∩[0,θθik]↓A∩[0,θ] as k→∞k→∞, so by continuity of measure we conclude that
λ(A∩[0,θ])=limk→∞λ(A∩[0,θik])=0,
and so Pθ(A)=0.
This proves the claim.
Thus, in this example we were able to explicitly construct a countable convex combination of probability measures from our dominated family which still dominates the entire family.
The Lemma above guarantees that this can be done for any dominated family (at least as long as the dominating measure is σ-finite).
The Halmos-Savage Theorem
So now on to the Halmos-Savage Theorem (for which I will use slightly different notation than in the question due to personal preference).
Given the Halmos-Savage Theorem, the Fisher-Neyman factorization theorem is just one application of the Doob-Dynkin lemma and the chain rule for Radon-Nikodym derivatives away!
Halmos-Savage Theorem.
Let (X,B,P) be a dominated statistical model (meaning that P is a set of probability measures on B and there is a σ-finite measure μ on B such that P≪μ for all P∈P).
Let T:(X,B)→(T,C) be a measurable function, where (T,C) is a standard Borel space.
Then the following are equivalent:
- T is sufficient for P (meaning that there is a probability kernel r:B×T→[0,1] such that r(B,T) is a version of P(B∣T) for all B∈B and P∈P).
- There exists a sequence {ci}∞i=1 of nonnegative numbers such that ∑∞i=1ci=1 and a sequence {Pi}∞i=1 of probability measures in P such that P≪P∗ for all P∈P, where P∗=∑∞i=1ciPi, and for each P∈P there exists a T-measurable version of dP/dP∗.
Proof.
By the lemma above, we may immediately replace μ by P∗=∑∞i=1ciPi for some sequence {ci}∞i=1 of nonnegative numbers such that ∑∞i=1ci=1 and a sequence {Pi}∞i=1 of probability measures in P.
(1. implies 2.)
Suppose T is sufficient.
Then we must show that there are T-measurable versions of dP/dP∗ for all P∈P.
Let r be the probability kernel in the statement of the theorem.
For each A∈σ(T) and B∈B we have
P∗(A∩B)=∞∑i=1ciPi(A∩B)=∞∑i=1ci∫APi(B∣T)dPi=∞∑i=1ci∫Ar(B,T)dPi=∫Ar(B,T)dP∗.
Thus r(B,T) is a version of P∗(B∣T) for all B∈B.
For each P∈P, let fP denote a version of the Radon-Nikodym derivative dP/dP∗ on the measurable space (X,σ(T)) (so in particular fP is T-measurable).
Then for all B∈B and P∈P we have
P(B)=∫XP(B∣T)dP=∫Xr(B,T)dP=∫Xr(B,T)fPdP∗=∫XP∗(B∣T)fPdP∗=∫XEP∗[1BfP∣T]dP∗=∫BfPdP∗.
Thus in fact fP is a T-measurable version of dP/dP∗ on (X,B).
This proves that the first condition of the theorem implies the second.
(2. implies 1.)
Suppose one can choose a T-measurable version fP of dP/dP∗ for each P∈P.
For each B∈B, let r(B,t) denote a particular version of P∗(B∣T=t) (e.g., r(B,t) is a function such that r(B,T) is a version of P∗(B∣T)).
Since (T,C) is a standard Borel space, we may choose r in a way that makes it a probability kernel (see, e.g., Theorem B.32 in Schervish's Theory of Statistics (1995)).
We will show that r(B,T) is a version of P(B∣T) for any P∈P and any B∈B.
Thus, let A∈σ(T) and B∈B be given.
Then for all P∈P we have
P(A∩B)=∫A1BfPdP∗=∫AEP∗[1BfP∣T]dP∗=∫AP∗(B∣T)fPdP∗=∫Ar(B,T)fPdP∗=∫Ar(B,T)dP.
This shows that r(B,T) is a version of P(B∣T) for any P∈P and any B∈B, and the proof is done.
Summary.
The important technical result underlying the Halmos-Savage theorem as presented here is the fact that a dominated family of probability measures is actually dominated by a countable convex combination of probability measures from that family.
Given that result, the rest of the Halmos-Savage theorem is mostly just manipulations with basic properties of Radon-Nikodym derivatives and conditional expectations.