Como encontrar intervalos de confiança para classificações?

32

" Como não classificar pela classificação média " de Evan Miller propõe usar o limite inferior de um intervalo de confiança para obter uma "pontuação" agregada sensata para os itens classificados. No entanto, está trabalhando com um modelo de Bernoulli: as classificações são positivas ou negativas.

Qual é um intervalo de confiança razoável para usar em um modelo de classificação que atribui uma pontuação discreta de $1$ a $k$ estrelas, assumindo que o número de classificações de um item possa ser pequeno?

Eu acho que posso ver como adaptar o centro dos intervalos Wilson e Agresti-Coull como

\tilde{p} = \frac{\sum_{i = 1}^{n} x_{i} + z_{α / 2}^{2} p_{0}}{n + z_{α / 2}^{2}}

$\tilde{p} = \frac{\sum_{i=1}^n{x_i} + z_{\alpha/2}^2\; p_0}{n + z_{\alpha/2}^2}$

onde ou (provavelmente melhor) é a classificação média de todos os itens. No entanto, não sei como adaptar a largura do intervalo. Meu melhor palpite (revisado) seria $p_0 = \frac{k+1}{2}$

\tilde{p} \pm \frac{z_{α / 2}}{\tilde{n}} \sqrt{\frac{\sum_{i = 1}^{n} (x_{i} - \tilde{p})^{2} + z_{α / 2} (p_{0} - \tilde{p})^{2}}{\tilde{n}}}

$\tilde{p} \pm \frac{z_{\alpha/2}}{\tilde{n}} \sqrt{\frac{\sum_{i=1}^n{(x_i - \tilde{p})^2} + z_{\alpha/2}(p_0-\tilde{p})^2}{\tilde{n}}}$

with $\tilde{n} = n + z_{\alpha/2}^2$ , but I can't justify with more than hand-waving it as an analogy of Agresti-Coull, taking that as

Estimate (\bar{X}) \pm \frac{z_{α / 2}}{\tilde{n}} \sqrt{Estimate (Var (X))}

$\text{Estimate}(\bar{X}) \pm \frac{z_{\alpha/2}}{\tilde{n}} \sqrt{\text{Estimate}(\text{Var}(X))}$

Are there standard confidence intervals which apply? (Note that I don't have subscriptions to any journals or easy access to a university library; by all means give proper references, but please supplement with the actual result!)

confidence-interval estimation

— Peter Taylor
fonte

4

Because the current replies have (perhaps out of politeness) skirted around this issue, I would like to point out that this application is a terrible abuse of confidence limits. There is no theoretical justification for using the LCL to rank means (and plenty of reasons why the LCL is actually worse than the mean itself for ranking purposes). Thus this question is predicated on a badly flawed approach, which may be why it has attracted relatively little attention.

— whuber

2

A nice feature of this particular question is that it contains sufficient context for us to ignore the actual question and focus on what appeared to be the more important underlying one.

— Karl

1

I'm glad you modified the changed title to your liking, Peter. My original edit was made not to be self-serving, but to make the title reflect the text of the question. You are the final arbiter of what you really mean.

— whuber

23

Like Karl Broman said in his answer, a Bayesian approach would likely be a lot better than using confidence intervals.

The Problem With Confidence Intervals

Why might using confidence intervals not work too well? One reason is that if you don't have many ratings for an item, then your confidence interval is going to be very wide, so the lower bound of the confidence interval will be small. Thus, items without many ratings will end up at the bottom of your list.

Intuitively, however, you probably want items without many ratings to be near the average item, so you want to wiggle your estimated rating of the item toward the mean rating over all items (i.e., you want to push your estimated rating toward a prior). This is exactly what a Bayesian approach does.

Bayesian Approach I: Normal Distribution over Ratings

One way of moving the estimated rating toward a prior is, as in Karl's answer, to use an estimate of the form $w*R + (1-w)*C$ :

$R$ is the mean over the ratings for the items.
$C$ is the mean over all items (or whatever prior you want to shrink your rating to).
Note that the formula is just a weighted combination of $R$ and $C$ .
$w = \frac{v}{v+m}$ is the weight assigned to $R$ , where $v$ is the number of reviews for the beer and $m$ is some kind of constant "threshold" parameter.
Note that when $v$ is very large, i.e., when we have a lot of ratings for the current item, then $w$ is very close to 1, so our estimated rating is very close to $R$ and we pay little attention to the prior $C$ . When $v$ is small, however, $w$ is very close to 0, so the estimated rating places a lot of weight on the prior $C$ .

This estimate can, in fact, be given a Bayesian interpretation as the posterior estimate of the item's mean rating when individual ratings comes from a normal distribution centered around that mean.

However, assuming that ratings come from a normal distribution has two problems:

A normal distribution is continuous, but ratings are discrete.
Ratings for an item don't necessarily follow a unimodal Gaussian shape. For example, maybe your item is very polarizing, so people tend to either give it a very high rating or give it a very low rating.

Bayesian Approach II: Multinomial Distribution over Ratings

So instead of assuming a normal distribution for ratings, let's assume a multinomial distribution. That is, given some specific item, there's a probability $p_1$ that a random user will give it 1 star, a probability $p_2$ that a random user will give it 2 stars, and so on.

Of course, we have no idea what these probabilities are. As we get more and more ratings for this item, we can guess that $p_1$ is close to $\frac{n_1}{n}$ , where $n_1$ is the number of users who gave it 1 star and $n$ is the total number of users who rated the item, but when we first start out, we have nothing. So we place a Dirichlet prior $Dir(\alpha_1, \ldots, \alpha_k)$ on these probabilities.

What is this Dirichlet prior? We can think of each $\alpha_i$ parameter as being a "virtual count" of the number of times some virtual person gave the item $i$ stars. For example, if $\alpha_1 = 2$ , $\alpha_2 = 1$ , and all the other $\alpha_i$ are equal to 0, then we can think of this as saying that two virtual people gave the item 1 star and one virtual person gave the item 2 stars. So before we even get any actual users, we can use this virtual distribution to provide an estimate of the item's rating.

[One way of choosing the $\alpha_i$ parameters would be to set $\alpha_i$ equal to the overall proportion of votes of $i$ stars. (Note that the $\alpha_i$ parameters aren't necessarily integers.)]

Then, once actual ratings come in, simply add their counts to the virtual counts of your Dirichlet prior. Whenever you want to estimate the rating of your item, simply take the mean over all of the item's ratings (both its virtual ratings and its actual ratings).

— raegtin
fonte

1

Approach 2 works out as identical to approach 1, doesn't it, but with a different justification?

— Peter Taylor

2

@Peter: oh, true! Didn't realize that until you mentioned it =). (If all you want to do is take the mean of the posterior, they're identical. I guess having a Dirichlet posterior might be useful if you want to compute a different kind of score, e.g., some kind of polarity measure, though that might be kind of rare.)

— raegtin

1

In approach 1, how do you typically choose

m

$m$ ?

— Jason C

15

This situation cries out for a Bayesian approach. There are simple approaches for Bayesian rankings of ratings here (pay particular to the comments, which are interesting) and here, and then a further commentary on these here. As one of the comments in the first of these links points out:

The Best of BeerAdvocate (BA) ... uses a Bayesian estimate:

weighted rank (WR) = (v / (v+m)) × R + (m / (v+m)) × C

where:
R = review average for the beer
v = number of reviews for the beer
m = minimum reviews required to be listed (currently 10)
C = the mean across the list (currently 2.5)

— Karl
fonte

2

A disadvantage of the Beer Advocate method is that it does not take account of variability. Nevertheless, I prefer this line of thinking to the lower condifence limit idea.

— Karl