Classificador Akinator.com e Naive Bayes

Contexto: Sou um programador com alguma experiência (meio esquecida) em estatística de cursos da uni. Recentemente me deparei com http://akinator.com e passei algum tempo tentando fazê-lo falhar. E quem não era? :)

Eu decidi descobrir como isso poderia funcionar. Depois de pesquisar no Google e ler postagens de blog relacionadas e adicionar um pouco do meu conhecimento (limitado) ao mix resultante, crio o seguinte modelo (tenho certeza de que usarei a notação errada, por favor, não me mate por isso):

Existem assuntos (s) e perguntas (Q). O objetivo do preditor é selecionar o assunto S que tem a maior probabilidade posterior de ser o assunto em que o usuário está pensando, com perguntas e respostas coletadas até o momento.

Seja o jogo G um conjunto de perguntas e respostas dadas: . $\{q_1, a_1\}, \{q_2, a_2\} ... \{q_n, a_n\}$

Então o preditor está procurando por . $P(S|G) = \frac{P(G|S) * P(S)}{P(G)}$

Anterior para sujeitos ( $P(S)$ ), poderia ser apenas o número de vezes que o sujeito foi adivinhado, dividido pelo número total de jogos.

Assumindo que todas as respostas são independentes, poderíamos calcular a probabilidade do sujeito S, dado o jogo G da seguinte forma:

$P(G|S) = \prod_{i=1..n} P(\{q_i, a_i\} | S)$

Poderíamos calcular o se mantivermos o controle de quais perguntas e respostas foram dadas quando os usados têm o assunto em questão: $P(\{q_i, a_i\} | S)$

$P({q, a} | S) = \frac{answer\ a\ was\ given\ to\ question\ q\ in\ the\ game\ when\ S\ was\ the\ subject}{number\ of\ times\ q\ was\ asked\ in\ the\ games\ involving\ S}$

Now, $P(S|G)$ defines a probability distribution over subjects and when we need to select the next question we have to select the one for which the expected change in the entropy of this distribution is maximal:

$argmax_j (H[P(S|G)] - \sum_{a=yes,no,maybe...} H[P(S|G \vee \{q_j, a\})]$

I've tried to implement this and it works. But, obviously, as the number of subjects goes up, performance degrades due to the need to recalculate the $P(S|G)$ after each move and calculate updated distribution $P(S|G \vee \{q_j, a\})$ for question selection.

I suspect that I simply have chosen the wrong model, being constrained by the limits of my knowledge. Or, maybe, there is an error in the math. Please enlighten me: what should I make myself familiar with, or how to change the predictor so that it could cope with millions of subjects and thousands of questions?

machine-learning naive-bayes

— ADEpt
fonte

I doubt it is a Naive Bayes, rather some decision tree extended each time it fails to recognize someone.

Wouldn't such decision tree be too unwieldy to update? Plus, I see no easy way to account for accidentally-wrong/honest-mistake answers and still get it right in the end with decision tree

— ADEpt

Looks like a reincarnation of the twenty year old 20-questions guesser, now at 20q.net. Here's a popular explanation how it works mentalfloss.com/blogs/archives/13725

— Yaroslav Bulatov

Excuse me, but I think that using "artificial intelligence" and "neural networks" without any context hardy counts as explanation. And I cannot see how one could use neural net for this kind of thing - what would be the output function, for example?

— ADEpt

Hi @ADEpt, It has been a while since the question was asked, but can you share the source code for the implementation you had back there?

— prikha

Respostas:

This game looks similar to 20 questions at http://20q.net, which the creator reports is based on a neural network. Here's one way to structure such network, similar to the neural network described in Concept description vectors and the 20 question game.

You'd have

A fixed number of questions, with some questions marked as "final" questions.
One input unit per question, where 0/1 represents no/yes answer. Initially set to 0.5
One output unit per question, sigmoid squished into 0..1 range
Hidden layer connecting all input units to all output units.

Input units for questions that have been answered are set to 0 or 1, and the assumption is that neural network has been trained to make output units output values close to 1 for questions that have "Yes" answer given a set of existing answers.

At each stage you would pick the question which NN is the least sure about, ie, corresponding output unit is close to 0.5, ask the question, and set corresponding input unit to the answer. At the last stage you pick an output unit from the "final question" list with value closest to 1.

Each game of 20 questions gives 20 datapoints which you can use to update NN's weights with back-propagation, ie, you update the weights to make the outputs of current neural network match the true answer given all the previous questions asked.

— Yaroslav Bulatov
fonte

I don’t think it is really a classification problem. 20 questions is often characterized as a compression problem. This actually matches better with the last part of your question where you talk about entropy.

See Chapter 5.7 (Google books) of

Cover, T.M. and Joy, A.T. (2006) Elements of Information Theory

and also Huffman coding. This paper I found on arXiv may be of interest as well.

Gill, J.T. and Wu, W. (2010) "Twenty Questions Games Always End With Yes” http://arxiv.org/abs/1002.4907

For simplicity assume yes/no questions (whereas akinator.com allows allows maybe, don’t know). Assume that every possible subject (what akinator.com knows) can be uniquely identified by a sequence of yes/no questions and answers — essentially a binary vector.

The questions that are (and their answers) asked define a recursive partitioning of the space of subjects. This partitioning also corresponds to a tree structure. The interior vertices of the tree correspond to questions, and the leaves correspond to subjects. The depth of the leaves is exactly the number of questions required to uniquely identify the subject. You can (trivially) identify every known subject by asking every possible question. That’s not interesting because there are potentially hundreds of questions.

The connection with Huffman coding is that it gives an optimal way (under a certain probabilistic model) to construct the tree so that the average depth is (nearly) minimal. In other words, it tells you how to arrange the sequence of questions (construct a tree) so that the number of questions you need to ask is on average small. It uses a greedy approach.

There is, of course, more to akinator.com than this, but the basic idea is that you can think of the problem in terms of a tree and trying to minimize the average depth of its leaves.

— vqv
fonte

That's a good start, but I think there is more to the problem. For example: how do they get the answers to the questions? Presumably they use the answers given by previous players (a reinforcement learning problem), in which case we face the tradeoff of choosing a question that (a) solves the problem for the current player, and (b) provides information for future players.

— Simon Byrne

At the first glance, ability to draw analogy between 20 questions and Huffman coding hinges on the ability to ask "range questions". That is, instead of "Have you character ever been to space?" we are asking "blanket" questions like "Has he ever been to space, or is a male, or is bald, or was in a movie, or ... (100500 other options)?" Am I right? If so, then I should probably edit my question to make it clear that I am interested in "ask one by one" variety

— ADEpt

Plus, most of the articles that use Huffman codes as model for 20 questions, imply that questioner is free to make up his own questions, which in essence boil down to "Does bit

i

$i$ in codeword for the object is

0

$0$ "? However, in my case (or, rather, akinator.com's case) the set of questions is predefined, and it (obviously) has nothing to with Huffman codewords. Right now I can't see how to make transition from my question to Huffman codes. Perhaps you could elaborate?

— ADEpt

@vqv: Re: "I don’t think it is really a classification problem. 20 questions is often characterized as a compression problem." Aren't statistical inference and information compression directly related / the same problem?

— Yang

@Yang Are you referring to Jorma Rissannen’s argument? Statistical inference and information compression both make use of probabilistic models to describe uncertainty, however their perspectives and those if the researchers in those areas are generally very different. What I mean to say above is that 20 questions can be more naturally formulated as a compression (specifically, source coding) problem rather than a classification problem. … continued below

— vqv