O script abaixo usa o LSA com IDF transformado para cortar os parâmetros dos modelos. A idéia é que todos os termos com IDF mais altos que algum limite sejam considerados parâmetros e que sua frequência seja zerada. O limite pode ser aproximado com a ocorrência média do modelo no corpus. Eliminando os parâmetros, a distância dos registros com o mesmo modelo é zero.
library(tm)
library(lsa)
df <- data.frame(TEMPLATE = c(rep("A",3),rep("B",3),rep("C",3)),
TEXT = c(
paste("Temperature today is",c(28,24,20),"centigrades"),
paste("Temperature today is",c(82,75,68),"Fahrenheit"),
paste("Her eyes are ",c("blue","black","green"), "and hair",c("grey","brown","white"))) , stringsAsFactors=FALSE)
> df
TEMPLATE TEXT
1 A Temperature today is 28 centigrades
2 A Temperature today is 24 centigrades
3 A Temperature today is 20 centigrades
4 B Temperature today is 82 Fahrenheit
5 B Temperature today is 75 Fahrenheit
6 B Temperature today is 68 Fahrenheit
7 C Her eyes are blue and hair grey
8 C Her eyes are black and hair brown
9 C Her eyes are green and hair white
corpus <- Corpus(VectorSource(df$TEXT))
td <- as.matrix(TermDocumentMatrix(corpus,control=list(wordLengths = c(1, Inf)) ))
> td Docs
Terms 1 2 3 4 5 6 7 8 9
20 0 0 1 0 0 0 0 0 0
24 0 1 0 0 0 0 0 0 0
28 1 0 0 0 0 0 0 0 0
68 0 0 0 0 0 1 0 0 0
75 0 0 0 0 1 0 0 0 0
82 0 0 0 1 0 0 0 0 0
and 0 0 0 0 0 0 1 1 1
are 0 0 0 0 0 0 1 1 1
black 0 0 0 0 0 0 0 1 0
blue 0 0 0 0 0 0 1 0 0
brown 0 0 0 0 0 0 0 1 0
centigrades 1 1 1 0 0 0 0 0 0
eyes 0 0 0 0 0 0 1 1 1
fahrenheit 0 0 0 1 1 1 0 0 0
green 0 0 0 0 0 0 0 0 1
grey 0 0 0 0 0 0 1 0 0
hair 0 0 0 0 0 0 1 1 1
her 0 0 0 0 0 0 1 1 1
is 1 1 1 1 1 1 0 0 0
temperature 1 1 1 1 1 1 0 0 0
today 1 1 1 1 1 1 0 0 0
white 0 0 0 0 0 0 0 0 1
## supress terms with idf higher than template frequency
## those terms are considered as parameters
template_freq <- 3
tdw <- lw_bintf(td) * ifelse(gw_idf(td)> template_freq,0, gw_idf(td))
dist <- dist(t(as.matrix(tdw)))
> dist
1 2 3 4 5 6 7 8
2 0.000000
3 0.000000 0.000000
4 3.655689 3.655689 3.655689
5 3.655689 3.655689 3.655689 0.000000
6 3.655689 3.655689 3.655689 0.000000 0.000000
7 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341
8 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000
9 6.901341 6.901341 6.901341 6.901341 6.901341 6.901341 0.000000 0.000000
A matriz de distância mostra claramente que os registros 1,2,3 são do mesmo modelo (distância = 0, com os dados sintéticos; em um caso real, um pequeno limiar deve ser usado). O mesmo vale para os registros 4,5,6 e 7,8,9.