Como fazer regressão linear por partes com vários nós desconhecidos?

14

Existem pacotes para fazer regressão linear por partes, que podem detectar os vários nós automaticamente? Obrigado. Quando eu uso o pacote strucchange. Não consegui detectar os pontos de mudança. Não faço ideia de como ele detecta os pontos de mudança. Pelas parcelas, pude ver que há vários pontos que quero que possam me ajudar a selecioná-los. Alguém poderia dar um exemplo aqui?

regression change-point

— Honglang Wang
fonte

1

Esta parece ser a mesma pergunta que stats.stackexchange.com/questions/5700/… . Se diferir de maneira substancial, informe-nos editando sua pergunta para refletir as diferenças; caso contrário, vamos fechá-lo como duplicado.

— whuber

1

Eu editei a pergunta.

— Honglang Wang

1

Eu acho que você pode fazer isso como um problema de otimização não linear. Basta escrever a equação da função a ser ajustada, com os coeficientes e a localização do nó como parâmetros.

— mark999

1

Eu acho que o segmentedpacote é o que você está procurando.

— AlefSin 24/11/11

1

Eu tive um problema idêntico, resolvi-o com o segmentedpacote de R : stackoverflow.com/a/18715116/857416

— um ben diferente

8

Será que MARS ser aplicável? R possui o pacote earthque o implementa.

— Wayne
fonte

8

Em geral, é um pouco estranho querer ajustar algo como linear por partes. No entanto, se você realmente deseja fazer isso, o algoritmo MARS é o mais direto. Ele criará uma função, um nó de cada vez; e, em seguida, geralmente reduz o número de nós para combater árvores de decisão ala excessivas. Você pode acessar o algotitmo MARS em R via earthou mda. Em geral, é adequado ao GCV, que não está tão distante do outro critério de informação (AIC, BIC etc.)

O MARS não oferece realmente um ajuste "ideal", pois os nós crescem um de cada vez. Seria realmente difícil ajustar um número verdadeiramente "ideal" de nós, já que as possíveis permutações da colocação dos nós explodiriam rapidamente.

Geralmente, é por isso que as pessoas se voltam para suavizar splines. A maioria das splines de suavização é cúbica, para que você possa enganar um olho humano e perder as descontinuidades. No entanto, seria bem possível fazer uma spline de suavização linear. A grande vantagem de suavizar splines é seu único parâmetro para otimizar. Isso permite que você alcance rapidamente uma solução verdadeiramente "ideal" sem precisar pesquisar por várias permutações. No entanto, se você realmente deseja procurar pontos de inflexão e tem dados suficientes para fazê-lo, então algo como MARS provavelmente seria sua melhor aposta.

Aqui está um código de exemplo para splines de suavização linear penalizadas em R:

require(mgcv);data(iris);
gam.test <- gam(Sepal.Length ~ s(Petal.Width,k=6,bs='ps',m=0),data=iris)
summary(gam.test);plot(gam.test);

Os nós reais escolhidos não necessariamente se correlacionariam com quaisquer pontos de inflexão verdadeiros.

— Shea Parkes
fonte

3

Programei isso do zero uma vez há alguns anos e tenho um arquivo Matlab para fazer regressão linear por partes no meu computador. Cerca de 1 a 4 pontos de interrupção são computacionalmente possíveis para cerca de 20 pontos de medição. 5 ou 7 pontos de interrupção começam a ser realmente demais.

A abordagem matemática pura, na minha opinião, é tentar todas as combinações possíveis, conforme sugerido pelo usuário mbq na pergunta vinculada ao comentário abaixo da sua pergunta.

Como as linhas ajustadas são todas consecutivas e adjacentes (sem sobreposições), a combinatória seguirá o triângulo Pascal. Se houvesse sobreposições entre os pontos de dados usados pelos segmentos de linha, acredito que a combinatória seguiria os números Stirling do segundo tipo.

A melhor solução em minha mente é escolher a combinação de linhas ajustadas que tem o menor desvio padrão dos valores de correlação R ^ 2 das linhas ajustadas. Vou tentar explicar com um exemplo. Lembre-se, porém, de que perguntar quantos pontos de interrupção devemos encontrar nos dados é semelhante a perguntar "Quanto tempo dura a costa da Grã-Bretanha?" como em um dos artigos de Benoit Mandelbrots (matemático) sobre fractais. E há uma troca entre o número de pontos de interrupção e a profundidade da regressão.

Agora para o exemplo.

$y$ $x$ $x$ $y$

\begin{array}{cccccc} x & y & R^{2} l i n e 1 & R^{2} l i n e 2 & s u m o f R^{2} v a l u e s & s t a n d a r d d e v i a t i o n o f R^{2} \\ 1 & 1 & 1, 000 & 0, 0400 & 1, 0400 & 0, 6788 \\ 2 & 2 & 1, 000 & 0, 0118 & 1, 0118 & 0, 6987 \\ 3 & 3 & 1, 000 & 0, 0004 & 1, 0004 & 0, 7067 \\ 4 & 4 & 1, 000 & 0, 0031 & 1, 0031 & 0, 7048 \\ 5 & 5 & 1, 000 & 0, 0135 & 1, 0135 & 0, 6974 \\ 6 & 6 & 1, 000 & 0, 0238 & 1, 0238 & 0, 6902 \\ 7 & 7 & 1, 000 & 0, 0277 & 1, 0277 & 0, 6874 \\ 8 & 8 & 1, 000 & 0, 0222 & 1, 0222 & 0, 6913 \\ 9 & 9 & 1, 000 & 0, 0093 & 1, 0093 & 0, 7004 \\ 10 & 10 & 1, 000 & - 1, 978 & 1, 000 & 0, 7071 \\ 11 & 9 & 0, 9709 & 0, 0271 & 0, 9980 & 0, 6673 \\ 12 & 8 & 0, 8951 & 0, 1139 & 1, 0090 & 0, 5523 \\ 13 & 7 & 0, 7734 & 0, 2558 & 1, 0292 & 0, 3659 \\ 14 & 6 & 0, 6134 & 0, 4321 & 1, 0455 & 0, 1281 \\ 15 & 5 & 0, 4321 & 0, 6134 & 1, 0455 & 0, 1282 \\ 16 & 4 & 0, 2558 & 0, 7733 & 1, 0291 & 0, 3659 \\ 17 & 3 & 0, 1139 & 0, 8951 & 1, 0090 & 0, 5523 \\ 18 & 2 & 0, 0272 & 0, 9708 & 0, 9980 & 0, 6672 \\ 19 & 1 & 0 & 1, 000 & 1, 000 & 0, 7071 \\ 20 & 2 & 0, 0094 & 1, 000 & 1, 0094 & 0, 7004 \\ 21 & 3 & 0, 0222 & 1, 000 & 1, 0222 & 0, 6914 \\ 22 & 4 & 0, 0278 & 1, 000 & 1, 0278 & 0, 6874 \\ 23 & 5 & 0, 0239 & 1, 000 & 1, 0239 & 0, 6902 \\ 24 & 6 & 0, 0136 & 1, 000 & 1, 0136 & 0, 6974 \\ 25 & 7 & 0, 0032 & 1, 000 & 1, 0032 & 0, 7048 \\ 26 & 8 & 0, 0004 & 1, 000 & 1, 0004 & 0, 7068 \\ 27 & 9 & 0, 0118 & 1, 000 & 1, 0118 & 0, 6987 \\ 28 & 10 & 0, 04 & 1, 000 & 1, 04 & 0, 6788 \end{array}

$\begin{array}{|c|c|c|c|c|c|} \hline &x &y &R^2 line 1 &R^2 line 2 &sum of R^2 values &standard deviation of R^2 \\ \hline &1 &1 &1,000 &0,0400 &1,0400 &0,6788 \\ \hline &2 &2 &1,000 &0,0118 &1,0118 &0,6987 \\ \hline &3 &3 &1,000 &0,0004 &1,0004 &0,7067 \\ \hline &4 &4 &1,000 &0,0031 &1,0031 &0,7048 \\ \hline &5 &5 &1,000 &0,0135 &1,0135 &0,6974 \\ \hline &6 &6 &1,000 &0,0238 &1,0238 &0,6902 \\ \hline &7 &7 &1,000 &0,0277 &1,0277 &0,6874 \\ \hline &8 &8 &1,000 &0,0222 &1,0222 &0,6913 \\ \hline &9 &9 &1,000 &0,0093 &1,0093 &0,7004 \\ \hline &10 &10 &1,000 &-1,978 &1,000 &0,7071 \\ \hline &11 &9 &0,9709 &0,0271 &0,9980 &0,6673 \\ \hline &12 &8 &0,8951 &0,1139 &1,0090 &0,5523 \\ \hline &13 &7 &0,7734 &0,2558 &1,0292 &0,3659 \\ \hline &14 &6 &0,6134 &0,4321 &1,0455 &0,1281 \\ \hline &15 &5 &0,4321 &0,6134 &1,0455 &0,1282 \\ \hline &16 &4 &0,2558 &0,7733 &1,0291 &0,3659 \\ \hline &17 &3 &0,1139 &0,8951 &1,0090 &0,5523 \\ \hline &18 &2 &0,0272 &0,9708 &0,9980 &0,6672 \\ \hline &19 &1 &0 &1,000 &1,000 &0,7071 \\ \hline &20 &2 &0,0094 &1,000 &1,0094 &0,7004 \\ \hline &21 &3 &0,0222 &1,000 &1,0222 &0,6914 \\ \hline &22 &4 &0,0278 &1,000 &1,0278 &0,6874 \\ \hline &23 &5 &0,0239 &1,000 &1,0239 &0,6902 \\ \hline &24 &6 &0,0136 &1,000 &1,0136 &0,6974 \\ \hline &25 &7 &0,0032 &1,000 &1,0032 &0,7048 \\ \hline &26 &8 &0,0004 &1,000 &1,0004 &0,7068 \\ \hline &27 &9 &0,0118 &1,000 &1,0118 &0,6987 \\ \hline &28 &10 &0,04 &1,000 &1,04 &0,6788 \\ \hline \end{array}$

These y values have the graph:

idealized data

Which clearly has two break points. For the sake of argument we will calculate the R^2 correlation values (with the Excel cell formulas (European dot-comma style)):

=INDEX(LINEST(B1:$B$1;A1:$A$1;TRUE;TRUE);3;1)
=INDEX(LINEST(B1:$B$28;A1:$A$28;TRUE;TRUE);3;1)

for all possible non-overlapping combinations of two fitted lines. All the possible pairs of R^2 values have the graph:

R^2 values

The question is which pair of R^2 values should we choose, and how do we generalize to multiple break points as asked in the title? One choice is to pick the combination for which the sum of the R-square correlation is the highest. Plotting this we get the upper blue curve below:

sum of R squared and standard deviation of R squared

The blue curve, the sum of the R-squared values, is the highest in the middle. This is more clearly visible from the table with the value $1,0455$ as the highest value. However it is my opinion that the minimum of the red curve is more accurate. That is, the minimum of the standard deviation of the R^2 values of the fitted regression lines should be the best choice.

Piece wise linear regression - Matlab - multiple break points

— Mats Granvik
fonte

1

There is a pretty nice algorithm described in Tomé and Miranda (1984).

The proposed methodology uses a least-squares approach to compute the best continuous set of straight lines that fit a given time series, subject to a number of constraints on the minimum distance between breakpoints and on the minimum trend change at each breakpoint.

The code and a GUI are available in both Fortran and IDL from their website: http://www.dfisica.ubi.pt/~artome/linearstep.html

— arkaia
fonte

0

... first of all you must to do it by iterations, and under some informative criterion, like AIC AICc BIC Cp; because you can get an "ideal" fit, if number of knots K = number od data points N, ok. ... first put K = 0; estimate L = K + 1 regressions, calculate AICc, for instance; then assume minimal number of data points at a separate segment, say L = 3 or L = 4, ok ... put K = 1; start from L-th data as the first knot, calculate SS or MLE, ... and step by step the next data point as a knot, SS or MLE, up to the last knot at the N - L data; choose the arrangement with the best fit (SS or MLE) calculate AICc ... ... put K = 2; ... use all previous regressions (that is their SS or MLE), but step by step divide a single segment into all possible parts ... choose the arrangement with the best fit (SS or MLE) calculate AICc ... if the last AICc occurs greater then the previous one: stop the iterations ! This is an optimal solution under AICc criterion, ok

— Maciek
fonte

AIC, BIC can't be used because they penalised for extra parameters, which is clearly not the case here.

— HelloWorld

0

I once came across a program called Joinpoint. On their website they say it fits a joinpoint model where "several different lines are connected together at the 'joinpoints'". And further: "The user supplies the minimum and maximum number of joinpoints. The program starts with the minimum number of joinpoint (e.g. 0 joinpoints, which is a straight line) and tests whether more joinpoints are statistically significant and must be added to the model (up to that maximum number)."

The NCI uses it for trend modelling of cancer rates, maybe it fits your needs as well.

— psj
fonte

0

In order to fit to data a piecewise function :

where $a_1 , a_2 , p_1 , q_1, p_2 , q_2 , p_3 , q_3$ are unknown parameters to be approximately computed, there is a very simple method (not iterative, no initial guess, easy to code in any math computer language). The theory given page 29 in paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf and from page 30 :

For example, with the exact data provided by Mats Granvik the result is :

Without scattered data, this example is not very signifiant. Other examples with scattered data are shown in the referenced paper.

— JJacquelin
fonte

0

You can use the mcp package if you know the number of change points to infer. It gives you great modeling flexibility and a lot of information about the change points and regression parameters, but at the cost of speed.

The mcp website contains many applied examples, e.g.,

library(mcp)

# Define the model
model = list(
  response ~ 1,  # plateau (int_1)
  ~ 0 + time,    # joined slope (time_2) at cp_1
  ~ 1 + time     # disjoined slope (int_3, time_3) at cp_2
)

# Fit it. The `ex_demo` dataset is included in mcp
fit = mcp(model, data = ex_demo)

Then you can visualize:

plot(fit)

Or summarise:

summary(fit)

Family: gaussian(link = 'identity')
Iterations: 9000 from 3 chains.
Segments:
  1: response ~ 1
  2: response ~ 1 ~ 0 + time
  3: response ~ 1 ~ 1 + time

Population-level parameters:
    name match  sim  mean lower  upper Rhat n.eff
    cp_1    OK 30.0 30.27 23.19 38.760    1   384
    cp_2    OK 70.0 69.78 69.27 70.238    1  5792
   int_1    OK 10.0 10.26  8.82 11.768    1  1480
   int_3    OK  0.0  0.44 -2.49  3.428    1   810
 sigma_1    OK  4.0  4.01  3.43  4.591    1  3852
  time_2    OK  0.5  0.53  0.40  0.662    1   437
  time_3    OK -0.2 -0.22 -0.38 -0.035    1   834

Disclaimer: I am the developer of mcp.

— Jonas Lindeløv
fonte

The use of "detect" in the question indicates the number--and even the existence--of changepoints are not known beforehand.

— whuber