Após o excelente post de JD Long neste encadeamento, procurei um exemplo simples e o código R necessário para produzir o PCA e, em seguida, voltar aos dados originais. Isso me deu uma intuição geométrica em primeira mão e quero compartilhar o que recebi. O conjunto de dados e o código podem ser diretamente copiados e colados no formulário R do Github .
Usei um conjunto de dados que encontrei on-line em semicondutores aqui e o reduzi para apenas duas dimensões - "número atômico" e "ponto de fusão" - para facilitar a plotagem.
Como uma ressalva, a idéia é meramente ilustrativa do processo computacional: o PCA é usado para reduzir mais de duas variáveis para alguns componentes principais derivados ou para identificar colinearidade também no caso de vários recursos. Portanto, não encontraria muita aplicação no caso de duas variáveis, nem haveria a necessidade de calcular autovetores de matrizes de correlação, conforme apontado por @amoeba.
Além disso, truncei as observações de 44 a 15 para facilitar a tarefa de rastrear pontos individuais. O resultado final foi um esqueleto de dados ( dat1
):
compounds atomic.no melting.point
AIN 10 498.0
AIP 14 625.0
AIAs 23 1011.5
... ... ...
A coluna "compostos" indica a constituição química do semicondutor e desempenha o papel de nome da linha.
Isso pode ser reproduzido da seguinte forma (pronto para copiar e colar no console R):
dat <- read.csv(url("http://rinterested.github.io/datasets/semiconductors"))
colnames(dat)[2] <- "atomic.no"
dat1 <- subset(dat[1:15,1:3])
row.names(dat1) <- dat1$compounds
dat1 <- dat1[,-1]
Os dados foram então escalados:
X <- apply(dat1, 2, function(x) (x - mean(x)) / sd(x))
# This centers data points around the mean and standardizes by dividing by SD.
# It is the equivalent to `X <- scale(dat1, center = T, scale = T)`
Os passos da álgebra linear seguiram:
C <- cov(X) # Covariance matrix (centered data)
⎡⎣⎢at_nomelt_pat_no10,296melt_p0,2961⎤⎦⎥
The correlation function cor(dat1)
gives the same output on the non-scaled data as the function cov(X)
on the scaled data.
lambda <- eigen(C)$values # Eigenvalues
lambda_matrix <- diag(2)*eigen(C)$values # Eigenvalues matrix
⎡⎣⎢λPC11.2964220λPC200.7035783⎤⎦⎥
e_vectors <- eigen(C)$vectors # Eigenvectors
12√⎡⎣⎢PC111PC21−1⎤⎦⎥
Since the first eigenvector initially returns as ∼[−0.7,−0.7] we choose to change it to [0.7,0.7] to make it consistent with built-in formulas through:
e_vectors[,1] = - e_vectors[,1]; colnames(e_vectors) <- c("PC1","PC2")
The resultant eigenvalues were 1.2964217 and 0.7035783. Under less minimalistic conditions, this result would have helped decide which eigenvectors to include (largest eigenvalues). For instance, the relative contribution of the first eigenvalue is 64.8%: eigen(C)$values[1]/sum(eigen(C)$values) * 100
, meaning that it accounts for ∼65% of the variability in the data. The variability in the direction of the second eigenvector is 35.2%. This is typically shown on a scree plot depicting the value of the eigenvalues:
We'll include both eigenvectors given the small size of this toy data set example, understanding that excluding one of the eigenvectors would result in dimensionality reduction - the idea behind PCA.
The score matrix was determined as the matrix multiplication of the scaled data (X
) by the matrix of eigenvectors (or "rotations"):
score_matrix <- X %*% e_vectors
# Identical to the often found operation: t(t(e_vectors) %*% t(X))
The concept entails a linear combination of each entry (row / subject / observation / superconductor in this case) of the centered (and in this case scaled) data weighted by the rows of each eigenvector, so that in each of the final columns of the score matrix, we'll find a contribution from each variable (column) of the data (the entire X
), BUT only the corresponding eigenvector will have taken part in the computation (i.e. the first eigenvector [0.7,0.7]T will contribute to PC1 (Principal Component 1) and [0.7,−0.7]T to PC2, as in:
Therefore each eigenvector will influence each variable differently, and this will be reflected in the "loadings" of the PCA. In our case, the negative sign in the second component of the second eigenvector [0.7,−0.7] will change the sign of the melting point values in the linear combinations that produce PC2, whereas the effect of the first eigenvector will be consistently positive:
The eigenvectors are scaled to 1:
> apply(e_vectors, 2, function(x) sum(x^2))
PC1 PC2
1 1
whereas the (loadings) are the eigenvectors scaled by the eigenvalues (despite the confusing terminology in the in-built R functions displayed below). Consequently, the loadings can be calculated as:
> e_vectors %*% lambda_matrix
[,1] [,2]
[1,] 0.9167086 0.497505
[2,] 0.9167086 -0.497505
> prcomp(X)$rotation %*% diag(princomp(covmat = C)$sd^2)
[,1] [,2]
atomic.no 0.9167086 0.497505
melting.point 0.9167086 -0.497505
It is interesting to note that the rotated data cloud (the score plot) will have variance along each component (PC) equal to the eigenvalues:
> apply(score_matrix, 2, function(x) var(x))
PC1 PC2
53829.7896 110.8414
> lambda
[1] 53829.7896 110.8414
Utilizing the built-in functions the results can be replicated:
# For the SCORE MATRIX:
prcomp(X)$x
# or...
princomp(X)$scores # The signs of the PC 1 column will be reversed.
# and for EIGENVECTOR MATRIX:
prcomp(X)$rotation
# or...
princomp(X)$loadings
# and for EIGENVALUES:
prcomp(X)$sdev^2
# or...
princomp(covmat = C)$sd^2
Alternatively, the singular value decomposition (UΣVT) method can be applied to manually calculate PCA; in fact, this is the method used in prcomp()
. The steps can be spelled out as:
svd_scaled_dat <-svd(scale(dat1))
eigen_vectors <- svd_scaled_dat$v
eigen_values <- (svd_scaled_dat$d/sqrt(nrow(dat1) - 1))^2
scores<-scale(dat1) %*% eigen_vectors
The result is shown below, with first, the distances from the individual points to the first eigenvector, and on a second plot, the orthogonal distances to the second eigenvector:
If instead we plotted the values of the score matrix (PC1 and PC2) - no longer "melting.point" and "atomic.no", but really a change of basis of the point coordinates with the eigenvectors as basis, these distances would be preserved, but would naturally become perpendicular to the xy axis:
The trick was now to recover the original data. The points had been transformed through a simple matrix multiplication by the eigenvectors. Now the data was rotated back by multiplying by the inverse of the matrix of eigenvectors with a resultant marked change in the location of the data points. For instance, notice the change in pink dot "GaN" in the left upper quadrant (black circle in the left plot, below), returning to its initial position in the left lower quadrant (black circle in the right plot, below).
Now we finally had the original data restored in this "de-rotated" matrix:
Beyond the change of coordinates of rotation of the data in PCA, the results must be interpreted, and this process tends to involve a biplot
, on which the data points are plotted with respect to the new eigenvector coordinates, and the original variables are now superimposed as vectors. It is interesting to note the equivalence in the position of the points between the plots in the second row of rotation graphs above ("Scores with xy Axis = Eigenvectors") (to the left in the plots that follow), and the biplot
(to the right):
The superimposition of the original variables as red arrows offers a path to the interpretation of PC1
as a vector in the direction (or with a positive correlation) with both atomic no
and melting point
; and of PC2
as a component along increasing values of atomic no
but negatively correlated with melting point
, consistent with the values of the eigenvectors:
PCA$rotation
PC1 PC2
atomic.no 0.7071068 0.7071068
melting.point 0.7071068 -0.7071068
This interactive tutorial by Victor Powell gives immediate feedback as to the changes in the eigenvectors as the data cloud is modified.