2 Exercise 1: Tasks in lecture notes

In the lecture note, we looked at the wines dataset and explained how to apply PCA to it. You may re-run the following codes to import the data and perform PCA.

wine <- read.csv("wine.data.csv")
wine.new<-wine[-122,-1]
wine.pca<-princomp(wine.new, cor=T)
summary(wine.pca)

## Importance of components:
##                        Comp.1 Comp.2 Comp.3  Comp.4  Comp.5  Comp.6  Comp.7
## Standard deviation     2.1861 1.5899 1.1601 0.96171 0.92826 0.80043 0.74398
## Proportion of Variance 0.3676 0.1945 0.1035 0.07114 0.06628 0.04928 0.04258
## Cumulative Proportion  0.3676 0.5621 0.6656 0.73675 0.80303 0.85232 0.89489
##                         Comp.8  Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation     0.57942 0.53979 0.50244 0.48134 0.40738 0.29864
## Proportion of Variance 0.02583 0.02241 0.01942 0.01782 0.01277 0.00686
## Cumulative Proportion  0.92072 0.94313 0.96255 0.98037 0.99314 1.00000

Task 2 (in the lecture note)

How many components should we have chosen if we wanted 80% of the original variability explained?

Answer: components

What about 95%?

Answer: components

Look at the row "Cumulative Proportion" and find the column whose value is larger than the required proportion of variance explained.

Task 3

Some social scientists use Joliffe’s rule, which says that for a PCA run on correlation, only those PCs with variation above 0.6 should be retained. How many PCs should be retained according to this rule?

Answer: components

Use wine.pca$sdev to find the standard deviation and thus the variance of each PC.

Task 4

Looking at the loadings of the PCA, how would you interpret the third principal component?

Task 5

Calculate the first component score for the new observation $(12,4,3,25,100,2,1,0.4,2,4,1,2,600)$ by hand (using R as a calculator)

Answer: The first component score for the new observation is .

Step 1. princomp automatically mean-centres each variable. Therefore, you will have to centre the new observation by taking away the centre vector; the centring vector is stored in wine.pca$center.

Step 2. Since we used the correlation matrix and so we were working with standardised data, you will have to scale the resulting centred vector by dividing by the scale vector; the scale vector is stored in wine.pca$scale.

Step 3. According to the definition, (PC) scores are the inner product of the new observation (after mean-centring and standardisation) and the first principal component loading vector; the 1st PC loadings are stored in wine.pca$loadings[,1].

Calculate the first component score for the new observation $(12,4,3,25,100,2,1,0.4,2,4,1,2,600)$ by using the predict command.

Answer: The first component score for the new observation is .