3 Exercise 2: Employment in Europe

In this example we are going to look at a data set on employment in 26 European countries and perform PCA.

The data gives for each of 26 European countries the percentage of the total workforce employed in nine different industries in 1979 (Hand et al, 1994).

Variable name Description
Agriculture % employed in agriculture
Mining % employed in mining
Manufacture % employed in manufacturing
Power % employed in power supply industries
Construction % employed in construction
Service % employed in service industries
Finance % employed in finance
Social % employed in social and personal services
Transport % employed in transport & communications
employ<-read.table("eurojob.txt",header=T,row.names=1);head(employ,4)
##         AGRIC MINING MANU POWER CONSTR SERVICE FINANCE SOCIAL TRANS
## Belgium   3.3    0.9 27.6   0.9    8.2    19.1     6.2   26.6   7.2
## Denmark   9.2    0.1 21.8   0.6    8.3    14.6     6.5   32.2   7.1
## France   10.8    0.8 27.5   0.9    8.9    16.8     6.0   22.6   5.7
## WGerm     6.7    1.3 35.8   0.9    7.3    14.4     5.0   22.3   6.1

Task

  1. Perform exploratory data analysis (i.e. numerical summaries and some sensible plots) for these data. Comment on the results.

For numerical summaries, you could use summary or skim. For plots, you could use pairs.

  1. Produce two important numerical summaries for deciding on how to run PCA and to tell how successful it is likely to be. Comment on these.

Think about which statistics you will need in order to decide whether PCA might useful for this data, and whether to use covariance-based PCA or correlation-based PCA.

  1. Run PCA on the appropriate matrix and look at the output.

Use the princomp command to implement PCA.

  1. Assuming we are most concerned with preserving information, how many coefficients should we retain if we want to have 90% of the original variability kept?

Answer: components

  1. Assuming we want to use Cattell's method, how many components would we retain?

Use plot on your PCA model to produce a scree plot and look for the bend/elbow.

Answer: components

  1. Assuming we want to use Kaiser's method, how many components would we retain?

Answer: components

  1. Assuming we have decided to retain 2 components, is there any useful interpretation to be had for these?

Look at the loadings by using $loadings.

  1. Produce a scatterplot of the data's scores for the first \(2\) PCs and comment.

PC scores are stored in$scoresand a scatterplot can be produced by usingplot`.

  1. Say we have the following two new observations.
obs1<-c(5.1,0.5,32.3,0.8,8.1,16.7,4.3,21.2,6.3)
obs2<-c(4.2,0.7,25.4,0.7,9.3,15.0,5.8,31.0,6.9)

Calculate their scores on the second principal component.

Answer: The second component score for obs1 is and the second component score for obs2 is .

You will need to mean-centre and standardise both observations first and then multiply them by the second PC loadings.