4 Exercise: Haberman

The Haberman dataset contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago's Billings Hospital on the survival of patients who had undergone surgery for breast cancer.

The dataset contains information about four variables:

  • V1. Age of patient at time of operation (numerical)

  • V2. Patient's year of operation (year - 1900, numerical)

  • V3. Number of positive axillary nodes detected (numerical)

  • V4. Survival status (class label)

    • 1 = the patient survived 5 years or longer
    • 2 = the patient died within 5 year

To load the data into R and convert V4 (i.e. class) from numeric variable to categorical variable, use

haberman <- read.table("haberman.data",sep = ",")
haberman$V4 <- as.factor(haberman$V4)

QUESTION

  1. Perform exploratory analysis for this data. What have you observed?

For numerical summaries, the following may be attempted

summary(haberman[,1:3])
skim(haberman)
apply(haberman[,1:3], 2, function(x) aggregate(x,by=list(haberman$V4), var))

For graphical summaries, try

apply(haberman[,1:3],2,hist)
ggpairs(haberman[,1:3])
ggpairs(haberman, columns=1:3, ggplot2::aes(colour=V4, alpha=0.2))
  1. Based on your findings in (a), comment on the appropriateness to apply \(k\)-NN, LDA and QDA.

It is to apply \(k\)-NN on this data.

It is to apply LDA on this data.

It is to apply QDA on this data.

  1. Use 10-fold cross-validation to select the optimal value of \(k\) in \(k\)-NN.

Check page 9 of week 3 lecture note for performing cross-validation.

Remember that you should still split the data into training and test sets before using cross-validation.

  1. Suppose we use \(3\)-NN as the classifier and obtain the following prediction results. Calculate sensitivity, specificity, positive prediction rate, negative prediction rate and accuracy, assuming class 1 (i.e. the survival class) is the positive class and class 2 is the negative class. Comment on your results.
test.pred <- knn(train.X, test.X, train.Y, k=3)
table(test.Y, test.pred)
##       test.pred
## test.Y  1  2
##      1 40  4
##      2 14  4

Round all answers to two decimal places.

  • Sensitivity =

  • Specificity =

  • Positive predictive rate =

  • Negative predictive rate =

  • Accuracy =