Use of this document

This is a study note combining course material form IE 583 by Siggi Olafsson at ISU with some additional material. The following topic will be discussed:

For more details on the study material see:

0. Prerequisites

# Classification and Regression trainning
library(caret)
library(tidyverse)

0.1 Four Types of Machine Learning

Four Type of Machine Learnning
Machine Learnning Description
Supervised learning (SML) the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative.
Unsupervised learning (UML) no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
Semi-supervised learning approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
Reinforcement learning the learning algorithm performs a task using feedback from operating in a real of synthetic environment.

0.2 Parallel Processing

To tune a predictive model using multiple workers, the function syntax in the \(caret\) package functions (e.g. train, rfe or sbf) do not change. A separate function is used to “register” the parallel processing technique and specify the number of workers to use.

# Parallel Processing
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

1 Model Training and Tuning

In this chapter, we will use function and tool in \(caret\) R package. The \(caret\) package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

# data
dataset <- read.csv("credit_g.csv")
dataset$class <- as.factor(dataset$class)
head(as_tibble(dataset))
## # A tibble: 6 x 21
##   checking_status duration credit_history purpose credit_amount
##   <fct>              <int> <fct>          <fct>           <int>
## 1 <0                     6 'critical/oth… radio/…          1169
## 2 0<=X<200              48 'existing pai… radio/…          5951
## 3 'no checking'         12 'critical/oth… educat…          2096
## 4 <0                    42 'existing pai… furnit…          7882
## 5 <0                    24 'delayed prev… 'new c…          4870
## 6 'no checking'         36 'existing pai… educat…          9055
## # … with 16 more variables: savings_status <fct>, employment <fct>,
## #   installment_commitment <int>, personal_status <fct>,
## #   other_parties <fct>, residence_since <int>, property_magnitude <fct>,
## #   age <int>, other_payment_plans <fct>, housing <fct>,
## #   existing_credits <int>, job <fct>, num_dependents <int>,
## #   own_telephone <fct>, foreign_worker <fct>, class <fct>

For more detail about this section, check the Link

Note: Remember also load the specific package for the learning method to allow the train() function work.

the \(caret\)-Relative R Packages
R Package Description
caret -
ggplot2 -
mlbench -
class -
caTools -
randomForst -
impute -
ranger -
kernlab -
class -
glmnet -
naivebayes -
rpart -
rpart.plot -

1.1 Function createDataPartition

The function createDataPartition can be used to create balanced splits of the data. If the y argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.

  • times: create multiple splits at once; the data indices are returned in a list of integer vectors.
  • createResample: make simple bootstrap samples.
  • createFolds: generate balanced cross-validation groupings from a set of data.
trainIndex <- createDataPartition(dataset$class, p = 0.67,list = FALSE, times = 1) 
Train_set <- dataset[trainIndex, ] 
Test_set <- dataset[-trainIndex, ]

1.2 Function trainControl

The function trainControl can be used to specifiy the type of resampling. It also generates parameters that further control how models are created:

  • allowParallel = TRUE: a logical that governs whether train
  • search = "random": use a random search
  • method: The resampling method (see the table: List of resampling method argument)
List of resampling method argument.
Resampling method Description
boot the usual bootstrap
cv cross-validation
LOOCV
LGOCV (for repeated Train_set/test splits)
repeatedcv
timeslice
none (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models)
oob out of bag
optimism_boot the optimism bootstrap estimator
boot632 the 0.632 bootstrap estimator
boot_all all of boot, bot632, optimism_boost (for efficiency, but boot will be used for calculations).
adaptive_cv, adaptive_boot or adaptive_LGOCV
none (only fits one model to the entire Train_set set)
ctrl <- trainControl(method = "boot632", savePred = T, classProb = T, number = 10)

1.3 Function expand.grid

\(caret\) automate the tuning of the hyperparameter using a grid search,

  • tuneLength: that sets the number of hyperparameter values to test.
  • tuneGrid: directly defining the hyperparameter values, which requires knowledge of the model. Find the hyperparameters from train Models By Tag

1.4 Function train

The train function can be used to:

  • preProcess allow single and multiple pre-processing methods
    • preProcess = "medianImpute": Imputation using median of features. This methods works well if the data are missing at random.
    • preProcess = "knnImpute": kNN imputation will impute missing values using on other, similar non-missing rows. The default value is 5.
    • preProcess = "scale": division by the standard deviation
    • preProcess = "center": subtraction of the mean
    • preProcess = "pca": PCA can be used as pre-processing method, generating a set of high-variance and perpendicular predictors, preventing collinearity.
  • trControl: estimate model performance from a training set
  • tuneGrid: evaluate, using resampling, the effect of model tuning parameters on performance
  • method: choose the “optimal” model across these parameters. check the train Model List

2 Classic Examples

Assumption of three simple Classification
Method Data used Other Assumption
Naive Bayes All Attribute Independence
Decision tree Only a few attributes but all the instances -
K-NN All the attributes but only a few instances -

2.1 Naive Bayes

2.1.1 Pseudocode

1. Read the Train_set dataset.
2. Calculate the mean and standard deviation of the predictor variables in each class
3. Repeat calculate the probability of using the guess density equation each class until the probability of all 
4. predictor variable has been calculate
5. Calculate the likelihood for each class
6. Get the greatest likelihood

2.1.2 Example Code

for more info, click link

library(klaR)
# Tuning parameters
searchGrid <- expand.grid(fL = 1:3, usekernel =  c(FALSE), adjust = seq(0, 10, by = 1))
# Train model
NBmodel <-train(class~., data = dataset, method = "nb", trControl = ctrl, tuneGrid = searchGrid)

# plot search grid results
plot(NBmodel)

2.2 Entropy-based Decision Trees

A great advantage of decision trees is that they make a complex decision simpler by breaking it down into smaller, simpler decisions using divide-and-conquer strategy. They basically identify a set of if-else conditions that split data according to the value if the features. Decision trees choose splits based on most homogeneous partitions, and lead to smaller and more homogeneous partitions over their iterations.

An issue with single decision trees is that they can grow, and become large and complex with many branches, with corresponds to over-fitting. Over-fitting models noise, rather than general patterns in the data, focusing on subtle patterns (outliers) that won’t generalise.

2.2.1 Pseudocode

1. Check for the above base cases.
2. For each attribute a, find the normalized information gain ratio from splitting on a.
3. Let a_best be the attribute with the highest normalized information gain.
4. Create a decision node that splits on a_best.
5. Recur on the sublists obtained by splitting on a_best, and add those nodes as children of node.
6. End

2.2.2 Example Code

The plot function can be used to examine the relationship between the estimates of performance and the tuning parameters

library(rpart)
# Tuning parameters
searchGrid <-expand.grid(cp=(0:15)*0.003)
# Train model
DTmodel <-train(class~., data = dataset ,method="rpart", trControl=ctrl, tuneGrid = searchGrid)
# print(DTmodel)
# summary(DTmodel)
# plot search grid results
par(mfrow=c(1,2))
plot(DTmodel, metric = "Accuracy")

plot(DTmodel, metric = "Kappa")

# plot DT
plot(DTmodel$finalModel)

2.3. K-Nearest Neighbor

K nearest neighbours works by directly measuring the (euclidean) distance between observations and infer the class of unlabelled data from the class of its nearest neighbours.

2.3.1 Pseudocode

1. Load the training and test data 
2. Choose the value of K 
3. For each point in test data:
4. find the Euclidean distance to all training data points
5. store the Euclidean distances in a list and sort it 
6. choose the first k points 
7. assign a class to the test point based on the majority of classes present in the chosen points
8. End

2.3.2 Example Code

This example overfit the data in high dimension.

# Tuning parameters
searchGrid <-expand.grid(k = c(1:10))
# Train model
KNNmodel <- train(class~., data=dataset, method="knn", trControl = ctrl, tuneGrid = searchGrid)
# plot search grid results
plot(KNNmodel)

3. Evaluation of classification results

This section only apply to classification.

In supervised machine learning, we have a desired output and thus know precisely what is to be computed. It thus becomes possible to directly evaluate a model using a quantifiable and object metric. The trainning process seeks to minimise:

In-sample Error vs Out-sample Error

  • in-sample error: lead to optimistic assessment of our model. Indeed, the model has already seen these data upon construction, and is does considered optimised the these observations in particular; it is said to over-fit the data
  • out-of-sample error (prefer): on new data, to gain a better idea of how to model performs on unseen data, and estimate how well the model generalises
List of Evaluation Methods
Evaluation Method Concept Outcome Data size
Independent test dataset single split into traing/testing set estimate of the out-of-sample error Large data size
cross validation Multiple split into traing/testing set better estimate of the out-of-sample error Moderate data size
Bootstrap random sampling from dataset with replacement gives a sense of the distribution Small data size
List of Ensemble Learning
Ensemble Learning Concept Outcome Data size
Bagging or Bootstrap aggregation average multiple models through Bootstrap sampling minimizes RMSE loss
Boosting consecutively train a single model solve for net error from the prior model to improve accuracy with some small risk of less coverage

3.1 Independent test dataset

Randomly select a subset of the data to be Train_set data and test data

3.1.1 Pseudocode

1. Let's create a random (i.e.) 80/20 split to define the test and train subsets.
2. Train a regression or classification model on the Train_set data.
3. Test the model on the Test_set data.
4. Calculating the out-of-sample RMSE or prediction accuracy.

3.1.2 Example Code

set.seed(1234) 
trainIndex <- createDataPartition(dataset$class ,p=.67,list=FALSE,times=1) 
Train_set <- dataset[trainIndex,] 
Test_set <- dataset[-trainIndex,]

3.2 Cross-validation

Instead of doing a single traing/testing split, we can systematise this process, produce multiple, different out-of-sample train/test splits, that will lead to a better estimate of the out-of-sample error.

Schematic of 3-fold cross validation producing three trainning (blue) and testing (white) splits.

Schematic of 3-fold cross validation producing three trainning (blue) and testing (white) splits.

3.2.1 Exmaple Code

set.seed(42)
ctrl = trainControl(method = "cv", number = 10, verboseIter = FALSE)
LMmodel_10cv <- train(price ~ ., diamonds, method = "lm", trControl= ctrl)
p <- predict(LMmodel_10cv, diamonds)
error <- p - diamonds$price
rmse_xval <- sqrt(mean(error^2)) ## xval RMSE
rmse_xval
## [1] 1129.843

3.2.2 Error by cross-validation

\[e_{cv} = \frac{\sum_{i=1}^{k_{fold}} e_i}{k_{fold}}\]

3.3 Bootstrap

The idea is to draw random samples with replacement of size N from the training data with size M (M > N). This process is repeated B times to get B bootstrap datasets.

bootstrap

bootstrap

3.3.1 Error by 0.632-bootstrap

\[E_{boot} = weight_{train} \times e_{train} + weight_{test} \times e_{test}\] \[e_{0.632} = 0.368 \times e_{train} + 0.632 \times e_{test}\]

3.4 Bagging (Bootstrap aggregating)

  • In regression, it averages the prediction over a collection of bootstrap samples, thus reducing the variance in prediction.
  • For classification, a committee (or ensemble - later) of classifiers each cast a vote for the predicted class.
Schematic of the RF algorithm based on the Bagging (Bootstrap + Aggregating) method by Xiaogang HE (original source: https://www.researchgate.net/publication/309031320_Spatial_downscaling_of_precipitation_using_adaptable_random_forests

Schematic of the RF algorithm based on the Bagging (Bootstrap + Aggregating) method by Xiaogang HE (original source: https://www.researchgate.net/publication/309031320_Spatial_downscaling_of_precipitation_using_adaptable_random_forests

3.4.2 Random forest

Building random forest starts by generating a high number of individual decision trees. A single decision tree isn’t very accurate, but many different trees built using different inputs (with bootstrapped inputs, features and observations) enable to explore a broad search space and, once combined, produce accurate models, a technique called bootstrap aggregation or bagging.

library(mlbench)
data(Sonar)
library(e1071)
library(ranger)
set.seed(42)
ctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
myGrid <- expand.grid(mtry = c(5, 10, 20, 40, 60),
                      min.node.size = c(5, 10, 15),
                     splitrule = c("gini", "extratrees"))
RFmodel <- train(Class ~ .,
               data = Sonar,
               method = "ranger", 
               tuneGrid = myGrid,
               trControl = ctrl)
# RFmodel
plot(RFmodel)

3.5 Confusion Matrix & contingency table

Reference Yes Reference No
Predicted Yes TP FP
Predicted No FN TN

3.5.1 Example Code

# use DTmodel as an example
prediction <- predict(DTmodel, Test_set)
confusionMatrix(prediction,Test_set$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction bad good
##       bad   56   20
##       good  43  211
##                                           
##                Accuracy : 0.8091          
##                  95% CI : (0.7625, 0.8501)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 4.5e-06         
##                                           
##                   Kappa : 0.5131          
##                                           
##  Mcnemar's Test P-Value : 0.005576        
##                                           
##             Sensitivity : 0.5657          
##             Specificity : 0.9134          
##          Pos Pred Value : 0.7368          
##          Neg Pred Value : 0.8307          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1697          
##    Detection Prevalence : 0.2303          
##       Balanced Accuracy : 0.7395          
##                                           
##        'Positive' Class : bad             
## 
  • accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
  • sensitivity (recall, TP rate): \(\frac{TP}{TP + FN}\)
  • specificity: \(\frac{TN}{TN + FP}\)
  • positive predictive value (precision): \(\frac{TP}{TP + FP}\)
  • negative predictive value: \(\frac{TN}{TN + FN}\)
  • FP rate (fall-out): \(\frac{FP}{FP + TN}\)

3.6 Receiver operating characteristic (ROC) curve

This illustrates the need to adequately balance TP and FP rates. We need to have a way to do a cost-benefit analysis, and the solution will often depend on the question/problem.

caTools::colAUC(p, diamonds[["price"]], plotROC = TRUE)
  • x: FP rate (1 - specificity)
  • y: TP rate (sensitivity)
  • each point along the curve represents a confusion matrix for a given threshold
  • In addition, the colAUC function returns the area under the curve (AUC) model accuracy metric, which is single number metric, summarising the model performance along all possible thresholds:
    • an AUC of 0.5 corresponds to a random model
    • values > 0.5 do better than a random guess
    • a value 1 represents a perfect model
    • a value 1 represents a model that is always wrong