Use of this document

This is a study note combining course material form IE 583 by Siggi Olafsson at ISU with some additional material. The following topic will be discussed:

classification,
clustering ,
association rule minning,
some advanced classification

For more details on the study material see:

0. Prerequisites

# Classification and Regression trainning
library(caret)
library(tidyverse)

0.1 Four Types of Machine Learning

Four Type of Machine Learnning
Machine Learnning	Description
`Supervised learning` (SML)	the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative.
`Unsupervised learning` (UML)	no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
`Semi-supervised learning`	approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
`Reinforcement learning`	the learning algorithm performs a task using feedback from operating in a real of synthetic environment.

0.2 Parallel Processing

To tune a predictive model using multiple workers, the function syntax in the \(caret\) package functions (e.g. train, rfe or sbf) do not change. A separate function is used to “register” the parallel processing technique and specify the number of workers to use.

# Parallel Processing
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)

1 Model Training and Tuning

In this chapter, we will use function and tool in \(caret\) R package. The \(caret\) package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:

data splitting
pre-processing
feature selection
model tuning using resampling
variable importance estimation
as well as other functionality.

# data
dataset <- read.csv("credit_g.csv")
dataset$class <- as.factor(dataset$class)

head(as_tibble(dataset))

## # A tibble: 6 x 21
##   checking_status duration credit_history purpose credit_amount
##   <fct>              <int> <fct>          <fct>           <int>
## 1 <0                     6 'critical/oth… radio/…          1169
## 2 0<=X<200              48 'existing pai… radio/…          5951
## 3 'no checking'         12 'critical/oth… educat…          2096
## 4 <0                    42 'existing pai… furnit…          7882
## 5 <0                    24 'delayed prev… 'new c…          4870
## 6 'no checking'         36 'existing pai… educat…          9055
## # … with 16 more variables: savings_status <fct>, employment <fct>,
## #   installment_commitment <int>, personal_status <fct>,
## #   other_parties <fct>, residence_since <int>, property_magnitude <fct>,
## #   age <int>, other_payment_plans <fct>, housing <fct>,
## #   existing_credits <int>, job <fct>, num_dependents <int>,
## #   own_telephone <fct>, foreign_worker <fct>, class <fct>

For more detail about this section, check the Link

Note: Remember also load the specific package for the learning method to allow the train() function work.

the \(caret\)-Relative R Packages
R Package	Description
`caret`	-
`ggplot2`	-
`mlbench`	-
`class`	-
`caTools`	-
`randomForst`	-
`impute`	-
`ranger`	-
`kernlab`	-
`class`	-
`glmnet`	-
`naivebayes`	-
`rpart`	-
`rpart.plot`	-

1.1 Function `createDataPartition`

The function createDataPartition can be used to create balanced splits of the data. If the y argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.

times: create multiple splits at once; the data indices are returned in a list of integer vectors.
createResample: make simple bootstrap samples.
createFolds: generate balanced cross-validation groupings from a set of data.

trainIndex <- createDataPartition(dataset$class, p = 0.67,list = FALSE, times = 1) 
Train_set <- dataset[trainIndex, ] 
Test_set <- dataset[-trainIndex, ]

1.2 Function `trainControl`

The function trainControl can be used to specifiy the type of resampling. It also generates parameters that further control how models are created:

allowParallel = TRUE: a logical that governs whether train
search = "random": use a random search
method: The resampling method (see the table: List of resampling method argument)

List of resampling `method` argument.
Resampling method	Description
`boot`	the usual bootstrap
`cv`	cross-validation
`LOOCV`
`LGOCV`	(for repeated Train_set/test splits)
`repeatedcv`
`timeslice`
`none`	(only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models)
`oob`	out of bag
`optimism_boot`	the optimism bootstrap estimator
`boot632`	the 0.632 bootstrap estimator
`boot_all`	all of `boot`, `bot632`, `optimism_boost` (for efficiency, but `boot` will be used for calculations).
`adaptive_cv`, `adaptive_boot` or `adaptive_LGOCV`
`none`	(only fits one model to the entire Train_set set)

ctrl <- trainControl(method = "boot632", savePred = T, classProb = T, number = 10)

1.3 Function `expand.grid`

\(caret\) automate the tuning of the hyperparameter using a grid search,

tuneLength: that sets the number of hyperparameter values to test.
tuneGrid: directly defining the hyperparameter values, which requires knowledge of the model. Find the hyperparameters from train Models By Tag

1.4 Function `train`

The train function can be used to:

preProcess allow single and multiple pre-processing methods
- preProcess = "medianImpute": Imputation using median of features. This methods works well if the data are missing at random.
- preProcess = "knnImpute": kNN imputation will impute missing values using on other, similar non-missing rows. The default value is 5.
- preProcess = "scale": division by the standard deviation
- preProcess = "center": subtraction of the mean
- preProcess = "pca": PCA can be used as pre-processing method, generating a set of high-variance and perpendicular predictors, preventing collinearity.
trControl: estimate model performance from a training set
tuneGrid: evaluate, using resampling, the effect of model tuning parameters on performance
method: choose the “optimal” model across these parameters. check the train Model List

2 Classic Examples

Naive Bayes: work with each attribute independently and then combine the results
Decision Trees: priortiies the attributes into a hierarchy and use them in combination as needed
K-nearest Neighbor: go directly to the data and identify the instances that are mosot similar to cool, overcast, but not windy.

Assumption of three simple Classification
Method	Data used	Other Assumption
Naive Bayes	All	Attribute Independence
Decision tree	Only a few attributes but all the instances	-
K-NN	All the attributes but only a few instances	-

The column labeled “Accuracy” is the overall agreement rate averaged over cross-validation iterations. The agreement standard deviation is also calculated from the cross-validation results.
The column “Kappa” is Cohena’s (unweighted) Kappa statistic which measures inter-rater agreement for qualitative (categorical) items. a statistic which measures inter-rater agreement for qualitative (categorical) items.

2.1 Naive Bayes

2.1.1 Pseudocode

1. Read the Train_set dataset.
2. Calculate the mean and standard deviation of the predictor variables in each class
3. Repeat calculate the probability of using the guess density equation each class until the probability of all 
4. predictor variable has been calculate
5. Calculate the likelihood for each class
6. Get the greatest likelihood

2.1.2 Example Code

for more info, click link

library(klaR)
# Tuning parameters
searchGrid <- expand.grid(fL = 1:3, usekernel =  c(FALSE), adjust = seq(0, 10, by = 1))
# Train model
NBmodel <-train(class~., data = dataset, method = "nb", trControl = ctrl, tuneGrid = searchGrid)

# plot search grid results
plot(NBmodel)

2.2 Entropy-based Decision Trees

A great advantage of decision trees is that they make a complex decision simpler by breaking it down into smaller, simpler decisions using divide-and-conquer strategy. They basically identify a set of if-else conditions that split data according to the value if the features. Decision trees choose splits based on most homogeneous partitions, and lead to smaller and more homogeneous partitions over their iterations.

An issue with single decision trees is that they can grow, and become large and complex with many branches, with corresponds to over-fitting. Over-fitting models noise, rather than general patterns in the data, focusing on subtle patterns (outliers) that won’t generalise.

2.2.1 Pseudocode

1. Check for the above base cases.
2. For each attribute a, find the normalized information gain ratio from splitting on a.
3. Let a_best be the attribute with the highest normalized information gain.
4. Create a decision node that splits on a_best.
5. Recur on the sublists obtained by splitting on a_best, and add those nodes as children of node.
6. End

2.2.2 Example Code

The plot function can be used to examine the relationship between the estimates of performance and the tuning parameters

library(rpart)
# Tuning parameters
searchGrid <-expand.grid(cp=(0:15)*0.003)
# Train model
DTmodel <-train(class~., data = dataset ,method="rpart", trControl=ctrl, tuneGrid = searchGrid)
# print(DTmodel)
# summary(DTmodel)
# plot search grid results
par(mfrow=c(1,2))
plot(DTmodel, metric = "Accuracy")

plot(DTmodel, metric = "Kappa")

# plot DT
plot(DTmodel$finalModel)

2.3. K-Nearest Neighbor

K nearest neighbours works by directly measuring the (euclidean) distance between observations and infer the class of unlabelled data from the class of its nearest neighbours.

2.3.1 Pseudocode

1. Load the training and test data 
2. Choose the value of K 
3. For each point in test data:
4. find the Euclidean distance to all training data points
5. store the Euclidean distances in a list and sort it 
6. choose the first k points 
7. assign a class to the test point based on the majority of classes present in the chosen points
8. End

2.3.2 Example Code

This example overfit the data in high dimension.

# Tuning parameters
searchGrid <-expand.grid(k = c(1:10))
# Train model
KNNmodel <- train(class~., data=dataset, method="knn", trControl = ctrl, tuneGrid = searchGrid)
# plot search grid results
plot(KNNmodel)

3. Evaluation of classification results

This section only apply to classification.

In supervised machine learning, we have a desired output and thus know precisely what is to be computed. It thus becomes possible to directly evaluate a model using a quantifiable and object metric. The trainning process seeks to minimise:

root mean squared error (RMSE) for regression.
prediction accuracy for classification.

In-sample Error vs Out-sample Error

in-sample error: lead to optimistic assessment of our model. Indeed, the model has already seen these data upon construction, and is does considered optimised the these observations in particular; it is said to over-fit the data
out-of-sample error (prefer): on new data, to gain a better idea of how to model performs on unseen data, and estimate how well the model generalises

List of Evaluation Methods
Evaluation Method	Concept	Outcome	Data size
Independent test dataset	single split into traing/testing set	estimate of the out-of-sample error	Large data size
cross validation	Multiple split into traing/testing set	better estimate of the out-of-sample error	Moderate data size
Bootstrap	random sampling from dataset with replacement	gives a sense of the distribution	Small data size

List of Ensemble Learning
Ensemble Learning	Concept	Outcome	Data size
Bagging or Bootstrap aggregation	average multiple models through Bootstrap sampling	minimizes RMSE loss
Boosting	consecutively train a single model	solve for net error from the prior model to improve accuracy with some small risk of less coverage

3.1 Independent test dataset

Randomly select a subset of the data to be Train_set data and test data

3.1.1 Pseudocode

1. Let's create a random (i.e.) 80/20 split to define the test and train subsets.
2. Train a regression or classification model on the Train_set data.
3. Test the model on the Test_set data.
4. Calculating the out-of-sample RMSE or prediction accuracy.

3.1.2 Example Code

set.seed(1234) 
trainIndex <- createDataPartition(dataset$class ,p=.67,list=FALSE,times=1) 
Train_set <- dataset[trainIndex,] 
Test_set <- dataset[-trainIndex,]

3.2 Cross-validation

Instead of doing a single traing/testing split, we can systematise this process, produce multiple, different out-of-sample train/test splits, that will lead to a better estimate of the out-of-sample error.

Schematic of 3-fold cross validation producing three trainning (blue) and testing (white) splits.

3.2.1 Exmaple Code

set.seed(42)
ctrl = trainControl(method = "cv", number = 10, verboseIter = FALSE)
LMmodel_10cv <- train(price ~ ., diamonds, method = "lm", trControl= ctrl)
p <- predict(LMmodel_10cv, diamonds)
error <- p - diamonds$price
rmse_xval <- sqrt(mean(error^2)) ## xval RMSE
rmse_xval

## [1] 1129.843

3.2.2 Error by cross-validation

\[e_{cv} = \frac{\sum_{i=1}^{k_{fold}} e_i}{k_{fold}}\]

3.3 Bootstrap

The idea is to draw random samples with replacement of size N from the training data with size M (M > N). This process is repeated B times to get B bootstrap datasets.

bootstrap

3.3.1 Error by 0.632-bootstrap

\[E_{boot} = weight_{train} \times e_{train} + weight_{test} \times e_{test}\] \[e_{0.632} = 0.368 \times e_{train} + 0.632 \times e_{test}\]

3.4 Bagging (Bootstrap aggregating)

In regression, it averages the prediction over a collection of bootstrap samples, thus reducing the variance in prediction.
For classification, a committee (or ensemble - later) of classifiers each cast a vote for the predicted class.

Schematic of the RF algorithm based on the Bagging (Bootstrap + Aggregating) method by Xiaogang HE (original source: https://www.researchgate.net/publication/309031320_Spatial_downscaling_of_precipitation_using_adaptable_random_forests

3.4.2 Random forest

Building random forest starts by generating a high number of individual decision trees. A single decision tree isn’t very accurate, but many different trees built using different inputs (with bootstrapped inputs, features and observations) enable to explore a broad search space and, once combined, produce accurate models, a technique called bootstrap aggregation or bagging.

library(mlbench)
data(Sonar)
library(e1071)
library(ranger)
set.seed(42)
ctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
myGrid <- expand.grid(mtry = c(5, 10, 20, 40, 60),
                      min.node.size = c(5, 10, 15),
                     splitrule = c("gini", "extratrees"))
RFmodel <- train(Class ~ .,
               data = Sonar,
               method = "ranger", 
               tuneGrid = myGrid,
               trControl = ctrl)
# RFmodel
plot(RFmodel)

3.5 Confusion Matrix & contingency table

	Reference Yes	Reference No
Predicted Yes	TP	FP
Predicted No	FN	TN

3.5.1 Example Code

# use DTmodel as an example
prediction <- predict(DTmodel, Test_set)
confusionMatrix(prediction,Test_set$class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction bad good
##       bad   56   20
##       good  43  211
##                                           
##                Accuracy : 0.8091          
##                  95% CI : (0.7625, 0.8501)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 4.5e-06         
##                                           
##                   Kappa : 0.5131          
##                                           
##  Mcnemar's Test P-Value : 0.005576        
##                                           
##             Sensitivity : 0.5657          
##             Specificity : 0.9134          
##          Pos Pred Value : 0.7368          
##          Neg Pred Value : 0.8307          
##              Prevalence : 0.3000          
##          Detection Rate : 0.1697          
##    Detection Prevalence : 0.2303          
##       Balanced Accuracy : 0.7395          
##                                           
##        'Positive' Class : bad             
##

accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
sensitivity (recall, TP rate): \(\frac{TP}{TP + FN}\)
specificity: \(\frac{TN}{TN + FP}\)
positive predictive value (precision): \(\frac{TP}{TP + FP}\)
negative predictive value: \(\frac{TN}{TN + FN}\)
FP rate (fall-out): \(\frac{FP}{FP + TN}\)

3.6 Receiver operating characteristic (ROC) curve

This illustrates the need to adequately balance TP and FP rates. We need to have a way to do a cost-benefit analysis, and the solution will often depend on the question/problem.

caTools::colAUC(p, diamonds[["price"]], plotROC = TRUE)

x: FP rate (1 - specificity)
y: TP rate (sensitivity)
each point along the curve represents a confusion matrix for a given threshold
In addition, the colAUC function returns the area under the curve (AUC) model accuracy metric, which is single number metric, summarising the model performance along all possible thresholds:
- an AUC of 0.5 corresponds to a random model
- values > 0.5 do better than a random guess
- a value 1 represents a perfect model
- a value 1 represents a model that is always wrong

Note-Classification

Weiquan Luo

Updated by 2021-01-21

Use of this document

0. Prerequisites

0.1 Four Types of Machine Learning

0.2 Parallel Processing

1 Model Training and Tuning

1.1 Function `createDataPartition`

1.2 Function `trainControl`

1.3 Function `expand.grid`

1.4 Function `train`

2 Classic Examples

2.1 Naive Bayes

2.1.1 Pseudocode

2.1.2 Example Code

2.2 Entropy-based Decision Trees

2.2.1 Pseudocode

2.2.2 Example Code

2.3. K-Nearest Neighbor

2.3.1 Pseudocode

2.3.2 Example Code

3. Evaluation of classification results

In-sample Error vs Out-sample Error

3.1 Independent test dataset

3.1.1 Pseudocode

3.1.2 Example Code

3.2 Cross-validation

3.2.1 Exmaple Code

3.2.2 Error by cross-validation

3.3 Bootstrap

3.3.1 Error by 0.632-bootstrap

3.4 Bagging (Bootstrap aggregating)

3.4.2 Random forest

3.5 Confusion Matrix & contingency table

3.5.1 Example Code

3.6 Receiver operating characteristic (ROC) curve

Note-Classification

Weiquan Luo

Updated by 2021-01-21

Use of this document

0. Prerequisites

0.1 Four Types of Machine Learning

0.2 Parallel Processing

1 Model Training and Tuning

1.1 Function createDataPartition

1.2 Function trainControl

1.3 Function expand.grid

1.4 Function train

2 Classic Examples

2.1 Naive Bayes

2.1.1 Pseudocode

2.1.2 Example Code

2.2 Entropy-based Decision Trees

2.2.1 Pseudocode

2.2.2 Example Code

2.3. K-Nearest Neighbor

2.3.1 Pseudocode

2.3.2 Example Code

3. Evaluation of classification results

In-sample Error vs Out-sample Error

3.1 Independent test dataset

3.1.1 Pseudocode

3.1.2 Example Code

3.2 Cross-validation

3.2.1 Exmaple Code

3.2.2 Error by cross-validation

3.3 Bootstrap

3.3.1 Error by 0.632-bootstrap

3.4 Bagging (Bootstrap aggregating)

3.4.2 Random forest

3.5 Confusion Matrix & contingency table

3.5.1 Example Code

3.6 Receiver operating characteristic (ROC) curve

1.1 Function `createDataPartition`

1.2 Function `trainControl`

1.3 Function `expand.grid`

1.4 Function `train`