This is a study note combining course material form IE 583 by Siggi Olafsson at ISU with some additional material. The following topic will be discussed:
classification
,clustering
,association rule minning
,advanced classification
For more details on the study material see:
# Classification and Regression trainning
library(caret)
library(tidyverse)
Machine Learnning | Description |
---|---|
Supervised learning (SML) |
the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative. |
Unsupervised learning (UML) |
no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data. |
Semi-supervised learning |
approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection). |
Reinforcement learning |
the learning algorithm performs a task using feedback from operating in a real of synthetic environment. |
To tune a predictive model using multiple workers, the function syntax in the \(caret\) package functions (e.g. train
, rfe
or sbf
) do not change. A separate function is used to “register” the parallel processing technique and specify the number of workers to use.
# Parallel Processing
library(doParallel)
cl <- makePSOCKcluster(5)
registerDoParallel(cl)
In this chapter, we will use function and tool in \(caret\) R package. The \(caret\) package (short for Classification And REgression Training
) is a set of functions that attempt to streamline the process for creating predictive models. The package contains tools for:
# data
dataset <- read.csv("credit_g.csv")
dataset$class <- as.factor(dataset$class)
head(as_tibble(dataset))
## # A tibble: 6 x 21
## checking_status duration credit_history purpose credit_amount
## <fct> <int> <fct> <fct> <int>
## 1 <0 6 'critical/oth… radio/… 1169
## 2 0<=X<200 48 'existing pai… radio/… 5951
## 3 'no checking' 12 'critical/oth… educat… 2096
## 4 <0 42 'existing pai… furnit… 7882
## 5 <0 24 'delayed prev… 'new c… 4870
## 6 'no checking' 36 'existing pai… educat… 9055
## # … with 16 more variables: savings_status <fct>, employment <fct>,
## # installment_commitment <int>, personal_status <fct>,
## # other_parties <fct>, residence_since <int>, property_magnitude <fct>,
## # age <int>, other_payment_plans <fct>, housing <fct>,
## # existing_credits <int>, job <fct>, num_dependents <int>,
## # own_telephone <fct>, foreign_worker <fct>, class <fct>
For more detail about this section, check the Link
Note: Remember also load the specific package for the learning method to allow the train()
function work.
R Package | Description |
---|---|
caret |
- |
ggplot2 |
- |
mlbench |
- |
class |
- |
caTools |
- |
randomForst |
- |
impute |
- |
ranger |
- |
kernlab |
- |
class |
- |
glmnet |
- |
naivebayes |
- |
rpart |
- |
rpart.plot |
- |
createDataPartition
The function createDataPartition
can be used to create balanced splits of the data. If the y
argument to this function is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.
times
: create multiple splits at once; the data indices are returned in a list of integer vectors.createResample
: make simple bootstrap
samples.createFolds
: generate balanced cross-validation
groupings from a set of data.trainIndex <- createDataPartition(dataset$class, p = 0.67,list = FALSE, times = 1)
Train_set <- dataset[trainIndex, ]
Test_set <- dataset[-trainIndex, ]
trainControl
The function trainControl
can be used to specifiy the type of resampling. It also generates parameters that further control how models are created:
allowParallel = TRUE
: a logical that governs whether trainsearch = "random"
: use a random searchmethod
: The resampling method (see the table: List of resampling method
argument)Resampling method | Description |
---|---|
boot |
the usual bootstrap |
cv |
cross-validation |
LOOCV |
|
LGOCV |
(for repeated Train_set/test splits) |
repeatedcv |
|
timeslice |
|
none |
(only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models) |
oob |
out of bag |
optimism_boot |
the optimism bootstrap estimator |
boot632 |
the 0.632 bootstrap estimator |
boot_all |
all of boot , bot632 , optimism_boost (for efficiency, but boot will be used for calculations). |
adaptive_cv , adaptive_boot or adaptive_LGOCV |
|
none |
(only fits one model to the entire Train_set set) |
ctrl <- trainControl(method = "boot632", savePred = T, classProb = T, number = 10)
expand.grid
\(caret\) automate the tuning of the hyperparameter using a grid search,
tuneLength
: that sets the number of hyperparameter values to test.tuneGrid
: directly defining the hyperparameter values, which requires knowledge of the model. Find the hyperparameters from train
Models By Tagtrain
The train
function can be used to:
preProcess
allow single and multiple pre-processing methods
preProcess = "medianImpute"
: Imputation using median of features. This methods works well if the data are missing at random.preProcess = "knnImpute"
: kNN imputation will impute missing values using on other, similar non-missing rows. The default value is 5.preProcess = "scale"
: division by the standard deviationpreProcess = "center"
: subtraction of the meanpreProcess = "pca"
: PCA can be used as pre-processing method, generating a set of high-variance and perpendicular predictors, preventing collinearity.trControl
: estimate model performance from a training settuneGrid
: evaluate, using resampling, the effect of model tuning parameters on performancemethod
: choose the “optimal” model across these parameters. check the train
Model ListMethod | Data used | Other Assumption |
---|---|---|
Naive Bayes | All | Attribute Independence |
Decision tree | Only a few attributes but all the instances | - |
K-NN | All the attributes but only a few instances | - |
1. Read the Train_set dataset.
2. Calculate the mean and standard deviation of the predictor variables in each class
3. Repeat calculate the probability of using the guess density equation each class until the probability of all
4. predictor variable has been calculate
5. Calculate the likelihood for each class
6. Get the greatest likelihood
for more info, click link
library(klaR)
# Tuning parameters
searchGrid <- expand.grid(fL = 1:3, usekernel = c(FALSE), adjust = seq(0, 10, by = 1))
# Train model
NBmodel <-train(class~., data = dataset, method = "nb", trControl = ctrl, tuneGrid = searchGrid)
# plot search grid results
plot(NBmodel)
A great advantage of decision trees is that they make a complex decision simpler by breaking it down into smaller, simpler decisions using divide-and-conquer strategy. They basically identify a set of if-else conditions that split data according to the value if the features. Decision trees choose splits based on most homogeneous partitions, and lead to smaller and more homogeneous partitions over their iterations.
An issue with single decision trees is that they can grow, and become large and complex with many branches, with corresponds to over-fitting. Over-fitting models noise, rather than general patterns in the data, focusing on subtle patterns (outliers) that won’t generalise.
1. Check for the above base cases.
2. For each attribute a, find the normalized information gain ratio from splitting on a.
3. Let a_best be the attribute with the highest normalized information gain.
4. Create a decision node that splits on a_best.
5. Recur on the sublists obtained by splitting on a_best, and add those nodes as children of node.
6. End
The plot function can be used to examine the relationship between the estimates of performance and the tuning parameters
library(rpart)
# Tuning parameters
searchGrid <-expand.grid(cp=(0:15)*0.003)
# Train model
DTmodel <-train(class~., data = dataset ,method="rpart", trControl=ctrl, tuneGrid = searchGrid)
# print(DTmodel)
# summary(DTmodel)
# plot search grid results
par(mfrow=c(1,2))
plot(DTmodel, metric = "Accuracy")
plot(DTmodel, metric = "Kappa")
# plot DT
plot(DTmodel$finalModel)
K nearest neighbours works by directly measuring the (euclidean) distance between observations and infer the class of unlabelled data from the class of its nearest neighbours.
1. Load the training and test data
2. Choose the value of K
3. For each point in test data:
4. find the Euclidean distance to all training data points
5. store the Euclidean distances in a list and sort it
6. choose the first k points
7. assign a class to the test point based on the majority of classes present in the chosen points
8. End
This example overfit the data in high dimension.
# Tuning parameters
searchGrid <-expand.grid(k = c(1:10))
# Train model
KNNmodel <- train(class~., data=dataset, method="knn", trControl = ctrl, tuneGrid = searchGrid)
# plot search grid results
plot(KNNmodel)
This section only apply to classification.
In supervised machine learning, we have a desired output and thus know precisely what is to be computed. It thus becomes possible to directly evaluate a model using a quantifiable and object metric. The trainning process seeks to minimise:
root mean squared error
(RMSE) for regression.prediction accuracy
for classification.in-sample error
: lead to optimistic assessment of our model. Indeed, the model has already seen these data upon construction, and is does considered optimised the these observations in particular; it is said to over-fit the dataout-of-sample error
(prefer): on new data, to gain a better idea of how to model performs on unseen data, and estimate how well the model generalisesEvaluation Method | Concept | Outcome | Data size |
---|---|---|---|
Independent test dataset | single split into traing/testing set | estimate of the out-of-sample error | Large data size |
cross validation | Multiple split into traing/testing set | better estimate of the out-of-sample error | Moderate data size |
Bootstrap | random sampling from dataset with replacement | gives a sense of the distribution | Small data size |
Ensemble Learning | Concept | Outcome | Data size |
---|---|---|---|
Bagging or Bootstrap aggregation | average multiple models through Bootstrap sampling | minimizes RMSE loss | |
Boosting | consecutively train a single model | solve for net error from the prior model to improve accuracy with some small risk of less coverage |
Randomly select a subset of the data to be Train_set data and test data
1. Let's create a random (i.e.) 80/20 split to define the test and train subsets.
2. Train a regression or classification model on the Train_set data.
3. Test the model on the Test_set data.
4. Calculating the out-of-sample RMSE or prediction accuracy.
set.seed(1234)
trainIndex <- createDataPartition(dataset$class ,p=.67,list=FALSE,times=1)
Train_set <- dataset[trainIndex,]
Test_set <- dataset[-trainIndex,]
Instead of doing a single traing/testing split, we can systematise this process, produce multiple, different out-of-sample train/test splits, that will lead to a better estimate of the out-of-sample error.
set.seed(42)
ctrl = trainControl(method = "cv", number = 10, verboseIter = FALSE)
LMmodel_10cv <- train(price ~ ., diamonds, method = "lm", trControl= ctrl)
p <- predict(LMmodel_10cv, diamonds)
error <- p - diamonds$price
rmse_xval <- sqrt(mean(error^2)) ## xval RMSE
rmse_xval
## [1] 1129.843
\[e_{cv} = \frac{\sum_{i=1}^{k_{fold}} e_i}{k_{fold}}\]
The idea is to draw random samples with replacement of size N from the training data with size M (M > N). This process is repeated B times to get B bootstrap datasets.
\[E_{boot} = weight_{train} \times e_{train} + weight_{test} \times e_{test}\] \[e_{0.632} = 0.368 \times e_{train} + 0.632 \times e_{test}\]
Building random forest starts by generating a high number of individual decision trees. A single decision tree isn’t very accurate, but many different trees built using different inputs (with bootstrapped inputs, features and observations) enable to explore a broad search space and, once combined, produce accurate models, a technique called bootstrap aggregation or bagging.
library(mlbench)
data(Sonar)
library(e1071)
library(ranger)
set.seed(42)
ctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
myGrid <- expand.grid(mtry = c(5, 10, 20, 40, 60),
min.node.size = c(5, 10, 15),
splitrule = c("gini", "extratrees"))
RFmodel <- train(Class ~ .,
data = Sonar,
method = "ranger",
tuneGrid = myGrid,
trControl = ctrl)
# RFmodel
plot(RFmodel)
Reference Yes | Reference No | |
---|---|---|
Predicted Yes | TP | FP |
Predicted No | FN | TN |
# use DTmodel as an example
prediction <- predict(DTmodel, Test_set)
confusionMatrix(prediction,Test_set$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction bad good
## bad 56 20
## good 43 211
##
## Accuracy : 0.8091
## 95% CI : (0.7625, 0.8501)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 4.5e-06
##
## Kappa : 0.5131
##
## Mcnemar's Test P-Value : 0.005576
##
## Sensitivity : 0.5657
## Specificity : 0.9134
## Pos Pred Value : 0.7368
## Neg Pred Value : 0.8307
## Prevalence : 0.3000
## Detection Rate : 0.1697
## Detection Prevalence : 0.2303
## Balanced Accuracy : 0.7395
##
## 'Positive' Class : bad
##
This illustrates the need to adequately balance TP and FP rates. We need to have a way to do a cost-benefit analysis, and the solution will often depend on the question/problem.
caTools::colAUC(p, diamonds[["price"]], plotROC = TRUE)