Use of this document

This is a study note from course material form IE 583 by Siggi Olafsson at ISU with some additional material. # 0. Prerequisites

library(factoextra)
library(dplyr)

0.1 Four Types of Machine Learning

Four Type of Machine Learnning
Machine Learnning Description
Supervised learning (SML) the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative.
Unsupervised learning (UML) no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
Semi-supervised learning approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
Reinforcement learning the learning algorithm performs a task using feedback from operating in a real of synthetic environment.

1 Find number of cluster

Type of Clustering

Type of Clustering

Use petal.Length and Sepal.Length in iris dataset as the example.

library(datasets)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

1.1 Assessing clustering tendency

To assess the clustering tendency, the Hopkins' statistic and a visual approach can be used. This can be performed using the function get_clust_tendency() in \(factoextra\) package, which creates an ordered dissimilarity image (ODI).

  • Hopkins statistic: If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
  • Visual approach: The visual approach detects the clustering tendency by counting the number of square shaped dark (or colored) blocks along the diagonal in the ordered dissimilarity image.
iris[, -5] %>%    # Remove column 5 (Species)
  scale() %>%     # Scale variables
  get_clust_tendency(n = 50, gradient = list(low = "steelblue",  high = "white"))
## $hopkins_stat
## [1] 0.2002686
## 
## $plot

A distance matrix is necessary for hierarchical clustering. In bioinformatics, distance matrices (also called heat map) are used to represent protein structures in a coordinate-independent manner, as well as the pairwise distances between two sequences in sequence space. They are used in structural and sequential alignment, and for the determination of protein structures from NMR or X-ray crystallography.

res.dist <- get_dist(iris[,1:4], stand = TRUE, method = "euclidean")
fviz_dist(res.dist, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# or use heatmap() from `stats` package
heatmap(as.matrix.data.frame(iris[,1:4]), scale = "column")

1.2 Optimal Number of Clusters

In a above analysis we (correctly) assumed that there were 3 clusters. In practice this is usually unknown but we can use cluster validation to determine the number of clusters with \(NbClust\) pakcage. What NbClus() is actually doing is using different clustering methods to obtain a clustering, and then evaluating the quality based on different quality meaasures (here called index). Try another clustering method by changing to method = "single". Results are very different!

  • distance: The similarity notion is a key concept for Clustering, in the way to decide which clusters should be combined or divided when observing sets. An appropriate metric use is strategic in order to achieve the best clustering, because it directly influences the shape of clusters. The Dissimilarity Matrix (or Distance matrix) is used in many algorithms of Density-based and Hierarchical clustering. (see reference)
  • min.nc: minimal number of clusters, between 1 and (number of objects - 1)
  • max.nc: maximal number of clusters, between 2 and (number of objects - 1), greater or equal to min.nc. By default, max.nc=15.
  • method: the cluster analysis method to be used. This should be one of: “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median”, “centroid”, “kmeans”.
  • index: the index to be calculated. This should be one of : “kl”, “ch”, “hartigan”, “ccc”, “scott”, “marriot”, “trcovw”, “tracew”, “friedman”, “rubin”, “cindex”, “db”, “silhouette”, “duda”, “pseudot2”, “beale”, “ratkowsky”, “ball”, “ptbiserial”, “gap”, “frey”, “mcclain”, “gamma”, “gplus”, “tau”, “dunn”, “hubert”, “sdindex”, “dindex”, “sdbw”, “all” (all indices except GAP, Gamma, Gplus and Tau), “alllong” (all indices with Gap, Gamma, Gplus and Tau included).
  • see reference for detail method and index.
library(NbClust)
# all criteria is used to determine the optimal number of cluster
valClust <- NbClust(data=iris[,1:4], distance="euclidean", method = "ward.D", index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 3 proposed 6 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

2 Five Type of Clustering methods (not completed)

Clustering methods are used to identify groups of similar objects in a multivariate data sets collected from fields such as marketing, bio-medical and geo-spatial. They are different types of clustering methods, including Reference:

In \(factoextra\) package, the eclust() function makes it easier to visualize clusters using the two primary principal components. It also allows us to call multiple clustering methods using one function (similar to train in \(caret\)). check reference on Related Books: Practical Guide To Cluster Analysis in R

Not sure what is the FUNcluster = c("agnes", "diana"). need to check in the future.

2.1 Partitioning methods

Partitioning algorithms are clustering techniques that subdivide the data sets into a set of k groups, where k is the number of groups pre-specified by the analyst. The option for selecting clustering function, FUNcluster for Partitioning methods are:

  • kmeans: K-means clustering (MacQueen 1967), in which, each cluster is represented by the center or means of the data points belonging to the cluster. The K-means method is sensitive to anomalous data points and outliers.
  • pam: K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), in which, each cluster is represented by one of the objects in the cluster. PAM is less sensitive to outliers compared to k-means.
  • clara: CLARA algorithm (Clustering Large Applications), which is an extension to PAM adapted for large data sets.

2.1.1 k-Means

Kmean calculation

Kmean calculation

1. Choose the number of clusters (K) and obtain the data points 
2. Place the centroids $c_1$, $c_2$, ..... $c_k$ randomly 
3. Repeat this step until convergence or until the end of a fixed number of iterations
    a) for each data point x_i:
        - find the nearest centroid($c_1$, $c_2$ .. $c_k$) 
        - assign the point to that cluster 
    b) for each cluster j = 1..k
        - new centroid = mean of all points assigned to that cluster
4. End 
# Visualizing partitional clustering
clusters.km <- eclust(iris[,1:4], FUNcluster = c("kmeans"), k = 3, nstart = 25, graph = FALSE)
fviz_cluster(clusters.km,geom="point")

2.2 Hierarchical clustering

Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in the dataset. It does not require to pre-specify the number of clusters to be generated.

  • dendrogram: The result of hierarchical clustering is a tree-based representation of the objects, which is also known as. Observations can be subdivided into groups by cutting the dendrogram at a desired similarity level.

The option for selecting clustering function, FUNcluster for Hierarchical clustering are:

  • single: The criteria is that distance between two clusters set equal to the minimum of distances between all instances. It results in a very elongated cluster, so good for flexible cluster shapes.
  • complete: The criteria is that distance between two clusters set equal to maximum of all distances between instances in the clusters.It results in a tightly bound, compact clusters

Check to read about Comparing Cluster Dendrograms in R

2.3 Fuzzy clustering

Fuzzy clustering is also known as soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each observation belongs to only one cluster. This is known as hard clustering. In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The Fuzzy c-means method is the most popular fuzzy clustering algorithm. See Reference

# Visualizing hierarchical clustering (dendrogram)
clusters.fa <- eclust(iris[,1:4], FUNcluster = c("fanny"), k = 3, graph = FALSE)
fviz_cluster(clusters.fa, geom="point")

2.4 Density-based clustering

A cluster is dense area of instances where each dense area is separated by low-density areas. DBSCAN is a partitioning method that has been introduced in Ester et al. (1996). It can find out clusters of different shapes and sizes from data containing noise and outliers (Ester et al. 1996). The basic idea behind density-based clustering approach is derived from a human intuitive clustering method. See Reference

1. For each point xi, compute the distance between xi and the other points. Finds all neighbor points within distance eps of the starting point (xi). Each point, with a neighbor count greater than or equal to MinPts, is marked as core point or visited.
2. For each core point, if it's not already assigned to a cluster, create a new cluster. Find recursively all its density connected points and assign them to the same cluster as the core point.
3. Iterate through the remaining unvisited points in the data set.
4. Those points that do not belong to any cluster are treated as outliers or noise.
# Load the data 
data("multishapes", package = "factoextra")
df <- multishapes[, 1:2]

# Compute DBSCAN using fpc package
library(fpc)
set.seed(123)
db <- fpc::dbscan(df, eps = 0.15, MinPts = 5)

# Plot DBSCAN results
library("factoextra")
fviz_cluster(db, data = df, stand = FALSE,
             ellipse = FALSE, show.clust.cent = FALSE,
             geom = "point",palette = "jco", ggtheme = theme_classic())

2.5 Model-based clustering

Reference

3 Clustering Validation Method

In section of 2.1.2 Optimal Number of Clusters, we use vote of existing criteria with NbClus() in \(NbClust\) pakcage to determin the number of cluster. However, the different criterian have their strength and weakness. Depended on dataset, knowing the nature of the criteria helps understand which validation method to use. precisely.

Criteria of Validation Method
criteria Validation Method, index
Compactness Hart, kl, hartigan, Clest, Sil, Gap, Clest
Disconnectivity Lee
Compactness (elbow) hartigan (Hartigan 1975), kl (Krzanowski and Lai 1988)
Compactness and separation ch (Calinski and Harabasz 1974), Clest, Sil
Reference distribution or resampling Gap, Clest, BH

To evaluate the goodnesss of the clustering results, plot the silhouette coefficient as follow:

fviz_silhouette(clusters.fa, palette = "jco",
                ggtheme = theme_minimal())
##   cluster size ave.sil.width
## 1       1   50          0.79
## 2       2   45          0.38
## 3       3   55          0.43