Use of this document

This is a study note from course material form IE 583 by Siggi Olafsson at ISU with some additional material.

0. Prerequisites

library(arules) 
library(arulesViz)
library(dplyr)
library(tibble)

0.1 Four Types of Machine Learning

Four Type of Machine Learnning
Machine Learnning Description
Supervised learning (SML) the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative.
Unsupervised learning (UML) no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
Semi-supervised learning approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
Reinforcement learning the learning algorithm performs a task using feedback from operating in a real of synthetic environment.

1. Association Rule

Association Rule Discovery aims to discovery interesting correlation or other relationships between attributes in large databases in expression of the form \(X \Rightarrow Y\), sucha as:

In practice, association rules are useful for analyzing and predicting customer behavior. They play an important part in customer analytics, market basket analysis, product clustering, catalog design and store layout.

1.2 Support and Confidence

Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships.

  • \(Support(X \Rightarrow Y) = P(X \cap Y)\): an indication of how frequently the items appear in the data.
  • \(Condidence(X \Rightarrow Y) = P(Y | X) = \frac{P(X \cap Y)}{P(X)}\): indicates the number of times the if-then statements are found true.
  • \(Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{P(Y)} = \frac{P(X \cap Y)}{P(X)P(Y)}\): can be used to compare confidence with expected confidence.It measures the correlation, and can be interpreted as how much mosre likely Y is in the subpopultion ( given X) then in the general population.

1.2 Pseudocode of Association Rule Mining

1. Find all item sets that meet minimum coverage 
2. Find all rules that meet minimum accuracy 
3. * Find all rules that meet minimum lift 
4. Prune by
    * retain more precise rule with smaller/same support or higher/same accuracy or higher/sames confidence
    * retain simpler rule with higher accuracy or hihger confidence. 

* Step 3 could be ignored if only comparing rules with common rhs, Y, because \(Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{P(Y)} \sim Confidence(X \Rightarrow Y)\)

2 Examples

vote <- read.csv("vote.csv")
head(as.tibble(vote))
## # A tibble: 6 x 17
##   handicapped_inf… water_project_c… adoption_of_the… physician_fee_f…
##   <fct>            <fct>            <fct>            <fct>           
## 1 n                y                n                y               
## 2 n                y                n                y               
## 3 ?                y                y                ?               
## 4 n                y                y                n               
## 5 y                y                y                n               
## 6 n                y                y                n               
## # … with 13 more variables: el_salvador_aid <fct>,
## #   religious_groups_in_schools <fct>, anti_satellite_test_ban <fct>,
## #   aid_to_nicaraguan_contras <fct>, mx_missile <fct>, immigration <fct>,
## #   synfuels_corporation_cutback <fct>, education_spending <fct>,
## #   superfund_right_to_sue <fct>, crime <fct>, duty_free_exports <fct>,
## #   export_administration_act_south_africa <fct>, Class <fct>

Apriori Algorithm, apriori(), is developed in early 90s and is the most commonly used approach.

2.1 Whole-set of Rules

findRules <-function(rules, c = 0.9, s = 0.35, l = 1.8){
  good_rules = rules[quality(rules)$confidence > c] 
  good_rules = good_rules[quality(good_rules)$support > s] 
  good_rules = good_rules[quality(good_rules)$lift > l] 
}
rules <- apriori(vote)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 43 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [482817 rule(s)] done [0.07s].
## creating S4 object  ... done [0.12s].
good_rules <- findRules(rules) # Customize function
plotly_arules(good_rules, c = 0.9, s = 0.35, l = 1.8)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
inspect(head(good_rules, n=10, by = "lift"))
##      lhs                                rhs                        support confidence     lift count
## [1]  {el_salvador_aid=y,                                                                            
##       Class=republican}              => {physician_fee_freeze=y} 0.3586207  0.9936306 2.441973   156
## [2]  {crime=y,                                                                                      
##       Class=republican}              => {physician_fee_freeze=y} 0.3563218  0.9810127 2.410963   155
## [3]  {physician_fee_freeze=y,                                                                       
##       el_salvador_aid=y}             => {Class=republican}       0.3586207  0.9285714 2.404337   156
## [4]  {physician_fee_freeze=y,                                                                       
##       crime=y}                       => {Class=republican}       0.3563218  0.9226190 2.388924   155
## [5]  {Class=republican}              => {physician_fee_freeze=y} 0.3747126  0.9702381 2.384483   163
## [6]  {physician_fee_freeze=y}        => {Class=republican}       0.3747126  0.9209040 2.384483   163
## [7]  {physician_fee_freeze=y,                                                                       
##       mx_missile=n}                  => {el_salvador_aid=y}      0.3517241  0.9935065 2.038563   153
## [8]  {aid_to_nicaraguan_contras=n,                                                                  
##       crime=y}                       => {el_salvador_aid=y}      0.3770115  0.9879518 2.027165   164
## [9]  {religious_groups_in_schools=y,                                                                
##       aid_to_nicaraguan_contras=n,                                                                  
##       crime=y}                       => {el_salvador_aid=y}      0.3540230  0.9871795 2.025581   154
## [10] {anti_satellite_test_ban=y,                                                                    
##       aid_to_nicaraguan_contras=y,                                                                  
##       education_spending=n,                                                                         
##       Class=democrat}                => {el_salvador_aid=n}      0.3586207  0.9570552 2.001534   156

2.1 Sub-set of Rules

Or we can subset the rule output from Apriori by defining the rhs or lhs.

# subset for rule with specification on lhs
def_rules <- subset(rules, subset = lhs %in% "Class=democrat") 
# subset for rule with specification on rhs
Dem_rules <- subset(rules, subset = rhs %in% "Class=democrat") 
good_Dem_rules <- findRules(Dem_rules, c = 0.9, s = 0.43, l = 1.5)
plotly_arules(good_Dem_rules)

2.3 Closed Item Set

The apriori() function can return either all itemsets, frequent itemsets, closed itemsets, association rules, or hyperedges. By default it returns rules based on frequent itemsets, and unfortunately there doesn’t seem to be a way to generate rules based on closed itemsets directly. Instead, you can start by generating the closed itemsets using the target parameter. Once you have those, you can generate the corresponding rules using the ruleInduction() function. Note that this function needs your dataset reformatted as transactional data as a parameter. From there, you can work with rules just the same as before, but the non-closed sets have already been filtered out (about half for the vote dataset).

itemsets = apriori(vote, parameter=list(target="closed frequent itemsets"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen                   target   ext
##      10 closed frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 43 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.11s].
## filtering closed item sets ... done [0.01s].
## writing ... [46577 set(s)] done [0.01s].
## creating S4 object  ... done [0.02s].
vote_trans = as(vote, "transactions")
rules = ruleInduction(itemsets, vote_trans, .8, control=list(method="apriori"))
rules
## set of 246588 rules