Use of this document
0. Prerequisites
- 0.1 Four Types of Machine Learning
1. Association Rule
- 1.2 Support and Confidence
- 1.2 Pseudocode of Association Rule Mining
2 Examples

Use of this document

This is a study note from course material form IE 583 by Siggi Olafsson at ISU with some additional material.

0. Prerequisites

library(arules) 
library(arulesViz)
library(dplyr)
library(tibble)

0.1 Four Types of Machine Learning

Four Type of Machine Learnning
Machine Learnning	Description
`Supervised learning` (SML)	the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative.
`Unsupervised learning` (UML)	no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
`Semi-supervised learning`	approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
`Reinforcement learning`	the learning algorithm performs a task using feedback from operating in a real of synthetic environment.

1. Association Rule

Association Rule Discovery aims to discovery interesting correlation or other relationships between attributes in large databases in expression of the form \(X \Rightarrow Y\), sucha as:

finds a rule of the form if A and B then C and D
Which attributes will be included in the relation is unknown.

In practice, association rules are useful for analyzing and predicting customer behavior. They play an important part in customer analytics, market basket analysis, product clustering, catalog design and store layout.

1.2 Support and Confidence

Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships.

\(Support(X \Rightarrow Y) = P(X \cap Y)\): an indication of how frequently the items appear in the data.
\(Condidence(X \Rightarrow Y) = P(Y | X) = \frac{P(X \cap Y)}{P(X)}\): indicates the number of times the if-then statements are found true.
\(Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{P(Y)} = \frac{P(X \cap Y)}{P(X)P(Y)}\): can be used to compare confidence with expected confidence.It measures the correlation, and can be interpreted as how much mosre likely Y is in the subpopultion ( given X) then in the general population.

1.2 Pseudocode of Association Rule Mining

1. Find all item sets that meet minimum coverage 
2. Find all rules that meet minimum accuracy 
3. * Find all rules that meet minimum lift 
4. Prune by
    * retain more precise rule with smaller/same support or higher/same accuracy or higher/sames confidence
    * retain simpler rule with higher accuracy or hihger confidence.

* Step 3 could be ignored if only comparing rules with common rhs, Y, because \(Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{P(Y)} \sim Confidence(X \Rightarrow Y)\)

2 Examples

vote <- read.csv("vote.csv")
head(as.tibble(vote))

## # A tibble: 6 x 17
##   handicapped_inf… water_project_c… adoption_of_the… physician_fee_f…
##   <fct>            <fct>            <fct>            <fct>           
## 1 n                y                n                y               
## 2 n                y                n                y               
## 3 ?                y                y                ?               
## 4 n                y                y                n               
## 5 y                y                y                n               
## 6 n                y                y                n               
## # … with 13 more variables: el_salvador_aid <fct>,
## #   religious_groups_in_schools <fct>, anti_satellite_test_ban <fct>,
## #   aid_to_nicaraguan_contras <fct>, mx_missile <fct>, immigration <fct>,
## #   synfuels_corporation_cutback <fct>, education_spending <fct>,
## #   superfund_right_to_sue <fct>, crime <fct>, duty_free_exports <fct>,
## #   export_administration_act_south_africa <fct>, Class <fct>

Apriori Algorithm, apriori(), is developed in early 90s and is the most commonly used approach.

The new developments focus on:
- generating item set more eficiently,
- generaing rule from item set more efficiently.
It still have some problem:
- costly to generate large number of item sets
- requires repeared scans of database to check candidates, so most problematic for long patterns.
- rule meet minimal confidence, support or lift will appear in inspect, need expertise to select the rules
apriori():
- parameter=list(target=)
  - target = "frequent itemsets"(default)
  - target = "maximally frequent itemsets"
  - target = "closed frequent itemsets"
  - target = "rules" (only available for Apriori; use ruleInduction for eclat.)
  - target = "hyperedgesets" (only available for Apriori; see references for the definition of association hyperedgesets)
plotly_arules(): Plot an interactive scatter plot for association rules using plotly.

2.1 Whole-set of Rules

findRules <-function(rules, c = 0.9, s = 0.35, l = 1.8){
  good_rules = rules[quality(rules)$confidence > c] 
  good_rules = good_rules[quality(good_rules)$support > s] 
  good_rules = good_rules[quality(good_rules)$lift > l] 
}
rules <- apriori(vote)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 43 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [482817 rule(s)] done [0.07s].
## creating S4 object  ... done [0.12s].

good_rules <- findRules(rules) # Customize function
plotly_arules(good_rules, c = 0.9, s = 0.35, l = 1.8)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

inspect(head(good_rules, n=10, by = "lift"))

##      lhs                                rhs                        support confidence     lift count
## [1]  {el_salvador_aid=y,                                                                            
##       Class=republican}              => {physician_fee_freeze=y} 0.3586207  0.9936306 2.441973   156
## [2]  {crime=y,                                                                                      
##       Class=republican}              => {physician_fee_freeze=y} 0.3563218  0.9810127 2.410963   155
## [3]  {physician_fee_freeze=y,                                                                       
##       el_salvador_aid=y}             => {Class=republican}       0.3586207  0.9285714 2.404337   156
## [4]  {physician_fee_freeze=y,                                                                       
##       crime=y}                       => {Class=republican}       0.3563218  0.9226190 2.388924   155
## [5]  {Class=republican}              => {physician_fee_freeze=y} 0.3747126  0.9702381 2.384483   163
## [6]  {physician_fee_freeze=y}        => {Class=republican}       0.3747126  0.9209040 2.384483   163
## [7]  {physician_fee_freeze=y,                                                                       
##       mx_missile=n}                  => {el_salvador_aid=y}      0.3517241  0.9935065 2.038563   153
## [8]  {aid_to_nicaraguan_contras=n,                                                                  
##       crime=y}                       => {el_salvador_aid=y}      0.3770115  0.9879518 2.027165   164
## [9]  {religious_groups_in_schools=y,                                                                
##       aid_to_nicaraguan_contras=n,                                                                  
##       crime=y}                       => {el_salvador_aid=y}      0.3540230  0.9871795 2.025581   154
## [10] {anti_satellite_test_ban=y,                                                                    
##       aid_to_nicaraguan_contras=y,                                                                  
##       education_spending=n,                                                                         
##       Class=democrat}                => {el_salvador_aid=n}      0.3586207  0.9570552 2.001534   156

2.1 Sub-set of Rules

Or we can subset the rule output from Apriori by defining the rhs or lhs.

# subset for rule with specification on lhs
def_rules <- subset(rules, subset = lhs %in% "Class=democrat") 
# subset for rule with specification on rhs
Dem_rules <- subset(rules, subset = rhs %in% "Class=democrat") 
good_Dem_rules <- findRules(Dem_rules, c = 0.9, s = 0.43, l = 1.5)
plotly_arules(good_Dem_rules)

2.3 Closed Item Set

The apriori() function can return either all itemsets, frequent itemsets, closed itemsets, association rules, or hyperedges. By default it returns rules based on frequent itemsets, and unfortunately there doesn’t seem to be a way to generate rules based on closed itemsets directly. Instead, you can start by generating the closed itemsets using the target parameter. Once you have those, you can generate the corresponding rules using the ruleInduction() function. Note that this function needs your dataset reformatted as transactional data as a parameter. From there, you can work with rules just the same as before, but the non-closed sets have already been filtered out (about half for the vote dataset).

itemsets = apriori(vote, parameter=list(target="closed frequent itemsets"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen                   target   ext
##      10 closed frequent itemsets FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 43 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.11s].
## filtering closed item sets ... done [0.01s].
## writing ... [46577 set(s)] done [0.01s].
## creating S4 object  ... done [0.02s].

vote_trans = as(vote, "transactions")
rules = ruleInduction(itemsets, vote_trans, .8, control=list(method="apriori"))
rules

## set of 246588 rules

Note-Association Rule

Weiquan Luo

Updated by 2021-01-21