This is a study note from course material form IE 583 by Siggi Olafsson at ISU with some additional material.
library(arules)
library(arulesViz)
library(dplyr)
library(tibble)
Machine Learnning | Description |
---|---|
Supervised learning (SML) |
the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative. |
Unsupervised learning (UML) |
no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data. |
Semi-supervised learning |
approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection). |
Reinforcement learning |
the learning algorithm performs a task using feedback from operating in a real of synthetic environment. |
Association Rule Discovery aims to discovery interesting correlation or other relationships between attributes in large databases in expression of the form \(X \Rightarrow Y\), sucha as:
In practice, association rules are useful for analyzing and predicting customer behavior. They play an important part in customer analytics, market basket analysis, product clustering, catalog design and store layout.
Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships.
1. Find all item sets that meet minimum coverage
2. Find all rules that meet minimum accuracy
3. * Find all rules that meet minimum lift
4. Prune by
* retain more precise rule with smaller/same support or higher/same accuracy or higher/sames confidence
* retain simpler rule with higher accuracy or hihger confidence.
* Step 3 could be ignored if only comparing rules with common rhs, Y, because \(Lift(X \Rightarrow Y) = \frac{Confidence(X \Rightarrow Y)}{P(Y)} \sim Confidence(X \Rightarrow Y)\)
vote <- read.csv("vote.csv")
head(as.tibble(vote))
## # A tibble: 6 x 17
## handicapped_inf… water_project_c… adoption_of_the… physician_fee_f…
## <fct> <fct> <fct> <fct>
## 1 n y n y
## 2 n y n y
## 3 ? y y ?
## 4 n y y n
## 5 y y y n
## 6 n y y n
## # … with 13 more variables: el_salvador_aid <fct>,
## # religious_groups_in_schools <fct>, anti_satellite_test_ban <fct>,
## # aid_to_nicaraguan_contras <fct>, mx_missile <fct>, immigration <fct>,
## # synfuels_corporation_cutback <fct>, education_spending <fct>,
## # superfund_right_to_sue <fct>, crime <fct>, duty_free_exports <fct>,
## # export_administration_act_south_africa <fct>, Class <fct>
Apriori Algorithm, apriori()
, is developed in early 90s and is the most commonly used approach.
apriori()
:
parameter=list(target=)
target = "frequent itemsets"
(default)target = "maximally frequent itemsets"
target = "closed frequent itemsets"
target = "rules"
(only available for Apriori; use ruleInduction for eclat.)target = "hyperedgesets"
(only available for Apriori; see references for the definition of association hyperedgesets)plotly_arules()
: Plot an interactive scatter plot for association rules using plotly.findRules <-function(rules, c = 0.9, s = 0.35, l = 1.8){
good_rules = rules[quality(rules)$confidence > c]
good_rules = good_rules[quality(good_rules)$support > s]
good_rules = good_rules[quality(good_rules)$lift > l]
}
rules <- apriori(vote)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 43
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.09s].
## writing ... [482817 rule(s)] done [0.07s].
## creating S4 object ... done [0.12s].
good_rules <- findRules(rules) # Customize function
plotly_arules(good_rules, c = 0.9, s = 0.35, l = 1.8)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
inspect(head(good_rules, n=10, by = "lift"))
## lhs rhs support confidence lift count
## [1] {el_salvador_aid=y,
## Class=republican} => {physician_fee_freeze=y} 0.3586207 0.9936306 2.441973 156
## [2] {crime=y,
## Class=republican} => {physician_fee_freeze=y} 0.3563218 0.9810127 2.410963 155
## [3] {physician_fee_freeze=y,
## el_salvador_aid=y} => {Class=republican} 0.3586207 0.9285714 2.404337 156
## [4] {physician_fee_freeze=y,
## crime=y} => {Class=republican} 0.3563218 0.9226190 2.388924 155
## [5] {Class=republican} => {physician_fee_freeze=y} 0.3747126 0.9702381 2.384483 163
## [6] {physician_fee_freeze=y} => {Class=republican} 0.3747126 0.9209040 2.384483 163
## [7] {physician_fee_freeze=y,
## mx_missile=n} => {el_salvador_aid=y} 0.3517241 0.9935065 2.038563 153
## [8] {aid_to_nicaraguan_contras=n,
## crime=y} => {el_salvador_aid=y} 0.3770115 0.9879518 2.027165 164
## [9] {religious_groups_in_schools=y,
## aid_to_nicaraguan_contras=n,
## crime=y} => {el_salvador_aid=y} 0.3540230 0.9871795 2.025581 154
## [10] {anti_satellite_test_ban=y,
## aid_to_nicaraguan_contras=y,
## education_spending=n,
## Class=democrat} => {el_salvador_aid=n} 0.3586207 0.9570552 2.001534 156
Or we can subset the rule output from Apriori by defining the rhs or lhs.
# subset for rule with specification on lhs
def_rules <- subset(rules, subset = lhs %in% "Class=democrat")
# subset for rule with specification on rhs
Dem_rules <- subset(rules, subset = rhs %in% "Class=democrat")
good_Dem_rules <- findRules(Dem_rules, c = 0.9, s = 0.43, l = 1.5)
plotly_arules(good_Dem_rules)
The apriori()
function can return either all itemsets, frequent itemsets, closed itemsets, association rules, or hyperedges. By default it returns rules based on frequent itemsets, and unfortunately there doesn’t seem to be a way to generate rules based on closed itemsets directly. Instead, you can start by generating the closed itemsets using the target
parameter. Once you have those, you can generate the corresponding rules using the ruleInduction()
function. Note that this function needs your dataset reformatted as transactional data as a parameter. From there, you can work with rules just the same as before, but the non-closed sets have already been filtered out (about half for the vote dataset).
itemsets = apriori(vote, parameter=list(target="closed frequent itemsets"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 closed frequent itemsets FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 43
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 435 transaction(s)] done [0.00s].
## sorting and recoding items ... [36 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 done [0.11s].
## filtering closed item sets ... done [0.01s].
## writing ... [46577 set(s)] done [0.01s].
## creating S4 object ... done [0.02s].
vote_trans = as(vote, "transactions")
rules = ruleInduction(itemsets, vote_trans, .8, control=list(method="apriori"))
rules
## set of 246588 rules