Summary¶

Machine learning experiment¶

What Machine learning experiment usualy do?

search hyper-parameter of a model
validate/compare performance and generliazation between models (overfit/underfit)

How to conduct a Machine learning experiment:

making comparision using cross validatiion, or CV.

How to make fair comparison among models?

use same data to sample for validation set

What tool-kit is avaliable?

Scikit-Learn library provides the two main functionalities to do Machine learning experiment as the following:

method	composition	focus
Hyper-parameter optimizers	a warper of estimator, data splitter, parameter grid	search the best set of hyper-parameters of a model
Model validator	a warper of estimator and data splitter	validate the models

Data streaming pipeline¶

A pipeline can wrap multiple functionalities including data transformation (scaler, dimentionality reduction) and an estimator to be a single class. It helps to conduct CV on the same data transformation and the same estimator for a fair comparison. Reference: 管道：链式评估器

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

from sklearn.decomposition import PCA
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# scaling
# dimentionality reduction
pca = PCA()
# Learnign algorithm
logistic = SGDClassifier(loss='log', penalty='l2', early_stopping=True,
                         max_iter=10000, tol=1e-5, random_state=0)
# pipeline
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

print(pipe)

Pipeline(steps=[('pca', PCA()),
                ('logistic',
                 SGDClassifier(early_stopping=True, loss='log', max_iter=10000,
                               random_state=0, tol=1e-05))])

ML Experiment¶

Estimator: fit data¶

source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

Hyper-parameter optimizers: select Hyper-parameter¶

optimizer().fit():

X,y: data
param_grid: parameter grid
estimator: class with fit function
scoring: evaluation metric
cv: Splitter (default 5-fold cross validation)

Hyper-parameter optimizers	description
model_selection.GridSearchCV()	Exhaustive search over specified parameter values for an estimator.
model_selection.HalvingGridSearchCV()	Search over specified parameter values with successive halving.
model_selection.ParameterGrid()	Grid of parameters with a discrete number of values for each.
model_selection.ParameterSampler()	Generator on parameters sampled from given distributions.
model_selection.RandomizedSearchCV()	Randomized search on hyper parameters.
model_selection.HalvingRandomSearchCV()	Randomized search on hyper parameters.

%time
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'pca__n_components': [5, 20, 30, 40, 50, 64],
    'logistic__alpha': np.logspace(-4, 4, 5),
}
cv=5
search = GridSearchCV(pipe, param_grid, iid=False, cv=cv, scoring='recall_macro', n_jobs=cv)
search.fit(X_digits, y_digits)

print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.01 µs
Best parameter (CV score=0.919):
{'logistic__alpha': 0.01, 'pca__n_components': 50}

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/sklearn/model_selection/_search.py:849: FutureWarning: The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.
  "removed in 0.24.", FutureWarning

d = {'mean_test_score': search.cv_results_.get('mean_test_score'),
     'std_test_score': search.cv_results_.get('std_test_score'),
     'rank_test_score': search.cv_results_.get('rank_test_score')}
pd.DataFrame(data= search.cv_results_.get('params')).join(pd.DataFrame(data= d))

Model validator: check model¶

validator():

X,y: data
estimator: class with fit function
scoring: evaluation metric
cv: Splitter (default 5-fold cross validation)

Model validator	validating on
`sklearn.model_selection.cross_val_score()`	single score
`sklearn.model_selection.cross_validate()`	by score
`sklearn.model_selection.validation_curve()`	hyper-parameter by score
`sklearn.model_selection.learning_curve()`	trained sample by score

%time
from sklearn.model_selection import cross_validate
from sklearn.metrics import recall_score
scoring = ['precision_macro', 'recall_macro']
cv=5
scores = cross_validate(pipe, X_digits, y_digits, scoring=scoring, cv=cv, return_estimator=True, n_jobs=cv)

CPU times: user 13 µs, sys: 34 µs, total: 47 µs
Wall time: 7.15 µs

[print(x,":", 
       np.mean(scores.get('test_'+ x)).round(4), 
       '+/-',np.std(scores.get('test_'+ x)).round(4)) 
 for x in scoring]

precision_macro : 0.901 +/- 0.039
recall_macro : 0.8941 +/- 0.0434

[None, None]

Cross-validation API¶

Splitter Classes¶

Two categories of the splitting for non TimeSeries data:

train_test_split : divide data set into training set and test set
kfold: divide data set into prespecified number of folds
ShuffleSplit: randomly sample the entire dataset, with prespecified number of test_size and train_size

Two categories of the Sampling:

Stratified Sampling: elements within each stratum(class) are sampled, aims to increase precision to reduce error
Cluster(group) Sampling: only selected clusters(group) are sampled, aims to reduce cost and increase the efficiency of sampling

One categories TimeSeries data

TimeSeriesSplit

Less-frequently-used Splitter Classes	Description
model_selection.LeaveOneGroupOut()	Leave One Group Out cross-validator
model_selection.LeavePGroupsOut()	Leave P Group(s) Out cross-validator
model_selection.LeaveOneOut()	Leave-One-Out cross-validator
model_selection.LeavePOut()	Leave-P-Out cross-validator
model_selection.PredefinedSplit()	Predefined split cross-validator
model_selection.RepeatedKFold()	Repeated K-Fold cross validator.
model_selection.RepeatedStratifiedKFold()	Repeated Stratified K-Fold cross validator.

Model specific cross-validation API¶

method	description
`linear_model.ElasticNetCV([l1_ratio, eps, …])`	Elastic Net model with iterative fitting along a regularization path
`linear_model.LarsCV([fit_intercept, …])`	Cross-validated Least Angle Regression model
`linear_model.LassoCV([eps, n_alphas, …])`	Lasso linear model with iterative fitting along a regularization path
`linear_model.LassoLarsCV([fit_intercept, …])`	Cross-validated Lasso, using the LARS algorithm
`linear_model.LogisticRegressionCV([Cs, …])`	Logistic Regression CV (aka logit, MaxEnt) classifier.
`linear_model.MultiTaskElasticNetCV([…])`	Multi-task L1/L2 ElasticNet with built-in cross-validation.
`linear_model.MultiTaskLassoCV([eps, …])`	Multi-task L1/L2 Lasso with built-in cross-validation.
`linear_model.OrthogonalMatchingPursuitCV([…])`	Cross-validated Orthogonal Matching Pursuit model (OMP)
`linear_model.RidgeCV([alphas, …])`	Ridge regression with built-in cross-validation.
`linear_model.RidgeClassifierCV([alphas, …])`	Ridge classifier with built-in cross-validation.

Score API¶

Classification score¶

Scoring	Function	Comment
‘accuracy’	`metrics.accuracy_score`
‘balanced_accuracy’	`metrics.balanced_accuracy_score`
‘top_k_accuracy’	`metrics.top_k_accuracy_score`
‘average_precision’	`metrics.average_precision_score`
‘neg_brier_score’	`metrics.brier_score_loss`
‘f1’	`metrics.f1_score`	for binary targets
‘f1_micro’	`metrics.f1_score`	micro-averaged
‘f1_macro’	`metrics.f1_score`	macro-averaged
‘f1_weighted’	`metrics.f1_score`	weighted average
‘f1_samples’	`metrics.f1_score`	by multilabel sample
‘neg_log_loss’	`metrics.log_loss`	requires `predict_proba` support
‘precision’ etc.	`metrics.precision_score`	suffixes apply as with ‘f1’
‘recall’ etc.	`metrics.recall_score`	suffixes apply as with ‘f1’
‘jaccard’ etc.	`metrics.jaccard_score`	suffixes apply as with ‘f1’
‘roc_auc’	`metrics.roc_auc_score`
‘roc_auc_ovr’	`metrics.roc_auc_score`
‘roc_auc_ovo’	`metrics.roc_auc_score`
‘roc_auc_ovr_weighted’	`metrics.roc_auc_score`
‘roc_auc_ovo_weighted’	`metrics.roc_auc_score`

Regression score¶

Scoring	Function	Comment
‘explained_variance’	`metrics.explained_variance_score`
‘max_error’	`metrics.max_error`
‘neg_mean_absolute_error’	`metrics.mean_absolute_error`
‘neg_mean_squared_error’	`metrics.mean_squared_error`
‘neg_root_mean_squared_error’	`metrics.mean_squared_error`
‘neg_mean_squared_log_error’	`metrics.mean_squared_log_error`
‘neg_median_absolute_error’	`metrics.median_absolute_error`
‘r2’	`metrics.r2_score`
‘neg_mean_poisson_deviance’	`metrics.mean_poisson_deviance`
‘neg_mean_gamma_deviance’	`metrics.mean_gamma_deviance`
‘neg_mean_absolute_percentage_error’	`metrics.mean_absolute_percentage_error`

Clustering score¶

Scoring	Function	Comment
‘adjusted_mutual_info_score’	`metrics.adjusted_mutual_info_score`
‘adjusted_rand_score’	`metrics.adjusted_rand_score`
‘completeness_score’	`metrics.completeness_score`
‘fowlkes_mallows_score’	`metrics.fowlkes_mallows_score`
‘homogeneity_score’	`metrics.homogeneity_score`
‘mutual_info_score’	`metrics.mutual_info_score`
‘normalized_mutual_info_score’	`metrics.normalized_mutual_info_score`
‘rand_score’	`metrics.rand_score`
‘v_measure_score’	`metrics.v_measure_score`

	logistic__alpha	pca__n_components	mean_test_score	std_test_score	rank_test_score
0	0.0001	5	0.716044	0.052309	23
1	0.0001	20	0.887411	0.035888	15
2	0.0001	30	0.903600	0.044263	9
3	0.0001	40	0.895183	0.039510	11
4	0.0001	50	0.891895	0.035644	14
5	0.0001	64	0.894120	0.043411	12
6	0.0100	5	0.768088	0.031506	16
7	0.0100	20	0.903073	0.033513	10
8	0.0100	30	0.908520	0.026920	4
9	0.0100	40	0.913647	0.027858	3
10	0.0100	50	0.919154	0.023918	1
11	0.0100	64	0.918032	0.025162	2
12	1.0000	5	0.753747	0.037280	22
13	1.0000	20	0.892815	0.035741	13
14	1.0000	30	0.903896	0.037860	8
15	1.0000	40	0.906120	0.035376	5
16	1.0000	50	0.906104	0.036751	6
17	1.0000	64	0.906104	0.036751	6
18	100.0000	5	0.627318	0.109118	24
19	100.0000	20	0.764470	0.080391	21
20	100.0000	30	0.766754	0.080410	17
21	100.0000	40	0.766199	0.080017	18
22	100.0000	50	0.765643	0.080739	19
23	100.0000	64	0.765643	0.080739	19
24	10000.0000	5	0.355885	0.314470	30
25	10000.0000	20	0.390485	0.356177	29
26	10000.0000	30	0.391597	0.357540	28
27	10000.0000	40	0.392168	0.358197	25
28	10000.0000	50	0.392168	0.358197	25
29	10000.0000	64	0.392168	0.358197	25

Note-Machine learning experiment

Weiquan Luo

2021-08-17

Table of Contents