Use of this document

This is a study note from course material form IE 583 by Siggi Olafsson at ISU with some additional material.

0. Prerequisites

0.1 Four Types of Machine Learning

Four Type of Machine Learnning
Machine Learnning Description
Supervised learning (SML) the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative.
Unsupervised learning (UML) no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data.
Semi-supervised learning approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection).
Reinforcement learning the learning algorithm performs a task using feedback from operating in a real of synthetic environment.

1 Development of Support Vector Machine

In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.

Category Maximum Margin Classifier Support Vector Classifier Support Vector Machine
multi-dimention YES YES YES
allowance to miss-classification NO YES YES
non-linear boundaries NO NO YES

Check reference for the classification of SVM.

1.1 Maximum Margin Classifier

Maximum Margin Classifier is the simple way to divide your data if your data if your data is linearly (or via a plane in multi-dimensional space) separable. But due to obvious reasons it can’t be applied to the data where no clear linear boundary is possible.

1.2 Support Vector Classifier

Support Vector Classifier is an extension to maximum margin classifier where we allow some miss-classification to happen.

1.3 Support Vector Machine

Support Vector Machine or SVM is a further extension to SVC to accommodate non-linear boundaries, so also called Non-linear classifier, which contrast to linear classifier (support vector classifier or Maximum Margin Classifier). Though there is a clear distinction between various definitions but people prefer to call all of them as SVM to avoid any complications.

Note: linear SVMs is a special case of the Neural Network, which has no hidden layers and the loss function is the hinge loss.

2 Theory

Suppose we are given a training dataset of \(n\) points of the form \((\vec{x}_1,y_1), ..., (\vec{x_i},y_i)\), where the \(y_i\) are either 1 or -1, each indicating the class to which the point \(\vec{x_i}\) belongs.

2.1 Linear function

In the discussion of SVM, we need to clarify the concept fo linear function to avoid confusion. In mathematics, the term linear function refers to two distinct but related notions:

  • In calculus and related areas, a linear function is a function whose graph is a straight line, that is a polynomial function of degree one or zero.
    • Hyperplane: In a N-Dimensional space, a hyperplane is a flat subspace having dimensions “N-1”, For example:
      • \(\beta_0 + \beta_1 \cdot X_1 = 0\) for 2-Dimensional space.
      • \(\beta_0 + \beta_1 \cdot X_1 + \beta_2 \cdot X_2 + ... + \beta_{n-1} \cdot X_{n-1} = 0\) for N-Dimensional space.
  • In linear algebra, mathematical analysis, and functional analysis, a linear function is a linear map.
    • Projection with an inner product: \(\vec{y}\) is being projected onto the vector space \(V\):
      • \(proj_V\vec{y}=\frac{\vec{y} \cdot \vec{u}^i}{\vec{u}_i \cdot \vec{u}^j} \vec{u}^i\)
      • Linear Kernel: a implicit linear mapping from the input vector space to its inner product space.
      • Non-linear Kernel: a implicit non-linear mapping from the inputor vect space to its inner product space.
    • Projection matrix: a linear function is a map f between two vector spaces that preserves vector addition and scalar multiplication.
      • \(P=A(B^TA)^{-1}B^T\)

Check Reference for definition of linear function.
Check Reference for definition of Projection in linearalgebra.
Check Reference for Kernel method.

2.2 Hyperplane

Orthogonalization

Orthogonalization

We want to find the maximum-margin hyperplane that divides the group of points \(\vec{x_i}\) for which \(y_i=1\) from the group of points for which \(y_{i}=-1\), which is defined. So, the margin that the distance between the hyperplane and the nearest point \(\vec{x_i}\) from either group is maximized. In the case of support-vector machines, a data point is viewed as \(p\)-dimensional vector (a list of \(p\) numbers), and we want to know whether we can separate such points with a \((p-1)\)-dimensional hyperplane.

  1. In linear algebra and functional analysis, a projection is a linear transformation P from a vector space to itself. The projected vector can be expressed as the multiplier of the unit vector on the projection: \[proj_{\vec{\omega}}\vec{v} = c \cdot \vec{u} = \vec{\omega}\]
  2. The dot product of perpendicular vectors is always zero: \[(\vec{v} - proj_{\vec{\omega}}\vec{v}) \cdot \vec{c}= 0\]
  3. Any hyperplane can be written as the set of points \(\vec{x}\) in term of a normal vector \(\vec{\omega}\), so it satisfying: \[\vec{\omega} \cdot \vec{x} - b = 0\]

2.3 Margin and Loss

During the model learning phase, the SVM aims to maximize the width of the gap between classes with two hyperplanes. The region bounded by these two hyperplanes is called the margin, and hyperplane that lies halfway between them is called the maximum-margin hyperplane. It is completely determined by those \(\vec {x}_i\) that lie nearest to it. These \(\vec{x_i}\) are called support vectors.

Characteristic linearly separable linearly inseparable
Normalized loss function \(\ell(y) = 0,\ 1\) \(\ell(y) = max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}\)
Type of margin hard margin soft margin
Allowance of miss-classification data must lie on the correct side of the margin data can lie on the wrong side of the margin with hinge loss increases linearly with one-sided error
Optimization problem \(\begin{eqnarray}Minimize\ \|\vec{\omega}\|\\s.t.\ y_i(\vec{\omega} \cdot \vec{x_i} - b) \ge 1\ with\ 1 \le i \le n\end{eqnarray}\) \(\begin{eqnarray}Minimized\ [\frac{1}{n}\sum_{i=1}^n max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}] +\lambda \|\vec{\omega}\|^2\\s.t.\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \in \Re,\ for\ all\ 1 \le i \le n\end{eqnarray}\)
Maximum-margin hyperplane (left); Hinge loss vs zero-loss (right)

Maximum-margin hyperplane (left); Hinge loss vs zero-loss (right)

2.3.1 Hard Margin when linearly separable

If the training data is linearly separable, we can select two parallel hyperplanes that separate the two classes of data, while the distance between the hyperplane is maximized.

Theory of Hard Margin SVM
Step Mathematical expression
1. The loss always yeild to 0 because no point can be misclassified: \(\ell(y_i;\vec{x_i},\vec{\omega},b) = \left\{\begin{array}{l}1,\ if\ y_i (\vec{\omega} \cdot \vec{x_i}-b)<0\\0,\ if\ y_i (\vec{\omega} \cdot \vec{x_i}-b) \ge 0\end{array}\right.=0, for\ y_i (\vec{\omega} \cdot \vec{x_i}-b) \ge 0\)
2. With a normalized dataset, the minimal normalized distance for any points with labeled \(y_i = 1\) on or above each hyperplane is 1, or that for the misclass \(y_i = -1\) is -1, so it is called hard margin. The corresponding hyperplanes can be described by the equations with the hard margin: \(\vec{\omega} \cdot \vec{x} - b = \pm Hard\ Margin = \pm 1\)
3. Since data is separable, the predicted class, \(y_i\), for each vector \(\vec{x_i}\) should at least be positive one as lying on the correct side of the hyperplane: \(Constraint:\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1-0,\ for\ all\ 1 \le i \le n\)
4. Geometrically, since the normalized margin is 1, the distance between these two hyperplanes is twice of distance from the mid-point to one of the hyperplane (Distance from a point to a plane): \(\begin{array}{} Maximize\ d= 2 \times Margin = \frac{2\vec{\omega}}{\|\vec{\omega}\|^2}=\frac{2}{\|\vec{\omega}\|} \\\Leftrightarrow Minimize\ \|\vec{\omega}\|\end{array}\)
5. We solve the optimization problem of maximizing the distance between the hyperplanes subject to the hard margin constraint: \(\begin{array}{} Minimize\ \|\vec{\omega}\|\\s.t.\ y_i(\vec{\omega} \cdot \vec{x_i} - b) \ge 1-0\ with\ 1 \le i \le n\end{array}\)
6. By solve this problem, the result of \(\vec{\omega}\) and \(b\) determine the classifier using a sign function, sgn: \(\vec{x} \mapsto sgn(\vec{\omega} \cdot \vec{x} -b)\)

2.3.2 Soft Margin when linearly inseparable

If the training data is linearly inseparable, we can select two parallel hyperplanes that separate the two classes of data with allowance of miss-classification, while the distance between the hyperplane is maximized.

Theory of Soft Margin SVM
Step Mathematical expression
1. For data on the wrong side of the margin, the loss is proportional to the distance from the margin. Therefore, the loss is defined by the hinge loss function: \(\ell(y_i;\vec{x_i},\vec{\omega},b) = max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}\)
2. With a normalized or standardized dataset, to allow two parallel hyperplanes misclassify some outlier, a soft margin penalizes any prediction \(y_i\), when it is below its hyperplane. These hyperplanes can be described by the equations with soft margin \(\vec{\omega} \cdot \vec{x} - b = \pm Soft\ Margin \simeq \pm 1\)
3. Since data is inseparable, the predicted class, \(y_i\), for each vector \(\vec{x_i}\) can be any real number: \(Constraint:\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1 -\ell(y_i;\vec{x_i},\vec{\omega},b),\ for\ all\ 1 \le i \le n\)
4. By knowing 2d is inversely proprotional to \(\vec{\omega}\), we need to balance between maximizing the margin between both classes and minimizing the amount of miss-classifications. So, we introduce a regularization parameter of \(\lambda\) that serves as a degree of importance that is given to miss-classifications. Then, the loss function becomes: \(\begin{array}{} Maximize\ d \propto \frac{1}{\|\vec{\omega}\|^{*}} \Leftrightarrow\\ Minimize\ \|\vec{\omega}\|^*=[\frac{1}{n}\sum_{i=1}^n \ell(y_i;\vec{x_i},\vec{\omega},b)] +\lambda \|\vec{\omega}\|^2 \end{array}\)
5. We want to maximize the distance between the hyperplanes by minimizing the loss function : \(\begin{array}{}Minimize\ \|\vec{\omega}\|^*=[\frac{1}{n}\sum_{i=1}^n \ell(y_i;\vec{x_i},\vec{\omega},b)] +\lambda \|\vec{\omega}\|^2\\s.t.\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1- \ell(y_i;\vec{x_i},\vec{\omega},b)\},\ for\ all\ 1 \le i \le n\end{array}\)
6. With a given \(\lambda\), the result of \(\vec{\omega}\) and \(b\) determine the classifier using a sign function, sgn: \(\vec{x} \mapsto sgn_\lambda(\vec{\omega} \cdot \vec{x} -b)\)

Check reference for the regularization parameter, \(\lambda\).

2.3.2.1 Primal problem and Dual problem

Primal problem can be solve via Lagrangian function, whereas, dual problem can be solve via Lagrangian dual function. The Lagrangian dual problem is obtained by forming the Lagrangian of a minimization problem by using nonnegative Lagrange multipliers to add the constraints to the objective function, and then solving for the primal variable values that minimize the original objective function. This solution gives the primal variables as functions of the Lagrange multipliers, which are called dual variables, so that the new problem is to maximize the objective function with respect to the dual variables under the derived constraints on the dual variables (including at least the nonnegativity constraints). The primal and dual problem has the follow relation:

  • Weak Duality Theorem: primal optimal \(\le\) dual optimal
  • Strong Duality Theorem: primal optimal \(=\) dual optimal if one of the primal and the dual have finite optima.

The case of svm satisfy the Strong Duality Theorem. It is difficult to sovle te primal problem of the soft margin SVM, because the \(\omega\) is explicitly represented by \(\ell\). Therefore, we rewrite it into the Lagrangian dual problem.

Step Mathematical expression
1. To explicitly represent \(\vec{\omega}\) with \(c_i\) for all \(i\), rewrite the Lagrangian function with \(\vec{\omega}= \sum_{i=1}^n c_iy_i\vec{x_i}\) to Lagrangian dual function. Then optimazation problem with a matrix of unknown variables \(\vec{c}\) can be solve with quadratic programming. \(\begin{array}{} maximize\ f(c_1...c_n) = \sum_{i=1}^n c_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^ny_ic_i(x_i \cdot x_j)y_jc_j \\ s.t. \sum_{i=1}^n c_i y_i=0,\ and\ 0 \le c_i \le \frac{1}{1n\lambda}\ for\ all\ i \end{array}\)
2. Plug \(\vec{c}\) in and solve for: \(\vec{\omega}= \sum_{i=1}^n c_iy_i\vec{x_i}\)
3. Finally, new points can be classified by computing \(\vec{x} \mapsto sgn_\lambda(\vec{\omega} \cdot \vec{x} -b)\)

Check reference for duality. Check reference for Quadratic programming

2.4 Kernel Method

Applying the kernel trick to create nonlinear classifiers to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.

The following are some common kernel. The \(r\), \(d\), \(\gamma\) are kernel parameters, \(\vec{x_i}\), \(\vec{x_j}\) are arbitrary data point from the input dataset.

  • Linear Kernel: \(k(\vec{x_i},\vec{x_j}) = \vec{x_i} \cdot \vec{x_j}\)
  • Polynomial kernel: \(k(\vec{x_i},\vec{x_j}) = (\gamma \vec{x_i} \cdot \vec{x_j} + r)^d, \gamma >0\)
  • Radial basis function (RBF) Kernel: \(K(\vec{x_i},\vec{x_j}) = exp(\frac{{ \|\vec{x_i}-\vec{x_j}\|}^2}{2\sigma})\) which in simple form can be written as \(exp(-\gamma \|\vec{x_i} - \vec{x_j}\|^2), \gamma > 0\)
  • Sigmoid Kernel: \(k(\vec{x_i},\vec{x_j}) = tanh(\gamma \vec{x_i} \cdot \vec{x_j} + r)\)
Step Mathematical expression
1. All same as sovling the Lagrangian dual function, but using the transformed data vector \(\varphi(\vec{x})\) instead of the original \(\vec{x}\) \(\begin{array}{}\begin{aligned}{} maximize\ f(c_1...c_n) & = \sum_{i=1}^n c_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i c_i(\varphi(\vec{x_i}) \cdot \varphi(\vec{x_j}))y_jc_j \\ & = \sum_{i=1}^n c_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i k(\vec{x_i}, \vec{x_j}) y_j c_j \\ \end{aligned} \\ s.t.\ \sum_{i=1}^n c_i y_i=0,\ and\ 0 \le c_i \le \frac{1}{1n\lambda}\ for\ all\ i \end{array}\)
2. Plug \(\vec{c}\) in and solve for classification vector \(\vec{w}\) in the transformed space: \(\vec{\omega}=\sum_{i=1}^n c_i y_i \varphi(\vec{x_i})\)
3. Finally, new points can be classified by computing \(\vec{z} \mapsto sgn_\lambda(\vec{\omega} \cdot \varphi(\vec{z}) -b) =sgn([\sum_{i=1}^n c_i y_i k(\vec{x_i},\vec{z})]-b)\)

Check Reference for kernel method

2.4.1 Kernel Function

A kernel function is to calculate the dot product of two data points in higher dimentional feature space. The following derivation explains why we only need the kernel function to understand the geometric property, angle, and distance of two arbitrary data point in some higher dimentiaonal feature space, without knowing transformation function \(\varphi()\):

  • The transformation function \(\varphi()\) must be a finitely positive semi-definite function, which the corresponding kernel matrix is positive semi-definite that all elements in \(M_k\) is \(\ge 0\):

\[M_k = \left[\begin{array}{rrr} k(\vec{x}_1, \vec{x}_1) & ...& k(\vec{x}_1, \vec{x}_n)\\ ...&...&...\\ k(\vec{x}_n, \vec{x}_1) & ...& k(\vec{x}_n, \vec{x}_n)\\ \end{array}\right] \ge 0\]

  • dot product of two vector points in mapped feature space via \(\varphi()\):

\[\begin{aligned} \langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle & = \langle \vec{z}_i,\vec{z}_j \rangle \\ & = \left[\begin{array}{rrr}z_{i1} & z_{i2} & z_{i3} \end{array}\right] \cdot \left[\begin{array}{r}z_{j1} \\z_{j2} \\ z_{j3}\end{array}\right] \\ & = k(\vec{x_i},\vec{x_j})\\ \end{aligned}\]

  • distance of two vector points in mapped feature space via \(\varphi()\):

\[\begin{aligned} \|\varphi(\vec{x_i})-\varphi(\vec{x_j})\|^2 & = (\varphi(\vec{x_i})-\varphi(\vec{x_j}))^T(\varphi(\vec{x_i})-\varphi(\vec{x_j})) \\ & =(\varphi(\vec{x_i})^T-\varphi(\vec{x_j})^T)(\varphi(\vec{x_i})-\varphi(\vec{x_j})) \\ & = \varphi(\vec{x_i})^T\varphi(\vec{x_i})-2\varphi(\vec{x_i})^T\varphi(\vec{x_j})+\varphi(\vec{x_j})^T\varphi(\vec{x_j}) \\ & = \langle \varphi(\vec{x_i}),\varphi(\vec{x_i}) \rangle -2\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle+\langle \varphi(\vec{x_j}),\varphi(\vec{x_j}) \rangle \\ & = k(\vec{x_i},\vec{x_i})-2k(\vec{x_i},\vec{x_j})+k(\vec{x_j},\vec{x_j}) \\ \end{aligned}\]

  • angle of two vector points in mapped feature space via \(\varphi()\):

\[\begin{aligned} \langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle & = \|\varphi(\vec{x_i})\| \cdot \|\varphi(\vec{x_j})\| cos\theta \\ \Rightarrow cos\theta = & \frac{\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle}{\|\varphi(\vec{x_i})\| \cdot \|\varphi(\vec{x_j})\|}\\ & =\frac{\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle}{\sqrt{\langle \varphi(\vec{x_i}),\varphi(\vec{x_i}) \rangle }\sqrt{\langle \varphi(\vec{x_j}),\varphi(\vec{x_j}) \rangle }} \\ & = \frac{k(\vec{x_j},\vec{x_j})}{\sqrt{k(\vec{x_i},\vec{x_i})}\sqrt{k(\vec{x_j},\vec{x_j})}}\\ \end{aligned}\]

where:

  • \(\vec{x_i}\) or \(\vec{x_j}\) is an arbitrary data points among \(\vec{x_i}\ for\ 1 \le i \le n\) in the original feature space.
  • \(\langle , \rangle\) is the dot product of two data points.
  • \(\varphi()\) is the transformation function from the original feature space to a higher dimemtional feature space.
  • \(\vec{z}_i\) and \(\vec{z}_j\) is two arbitrary transformed data points in a higher dimemtional feature space.

2.4.2 Kernel Trick

Kernel trick uses kernel functions to operate raw feature vector representation in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space.

  • Before the transformation, the dot product of the input data and classification vector is:

\[\vec{\omega} \cdot \vec{x} = \langle \sum_{i=1}^n c_iy_i\vec{x_i}, \vec{x_i} \rangle\]

  • By some transformation, \(\varphi()\), the dot product of the input data and classification vector represented in term of kernel function is:

\[\vec{\omega} \cdot \varphi(\vec{x}) = \langle \sum_{i=1}^n \alpha_i y_i \varphi(\vec{x_i}), \varphi(\vec{x})\rangle = \sum_{i=1}^n \alpha_i y_i \langle \varphi(\vec{x_i}), \varphi(\vec{x_j}) \rangle =\sum_{i=1}^n \alpha_i y_i k(\vec{x_i},\vec{x_j})\]

3 Examples