This is a study note from course material form IE 583 by Siggi Olafsson at ISU with some additional material.
Machine Learnning | Description |
---|---|
Supervised learning (SML) |
the learning algorithm is presented with labelled example inputs, where the labels indicate the desired output. SML itself is composed of: 1) classification, where the output is qualitative. 2) regression, where the output is quantitative. |
Unsupervised learning (UML) |
no labels are provided, and the learning algorithm focuses solely on detecting structure in unlabelled input data. |
Semi-supervised learning |
approaches use labelled data to inform unsupervised learning on the unlabelled data to identify and annotate new classes in the dataset (also called novelty detection). |
Reinforcement learning |
the learning algorithm performs a task using feedback from operating in a real of synthetic environment. |
In machine learning, support-vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
Category | Maximum Margin Classifier | Support Vector Classifier | Support Vector Machine |
---|---|---|---|
multi-dimention | YES | YES | YES |
allowance to miss-classification | NO | YES | YES |
non-linear boundaries | NO | NO | YES |
Check reference for the classification of SVM.
Maximum Margin Classifier is the simple way to divide your data if your data if your data is linearly (or via a plane in multi-dimensional space) separable. But due to obvious reasons it can’t be applied to the data where no clear linear boundary is possible.
Support Vector Classifier is an extension to maximum margin classifier where we allow some miss-classification to happen.
Support Vector Machine or SVM is a further extension to SVC to accommodate non-linear boundaries, so also called Non-linear classifier, which contrast to linear classifier (support vector classifier or Maximum Margin Classifier). Though there is a clear distinction between various definitions but people prefer to call all of them as SVM to avoid any complications.
Note: linear SVMs is a special case of the Neural Network, which has no hidden layers and the loss function is the hinge loss.
Suppose we are given a training dataset of \(n\) points of the form \((\vec{x}_1,y_1), ..., (\vec{x_i},y_i)\), where the \(y_i\) are either 1 or -1, each indicating the class to which the point \(\vec{x_i}\) belongs.
In the discussion of SVM, we need to clarify the concept fo linear function to avoid confusion. In mathematics, the term linear function refers to two distinct but related notions:
Check Reference for definition of linear function.
Check Reference for definition of Projection in linearalgebra.
Check Reference for Kernel method.
We want to find the maximum-margin hyperplane
that divides the group of points \(\vec{x_i}\) for which \(y_i=1\) from the group of points for which \(y_{i}=-1\), which is defined. So, the margin
that the distance between the hyperplane and the nearest point \(\vec{x_i}\) from either group is maximized. In the case of support-vector machines, a data point is viewed as \(p\)-dimensional vector (a list of \(p\) numbers), and we want to know whether we can separate such points with a \((p-1)\)-dimensional hyperplane.
normal vector
\(\vec{\omega}\), so it satisfying: \[\vec{\omega} \cdot \vec{x} - b = 0\]During the model learning phase, the SVM aims to maximize the width of the gap between classes with two hyperplanes. The region bounded by these two hyperplanes is called the margin
, and hyperplane that lies halfway between them is called the maximum-margin hyperplane
. It is completely determined by those \(\vec {x}_i\) that lie nearest to it. These \(\vec{x_i}\) are called support vectors
.
Characteristic | linearly separable | linearly inseparable |
---|---|---|
Normalized loss function | \(\ell(y) = 0,\ 1\) | \(\ell(y) = max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}\) |
Type of margin | hard margin | soft margin |
Allowance of miss-classification | data must lie on the correct side of the margin | data can lie on the wrong side of the margin with hinge loss increases linearly with one-sided error |
Optimization problem | \(\begin{eqnarray}Minimize\ \|\vec{\omega}\|\\s.t.\ y_i(\vec{\omega} \cdot \vec{x_i} - b) \ge 1\ with\ 1 \le i \le n\end{eqnarray}\) | \(\begin{eqnarray}Minimized\ [\frac{1}{n}\sum_{i=1}^n max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}] +\lambda \|\vec{\omega}\|^2\\s.t.\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \in \Re,\ for\ all\ 1 \le i \le n\end{eqnarray}\) |
If the training data is linearly separable
, we can select two parallel hyperplanes that separate the two classes of data, while the distance between the hyperplane is maximized.
Step | Mathematical expression |
---|---|
1. The loss always yeild to 0 because no point can be misclassified: |
\(\ell(y_i;\vec{x_i},\vec{\omega},b) = \left\{\begin{array}{l}1,\ if\ y_i (\vec{\omega} \cdot \vec{x_i}-b)<0\\0,\ if\ y_i (\vec{\omega} \cdot \vec{x_i}-b) \ge 0\end{array}\right.=0, for\ y_i (\vec{\omega} \cdot \vec{x_i}-b) \ge 0\) |
2. With a normalized dataset, the minimal normalized distance for any points with labeled \(y_i = 1\) on or above each hyperplane is 1, or that for the misclass \(y_i = -1\) is -1, so it is called hard margin . The corresponding hyperplanes can be described by the equations with the hard margin: |
\(\vec{\omega} \cdot \vec{x} - b = \pm Hard\ Margin = \pm 1\) |
3. Since data is separable, the predicted class , \(y_i\), for each vector \(\vec{x_i}\) should at least be positive one as lying on the correct side of the hyperplane: |
\(Constraint:\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1-0,\ for\ all\ 1 \le i \le n\) |
4. Geometrically, since the normalized margin is 1, the distance between these two hyperplanes is twice of distance from the mid-point to one of the hyperplane (Distance from a point to a plane ): |
\(\begin{array}{} Maximize\ d= 2 \times Margin = \frac{2\vec{\omega}}{\|\vec{\omega}\|^2}=\frac{2}{\|\vec{\omega}\|} \\\Leftrightarrow Minimize\ \|\vec{\omega}\|\end{array}\) |
5. We solve the optimization problem of maximizing the distance between the hyperplanes subject to the hard margin constraint: | \(\begin{array}{} Minimize\ \|\vec{\omega}\|\\s.t.\ y_i(\vec{\omega} \cdot \vec{x_i} - b) \ge 1-0\ with\ 1 \le i \le n\end{array}\) |
6. By solve this problem, the result of \(\vec{\omega}\) and \(b\) determine the classifier using a sign function, sgn : |
\(\vec{x} \mapsto sgn(\vec{\omega} \cdot \vec{x} -b)\) |
If the training data is linearly inseparable
, we can select two parallel hyperplanes that separate the two classes of data with allowance of miss-classification, while the distance between the hyperplane is maximized.
Step | Mathematical expression |
---|---|
1. For data on the wrong side of the margin, the loss is proportional to the distance from the margin. Therefore, the loss is defined by the hinge loss function: |
\(\ell(y_i;\vec{x_i},\vec{\omega},b) = max\{0, 1-y_i(\vec{\omega} \cdot \vec{x} -b)\}\) |
2. With a normalized or standardized dataset, to allow two parallel hyperplanes misclassify some outlier, a soft margin penalizes any prediction \(y_i\), when it is below its hyperplane. These hyperplanes can be described by the equations with soft margin |
\(\vec{\omega} \cdot \vec{x} - b = \pm Soft\ Margin \simeq \pm 1\) |
3. Since data is inseparable, the predicted class , \(y_i\), for each vector \(\vec{x_i}\) can be any real number: |
\(Constraint:\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1 -\ell(y_i;\vec{x_i},\vec{\omega},b),\ for\ all\ 1 \le i \le n\) |
4. By knowing 2d is inversely proprotional to \(\vec{\omega}\), we need to balance between maximizing the margin between both classes and minimizing the amount of miss-classifications. So, we introduce a regularization parameter of \(\lambda\) that serves as a degree of importance that is given to miss-classifications. Then, the loss function becomes: |
\(\begin{array}{} Maximize\ d \propto \frac{1}{\|\vec{\omega}\|^{*}} \Leftrightarrow\\ Minimize\ \|\vec{\omega}\|^*=[\frac{1}{n}\sum_{i=1}^n \ell(y_i;\vec{x_i},\vec{\omega},b)] +\lambda \|\vec{\omega}\|^2 \end{array}\) |
5. We want to maximize the distance between the hyperplanes by minimizing the loss function : | \(\begin{array}{}Minimize\ \|\vec{\omega}\|^*=[\frac{1}{n}\sum_{i=1}^n \ell(y_i;\vec{x_i},\vec{\omega},b)] +\lambda \|\vec{\omega}\|^2\\s.t.\ y_i (\vec{\omega} \cdot \vec{x_i} - b) \ge 1- \ell(y_i;\vec{x_i},\vec{\omega},b)\},\ for\ all\ 1 \le i \le n\end{array}\) |
6. With a given \(\lambda\), the result of \(\vec{\omega}\) and \(b\) determine the classifier using a sign function, sgn : |
\(\vec{x} \mapsto sgn_\lambda(\vec{\omega} \cdot \vec{x} -b)\) |
Check reference for the regularization parameter, \(\lambda\).
Primal problem can be solve via Lagrangian function, whereas, dual problem can be solve via Lagrangian dual function. The Lagrangian dual problem is obtained by forming the Lagrangian of a minimization problem by using nonnegative Lagrange multipliers to add the constraints to the objective function, and then solving for the primal variable values that minimize the original objective function. This solution gives the primal variables as functions of the Lagrange multipliers, which are called dual variables, so that the new problem is to maximize the objective function with respect to the dual variables under the derived constraints on the dual variables (including at least the nonnegativity constraints). The primal and dual problem has the follow relation:
primal optimal
\(\le\) dual optimal
primal optimal
\(=\) dual optimal
if one of the primal and the dual have finite optima.The case of svm satisfy the Strong Duality Theorem. It is difficult to sovle te primal problem of the soft margin SVM, because the \(\omega\) is explicitly represented by \(\ell\). Therefore, we rewrite it into the Lagrangian dual problem.
Step | Mathematical expression |
---|---|
1. To explicitly represent \(\vec{\omega}\) with \(c_i\) for all \(i\), rewrite the Lagrangian function with \(\vec{\omega}= \sum_{i=1}^n c_iy_i\vec{x_i}\) to Lagrangian dual function . Then optimazation problem with a matrix of unknown variables \(\vec{c}\) can be solve with quadratic programming. |
\(\begin{array}{} maximize\ f(c_1...c_n) = \sum_{i=1}^n c_i - \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^ny_ic_i(x_i \cdot x_j)y_jc_j \\ s.t. \sum_{i=1}^n c_i y_i=0,\ and\ 0 \le c_i \le \frac{1}{1n\lambda}\ for\ all\ i \end{array}\) |
2. Plug \(\vec{c}\) in and solve for: | \(\vec{\omega}= \sum_{i=1}^n c_iy_i\vec{x_i}\) |
3. Finally, new points can be classified by computing | \(\vec{x} \mapsto sgn_\lambda(\vec{\omega} \cdot \vec{x} -b)\) |
Check reference for duality. Check reference for Quadratic programming
Applying the kernel trick to create nonlinear
classifiers to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.
The following are some common kernel. The \(r\), \(d\), \(\gamma\) are kernel parameters, \(\vec{x_i}\), \(\vec{x_j}\) are arbitrary data point from the input dataset.
Step | Mathematical expression |
---|---|
1. All same as sovling the Lagrangian dual function , but using the transformed data vector \(\varphi(\vec{x})\) instead of the original \(\vec{x}\) |
\(\begin{array}{}\begin{aligned}{} maximize\ f(c_1...c_n) & = \sum_{i=1}^n c_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i c_i(\varphi(\vec{x_i}) \cdot \varphi(\vec{x_j}))y_jc_j \\ & = \sum_{i=1}^n c_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i k(\vec{x_i}, \vec{x_j}) y_j c_j \\ \end{aligned} \\ s.t.\ \sum_{i=1}^n c_i y_i=0,\ and\ 0 \le c_i \le \frac{1}{1n\lambda}\ for\ all\ i \end{array}\) |
2. Plug \(\vec{c}\) in and solve for classification vector \(\vec{w}\) in the transformed space: | \(\vec{\omega}=\sum_{i=1}^n c_i y_i \varphi(\vec{x_i})\) |
3. Finally, new points can be classified by computing | \(\vec{z} \mapsto sgn_\lambda(\vec{\omega} \cdot \varphi(\vec{z}) -b) =sgn([\sum_{i=1}^n c_i y_i k(\vec{x_i},\vec{z})]-b)\) |
Check Reference for kernel method
A kernel function
is to calculate the dot product
of two data points in higher dimentional feature space. The following derivation explains why we only need the kernel function to understand the geometric property, angle, and distance of two arbitrary data point in some higher dimentiaonal feature space, without knowing transformation function \(\varphi()\):
finitely positive semi-definite function
, which the corresponding kernel matrix is positive semi-definite
that all elements in \(M_k\) is \(\ge 0\):\[M_k = \left[\begin{array}{rrr} k(\vec{x}_1, \vec{x}_1) & ...& k(\vec{x}_1, \vec{x}_n)\\ ...&...&...\\ k(\vec{x}_n, \vec{x}_1) & ...& k(\vec{x}_n, \vec{x}_n)\\ \end{array}\right] \ge 0\]
dot product
of two vector points in mapped feature space via \(\varphi()\):\[\begin{aligned} \langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle & = \langle \vec{z}_i,\vec{z}_j \rangle \\ & = \left[\begin{array}{rrr}z_{i1} & z_{i2} & z_{i3} \end{array}\right] \cdot \left[\begin{array}{r}z_{j1} \\z_{j2} \\ z_{j3}\end{array}\right] \\ & = k(\vec{x_i},\vec{x_j})\\ \end{aligned}\]
distance
of two vector points in mapped feature space via \(\varphi()\):\[\begin{aligned} \|\varphi(\vec{x_i})-\varphi(\vec{x_j})\|^2 & = (\varphi(\vec{x_i})-\varphi(\vec{x_j}))^T(\varphi(\vec{x_i})-\varphi(\vec{x_j})) \\ & =(\varphi(\vec{x_i})^T-\varphi(\vec{x_j})^T)(\varphi(\vec{x_i})-\varphi(\vec{x_j})) \\ & = \varphi(\vec{x_i})^T\varphi(\vec{x_i})-2\varphi(\vec{x_i})^T\varphi(\vec{x_j})+\varphi(\vec{x_j})^T\varphi(\vec{x_j}) \\ & = \langle \varphi(\vec{x_i}),\varphi(\vec{x_i}) \rangle -2\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle+\langle \varphi(\vec{x_j}),\varphi(\vec{x_j}) \rangle \\ & = k(\vec{x_i},\vec{x_i})-2k(\vec{x_i},\vec{x_j})+k(\vec{x_j},\vec{x_j}) \\ \end{aligned}\]
angle
of two vector points in mapped feature space via \(\varphi()\):\[\begin{aligned} \langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle & = \|\varphi(\vec{x_i})\| \cdot \|\varphi(\vec{x_j})\| cos\theta \\ \Rightarrow cos\theta = & \frac{\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle}{\|\varphi(\vec{x_i})\| \cdot \|\varphi(\vec{x_j})\|}\\ & =\frac{\langle \varphi(\vec{x_i}),\varphi(\vec{x_j}) \rangle}{\sqrt{\langle \varphi(\vec{x_i}),\varphi(\vec{x_i}) \rangle }\sqrt{\langle \varphi(\vec{x_j}),\varphi(\vec{x_j}) \rangle }} \\ & = \frac{k(\vec{x_j},\vec{x_j})}{\sqrt{k(\vec{x_i},\vec{x_i})}\sqrt{k(\vec{x_j},\vec{x_j})}}\\ \end{aligned}\]
where:
Kernel trick uses kernel functions to operate raw feature vector representation in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space, but rather by simply computing the inner products
between the images of all pairs of data in the feature space.
\[\vec{\omega} \cdot \vec{x} = \langle \sum_{i=1}^n c_iy_i\vec{x_i}, \vec{x_i} \rangle\]
\[\vec{\omega} \cdot \varphi(\vec{x}) = \langle \sum_{i=1}^n \alpha_i y_i \varphi(\vec{x_i}), \varphi(\vec{x})\rangle = \sum_{i=1}^n \alpha_i y_i \langle \varphi(\vec{x_i}), \varphi(\vec{x_j}) \rangle =\sum_{i=1}^n \alpha_i y_i k(\vec{x_i},\vec{x_j})\]