The objective of this note is to explain the loss function from a statistical stand point, or specifically the maximum a posteriori estimation (MAP)
library(tidyverse)
(source: https://www.cnblogs.com/LittleHann/p/9608599.html#_label0)
(source: https://medium.com/jarvis-toward-intelligence/%E6%AF%94%E8%BC%83-cross-entropy-%E8%88%87-mean-squared-error-8bebc0255f5)
(source: https://en.wikipedia.org/wiki/Loss_function)
par(mfrow = c(2,2), mar=c(4,4,4,4))
plot(function(x) dnorm(x), -3, 3,
main = "Normal density: dnorm()", xlab = "z", ylab = "density")
plot(function(x) pnorm(x), -3, 3,
main = "Normal cumulative: pnorm()", xlab = "z", ylab = "cumulative probability")
hist(rnorm(1000),
main = "sample from normal distribtuion, rnorm()", xlab = "z", ylab = "count")
plot(function(x) qnorm(x),
main = "Normal quantile, qnorm()", xlab = "quantile", ylab = "z")
The functions for the density/mass function, cumulative distribution function, quantile function and random variate generation are named in the form dxxx
, pxxx
, qxxx
and rxxx
respectively.
distributions | d , density |
p , cumulative |
q quantile |
r , random generation |
---|---|---|---|---|
normal | dnorm() |
pnorm() |
qnorm() |
dnorm() |
uniform | dunif() |
punif() |
qunif() |
dunif() |
beta | dbeta() |
pbeta() |
qbeta() |
dbeta() |
binomial (including Bernoulli) | dbinom() |
pbinom() |
qbinom() |
dbinom() |
Cauchy | dcauchy() |
pcauchy() |
qcauchy() |
dcauchy() |
chi-squared | dchisq() |
pchisq() |
qchisq() |
dchisq() |
exponential | dexp() |
pexp() |
qexp() |
dexp() |
F | df() |
pf() |
qf() |
df() |
gamma | dgamma() |
pgamma() |
qgamma() |
dgamma() |
geometric | dgeom() |
pgeom() |
qgeom() |
dgeom() |
hypergeometric | dhyper() |
phyper() |
qhyper() |
dhyper() |
log-normal | dlnorm() |
plnorm() |
qlnorm() |
dlnorm() |
multinomial | dmultinom() |
pmultinom() |
qmultinom() |
dmultinom() |
negative | dnbinom() |
pnbinom() |
qnbinom() |
dnbinom() |
Poisson | dpois() |
ppois() |
qpois() |
dpois() |
Student’s t | dt() |
pt() |
qt() |
dt() |
Weibull | dt() |
pt() |
qt() |
dt() |
In mathematics, a norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin.
The n-norm for vector is:
\[||X||_{n} = (\sum_{i=1}^{n} |x_i|^n)^{1/n}\]
The n-norm for Matrix is:
\[||A||_{n} = (\sum_{i=1}^{n} \sum_{j=1}^{n} |a_{ij}|^n)^{1/n}\]
Type | Vector Norm | Matrix Norm |
---|---|---|
0-Norm (L0) | \(||X||_{0} = \sum_{i=1}^{n} |x_i|^0\) | \(||A||_{0} = \sum_{i=1}^{n} |a_i|^0\) |
1-Norm (L1) | \(||X||_{1} = \sum_{i=1}^{n} |x_i|\) | \(||A||_{1} = \max\limits_{1 \leq j \leq n} \sum_{i=1}^{n} |a_i|\) |
2-Norm (L2) also called Euclidean Norm, Frobenius-Norm | \(||X||_{2} = \sqrt{\sum_{i=1}^{n} |x_i|^2}\) (i.e. \(a^2 + b^2 + c^2\)) | \(||A||_{F} = \sqrt{\sum_{i=1}^{n} \sum_{j=1}^{n} |a_{ij}|^2}\) |
P-Norm | \(||X||_{p} = (\sum_{i=1}^{n} |x_i|^p)^{1/p}\) | \(||A||_{p} = (\sum_{i=1}^{n} \sum_{j=1}^{n} |a_{ij}|^p)^{1/p}\) |
\(\infty\) -Norm | \(||X||_{\infty} = \max\limits_{1 \leq j \leq n} |x_{i}|\) | \(||A||_{\infty} = \max\limits_{1 \leq i \leq n} \sum_{j=1}^{n}|a_{ij}|\) |
Common problems related to Activation function:
Activation function | Mathematical formulation | Prominent algorithm | problems |
---|---|---|---|
sigmoid (special case of the softmax function for a classifier with only two input classes) | \(\sigma(x) = \frac{1}{1+e^{-x \omega}} = \frac{e^{x \omega}}{1+e^{x \omega}}\) | for convert to probability of the positive class in binary classification. | Gradient exploding problem; Vanishing gradient problem; Not zero-centered; high computational complexity |
Softmax | \(\sigma(\vec{z})_{i} = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\) | convert to probabilities of j-class in multiclass classification. | - |
tanh, or called Hyperbolic Tangent | \(\frac{e^x-e^{-x}}{e^x+e^{-x}}\) | zero-centered | Gradient exploding problem; Vanishing gradient problem; high computational complexity |
Rectified Linear Unit (ReLU) | \(\sigma(x) = \max(x,0)\) | create non-linearity in stacking regressions; To avoid Gradient exploding problem and Vanishing gradient problem when \(x>0\) | only gradient vanishing and Gradient exploding in \(x > 0\); Not zero-centered; Dead ReLU Problem. Infinite range |
Exponential Linear Unit (ELU) | \(\begin{cases} x \ if \ x >0 \\ \alpha (e^x-1) \ if \ x \leq 0 \end{cases}\) | between ReLU and Leaky ReLU | high computational complexity. Infinite range |
Leaky ReLU | \(max(0.01x,x)\)$ | improvement on ReLU; To avoid Gradient exploding problem and Vanishing gradient problem | Not zero-centered; Infinite range; linearity |
PReLU | \(max(\alpha,x)\), where \(\alpha\) is learnt during trainnig | improvement on Leaky ReLU | Not zero-centered; Infinite range |
Maxout | \(for\ integer \ i \in [0,k],z_i = \omega_i x + b_i, out = max(z_i)\) | A combination of different activation function | high computational complexity with more parameter |
In statistics, a maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function. the objective function (or risk function) only consist of
\[log\text -likelihood: \ L(\hat{Y}, Y) = ln(J(\hat{Y}, Y)_{MLE}) = \ln(p(Y|X, \omega)) \]
In Bayesian statistics, a Maximum a posteriori estimation (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution, or reverse probablity. The objective function (or risk function) consist of
\[likelihood: \ J(\hat{Y}, Y)_{MLE} = p(Y|X, \omega) p(\omega)\]
\[log\text -likelihood: \ L(\hat{Y}, Y)_{MAP} = \ln (p(Y|X, \omega) p(\omega)) = \ln (p(Y|X, \omega)) + \ln (p(\omega))\]
In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized. Supervised learning can be simply interpret as finding parameters through minimizing your error while regularizing your parameters.
\[\omega^* = \arg \max_\omega {\ln (p(Y|X, \omega)) + \ln (p(\omega))}\]
\[\omega^* = \arg \min_\omega - \ln (p(Y|X, \omega)) - \ln (p(\omega))\]
\[ln(p(Y|X \omega)) \propto \frac{1}{n} \sum_{i=1}^{n} \ell(\hat{y_i},f(x_i;\omega)) = \frac{1}{n} \sum_{i=1}^{n} \ell(\hat{y_{i}}, y_{i})\]
Loss function | Likelihood of data, \(p(Y|X, \omega)\) | Mathematical formulation of \(\ell(\hat{y_{i}}, y_{i})\) | Prominent algorithm |
---|---|---|---|
L1 loss, or Mean Absolute Error Loss | Laplace | \(|\hat{y_{i}} - y_{i}|\) | Regression with Least absolute deviations (LAD) Error. Minimize the absolute distance. pro: less likely to explode gradient at large x than L2 loss. con: first derivate is constant result the loss is bounce around the optimal point and difficult to converge. |
L2 loss, or Mean Squared error(MSE) loss, Quadratic Loss | Gaussian | \((\hat{y_{i}}- y_{i})^2\) | Regression with Ordinary Least Squares (OLS) Error. Minimize the absolute distance. pro: smoonth gradient at small x. con: easy to explode gradient at large x |
Smooth L1 | —- | \(\begin{cases} 0.5(\hat{y_{i}}- y_{i})^2 \ if \ |\hat{y_{i}}- y_{i}|<1 \\ |\hat{y_{i}}- y_{i}|-0.5 \ otherwise \end{cases}\) | It has the advanages of L1 and L2 loss. |
Huber Loss | - | \(\begin{cases} \frac{1}{2} (\hat{y_{i}}-y_{i})^2 \ for \ |\hat{y_{i}}-y_{i}| \leq \delta \\ \delta |\hat{y_{i}}-y_{i}|- \frac{1}{2} \delta^2 \ otherwise \end{cases}\) | regression. become MSE when \(\delta \rightarrow 0\), become MAE when \(\delta \rightarrow \infty\) |
—– | Poisson | \(y_i \hat{y_i} - e^{y_i \hat{y_i}}\) | Poisson regression (link: logarithm) for count data |
Loss function | Likelihood of data, \(p(Y|X, \omega)\) | Mathematical formulation of \(\ell(\hat{y_{i}}, y_{i})\) | Prominent algorithm |
---|---|---|---|
zero-one loss | Dirac | \(I(\hat{y_{i}} - y_{i} \neq 0) = \begin{cases} 1,\ \hat{y_{i}} - y_{i} \neq 0 \\ 0,\ \hat{y_{i}} - y_{i} =0 \end{cases}\) | Maximum Margin Classifier with hard-margin linearly separable problem. Easy to interpred but inapplicable to distinguish the performance between models with the same classification. |
probit loss | probit, CDF of the standard normal distribution | link | probit regression, binary classification |
perceptron loss | - | \(I(|\hat{y_{i}} - y_{i}| \geq t ) = \begin{cases} 1,\ |\hat{y_{i}} - y_{i}| \geq t \\ 0,\ |\hat{y_{i}} - y_{i}| < t \end{cases}\) | soft/hard-margin SVM |
hinge loss | - | \(\sum_{j \neq i} \max(0,1- y_{i} \cdot \hat{y_{i}}_j), \ y_{i} = \pm1\) | Multiclass classification, Support vector machine (SVM) with soft-margin non-linearly separable problem |
Logistic-loss (logit + sigmoid function + Cross-entropy loss) | Bernoulli | \(-ln( \sigma(x_i \omega) )\) \(= -[I_{positive}(y_i) ln( \sigma(x_i \omega) ) + (1-I_{positive}(y_i)) ln(1- \sigma(x_i \omega) )]\) where sigmoid function \(\sigma(x_i \omega) = \frac{1}{1+e^{-x \omega}}\), \(I_{positive}(y_i) = \begin{cases} 1 \ if class = positive \\ 0 \ if \ class \neq positive \end{cases}\) | Logistic regression, ordinal logistic regression, binary classification. |
Cross-entropy loss (logit + sigmoid function + Cross-entropy loss) | Categorical, also called generalized Bernoulli, multinoulli | \(-ln( \sigma(x_i \omega) ) = -\sum_{j \in Classes} I_j(y_{i}) ln( \sigma(x_i \omega))\) where sigmoid function \(\sigma(x_i \omega) = \frac{1}{1+e^{-x \omega}}\), \(I_j(y_{i}) = \begin{cases} 1 \ if \ class = j \\ 0 \ if \ class \neq j \end{cases}\) | Multiclass classification when class is not too many and in order. Maximize interclass distance. A generalization of Logistic-loss to multiclass, export the one-hot result(s) |
softmax loss, softmax cross-entropy loss (logit + softmax function + Cross-entropy loss) | multinomial | \(-ln(\sigma(x_i \omega))\), where softmax function \(\sigma(x_i \omega) = \frac{e^{x_i \omega \ I_{correct}(y_i)}}{\sum_{j \in Classes} e^{(x_i \omega)_j}}\) | Multinomial logistic regression, Multiclass classification, Cross-entropy loss with softmax function |
Large-Margin Softmax Loss (L-softmax loss) | —- | Replace \(\omega x\) with function of amplitude and angular in softmax loss | An improvement of softmax loss on decision margin. Constrain decision margin using function of amplitude and angular, but do not disassociate so well because the class can be disassociated from two metrics. more |
SphereFace: Angular Softmax Loss (A-Softmax loss) | —- | An improvement of L-softmax. With normailization on \(\omega\), constrain decision margin using function of angular. | |
L2-constrained softmax loss | —- | Add L2-norm layer and scale layer before softmax loss NormFace | An improvement of softmax loss on difficult features. To avoid DCNN only learn the simple feature, it normalizing feature of all samples using L2 norm to equally learn the feature from difficult samples. |
CosFace: Large margin cosine loss for deep face recognition or Additive Margin Softmax for Face Verification (AM-softmax loss) | —- | ———- | An improvement of A-Softmax loss (SphereFace). Normailizing feature and maximizing additive cosine margin to maximize interclass variance and minimize intraclass variance CosFace: Large margin cosine loss for deep face recognition |
ArcFace: Additive Angular Margin Loss for Deep Face Recognition | —- | —- | An improvement of A-Softmax loss (SphereFace). Normailizing feature and maximizing additive angular margin (more direct than cosine)to maximize interclass variance and minimize intraclass variance. \(\omega\) is large |
Contrastive Loss | —- | ———- | Euclidean embedding. Increase the pairwise distance between samples in different classes, and decrease the distance betwwen sample in a same class. con: has a identical margin for all pair of sample in different classes. |
triplet loss | —- | ———- | Euclidean embedding. An improvement of Contrastive Loss. Increase the relative intraclass distance, but do not consider the compactness . |
center loss | —- | \(\ell(\hat{y_{i}}, y_{i})_{softmax} + \lambda (x_i - C_{yi})^2\) | Euclidean embedding. An improvement of Contrastive Loss. Minimize the absolute intraclass distance. con: The L2 loss can be easily affected by outlines within a class. L2 loss has higher computation complexity and requires more computation power. |
exponential loss | - | \(L(Y|f(X)) = \exp(-\hat{y_{i}} y_{i})\), link | Boosting, AdaBoost |
Fitting Parameter via MLE for different distribution:
\[\ln (p(\omega)) \propto \lambda \Omega(\omega) \]
penalty | prior distribution of \(\hat{\omega}\) | Mathematical formulation | Prominent used |
---|---|---|---|
L1 Loss, Mean Absolute Error | Laplace | \(|\omega|\) | lasso |
L2 Loss, Mean Square Error, Quadratic Loss | Gaussian | \(\omega^2\) | ridge |