Use of this document
1. Background
2. Objective function
- 2.1 maximum likelihood estimation (MLE)
- 2.2 Maximum a posteriori estimation (MAP)
3. Minimize error: Loss term
- 3.1 Regression
- 3.2 Classification
4. Regularizing parameters: Regularization term, or penalty

Use of this document

The objective of this note is to explain the loss function from a statistical stand point, or specifically the maximum a posteriori estimation (MAP)

library(tidyverse)

(source: https://www.cnblogs.com/LittleHann/p/9608599.html#_label0)
(source: https://medium.com/jarvis-toward-intelligence/%E6%AF%94%E8%BC%83-cross-entropy-%E8%88%87-mean-squared-error-8bebc0255f5)
(source: https://en.wikipedia.org/wiki/Loss_function)

1. Background

1.1 Distribution

par(mfrow = c(2,2), mar=c(4,4,4,4))
plot(function(x) dnorm(x), -3, 3, 
     main = "Normal density: dnorm()", xlab = "z", ylab = "density")
plot(function(x) pnorm(x), -3, 3, 
     main = "Normal cumulative: pnorm()", xlab = "z", ylab = "cumulative probability")
hist(rnorm(1000), 
     main = "sample from normal distribtuion, rnorm()", xlab = "z", ylab = "count")
plot(function(x) qnorm(x), 
     main = "Normal quantile, qnorm()", xlab = "quantile", ylab = "z")

The functions for the density/mass function, cumulative distribution function, quantile function and random variate generation are named in the form dxxx, pxxx, qxxx and rxxx respectively.

Functions for distributions functionalities provided by $stats$ package
distributions	`d`, density	`p`, cumulative	`q` quantile	`r`, random generation
normal	`dnorm()`	`pnorm()`	`qnorm()`	`dnorm()`
uniform	`dunif()`	`punif()`	`qunif()`	`dunif()`
beta	`dbeta()`	`pbeta()`	`qbeta()`	`dbeta()`
binomial (including Bernoulli)	`dbinom()`	`pbinom()`	`qbinom()`	`dbinom()`
Cauchy	`dcauchy()`	`pcauchy()`	`qcauchy()`	`dcauchy()`
chi-squared	`dchisq()`	`pchisq()`	`qchisq()`	`dchisq()`
exponential	`dexp()`	`pexp()`	`qexp()`	`dexp()`
F	`df()`	`pf()`	`qf()`	`df()`
gamma	`dgamma()`	`pgamma()`	`qgamma()`	`dgamma()`
geometric	`dgeom()`	`pgeom()`	`qgeom()`	`dgeom()`
hypergeometric	`dhyper()`	`phyper()`	`qhyper()`	`dhyper()`
log-normal	`dlnorm()`	`plnorm()`	`qlnorm()`	`dlnorm()`
multinomial	`dmultinom()`	`pmultinom()`	`qmultinom()`	`dmultinom()`
negative	`dnbinom()`	`pnbinom()`	`qnbinom()`	`dnbinom()`
Poisson	`dpois()`	`ppois()`	`qpois()`	`dpois()`
Student’s t	`dt()`	`pt()`	`qt()`	`dt()`
Weibull	`dt()`	`pt()`	`qt()`	`dt()`

1.2 Statistical inference

Orthogonal projections to ordinary least squares (OLS): Orthogonal projections to find the closed projection point on column plan, $X$. Using ordinal least squared method as the exmaple.
Posterior distribution and maximum likelihood estimation (MLE)
Prior probability distribution to Maximum a posteriori estimation (MAP)

1.3 The Norm

In mathematics, a norm is a function from a real or complex vector space to the nonnegative real numbers that behaves in certain ways like the distance from the origin.

The n-norm for vector is:

\[||X||_{n} = (\sum_{i=1}^{n} |x_i|^n)^{1/n}\]

The n-norm for Matrix is:

\[||A||_{n} = (\sum_{i=1}^{n} \sum_{j=1}^{n} |a_{ij}|^n)^{1/n}\]

Comparision between Vector Norm and Matrix Norm
Type	Vector Norm	Matrix Norm
0-Norm (L0)	$\|\|X\|\|_{0} = \sum_{i=1}^{n} \|x_i\|^0$	$\|\|A\|\|_{0} = \sum_{i=1}^{n} \|a_i\|^0$
1-Norm (L1)	$\|\|X\|\|_{1} = \sum_{i=1}^{n} \|x_i\|$	$\|\|A\|\|_{1} = \max\limits_{1 \leq j \leq n} \sum_{i=1}^{n} \|a_i\|$
2-Norm (L2) also called Euclidean Norm, Frobenius-Norm	$\|\|X\|\|_{2} = \sqrt{\sum_{i=1}^{n} \|x_i\|^2}$ (i.e. $a^2 + b^2 + c^2$)	$\|\|A\|\|_{F} = \sqrt{\sum_{i=1}^{n} \sum_{j=1}^{n} \|a_{ij}\|^2}$
P-Norm	$\|\|X\|\|_{p} = (\sum_{i=1}^{n} \|x_i\|^p)^{1/p}$	$\|\|A\|\|_{p} = (\sum_{i=1}^{n} \sum_{j=1}^{n} \|a_{ij}\|^p)^{1/p}$
$\infty$ -Norm	$\|\|X\|\|_{\infty} = \max\limits_{1 \leq j \leq n} \|x_{i}\|$	$\|\|A\|\|_{\infty} = \max\limits_{1 \leq i \leq n} \sum_{j=1}^{n}\|a_{ij}\|$

1.4 Activation function

Common problems related to Activation function:

Gradient exploding problem: large error gradients accumulate and result in very large updates to neural network model weights during training. Cause: 1) neural network is too deep; 2) too large learning rate.
Vanishing gradient problem: With each subsequent layer the magnitude of the gradients gets exponentially smaller (vanishes) thus making the steps also very small which results in very slow learning of the weights in the lower layers of a deep network. Cause: 1) neural network is too deep; 2) magnitude of the activation function derivative is well below 1.0 in the whole range of the function and makes the gradients vanish quite fast.
Not zero-centered: force either positive or negative update for all $\omega$ in backprop. i.e. sigmoid (mean = 0.5)
high computational complexity: Slow learning speed require more trainning time for each updat. Cause: high computational complexity of activation functions, layers, etc. i.e Exponentiation in sigmoid an tanh
Dead ReLU Problem: a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again, then the gradient flowing through the unit will forever be zero from that point on. Cause: 1) activation function where the gradient is always zero in some range. i.e. ReLU when x < 0, 2) parameter initiation result some neuron has x < 0 and are never activated during training.
Infinite range: if the range of activation has either -$infty$ or $infty$, it can blow up the activation when the learning rate is too large.
linearity: it can’t be used for the complex Classification. i.e. Leaky ReLUs lags behind the Sigmoid and Tanh for some of the use cases.

Common activation function. logits in tensorflow or pytroch could just refer to $X\omega$, no the logits function $log(\frac{p}{1+p})$.
Activation function	Mathematical formulation	Prominent algorithm	problems
sigmoid (special case of the softmax function for a classifier with only two input classes)	$\sigma(x) = \frac{1}{1+e^{-x \omega}} = \frac{e^{x \omega}}{1+e^{x \omega}}$	for convert to probability of the positive class in binary classification.	Gradient exploding problem; Vanishing gradient problem; Not zero-centered; high computational complexity
Softmax	$\sigma(\vec{z})_{i} = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$	convert to probabilities of j-class in multiclass classification.	-
tanh, or called Hyperbolic Tangent	$\frac{e^x-e^{-x}}{e^x+e^{-x}}$	zero-centered	Gradient exploding problem; Vanishing gradient problem; high computational complexity
Rectified Linear Unit (ReLU)	$\sigma(x) = \max(x,0)$	create non-linearity in stacking regressions; To avoid Gradient exploding problem and Vanishing gradient problem when $x>0$	only gradient vanishing and Gradient exploding in $x > 0$; Not zero-centered; Dead ReLU Problem. Infinite range
Exponential Linear Unit (ELU)	$\begin{cases} x \ if \ x >0 \\ \alpha (e^x-1) \ if \ x \leq 0 \end{cases}$	between ReLU and Leaky ReLU	high computational complexity. Infinite range
Leaky ReLU	$max(0.01x,x)$$	improvement on ReLU; To avoid Gradient exploding problem and Vanishing gradient problem	Not zero-centered; Infinite range; linearity
PReLU	$max(\alpha,x)$, where $\alpha$ is learnt during trainnig	improvement on Leaky ReLU	Not zero-centered; Infinite range
Maxout	$for\ integer \ i \in [0,k],z_i = \omega_i x + b_i, out = max(z_i)$	A combination of different activation function	high computational complexity with more parameter

Activation Functions pro and con of Activation Functions

2. Objective function

2.1 maximum likelihood estimation (MLE)

In statistics, a maximum likelihood estimation (MLE) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function. the objective function (or risk function) only consist of

a likelihood function of data $X$ given parameter $\omega$.

\[likelihood: \ J(\hat{Y}, Y)_{MLE} = p(Y|X, \omega)\] take $ln()$ at both side, become:

\[log\text -likelihood: \ L(\hat{Y}, Y) = ln(J(\hat{Y}, Y)_{MLE}) = \ln(p(Y|X, \omega)) \]

2.2 Maximum a posteriori estimation (MAP)

In Bayesian statistics, a Maximum a posteriori estimation (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution, or reverse probablity. The objective function (or risk function) consist of

likelihood function of data $X$ given parameter $\omega$.
prior distribution of parameter $\omega$.

\[likelihood: \ J(\hat{Y}, Y)_{MLE} = p(Y|X, \omega) p(\omega)\]

take $ln()$ at both side, become:

\[log\text -likelihood: \ L(\hat{Y}, Y)_{MAP} = \ln (p(Y|X, \omega) p(\omega)) = \ln (p(Y|X, \omega)) + \ln (p(\omega))\]

In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized. Supervised learning can be simply interpret as finding parameters through minimizing your error while regularizing your parameters.

\[\omega^* = \arg \max_\omega {\ln (p(Y|X, \omega)) + \ln (p(\omega))}\]

equivalent to

\[\omega^* = \arg \min_\omega - \ln (p(Y|X, \omega)) - \ln (p(\omega))\]

3. Minimize error: Loss term

\[ln(p(Y|X \omega)) \propto \frac{1}{n} \sum_{i=1}^{n} \ell(\hat{y_i},f(x_i;\omega)) = \frac{1}{n} \sum_{i=1}^{n} \ell(\hat{y_{i}}, y_{i})\]

3.1 Regression

relation between loss function of regression and various probability distributions corresponding to maximum likelihood procedure. $y_{i}$ is the observation, $\hat{y_{i}}$ is the estimate of $y_{i}$.
Loss function	Likelihood of data, $p(Y\|X, \omega)$	Mathematical formulation of $\ell(\hat{y_{i}}, y_{i})$	Prominent algorithm
L1 loss, or Mean Absolute Error Loss	Laplace	$\|\hat{y_{i}} - y_{i}\|$	Regression with Least absolute deviations (LAD) Error. Minimize the absolute distance. pro: less likely to explode gradient at large x than L2 loss. con: first derivate is constant result the loss is bounce around the optimal point and difficult to converge.
L2 loss, or Mean Squared error(MSE) loss, Quadratic Loss	Gaussian	$(\hat{y_{i}}- y_{i})^2$	Regression with Ordinary Least Squares (OLS) Error. Minimize the absolute distance. pro: smoonth gradient at small x. con: easy to explode gradient at large x
Smooth L1	—-	$\begin{cases} 0.5(\hat{y_{i}}- y_{i})^2 \ if \ \|\hat{y_{i}}- y_{i}\|<1 \\ \|\hat{y_{i}}- y_{i}\|-0.5 \ otherwise \end{cases}$	It has the advanages of L1 and L2 loss.
Huber Loss	-	$\begin{cases} \frac{1}{2} (\hat{y_{i}}-y_{i})^2 \ for \ \|\hat{y_{i}}-y_{i}\| \leq \delta \\ \delta \|\hat{y_{i}}-y_{i}\|- \frac{1}{2} \delta^2 \ otherwise \end{cases}$	regression. become MSE when $\delta \rightarrow 0$, become MAE when $\delta \rightarrow \infty$
—–	Poisson	$y_i \hat{y_i} - e^{y_i \hat{y_i}}$	Poisson regression (link: logarithm) for count data

3.2 Classification

relation between loss function of classification and various probability distributions corresponding to maximum likelihood procedure, where $y_{i}$ is the observation, $x_i \omega$ is the unnormalized estimate of probability of the predicted class, or unnormalized $logic$.
Loss function	Likelihood of data, $p(Y\|X, \omega)$	Mathematical formulation of $\ell(\hat{y_{i}}, y_{i})$	Prominent algorithm
zero-one loss	Dirac	$I(\hat{y_{i}} - y_{i} \neq 0) = \begin{cases} 1,\ \hat{y_{i}} - y_{i} \neq 0 \\ 0,\ \hat{y_{i}} - y_{i} =0 \end{cases}$	Maximum Margin Classifier with hard-margin linearly separable problem. Easy to interpred but inapplicable to distinguish the performance between models with the same classification.
probit loss	probit, CDF of the standard normal distribution	link	probit regression, binary classification
perceptron loss	-	$I(\|\hat{y_{i}} - y_{i}\| \geq t ) = \begin{cases} 1,\ \|\hat{y_{i}} - y_{i}\| \geq t \\ 0,\ \|\hat{y_{i}} - y_{i}\| < t \end{cases}$	soft/hard-margin SVM
hinge loss	-	$\sum_{j \neq i} \max(0,1- y_{i} \cdot \hat{y_{i}}_j), \ y_{i} = \pm1$	Multiclass classification, Support vector machine (SVM) with soft-margin non-linearly separable problem
Logistic-loss (logit + sigmoid function + Cross-entropy loss)	Bernoulli	$-ln( \sigma(x_i \omega) )$ $= -[I_{positive}(y_i) ln( \sigma(x_i \omega) ) + (1-I_{positive}(y_i)) ln(1- \sigma(x_i \omega) )]$ where sigmoid function $\sigma(x_i \omega) = \frac{1}{1+e^{-x \omega}}$, $I_{positive}(y_i) = \begin{cases} 1 \ if class = positive \\ 0 \ if \ class \neq positive \end{cases}$	Logistic regression, ordinal logistic regression, binary classification.
Cross-entropy loss (logit + sigmoid function + Cross-entropy loss)	Categorical, also called generalized Bernoulli, multinoulli	$-ln( \sigma(x_i \omega) ) = -\sum_{j \in Classes} I_j(y_{i}) ln( \sigma(x_i \omega))$ where sigmoid function $\sigma(x_i \omega) = \frac{1}{1+e^{-x \omega}}$, $I_j(y_{i}) = \begin{cases} 1 \ if \ class = j \\ 0 \ if \ class \neq j \end{cases}$	Multiclass classification when class is not too many and in order. Maximize interclass distance. A generalization of Logistic-loss to multiclass, export the one-hot result(s)
softmax loss, softmax cross-entropy loss (logit + softmax function + Cross-entropy loss)	multinomial	$-ln(\sigma(x_i \omega))$, where softmax function $\sigma(x_i \omega) = \frac{e^{x_i \omega \ I_{correct}(y_i)}}{\sum_{j \in Classes} e^{(x_i \omega)_j}}$	Multinomial logistic regression, Multiclass classification, Cross-entropy loss with softmax function
Large-Margin Softmax Loss (L-softmax loss)	—-	Replace $\omega x$ with function of amplitude and angular in softmax loss	An improvement of softmax loss on decision margin. Constrain decision margin using function of amplitude and angular, but do not disassociate so well because the class can be disassociated from two metrics. more
SphereFace: Angular Softmax Loss (A-Softmax loss)	—-		An improvement of L-softmax. With normailization on $\omega$, constrain decision margin using function of angular.
L2-constrained softmax loss	—-	Add L2-norm layer and scale layer before softmax loss NormFace	An improvement of softmax loss on difficult features. To avoid DCNN only learn the simple feature, it normalizing feature of all samples using L2 norm to equally learn the feature from difficult samples.
CosFace: Large margin cosine loss for deep face recognition or Additive Margin Softmax for Face Verification (AM-softmax loss)	—-	———-	An improvement of A-Softmax loss (SphereFace). Normailizing feature and maximizing additive cosine margin to maximize interclass variance and minimize intraclass variance CosFace: Large margin cosine loss for deep face recognition
ArcFace: Additive Angular Margin Loss for Deep Face Recognition	—-	—-	An improvement of A-Softmax loss (SphereFace). Normailizing feature and maximizing additive angular margin (more direct than cosine)to maximize interclass variance and minimize intraclass variance. $\omega$ is large
Contrastive Loss	—-	———-	Euclidean embedding. Increase the pairwise distance between samples in different classes, and decrease the distance betwwen sample in a same class. con: has a identical margin for all pair of sample in different classes.
triplet loss	—-	———-	Euclidean embedding. An improvement of Contrastive Loss. Increase the relative intraclass distance, but do not consider the compactness .
center loss	—-	$\ell(\hat{y_{i}}, y_{i})_{softmax} + \lambda (x_i - C_{yi})^2$	Euclidean embedding. An improvement of Contrastive Loss. Minimize the absolute intraclass distance. con: The L2 loss can be easily affected by outlines within a class. L2 loss has higher computation complexity and requires more computation power.
exponential loss	-	$L(Y\|f(X)) = \exp(-\hat{y_{i}} y_{i})$, link	Boosting, AdaBoost

Fitting Parameter via MLE for different distribution:

4. Regularizing parameters: Regularization term, or penalty

\[\ln (p(\omega)) \propto \lambda \Omega(\omega) \]

Common loss function for Regularization
penalty	prior distribution of $\hat{\omega}$	Mathematical formulation	Prominent used
L1 Loss, Mean Absolute Error	Laplace	$\|\omega\|$	lasso
L2 Loss, Mean Square Error, Quadratic Loss	Gaussian	$\omega^2$	ridge

Type	Vector Norm	Matrix Norm
0-Norm (L0)	\(\|\|X\|\|_{0} = \sum_{i=1}^{n} \|x_i\|^0\)	\(\|\|A\|\|_{0} = \sum_{i=1}^{n} \|a_i\|^0\)
1-Norm (L1)	\(\|\|X\|\|_{1} = \sum_{i=1}^{n} \|x_i\|\)	\(\|\|A\|\|_{1} = \max\limits_{1 \leq j \leq n} \sum_{i=1}^{n} \|a_i\|\)
2-Norm (L2) also called Euclidean Norm, Frobenius-Norm	\(\|\|X\|\|_{2} = \sqrt{\sum_{i=1}^{n} \|x_i\|^2}\) (i.e. \(a^2 + b^2 + c^2\))	\(\|\|A\|\|_{F} = \sqrt{\sum_{i=1}^{n} \sum_{j=1}^{n} \|a_{ij}\|^2}\)
P-Norm	\(\|\|X\|\|_{p} = (\sum_{i=1}^{n} \|x_i\|^p)^{1/p}\)	\(\|\|A\|\|_{p} = (\sum_{i=1}^{n} \sum_{j=1}^{n} \|a_{ij}\|^p)^{1/p}\)
\(\infty\) -Norm	\(\|\|X\|\|_{\infty} = \max\limits_{1 \leq j \leq n} \|x_{i}\|\)	\(\|\|A\|\|_{\infty} = \max\limits_{1 \leq i \leq n} \sum_{j=1}^{n}\|a_{ij}\|\)

Note-Loss function

Weiquan Luo

2021-08-04