Use of this document
1. Introduction
2. From the 20th Century
3. The 21st Century: Image Classification
4. Natural language processing (NLP)

Use of this document

This is a study note for evolution of Deep Learning.

1. Introduction

2. From the 20th Century

2.1 From 1940s to 1969s

Paper, Year	Core Idea	Demo	Pro and Con
McCulloch–Pitts (MCP) neuron, 1943	-	-	simplify the neuron reaction into three step: Weighted inputs=>Sum=>Non-Linear Activation
adaptive linear element，ADALINE	-	-	-
Perceptrons, 1958	-	-	apply it to classification
linear model, 1969	-	-	-

2.1 During 1980s

Paper, Year	Core Idea	Demo	Pro and Con
Feed-Forward Network, 1989	-	-	-
SVM, 1990s	-	-	-
Boltzmann machine, 1985	-	-	-
Multilayer perceptron (MLP), 1986	-	-	-
LeNet V1, 1989	Back-propagation, 1986	-	first applied the Backpropagation algorithm to practical applications.
Auto Encoder， 1987	-	-	-
, 1991	-	-	Pointing out Vanishing gradient problem

3.1 The end of 20th Century

Paper, Year	Core Idea	Demo	Pro and Con
Variational Auto Encoder，1993	-	-	-
LeNet V5, 1998	Convolution layer		The first application of Convolution layer on classification task. Pro: 1) The Convolution layer provide ability of local receptive fields and shared wights to ensure some degree of shift, scale and distortion invariance.

3. The 21st Century: Image Classification

3.1 Deeper

The deeper, the more detail the feature can be extract, there fore higher accuracy

Improvement to deeper network:

Activation function
Batch normaliztion
by pass

Paper, Year	Core Idea	Demo	Pro and Con
AlexNet, 2012	ReLU		Pro: 1) solve the *Vanishing gradient problem* by replacing Sigmoid (max derivative = 0.25) with ReLU to ensure the derivative is alway 1 in positive range, so that the allow the gradient flow to shallower layers in backpropagation. 2) lower computation complexisty during backpropagation compare to sigmoid. Con: 1) not zero-centered, 2) Dead ReLU Problem if learning rate or gradient is is too high
AlexNet, 2012	Dropout (0.3~0.5)		Pro: 1) use dropout at fully-connected layer to regularize on cost function to avoid overfit. Con: 1) The cost function is probably not monotonic 2) training time will be twice to triple compare to without dropout because only a sub-model is trained at each epoch.
GoogLeNet, or Incpetion, 2014	Inception module, or Group convolution		Pro: 1) combine features from various scales.
GoogLeNet, or Incpetion, 2014	Bottleneck layer		Pro: 1) Compress the feature map from high to low dimention. 2) The dimention reduction with 1x1 convolution kernel reduce the parameter number in the following feature extraction.
GoogLeNet, or Incpetion, 2014	Auxiliary classifier		Pro: 1) if BN or dropout in Auxiliary classifier, the Auxiliary classifier can regularize on the loss to avoid overfit, proved in Incpetion_V2&V3. Con: 1) It is useless in shallow model. 2) Not regularization effect if without BN or dropout. 3) It is not usefull in transfer learning because of the extra output.
GoogLeNet, or Incpetion, 2014	Average pooling		Pro: 1) Largely reduce the number of parameter by abandoning fully connected layer, 2) Average pooling consider to have receptive field of the whole feature map
VGG, 2015	small Convolution filter (3x3)		Pro: 1) replace a 7x7 Conv kernel with Stacking of three 3x3 Conv kernel to obtain same Receptive field, \(F(i)\), with less parameter. \(F_i=(F_{i+1}-1) \times Stride + Ksize\). 2) smaller kernel size reduce the number of parameter: assuming input and output channel is 3, then three 3x3 kernels have parm of \((3 \times 3 \times C^2) \times 3 = 27C^2\), a 7x7 kernel has parm of \(7 \times 7 \times C^2 = 49 C^2\)
ResNet, 2015	Residual Block		create short paths from early layers to later layers. Pro: 1) solve degradation problem with Identity Mapping to allow information communicate between shallow and deep layer
ResNet, 2015	Batch Normalization (Conv_NoBias=>BN=>ReLu)		Pro: 1) resolve the gradient exploding and vanishing problem in a very deep model by normalizing the feature maps with Norm(0,1) to remove the scaling effect of Weight \(W\). 2) prove that dropout and BN together can result in variance shit, and dropout only apply to fully connected layer, so just use BN and *not use dropout* to save trainning time.
DenseNet	Dense block		Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision Pro: 1) Less parameter in narrower architecture,
SENet, 2017	Squeeze-and-Excitation Block		Pro: 1) enhance the important channels, whereas weaken the non-important channels

3.2 Lighter

Less MAdds

Paper, Year	Core Idea	Demo	Pro and Con
SqueezeNet,2017	-	-	提出Fire Module设计，主要思想是先通过1x1卷积压缩通道数（Squeeze），再通过并行使用1x1卷积和3x3卷积来抽取特征（Expand），通过延迟下采样阶段来保证精度。综合来说，SqueezeNet旨在减少参数量来加速
MobileNet, 2018	Separable convolution (Depthwise=>PointWise)		Pro: 1) replace the traditional convolution with separable-convolution to reduce computation. Con: 1) the kernel for Depthwise are easily diminished due to the loss of high-dimentional feature information by Relu; 2) the number of PointWise filter is fixed by the input channel.
MobileNet V2, 2018	Inverted residual block (PointWise=>Depthwise=>PointWise)		Use 1x1 PointWise convolution as the Expansion and Projection layer to transform information between low-high-low dimentions. Pro: 1) the low-dimentional tensor flowing along the network reduce total computation, while feature extraction occurs at high dimention within bottleneck; 2) the additional PointWise before Depthwise is added to resolve the restriction of the number of PointWise filter
MobileNet V2, 2018	Linear Bottleneck		Pro: replace the ReLu with linear activation (\(f(x) = x\)) to prevent loss of information by Relu during high-dimention to low-dimention transformation.
ShuffleNet, 2017	-		利用了组卷积和通道shuffle进一步提高模型效率
ShuffleNet V2, 2018	-	-	-
CondenseNet	-	-	学习保留有效的dense连接在保持精度的同时降低
ShiftNet	-	-	利用shift操作和逐点卷积代替了昂贵的空间卷积
Xception	-	-	-

3.3 Faster

Roofline Model 深度学习模型压缩与加速综述

4. Natural language processing (NLP)

4.1 static Word Embedding

Static Word Embedding
Paper, Year	Core Idea	Demo	Pro and Con
onehot	-	-	`sklearn.DictVectorizer()` Pro: Have onehot vector that each dimention represent one word. Con: 1) curse of dimensionality that many dimention 2) Very sparse vector that all entries are zero except corresponding position 3) not able to encode word semeatics (order, position, distance). 3) and so not unable measure the different between word meaning
Term Frequency (TF)	count-based co-occurance matrix	-	Pro: 1) uses overall statistics to repressent the global context. Con: 1) not local information (sematic)
Term Frequency–Inverse Document Frequency (TF-IDF)	Importance by Inverse Frequency	-	A improvement on TF. Pro: 1) utilize global context with consideration of the importance of a word using inverse-count. Con: 1) did not consider the context (order, position, distance) to a word.
Hashing	-	-	`sklearn.HashingVectorizer()` Pro: 1) reduce memory
Matrix Factorization	-	-	Also called Latent semantic analysis. Apply dimention reduction with Singular value decomposition on a co-ocurrance matrix. Pro: 1) obtain global statistic on co-ocurrance matrix. 2) High computational complexity
Word2Vec: Continuous Bag-of-Words Model (CBOW), 2013	distrubional hypothesis with context window		Improvement from One-hot to distrutional word representation. Throught word embeeding on the context, the center word can be represented by a vector. Pro: 1) includes local context information with the sliding window method. 2) Compare to Skip-gram, it has lower complexity O(text length), Con: 1) Out of vocabulary, 2) Cannot handle polysemous words, 3) the word vector is not good for less frequent words.
Word2Vec: Continuous Skip-gram Model, 2013	distrubional hypothesis with context window		Through word embedding (Trainning the dictionary \(W\)) on a word, the word representation is converted from onehot encoding without meaning to lower dimention vector with context meaning. Pro: 1) bring in the local context information by using context to infer a word 2) dimention reduction. 3) words are comparable on semanticity using Cosine similarity, range from -1 to 1. 4) better performance than CBOW when data size is small. 5) negative sampling reduce computaion complexity by converting multiclass to binary classification. Con: 1) Out of vocabulary, 2) Cannot handle polysemous words, 3) Compare to CBOW, it has higher complexity, O(text length * window length)
GloVe, 2014	co-occurrence probabilities matrix	-	An combination of Word2Vec and Matrix Factorization with utilizing both local and global information. Pro: 1) Assume that the probabilities of each two word in a particular sliding window have a fixed ratio. 2) enable gradient descent method by constructing the weighted least square error derived from the ratio. Compaare to Matrix Factorization, it lowers the computation complexity. 3) by optimizing the the loss function bassed on the pairwise probability relationship within a sliding window, GloVe is able to capture similarity and analogy of text.
Gaussian embedding	-	-	Pro: 1) embed each word with Gaussian distribution with a mean vector and a variance vector to represent the uncertainty of the word. 2) use KL divergence to masure the similarity between two word.
Pointcare embedding	-	-	Convert the discrete data in euclidean space into continuous version in a non euclidean space for some data that has inherent hierarchical structure, i.e., tree structures.
FastText	-	-	-
Feedforward Neural Network Language Model (NNLM), 2003	-	-	-
Recurrent Neural Net Language Model (RNNLM), 2010	-	-	-

4.2 Dynamic (Contextualized) Word Embedding

Dynamic (Contextualized) Word Embedding
Paper, Year	Core Idea	Demo	Pro and Con
CoVe: Contextualized Word Vectors, 2018	Context vector		LSTM-based model. After learning Neural machine translation model with Seq2Seq Bi-LSTM with input of static vectors (i.e., GloVe), uses encoder to learn context vector and append them to static vector. Pro: 1) encode the same word in different context with different vector. Con: 1) Gradient vanishing problem, so could not capture long-term dependency, 2) CoVe is a sequential model that could not computing parallel
ELMo: Embeddings from Language Models, 2018	context vector		An improvement of CoVe on context vectors. LSTM-based model. Pro 1) better represent the context vectors with synatix information from shallow layers and sematic information from deep layers. Con: 1) Gradient vanishing problem, so could not capture long-term dependency. 2) Similar to CoVe, ELMo is a sequential model that could not computing parallel
BERT: Bidirectional Encoder Representations from Transformers, 2019	Transformer	-	Transformer-based model. Pro: 1) capture long-term dependency by solving the Gradient vanishing problem. 2) self-attention mechanism allow parallel computing 3) masked language model Con: 1) trainning and testing discrepancy 2) autoencoding method bases on Independent assumption：each of predicted tokens has not dependency, in fact is not true.
GPT, 2019	-	-	Similar to ELMo with LSTM, using tranformer instead.
RoBERTa, 2019	dynamical masking	-	A improvement on BERT. Pro: 1) obtain a robust result by dynamically changing the masking pattern applied to the trainning data. 2) Trainning the model longer, with bigger batches, over more data, traning on longer sequence 3)
XLNeT	Two-strengh self-attention	-	An improvement of autoregressive method by considering all possible factorizations. Pro: 1) different from autoencoding method in BERT, XLNeT use autoregressive method to allow the dependency between predicted token with modification to consider contextual information

4.3 Model compression

Model compression
Paper, Year	Core Idea	Demo	Pro and Con
, 2017	Sparse priors	-	-
ALBERT, 2019	Sparse matrix factorization	-	-
TinyBert, 2019	Knowledge distillation		-

4.2 Unsorted

Paper, Year	Core Idea	Demo	Pro and Con
LSTM_NMT	-	-	-
Char Embedding	-	-	-
TextCNN， 2014	-	-	-
CharTextCNN, 2015	-	-	Pro：1) 模型结构简单，并且在大语料上效果很好.2)可以用于各种语言，不需要做分词处理. 3)在噪音比较多的文本上表现较好，因为基本不存在OOV问题. Con：1)字符级别的文本长度特别长，不利于处理长文本的分类.2)只使用字符级别信息，所以模型学习到的语义方面的信息较少)在小语料上效果较差.
BahDanau_NMT	-	-	-
Han_Attention Attention	-	-	-
SMG	-	-	-