This is a study note for evolution of Deep Learning.
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
McCulloch–Pitts (MCP) neuron, 1943 | - | - | simplify the neuron reaction into three step: Weighted inputs=>Sum=>Non-Linear Activation |
adaptive linear element,ADALINE | - | - | - |
Perceptrons, 1958 | - | - | apply it to classification |
linear model, 1969 | - | - | - |
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
Feed-Forward Network, 1989 | - | - | - |
SVM, 1990s | - | - | - |
Boltzmann machine, 1985 | - | - | - |
Multilayer perceptron (MLP), 1986 | - | - | - |
LeNet V1, 1989 | Back-propagation, 1986 | - | first applied the Backpropagation algorithm to practical applications. |
Auto Encoder, 1987 | - | - | - |
, 1991 | - | - | Pointing out Vanishing gradient problem |
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
Variational Auto Encoder,1993 | - | - | - |
LeNet V5, 1998 | Convolution layer | The first application of Convolution layer on classification task. Pro: 1) The Convolution layer provide ability of local receptive fields and shared wights to ensure some degree of shift, scale and distortion invariance. |
The deeper, the more detail the feature can be extract, there fore higher accuracy
Improvement to deeper network:
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
AlexNet, 2012 | ReLU | Pro: 1) solve the Vanishing gradient problem by replacing Sigmoid (max derivative = 0.25) with ReLU to ensure the derivative is alway 1 in positive range, so that the allow the gradient flow to shallower layers in backpropagation. 2) lower computation complexisty during backpropagation compare to sigmoid. Con: 1) not zero-centered, 2) Dead ReLU Problem if learning rate or gradient is is too high | |
AlexNet, 2012 | Dropout (0.3~0.5) | Pro: 1) use dropout at fully-connected layer to regularize on cost function to avoid overfit. Con: 1) The cost function is probably not monotonic 2) training time will be twice to triple compare to without dropout because only a sub-model is trained at each epoch. | |
GoogLeNet, or Incpetion, 2014 | Inception module, or Group convolution | Pro: 1) combine features from various scales. | |
GoogLeNet, or Incpetion, 2014 | Bottleneck layer | Pro: 1) Compress the feature map from high to low dimention. 2) The dimention reduction with 1x1 convolution kernel reduce the parameter number in the following feature extraction. | |
GoogLeNet, or Incpetion, 2014 | Auxiliary classifier | Pro: 1) if BN or dropout in Auxiliary classifier, the Auxiliary classifier can regularize on the loss to avoid overfit, proved in Incpetion_V2&V3. Con: 1) It is useless in shallow model. 2) Not regularization effect if without BN or dropout. 3) It is not usefull in transfer learning because of the extra output. | |
GoogLeNet, or Incpetion, 2014 | Average pooling | Pro: 1) Largely reduce the number of parameter by abandoning fully connected layer, 2) Average pooling consider to have receptive field of the whole feature map | |
VGG, 2015 | small Convolution filter (3x3) | Pro: 1) replace a 7x7 Conv kernel with Stacking of three 3x3 Conv kernel to obtain same Receptive field, \(F(i)\), with less parameter. \(F_i=(F_{i+1}-1) \times Stride + Ksize\). 2) smaller kernel size reduce the number of parameter: assuming input and output channel is 3, then three 3x3 kernels have parm of \((3 \times 3 \times C^2) \times 3 = 27C^2\), a 7x7 kernel has parm of \(7 \times 7 \times C^2 = 49 C^2\) | |
ResNet, 2015 | Residual Block | create short paths from early layers to later layers. Pro: 1) solve degradation problem with Identity Mapping to allow information communicate between shallow and deep layer | |
ResNet, 2015 | Batch Normalization (Conv_NoBias=>BN=>ReLu) | Pro: 1) resolve the gradient exploding and vanishing problem in a very deep model by normalizing the feature maps with Norm(0,1) to remove the scaling effect of Weight \(W\). 2) prove that dropout and BN together can result in variance shit, and dropout only apply to fully connected layer, so just use BN and not use dropout to save trainning time. | |
DenseNet | Dense block | Each layer has direct access to the gradients from the loss function and the original input signal, leading to an implicit deep supervision Pro: 1) Less parameter in narrower architecture, | |
SENet, 2017 | Squeeze-and-Excitation Block | Pro: 1) enhance the important channels, whereas weaken the non-important channels |
Less MAdds
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
SqueezeNet,2017 | - | - | 提出Fire Module设计,主要思想是先通过1x1卷积压缩通道数(Squeeze),再通过并行使用1x1卷积和3x3卷积来抽取特征(Expand),通过延迟下采样阶段来保证精度。综合来说,SqueezeNet旨在减少参数量来加速 |
MobileNet, 2018 | Separable convolution (Depthwise=>PointWise) | Pro: 1) replace the traditional convolution with separable-convolution to reduce computation. Con: 1) the kernel for Depthwise are easily diminished due to the loss of high-dimentional feature information by Relu; 2) the number of PointWise filter is fixed by the input channel. | |
MobileNet V2, 2018 | Inverted residual block (PointWise=>Depthwise=>PointWise) | Use 1x1 PointWise convolution as the Expansion and Projection layer to transform information between low-high-low dimentions. Pro: 1) the low-dimentional tensor flowing along the network reduce total computation, while feature extraction occurs at high dimention within bottleneck; 2) the additional PointWise before Depthwise is added to resolve the restriction of the number of PointWise filter | |
MobileNet V2, 2018 | Linear Bottleneck | Pro: replace the ReLu with linear activation (\(f(x) = x\)) to prevent loss of information by Relu during high-dimention to low-dimention transformation. | |
ShuffleNet, 2017 | - | 利用了组卷积和通道shuffle进一步提高模型效率 | |
ShuffleNet V2, 2018 | - | - | - |
CondenseNet | - | - | 学习保留有效的dense连接在保持精度的同时降低 |
ShiftNet | - | - | 利用shift操作和逐点卷积代替了昂贵的空间卷积 |
Xception | - | - | - |
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
onehot | - | - | sklearn.DictVectorizer() Pro: Have onehot vector that each dimention represent one word. Con: 1) curse of dimensionality that many dimention 2) Very sparse vector that all entries are zero except corresponding position 3) not able to encode word semeatics (order, position, distance). 3) and so not unable measure the different between word meaning |
Term Frequency (TF) | count-based co-occurance matrix | - | Pro: 1) uses overall statistics to repressent the global context. Con: 1) not local information (sematic) |
Term Frequency–Inverse Document Frequency (TF-IDF) | Importance by Inverse Frequency | - | A improvement on TF. Pro: 1) utilize global context with consideration of the importance of a word using inverse-count. Con: 1) did not consider the context (order, position, distance) to a word. |
Hashing | - | - | sklearn.HashingVectorizer() Pro: 1) reduce memory |
Matrix Factorization | - | - | Also called Latent semantic analysis. Apply dimention reduction with Singular value decomposition on a co-ocurrance matrix. Pro: 1) obtain global statistic on co-ocurrance matrix. 2) High computational complexity |
Word2Vec: Continuous Bag-of-Words Model (CBOW), 2013 | distrubional hypothesis with context window | Improvement from One-hot to distrutional word representation. Throught word embeeding on the context, the center word can be represented by a vector. Pro: 1) includes local context information with the sliding window method. 2) Compare to Skip-gram, it has lower complexity O(text length), Con: 1) Out of vocabulary, 2) Cannot handle polysemous words, 3) the word vector is not good for less frequent words. | |
Word2Vec: Continuous Skip-gram Model, 2013 | distrubional hypothesis with context window | Through word embedding (Trainning the dictionary \(W\)) on a word, the word representation is converted from onehot encoding without meaning to lower dimention vector with context meaning. Pro: 1) bring in the local context information by using context to infer a word 2) dimention reduction. 3) words are comparable on semanticity using Cosine similarity, range from -1 to 1. 4) better performance than CBOW when data size is small. 5) negative sampling reduce computaion complexity by converting multiclass to binary classification. Con: 1) Out of vocabulary, 2) Cannot handle polysemous words, 3) Compare to CBOW, it has higher complexity, O(text length * window length) | |
GloVe, 2014 | co-occurrence probabilities matrix | - | An combination of Word2Vec and Matrix Factorization with utilizing both local and global information. Pro: 1) Assume that the probabilities of each two word in a particular sliding window have a fixed ratio. 2) enable gradient descent method by constructing the weighted least square error derived from the ratio. Compaare to Matrix Factorization, it lowers the computation complexity. 3) by optimizing the the loss function bassed on the pairwise probability relationship within a sliding window, GloVe is able to capture similarity and analogy of text. |
Gaussian embedding | - | - | Pro: 1) embed each word with Gaussian distribution with a mean vector and a variance vector to represent the uncertainty of the word. 2) use KL divergence to masure the similarity between two word. |
Pointcare embedding | - | - | Convert the discrete data in euclidean space into continuous version in a non euclidean space for some data that has inherent hierarchical structure, i.e., tree structures. |
FastText | - | - | - |
Feedforward Neural Network Language Model (NNLM), 2003 | - | - | - |
Recurrent Neural Net Language Model (RNNLM), 2010 | - | - | - |
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
CoVe: Contextualized Word Vectors, 2018 | Context vector | LSTM-based model. After learning Neural machine translation model with Seq2Seq Bi-LSTM with input of static vectors (i.e., GloVe), uses encoder to learn context vector and append them to static vector. Pro: 1) encode the same word in different context with different vector. Con: 1) Gradient vanishing problem, so could not capture long-term dependency, 2) CoVe is a sequential model that could not computing parallel | |
ELMo: Embeddings from Language Models, 2018 | context vector | An improvement of CoVe on context vectors. LSTM-based model. Pro 1) better represent the context vectors with synatix information from shallow layers and sematic information from deep layers. Con: 1) Gradient vanishing problem, so could not capture long-term dependency. 2) Similar to CoVe, ELMo is a sequential model that could not computing parallel | |
BERT: Bidirectional Encoder Representations from Transformers, 2019 | Transformer | - | Transformer-based model. Pro: 1) capture long-term dependency by solving the Gradient vanishing problem. 2) self-attention mechanism allow parallel computing 3) masked language model Con: 1) trainning and testing discrepancy 2) autoencoding method bases on Independent assumption:each of predicted tokens has not dependency, in fact is not true. |
GPT, 2019 | - | - | Similar to ELMo with LSTM, using tranformer instead. |
RoBERTa, 2019 | dynamical masking | - | A improvement on BERT. Pro: 1) obtain a robust result by dynamically changing the masking pattern applied to the trainning data. 2) Trainning the model longer, with bigger batches, over more data, traning on longer sequence 3) |
XLNeT | Two-strengh self-attention | - | An improvement of autoregressive method by considering all possible factorizations. Pro: 1) different from autoencoding method in BERT, XLNeT use autoregressive method to allow the dependency between predicted token with modification to consider contextual information |
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
, 2017 | Sparse priors | - | - |
ALBERT, 2019 | Sparse matrix factorization | - | - |
TinyBert, 2019 | Knowledge distillation | - |
Paper, Year | Core Idea | Demo | Pro and Con |
---|---|---|---|
LSTM_NMT | - | - | - |
Char Embedding | - | - | - |
TextCNN, 2014 | - | - | - |
CharTextCNN, 2015 | - | - | Pro:1) 模型结构简单,并且在大语料上效果很好.2)可以用于各种语言,不需要做分词处理. 3)在噪音比较多的文本上表现较好,因为基本不存在OOV问题. Con:1)字符级别的文本长度特别长,不利于处理长文本的分类.2)只使用字符级别信息,所以模型学习到的语义方面的信息较少)在小语料上效果较差. |
BahDanau_NMT | - | - | - |
Han_Attention Attention | - | - | - |
SMG | - | - | - |