The deeper the model,
the more successful feature extraction,
the higher performance.
This note dicuss a simple question:
How to build the model deeper?
The abstract answer is:
The magnitude and direction of the gradient
1. Terminology
- the weight of one neuron: \(W_j\)
- the ouput of the jth neuron, \(Z_j\)
- the gradient, or weight update:\(\frac{\Delta loss}{\Delta W_j}\)
- the ith element of an object: \(()^{i}\)
- the matrix with m by n dimension: \(()_{m,n}\)
2. Problem: Exploding/Vanishing gradient
\[Final \ solution: \ W_jx=> BN() => LeakyReLU()\]
2.1 Batch normalization, BN:
- Problem: As \(|x| \neq |Wx|\), the shift in mean and variance accumulate in a series of \(|Wx|\).
- Solution: Batch normalization
\[(W_jx)^{(i)}=Z^{(i)}=\frac{Z^{(i)}-E(Z^{(i)})}{std(Z^{(i)}) + \epsilon}\]
- Effect:
- In Forward propagation, BN mantains \(|W_jx| \sim Normal(0,1)\), so \(W_j\) is always the same.
- Therefore, In backward propagation, the gradient \(|\frac{\Delta loss}{\Delta W_j}|\) is always the same.
- Result: focus on \(W\), reducing the accumulation of ampification chance of gradient Exploding/Vanishing.
- Additional note: mantaining \(|W_j| \sim Normal(0,1)\) is computational heavy, so BN is a compromise between computational cost and reducing shift for mean and variance of \(W\).
2.2 Leaky Rectified Linear Unit, LeakyReLU
- Problem:
- the traditional activation function sigmoid() has saturated gradient \(\frac{\Delta loss}{\Delta W_j} \approx 0\) when input \(x<-1\) and \(x>1\). The weight update is suppressed when input value is extreme.
- The highest weight update of sigmoid() is 0.25, \(\frac{\Delta loss}{\Delta W_j} \in (0,0.25)\). The gradient is diminished as going shallower in the back-propagation, so the weights at shallower layers cannot be updated.
- Solution: Leaky Rectified Linear Unit
\[LeakyReLU = max(\alpha x,x)\]
- Effect:
- In Forward propagation, LeakyReLU provide nonlinearity activation by discrimating \(x>1\) with identity mapping and \(x<-1\) with suppression.
- In backward propagation, LeakyReLU enables consistant weight update \(\frac{\Delta LeakyReLU(x)}{\Delta x} = \begin{cases} 0.001 \ if \ x \leq 0 \\ 1 \ if \ x > 0 \end{cases}\) at both positive and negative ranges
- Result: focus on \(x\), provide nonlinear activation with full-range covering to avoid vanishing gradient.
3. Problem: Gradient degradation
3.1 Residual block
- Problem: Infomation loss occur passsing a series of nonlinear activation.
- Solution: Residual block
\[Z_j + f(Z_j) = Z_{j+1}\]
- Effect:
- In forward propagation, while residual mapping extract feature, identity mapping passes the shallow information to the deeper layers
- In backward propagation, identity mapping always has a gradient equal to 1. Unless residual = 0, the residual mapping has a gradient alway greater than 0. The total gradient flow through the Residual block is:
\[\Delta W_j = \frac{\Delta loss}{\Delta Z_j} = \frac{\Delta loss}{\Delta Z_{j+1}} \frac{\Delta Z_{j+1}}{\Delta Z_j} = \frac{\Delta loss}{\Delta Z_{j+1}} (1 + \frac{\Delta f(Z_j)}{\Delta Z_j})\]
- Result: the gradient throught identity mapping is alway greater than 1, which means gradient would never been zero and no gradient degradation.
- Extreme case: When residual equal to 0 (mapping x to x), the residual mapping become negative identity mapping, then the residual block become \(Z_j+ Z_j = Z_{j+1}\).
4. Problem: Equal importance assumption
4.2. Attention
- Problem: model does not distinguish between useful information (channel/feature, space, time) and useless information.
- Solution: weighted summation (Hadamard product) of value \((Z_j)_{m,n}\) and weight \(att(q,k)_{n,n}\) where \(att(q,k)_{n,n} = (\frac{similarity.(q,k)}{|q|,|k| +\epsilon})_{n,n}\) and \(|att(q,k)_{n,n}| = 1\).
\[(Z_j)_{m,n} \circ att(q,k)_{n,n} = (Z_{j+1})_{m,n}\]
- Perform dimentional reduction using affine mapping to a lower dimensional space, get q and k.
- In this lower dimensional space, obtain the element-wise relation by calculating element-wise similarity \(att(q,k)\)。
- Hadamard product to reweighted the value \(v\) with \(att(q,k)\).
- Effect:
- In forward propagation, useful information is amplified, whereas useless information is suppressed:
- In backward propagation, the gradient through identity mapping is \(|\frac{\Delta Z_{j+i}}{\Delta Z_j}| = |att(q,k)| = 1\). Unless all information are equally important or not important, the norm of attention would never been zero, and the gradient througth affine mapping would never be zero. The total gradient flow through the attention block is:
\[\Delta W_j = \frac{\Delta loss}{ \Delta Z_j} = \frac{\Delta loss}{ \Delta Z_{j+1}} \frac{\Delta Z_{j+1}}{ \Delta Z_j} = \frac{\Delta loss}{ \Delta Z_{j+1}}( att(q,k) + \frac{\Delta att(q,k)}{ \Delta Z_j} )\]
- Result: weight efficiently update on learning the important information.
- Extreme case: When information are equally important (collinear), the \(att(q,k)\) is identity matrix, which \(|att(q,k)|=|I|=1\). The whole attention block become identity mapping, with constant gradient of 1. When information are not important at all, a lower dimention space does not exist, and \(att(q,k)\) become zero.