Not so short introduction to Attention and Transformers

By Kemal Erdem

Attention weights visualization

Attention weights visualization
Figure 1: Attention weights for English to French translation, Source: Neural Machine Translation by Jointly Learning to Align and Translate

Soft attention

Soft attention
Figure 2: Image attention visualization, white regions have higher value of the attention weight, Source: Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention

Hard attention

Hard attention
Figure 3: Hard attention visualization, white regions are the regions which model attends to, Source: Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention
- **Input vectors**: $\color{limegreen}{X}$ (shape $N_X \times D_X$) - **Query vector**: $\color{mediumpurple}{q}$ (shape $D_Q$), this is our previous hidden state - **Similarity function**: $f_{att}$ - **Similarities**: **e**, $e_i = f_{att}(\color{mediumpurple}{q}, \color{limegreen}{X_i})$ - **Attention weights**: $a = \text{softmax}(e)$ (shape $N_X$) - **Output**: $\color{hotpink}{y} = \sum_i(a_i,\color{limegreen}{X_i})$
$f_{att}$ ---> **Scaled dot product** $\vec{a} \cdot \vec{b} = |\vec{a}| * |\vec{b}| * cos(\theta)$ $e_i = \color{mediumpurple}{q} \cdot \color{limegreen}{X_i} / \sqrt{D_Q}$
- **Input vectors**: $\color{limegreen}{X}$ (shape $N_X \times D_X$) - **Query vectors**: $\color{mediumpurple}{Q}$ (Shape $N_Q \times D_Q$) - **Similarity function**: _scaled dot product_ - **Similarities**: $E = \color{mediumpurple}{Q}\color{limegreen}{X^T}$ (shape $N_Q \times N_X$), $E_{i,j} = \color{mediumpurple}{Q_i} \cdot \color{limegreen}{X_j} / \sqrt{D_Q}$ - **Attention weights**: $A = \text{softmax}(E, dim=1)$ (shape $N_Q \times N_X$) - **Output**: $\color{hotpink}{Y} = A\color{limegreen}{X}$ (shape $N_Q \times D_X$) where $\color{hotpink}{Y_i} = \sum_j(A_{i,j},\color{limegreen}{X_j})$

Transformer Architecture

Transformer Architecture

Values Visualization

Hard attention
Figure 4: Positional Encoding values for first 20 positions, Generated with the use of Tensorflow - Positional encoding code

Values Visualization

Hard attention
Figure 5: Positional Encoding periods for further indexes, Generated with the use of Tensorflow - Positional encoding code

Masked Attention - "I can't see the future"

Masked Attention
Figure 6: Masked Self-Attention Layer

Multi-Headed Self-Attention

Masked Attention
Figure 7: Multi-head processing, Source: Jay Alammar, The Illustrated Transformer, 2019

Architecture Recap

Architecture Transformer
Figure 8: The Transformer - model architecture., Source: Vaswani et al, "Attention Is All You Need", NeurIPS 2017

Encoder Block

Encoder Block
Figure 9: Encoder Block structure with residual connection and normalization, Source: Jay Alammar, The Illustrated Transformer, 2019

Encoder-Decoder

Encoder-Decoder
Figure 10: Stacked encoders with connection to stacked decoders, Source: Jay Alammar, The Illustrated Transformer, 2019

First timestep

First timestep
Figure 11: Generating output for the first timestep, Source: Jay Alammar, The Illustrated Transformer, 2019

Deconding while output

Decoding while output
Figure 12: Step by step generation of the output until the EOS token returned, Source: Jay Alammar, The Illustrated Transformer, 2019

The output

Decoder output
Figure 13: Output of the last decoder with classification, Source: Jay Alammar, The Illustrated Transformer, 2019

Original Transformer

Decoder output
Table 1: Variations on the Transformer architecture. Unlisted values are identical to those of the base model, Source: Vaswani et al, "Attention Is All You Need", NeurIPS 2017
Decoder output
Source: imgflip

Transformers...

Model Layers Width Heads Params Data Training
Transformer-Base 12 512 8 65M 8xP100 (12h)
Transformer-Large 12 1024 16 213M 8xP100 (3.5d)
GPT-2-XL 48 1600 25 1558M 40GB
Megatron-LM 72 3072 32 8300M 174GB 512x V100 (9d)
GTP-3 175B 96 12888 96 175000M 45TB 355y on V100

Table 2: Transformer model comparison

Bibliography

Thanks

"There's no such thing as a stupid question!"

Kemal Erdem
https://erdem.pl