Not so short introduction to Attention and Transformers
By Kemal ErdemNot so short introduction to Attention and Transformers
By Kemal ErdemAttention weights visualization
Soft attention
Hard attention
Transformer Architecture
Values Visualization
Values Visualization
Masked Attention - "I can't see the future"
Multi-Headed Self-Attention
Architecture Recap
Encoder Block
Encoder-Decoder
First timestep
Deconding while output
The output
Original Transformer
Transformers...
| Model | Layers | Width | Heads | Params | Data | Training |
|---|---|---|---|---|---|---|
| Transformer-Base | 12 | 512 | 8 | 65M | 8xP100 (12h) | |
| Transformer-Large | 12 | 1024 | 16 | 213M | 8xP100 (3.5d) | |
| GPT-2-XL | 48 | 1600 | 25 | 1558M | 40GB | |
| Megatron-LM | 72 | 3072 | 32 | 8300M | 174GB | 512x V100 (9d) |
| GTP-3 175B | 96 | 12888 | 96 | 175000M | 45TB | 355y on V100 |
Bibliography
"There's no such thing as a stupid question!"
Kemal Erdem
https://erdem.pl