Transformer Self-Attention
簡介
Visualizes the scaled dot-product self-attention mechanism in a Transformer. Input tokens are projected to Q, K, V matrices, attention scores are computed via Q×K^T dot products shown as a heatmap, softmax is applied, and the output is computed as a weighted sum with V. The full attention formula is displayed.
Transformer Self-Attention
Description
Visualizes the scaled dot-product self-attention mechanism in a Transformer. Input tokens are projected to Q, K, V matrices, attention scores are computed via Q×K^T dot products shown as a heatmap, softmax is applied, and the output is computed as a weighted sum with V. The full attention formula is displayed.
Phases
| # | Phase Name | Duration | Description |
|---|---|---|---|
| 1 | Intro | 3s | Title and input tokens displayed |
| 2 | Q K V Projection | 8s | Token embeddings projected to Q, K, V matrices shown |
| 3 | Attention Scores | 10s | Q×K^T dot products shown; result is a heatmap grid |
| 4 | Scale & Softmax | 6s | Divide by sqrt(d_k), apply softmax; rows sum to 1 |
| 5 | Output Computation | 8s | Weighted sum with V; output tokens shown |
| 6 | Full Formula | 6s | Attention(Q,K,V) formula displayed |
| 7 | Multi-Head Note | 5s | Brief note: multiple heads run in parallel |
| 8 | Outro | 4s | Summary |
Layout
+--------------------------------------------------+
| Title: Transformer Self-Attention |
+--------------------------------------------------+
| |
| Tokens: [The] [cat] [sat] [on] [mat] |
| |
| Q matrix K matrix V matrix (3 column panels) |
| |
| Attention heatmap (5×5 grid): |
| softmax(QK^T / sqrt(d_k)) |
| |
| Output = attn_weights × V |
| |
| Formula (bottom): |
| Attention(Q,K,V) = softmax(QK^T/√d_k)V |
+--------------------------------------------------+
Area Descriptions
- Top: Input token boxes
- Center left: Q, K, V matrix panels
- Center right: Attention heatmap
- Bottom: Formula display
Assets & Dependencies
- Fonts: LaTeX / sans-serif
- Manim version: ManimCE 0.19.1
Notes
- Token boxes colored distinctly (different hues)
- Attention heatmap uses color gradient from dark (low) to bright (high attention)
- Show which token attends most to which (highlight max attention in each row)
- d_k scaling factor annotated with reasoning (prevents vanishing gradients in softmax)
受眾: Software Engineer類別: Cs