Transformer Self-Attention

Audience: Software EngineerCategory: Computer Science

Description

Visualizes the scaled dot-product self-attention mechanism in a Transformer. Input tokens are projected to Q, K, V matrices, attention scores are computed via Q×K^T dot products shown as a heatmap, softmax is applied, and the output is computed as a weighted sum with V. The full attention formula is displayed.

Inspired by this animation?

Transformer Self-Attention

Description

Phases

#	Phase Name	Duration	Description
1	Intro	3s	Title and input tokens displayed
2	Q K V Projection	8s	Token embeddings projected to Q, K, V matrices shown
3	Attention Scores	10s	Q×K^T dot products shown; result is a heatmap grid
4	Scale & Softmax	6s	Divide by sqrt(d_k), apply softmax; rows sum to 1
5	Output Computation	8s	Weighted sum with V; output tokens shown
6	Full Formula	6s	Attention(Q,K,V) formula displayed
7	Multi-Head Note	5s	Brief note: multiple heads run in parallel
8	Outro	4s	Summary

Layout

+--------------------------------------------------+
|  Title: Transformer Self-Attention               |
+--------------------------------------------------+
|                                                  |
|  Tokens: [The] [cat] [sat] [on] [mat]           |
|                                                  |
|  Q matrix  K matrix  V matrix  (3 column panels) |
|                                                  |
|  Attention heatmap (5×5 grid):                   |
|  softmax(QK^T / sqrt(d_k))                       |
|                                                  |
|  Output = attn_weights × V                       |
|                                                  |
|  Formula (bottom):                               |
|  Attention(Q,K,V) = softmax(QK^T/√d_k)V         |
+--------------------------------------------------+

Area Descriptions

Top: Input token boxes
Center left: Q, K, V matrix panels
Center right: Attention heatmap
Bottom: Formula display

Assets & Dependencies

Fonts: LaTeX / sans-serif
Manim version: ManimCE 0.19.1

Notes

Token boxes colored distinctly (different hues)
Attention heatmap uses color gradient from dark (low) to bright (high attention)
Show which token attends most to which (highlight max attention in each row)
d_k scaling factor annotated with reasoning (prevents vanishing gradients in softmax)

Menu

Transformer Self-Attention

Description

Transformer Self-Attention

Description

Phases

Layout

Area Descriptions

Assets & Dependencies

Notes