nanoGPT, explained from scratch
Andrej Karpathy's nanoGPT is the cleanest ~300-line implementation of a GPT-style language model. Read it once and the rest of the transformer ecosystem stops feeling like magic.
What is nanoGPT?
nanoGPT is a minimal, hackable PyTorch re-implementation of GPT-2 by Andrej Karpathy. The whole model lives in one file (model.py), training in another (train.py), and sampling in a third (sample.py). It's small enough to read in an afternoon and powerful enough to reproduce GPT-2 (124M) on a single 8×A100 node.
The mental model
A GPT is a function that takes a sequence of tokens and predicts the next token. That's it. Training teaches it to predict the next token across billions of examples; sampling repeatedly calls it, appending each prediction, to generate text. Everything else — attention, MLPs, layer norms — is plumbing that makes that one prediction more accurate.
tokens ──▶ token + position embeddings
│
▼
┌───────────────────┐
│ Transformer block│ × N layers
│ ┌─────────────┐ │
│ │ LayerNorm │ │
│ │ Causal Attn │ │ ← tokens look only at past tokens
│ │ + residual │ │
│ ├─────────────┤ │
│ │ LayerNorm │ │
│ │ MLP (4×) │ │
│ │ + residual │ │
│ └─────────────┘ │
└───────────────────┘
│
▼
LayerNorm → Linear → logits over vocab → softmax → next token1. Tokenization
Text is split into integer tokens using Byte-Pair Encoding (BPE). nanoGPT uses tiktoken's GPT-2 encoding with a 50,257-token vocabulary. Each token becomes an index into the embedding table.
import tiktoken
enc = tiktoken.get_encoding("gpt2")
ids = enc.encode("Vector databases are useful.")
# [38469, 18209, 389, 4465, 13]
enc.decode(ids)
# 'Vector databases are useful.'2. Embeddings: token + position
Each token id is looked up in a learned token embedding table to get a vector of size n_embd (e.g. 768). Transformers have no built-in sense of order, so we add a learned position embedding for each slot in the sequence. The sum is what enters the first transformer block.
class GPT(nn.Module):
def __init__(self, config):
super().__init__()
self.tok_emb = nn.Embedding(config.vocab_size, config.n_embd)
self.pos_emb = nn.Embedding(config.block_size, config.n_embd)
self.blocks = nn.ModuleList([Block(config) for _ in range(config.n_layer)])
self.ln_f = nn.LayerNorm(config.n_embd)
self.head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
def forward(self, idx):
B, T = idx.shape
pos = torch.arange(T, device=idx.device)
x = self.tok_emb(idx) + self.pos_emb(pos) # (B, T, n_embd)
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
return self.head(x) # (B, T, vocab_size)3. Causal self-attention (the heart)
Self-attention lets every token look at every other token and decide which ones to pay attention to. In a decoder-only GPT it's causal: token t can attend to tokens 0…t but not the future. This is enforced with a triangular mask.
For each token we project the embedding into three vectors: Query (what am I looking for?), Key (what do I offer?), and Value (what info do I carry?). The attention score between tokens i and j is the dot product Qᵢ · Kⱼ, scaled and softmaxed, then used to weight the Values.
class CausalSelfAttention(nn.Module):
def __init__(self, config):
super().__init__()
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd) # Q, K, V
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.n_head = config.n_head
self.n_embd = config.n_embd
# lower-triangular mask: position i can see 0..i
self.register_buffer(
"mask",
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size),
)
def forward(self, x):
B, T, C = x.shape
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
# reshape into (B, n_head, T, head_dim)
hd = C // self.n_head
q = q.view(B, T, self.n_head, hd).transpose(1, 2)
k = k.view(B, T, self.n_head, hd).transpose(1, 2)
v = v.view(B, T, self.n_head, hd).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) / math.sqrt(hd) # (B, h, T, T)
att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
y = att @ v # (B, h, T, hd)
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y)n_head smaller heads lets each head specialize: one head might track subject–verb agreement, another might track quotation pairing. The outputs are concatenated and projected back.4. The MLP and the transformer block
After attention mixes information across tokens, a position- wise MLP (two linear layers with a GELU in between, expanding to 4× the embedding dim) processes each token independently. Both sublayers are wrapped in residual connections and pre-LayerNorm.
class MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
def forward(self, x):
return self.c_proj(F.gelu(self.c_fc(x)))
class Block(nn.Module):
def __init__(self, config):
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = MLP(config)
def forward(self, x):
x = x + self.attn(self.ln_1(x)) # residual + attention
x = x + self.mlp(self.ln_2(x)) # residual + MLP
return x5. The training loop
Training is unglamorous: grab a random window of the corpus, shift it by one to make the targets, compute cross-entropy between predicted and actual next tokens, backprop, step the optimizer (AdamW with cosine LR schedule). Repeat for hundreds of thousands of iterations.
def get_batch(data, block_size, batch_size):
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i :i+block_size] for i in ix])
y = torch.stack([data[i+1:i+1+block_size] for i in ix]) # shifted by 1
return x, y
for step in range(max_iters):
xb, yb = get_batch(train_data, block_size, batch_size)
logits = model(xb) # (B, T, vocab)
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), # (B*T, vocab)
yb.view(-1), # (B*T,)
)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()6. Sampling: generating text
To generate, feed the prompt through the model, take the logits at the last position, optionally apply temperature and top-k, sample a token, append it, and repeat until you hit a length limit or stop token.
@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -block_size:] # crop to context window
logits = model(idx_cond)[:, -1, :] / temperature # last position only
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float("inf")
probs = F.softmax(logits, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_id), dim=1)
if next_id.item() == eos_id: break
return idx7. KV-cache: making decoding fast
During generation we call the model once per new token. Naively, every call recomputes Q, K, and V for the entire prefix — even though the prefix hasn't changed. That's O(T²) work to produce T tokens.
The fix: cache the K and V tensors for every past token in every layer. On each new step we only compute Q/K/V for the single new token, append its K and V to the cache, and attend against the full cached K/V. Each generation step becomes O(T) instead of O(T²), and the speedup grows with context length.
class CausalSelfAttention(nn.Module):
def forward(self, x, kv_cache=None):
B, T, C = x.shape # T == 1 during decoding
q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
hd = C // self.n_head
q = q.view(B, T, self.n_head, hd).transpose(1, 2)
k = k.view(B, T, self.n_head, hd).transpose(1, 2)
v = v.view(B, T, self.n_head, hd).transpose(1, 2)
# Append new K, V to the cache for this layer
if kv_cache is not None:
past_k, past_v = kv_cache
k = torch.cat([past_k, k], dim=2) # (B, h, T_past+1, hd)
v = torch.cat([past_v, v], dim=2)
new_cache = (k, v)
# No mask needed: q has length 1, k/v are all past+current
att = (q @ k.transpose(-2, -1)) / math.sqrt(hd)
att = F.softmax(att, dim=-1)
y = (att @ v).transpose(1, 2).contiguous().view(B, T, C)
return self.c_proj(y), new_cache
@torch.no_grad()
def generate_cached(model, idx, max_new_tokens):
caches = [None] * len(model.blocks) # one per layer
# 1) Prefill: run the full prompt once to populate the caches
x, caches = model.forward_with_cache(idx, caches)
next_id = sample(x[:, -1, :])
out = [next_id]
# 2) Decode: feed only the newest token each step
for _ in range(max_new_tokens - 1):
x, caches = model.forward_with_cache(next_id, caches)
next_id = sample(x[:, -1, :])
out.append(next_id)
return torch.cat([idx] + out, dim=1)This is also why long contexts get expensive at inference time even on small models: the KV-cache grows linearly with context length per layer. Tricks like grouped-query attention, multi-query attention, and paged KV-caches (vLLM) all exist to shrink or better manage this exact buffer.
What nanoGPT deliberately leaves out
To stay readable, nanoGPT skips many production niceties — and that's the point. Once you understand the core, you can read papers about each addition without losing the thread.
Try it yourself
The fastest on-ramp is the Shakespeare character-level demo — trains in minutes on a laptop CPU and produces recognizably Shakespearean gibberish. Then graduate to OpenWebText on a GPU to reproduce GPT-2.
git clone https://github.com/karpathy/nanoGPT
cd nanoGPT
pip install torch numpy transformers datasets tiktoken wandb tqdm
# Prepare the tiny Shakespeare dataset
python data/shakespeare_char/prepare.py
# Train a small model on CPU / single GPU
python train.py config/train_shakespeare_char.py
# Sample from your trained model
python sample.py --out_dir=out-shakespeare-charmodel.py → train.py → sample.py. Then read Karpathy's "Let's build GPT" video alongside it — it walks through every line.