How LLMs
Actually Work

Human: What is behind this text box?

A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive.

Training Tokens
15T
Parameters
405B
Text Data
44 TB
Token Vocabulary
100K

Downloading
the Internet

The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.

The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly what fits on a single hard drive — representing ~15 trillion tokens.

Key Insight The quality and diversity of this training data has more impact on the final model than almost anything else. Garbage in, garbage out — but at a trillion-token scale.

Click any stage to read more detail

🌐 Common Crawl
2.7B web pages · Raw HTML · Since 2007
A non-profit organization that crawls the web and freely provides its data. Their bots follow links from seed pages, recursively indexing the internet. The raw archive is petabytes of gzip'd WARC files containing raw HTML.
Chapter 1 · Pre-Training · Stage 2

Tokenization

Neural networks can't process raw text — they need numbers. The solution is tokenization: breaking text into "tokens" (sub-word chunks) and assigning each an ID.

GPT-4 uses a vocabulary of 100,277 tokens, built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary.

Why not just use words? Words have infinite variants. "run", "running", "runner" would be 3 separate entries. Subword tokens share roots: "run" + "ning", "run" + "ner". This also handles new words, typos, and multiple languages efficiently.

BPE in Action

BPE Tokenization StepsInteractive diagram showing how Byte Pair Encoding progressively merges characters into subword tokens
Step 1 of 5

Live Tokenizer

Try the examples below or type your own text. Hover (or focus) any token to see its ID.

Tokens: 0
Characters: 0
Ratio: 0 chars/token

Explore tokenization across GPT-4, Claude, Llama and more → tiktokenizer.vercel.app

Chapter 1 · Pre-Training · Stage 3

Training the
Neural Network

The Transformer neural network is initialized with random parameters — billions of "knobs". Training adjusts these knobs so the network gets better at predicting the next token in any sequence.

Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times.

The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language.

Scale GPT-2 (2019): 1.6B params, 100B tokens, ~$40K to train. Today: same quality for ~$100. Llama 3: 405B params, 15T tokens. Modern frontier models: hundreds of billions of parameters, trillions of tokens.

Transformer Architecture

Transformer Architecture

Select a training stage to see model output quality

Training Loss ↓
4.8
Cross-entropy loss
500
Training step

Model Output at This Stage

the model has learning but confustion still the wqp mxr model bns to predict...
What the model is learning At step 1: pure noise. By step 500: local coherence appears. By step 32K: fluent English. The model is learning grammar, facts, reasoning patterns — all implicitly from token prediction.
Chapter 1 · Pre-Training · Stage 4

Inference &
Token Sampling

Once trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat.

This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen.

Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos. 0.7–1.0 is the sweet spot for coherent-but-creative text.

Key Mental Model The model doesn't "think" about what to say. It computes a probability distribution over all possible next tokens and samples from it. Every word is a coin flip — just a very informed one.

Token Sampling Demo

Watch the model choose the next word. Each bar shows the probability of a candidate token.

The sky appears blue
Chapter 2 · The Base Model

The Internet
Simulator

After pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet.

Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data.

The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information.

Base Model Behavior

Few-Shot Prompting
Hello: Bonjour | Cat: Chat | Dog: Chien | Teacher:
→ Professeur  ✓ correct
Memorization
Zebras (/ˈzɛbrə, ˈziːbrə/) are African equines with distinctive...
...black-and-white striped coats. There are three living species: the Grévy's zebra, plains zebra, and mountain zebra...
↑ Verbatim Wikipedia recall from weights
Hallucination
The Republican Party nominated Trump and [running mate] in the 2024 election against...
→ ...Mike Pence, facing Hillary Clinton and Tim Kaine...
→ ...Ron DeSantis, against Joe Biden and Kamala Harris...
↑ Knowledge cutoff → plausible confabulation
In-Context Learning Base models can perform translation, classification, and Q&A via few-shot prompts — no fine-tuning needed. The model infers the task from the pattern of examples in its context window.
Chapter 3 · Post-Training

Building the Assistant

The base model is a token simulator. To turn it into a helpful assistant, we need post-training — a much cheaper but equally critical stage. This is where the model learns conversations.

Supervised Fine-Tuning (SFT)

Human labelers create a dataset of ideal conversations, following detailed labeling instructions: be helpful, be truthful, be harmless. The model is then trained on these conversations — not from scratch, but by continuing to adjust the pre-trained weights on this new data.

Modern SFT datasets (like UltraChat) have millions of conversations — mostly synthetic (LLM-generated), with human review. The model learns by imitation: it adopts the persona of the ideal assistant reflected in the data.

Training Conversation Example
Human
What is 2 + 2?
Assistant
2 + 2 = 4. Is there anything else you'd like help with?
Human
What if it was multiplication instead?
Assistant
2 × 2 = 4 as well — the same result! For multiplication, 2 × 2 means adding 2 to itself once, giving you 4.
What you're really talking to ChatGPT is a statistical simulation of the human labelers OpenAI hired — experts following labeling instructions. When it answers a coding question, it's imitating what a skilled developer-labeler would write.

Conversation Token Format

Every conversation must be encoded as a flat token sequence. Special tokens mark the structure:

<|im_start|>user<|im_sep|> What is 2 + 2? <|im_end|> <|im_start|>assistant<|im_sep|> 2 + 2 = 4. <|im_end|>

Then RLHF refines the assistant's behavior further:

RLHF — Reinforcement Learning
from Human Feedback

Human raters rank multiple model responses. A reward model learns to predict human preferences. The language model is then trained via reinforcement learning to generate responses the reward model scores highly.

✓ Preferred
Here are the top 5 landmarks in Paris: 1) Eiffel Tower — iconic iron lattice structure... 2) The Louvre — world's largest art museum...
✗ Rejected
Paris has many landmarks. You should visit the Eiffel Tower. There is also a museum called the Louvre. Also Notre-Dame Cathedral is there...
Why RLHF matters SFT teaches the model what to say. RLHF teaches it how to say it well — making responses more helpful, better structured, more honest, and less likely to hallucinate.
Chapter 4 · LLM Psychology

Cognitive Quirks
of Language Models

Understanding why LLMs behave the way they do requires thinking about their psychology — the emergent properties of being trained to statistically imitate human text.

🌀
Hallucination
Models confabulate confidently because training data always has confident answers. "Who is Orson Kovats?" gets a made-up biography because the training distribution of "who is X?" questions is always followed by confident replies — even for fictional names. Fix: add "I don't know" examples for questions the model gets wrong consistently.
🧠
Two Types of Memory
Parameters = long-term memory. Everything the model learned during training — vast but vague, like something you read months ago. Context window = working memory. Text in the current conversation — precise, directly accessible. Always paste important info into context rather than relying on the model to "remember."
🔧
Tool Use
Models can emit special tokens that trigger external tools: <search>query</search>. The program pauses generation, executes the search, stuffs the results into the context window, then resumes. The model "looks things up" the same way you do — by refreshing working memory.
🪞
No Persistent Self
Each conversation starts fresh — no memory of prior chats. The model "boots up," processes tokens, then shuts off. It has no stable identity. When it says "I'm ChatGPT by OpenAI," that's just the most statistically likely answer from training data — not genuine self-knowledge.
📊
Stochastic Token Tumbler
The model doesn't "decide" what to say. It computes probability distributions and samples. Run the same prompt 10 times and get 10 different outputs — all plausible, all drawn from the same learned distribution. Temperature controls how broadly it samples from this distribution.
📚
Knowledge Cutoff
Training data has a date. The model genuinely doesn't know what happened after that. Ask about recent events and it will hallucinate — not from malice but from the same mechanism that answers every question: predict the most likely continuation of the token sequence.
Applied LLMs · RAG

Retrieval-Augmented
Generation

LLMs have a knowledge cutoff and a finite context window. RAG solves this by embedding your documents into a vector store, retrieving the most semantically relevant chunks at query time, and injecting them into the context — shifting the model's prediction distribution toward grounded, up-to-date facts rather than memorized training data.

Step 01 — Embed everything

Every document is converted to a dense vector (~1,536 numbers) by an embedding model. Semantically similar texts land near each other in this high-dimensional space — no keyword matching needed.

Step 02 — Embed the query & search

The user's question is embedded the same way. Cosine similarity finds the nearest document vectors — the chunks most semantically related to the query — typically the top 2–5.

Step 03 — Inject & generate

Retrieved chunks are prepended to the prompt before the LLM sees the question. The model generates from injected facts rather than relying on memorized training data — dramatically reducing hallucination on knowledge-intensive tasks.

1 · User Query
"What is the capital of Ares Base?"
2 · Embedding Model
Text → [0.23, −0.87, 0.41, ...] ~1,536 floats
3 · Vector DB — Cosine Search
Find top-k nearest neighbors in embedding space
4 · Retrieved Chunks (top 2)
Doc 1: "Ares Base established 2031..."
Doc 2: "Capital is New Houston, 312 colonists..."
5 · Context Window (assembled)
[Retrieved] Ares Base est. 2031...
[Retrieved] Capital: New Houston...
[Query] What is the capital...?
6 · LLM → Grounded Answer
"The capital of Ares Base is New Houston."

Effect on Predictions

Query "What is the administrative capital of the Ares Base colony?"
Knowledge Base — 4 documents
DOC 1
The Mars colony Ares Base was established in 2031 near Hellas Planitia.
DOC 2
The administrative capital of Ares Base is New Houston, housing 312 colonists.
DOC 3
Mars surface temperature averages −63°C with seasonal variation near the poles.
DOC 4
The first crewed Mars mission launched from Kennedy Space Center in 2029.
Context Window — sent to LLM
📄 [Retrieved] The Mars colony Ares Base was established in 2031 near Hellas Planitia.
📄 [Retrieved] The administrative capital of Ares Base is New Houston, housing 312 colonists.
❓ [Query] What is the administrative capital of the Ares Base colony?
✕ Without RAG
"I don't have reliable information about a colony called Ares Base. As of my training cutoff, no such Mars colony has been established..."
Hallucination / Refusal
✓ With RAG
"The administrative capital of Ares Base is New Houston, which houses 312 colonists. The colony was established in 2031 near Hellas Planitia."
Grounded in retrieved context
Full Pipeline

From Text to
Assistant

The complete journey from raw web crawl to the ChatGPT you interact with — across two major stages, months of compute, and billions of parameters.

01
Data Collection
Common Crawl + other sources → URL filtering → text extraction → language filtering → deduplication → PII removal → 44 TB of curated text (FineWeb, etc.)
Common CrawlFineWeb44 TB15T tokens
02
Tokenization
Text → UTF-8 bytes → Byte Pair Encoding → 15 trillion token sequence. Each token is a sub-word chunk with an integer ID. GPT-4 vocabulary: 100,277 tokens.
BPE100K vocabSub-word units
03
Pre-Training
Transformer neural network trained to predict the next token. Billions of parameters tuned via gradient descent. Months of compute on thousands of GPUs. Loss decreases from ~11 to ~2.4.
Transformer405B params$millions compute3 months
04
Base Model
An internet document simulator. Can autocomplete, few-shot prompt, and regurgitate memorized facts. NOT an assistant — just a very sophisticated token predictor.
GPT-2Llama 3 baseToken autocomplete
05
Supervised Fine-Tuning (SFT)
Base model retrained on human-labeled conversations. Labelers write ideal responses following company guidelines: helpful, truthful, harmless. Modern datasets: millions of synthetic + human-curated conversations. Duration: hours (not months).
Human labelersInstructGPTUltraChat~3 hours
06
RLHF
Human raters rank model outputs. A reward model learns these preferences. The language model is optimized via reinforcement learning to score higher — producing responses that are more helpful, better structured, and more honest.
Reward ModelPPOHuman preferences
07
🤖 ChatGPT / Claude / Gemini
The final assistant. A statistical simulation of expert human labelers, backed by a vast compressed representation of the internet. Not magic — but remarkable engineering at enormous scale.
ConversationalHelpful · Truthful · HarmlessTool use

Built from Andrej Karpathy's "Intro to Large Language Models" lecture. The most important takeaway: every word generated is a probabilistic sample — a biased coin flip, at 100K-way scale, billions of times.