How to Use
LLMs

Beyond the internals — a practical walkthrough of how to actually use large language models in your daily work. Based on Andrej Karpathy's follow-up to his LLM deep dive.

Tools Covered
14+
Use Cases
12
Models
8+
Source
3h

Companion to Part 1: How LLMs Work. All content and examples traced directly to Karpathy's 2025 video.

Q: What should I know before using this?
Chapter 1 · Foundation

You're Talking to
a ZIP File

Karpathy's mental model: ChatGPT is a "one-tab ZIP file" — a highly compressed snapshot of the internet. It read virtually every web page, book, and document up to its training cutoff, roughly 6–12 months ago. What comes back is a probabilistic recollection of that data.

The context window is its working memory — a finite tape of tokens it can see right now. Anything in it is directly accessible. Anything outside it doesn't exist for this conversation. There's no persistent memory between sessions unless you enable it.

There's no live connection to the web by default. The model produces the most statistically likely continuation of your prompt — not a lookup, not a search, not a guarantee.

The Introduction "Hi, I'm ChatGPT. I'm a one-tab ZIP file. My knowledge comes from reading the internet about 6 months ago. I only know what's in this conversation. Every word I generate is a probabilistic sample — treat it accordingly."
Context Window · live working memory
system
You are a helpful assistant.
user
How much caffeine is in an Americano?
assistant
About 63mg per shot...
user
What about a double shot?
Everything above is visible to the model. The moment the window fills, old context falls off the edge and is gone.
Stale knowledge caveat For timeless facts like caffeine content, the model's weights are reliable. For last month's news — they're not. Know the difference before trusting the answer.
Chapter 2 · Ecosystem

Models &
Tiers

ChatGPT is the "Original Gangster" — most features, most popular, most polished. But the ecosystem has exploded since 2022. Pick the right tool for the task.

Claude
Anthropic
Exceptional at coding and document analysis. Powers Cursor (3.7 Sonnet) under the hood. Often outperforms on nuanced reasoning tasks.
CodingDocs
Gemini
Google
Google's entrant. Gemini 2.0 Pro experimental available. Deep integration with Google Workspace. Strong multimodal capabilities.
Multimodal
Perplexity
Perplexity AI
Search-first LLM. Always retrieves and cites sources. Karpathy demoed its Deep Research feature for the rapamycin research example.
Search-firstCitations
Le Chat
Mistral
French startup alternative. Mistral's consumer chat interface. Strong at European languages and code.
Alternative
DeepSeek
DeepSeek
Chinese AI lab. Surprisingly strong at code and reasoning. Different training approach from US labs — worth benchmarking.
AlternativeCode
Model Families
OpenAI
GPT-4o fast · smart · default
o1 / o3 / o3-mini thinking models
o1 Pro $200/mo · deep reasoning
Anthropic
Claude 3.7 Sonnet coding + reasoning
Claude 3.5 Sonnet fast + capable
Claude Haiku lightweight
Google
Gemini 2.0 Pro multimodal
Gemini Flash fast · often free
Others
DeepSeek Chinese · strong at code
Mistral French · Le Chat
Where to compare LM Arena (lmarena.ai) — formerly Chatbot Arena — maintains a live leaderboard ranked by human preference votes. It's the most reliable signal for "which model is actually better right now."
Chapter 3 · Reasoning

Thinking
Models

OpenAI's o1, o1 Pro, o3, and o3-mini are a different breed — all model names starting with "o" are thinking models. Before returning an answer, they run an extended internal monologue: exploring approaches, backtracking, trying alternatives.

This emerged from reinforcement learning: the model discovered that deliberation strategies lead to better outcomes on hard problems. It tries different ideas, backtracks, checks its reasoning — much like the inner monologue you have when problem-solving.

Karpathy noted that Claude 3.7 Sonnet (non-thinking) solved a hard coding problem that o1 Pro could not. Model selection isn't always obvious — the right tool depends on the specific task.

When to use thinking models Hard math, complex multi-step code, formal reasoning, logic puzzles. Skip them for simple tasks — they're slower, more expensive, and deliberation helps less when there's nothing difficult to reason through.
o1 Pro · Extended Thinking ready
"Prove that the sum of two odd numbers is always even."
Click Run to see extended thinking unfold
Chapter 5 · Synthesis

Deep
Research

Deep Research = extended thinking + web search, run for 5–15 minutes. The model searches dozens of sources in parallel, reasons across them, and produces a structured report — work that would take a human researcher hours.

Karpathy's demo: researching rapamycin and longevity. The model looked at 27+ sources, thought for 5 minutes, and produced a report covering mechanism of action (mTOR inhibition), worm/mouse/human trial data, safety concerns, and ongoing studies.

Both ChatGPT Deep Research (requires $200/mo Pro) and Perplexity's research mode offer this. For literature reviews, competitive analysis, and due diligence — it dramatically lowers the research bar.

Best for Scientific literature surveys, competitive landscape analysis, due diligence on decisions, medical/legal research (with verification). Not worth it for simple factual questions.
Deep Research Pipeline
1
Query Planning
Break the question into subtopics and parallel search queries
2
Parallel Web Search
Fetches 20–30 sources simultaneously across subtopics
3
Extended Thinking
Reasons across sources, resolves conflicts, identifies gaps
4
Report Generation
Structured report with citations, mechanisms, caveats
Chapter 6 · Reading

Docs &
Books

Attaching documents transforms the model into a reading assistant. Upload a PDF, paste a chapter, share a spreadsheet — then ask questions, request summaries, or generate conceptual diagrams from the content.

Karpathy's example: reading The Wealth of Nations with Claude. "I'm attaching Chapter 3, Book 1 — please create a conceptual diagram of this chapter." Claude responds with Mermaid code, a diagram markup language that renders as a graph connecting key concepts.

For data, the model can write and run Python to generate charts. But treat it as a very junior data analyst — brilliant at writing the code, but it hallucinated a "1.7 trillion" figure in a chart Karpathy caught. Always scrutinize the numbers, not just the chart shape.

Key workflow Attach document → request summary → drill into sections with Q&A → ask for concept map (Mermaid) → verify any specific figures against the source.
The Wealth of Nations · Book I, Ch. 3 Claude 3.7 Sonnet
Please create a conceptual diagram of this chapter
mermaid
graph TD
  A[Division of Labor] -->|limited by| B[Extent of Market]
  B --> C[Local Market]
  B --> D[Trade Networks]
  D --> E[Water Transport]
  D --> F[Land Transport]
  E -->|lower cost| G[Coastal Cities]
  G -->|develop first| H[Specialization]
Rendered diagram showing how Smith argues market size constrains specialization — larger markets enable deeper division of labor.
What implicit assumption is in node D?
Node D assumes that trade networks are legally and politically accessible — Smith's implicit premise that functioning exchange infrastructure already exists. He notes this precondition without arguing for it.
Chapter 7 · Data Analysis

Code
Execution

ChatGPT's Advanced Data Analysis wires the model to a live Python runtime. You describe a task in plain language — it writes code, runs it, and shows you the result. No copy-paste, no local setup.

This is the integration of language with computation. Arithmetic, statistics, data cleaning, chart generation — anything Python can do. Upload a CSV and ask for a trend analysis; get a matplotlib chart in seconds.

Karpathy's caution: he caught the model generating a chart with a hallucinated "1.7 trillion" instead of the correct value. The code ran fine; the number was wrong. Treat it like a very capable but unreliable junior — verify the figures, not just the output shape.

The rule Use code execution when you need computation, transformation, or visualization. Always check: does the generated code match what you asked? Does the output look plausible against your source data?
Advanced Data Analysis · Python Runtime
"Plot GDP growth for G7 countries from 1990–2023"
Chapter 8 · Development

Agentic
Coding

Beyond chat, a new class of tools integrates LLMs directly into your code editor. Cursor and Windsurf run Claude or GPT under the hood, operating autonomously across your entire codebase — reading files, writing code, running commands, and iterating.

Cursor's Composer (⌘I) is an autonomous agent loop: describe a task, and it plans, writes files, runs shell commands, reads errors, and loops — asking your confirmation before any destructive action. Karpathy built a React app from scratch in a few minutes.

The model under the hood in Karpathy's setup: Claude 3.7 Sonnet. The key insight is that these tools are most powerful when you understand the model well enough to guide and correct it, not just prompt and hope.

Cursor keyboard shortcuts ⌘K inline edit  ·  ⌘L chat sidebar  ·  ⌘I Composer (agentic)
Composer Agent Loop
1
Plan
Break the task into file changes and shell commands
2
Generate
Write or edit source files across the codebase
3
Execute
Run shell commands — asks your approval first
4
Observe
Read output, catch errors, update its plan
↺ loops until done or stuck
Chapter 9 · Multimodal

Voice &
Audio

Karpathy routes roughly half his queries through voice using Super Whisper — his pick among Super Whisper, WhisperFlow, and MacWhisper. Press a hotkey, speak, press again — query transcribed and sent. No typing, no friction.

ChatGPT's Advanced Voice Mode goes further: audio tokens flow directly to and from the model, with no text transcription layer. The result feels genuinely conversational, not a text-to-speech wrapper.

NotebookLM (Google) generates audio podcasts from your documents. Upload papers, books, or notes — it produces a two-host discussion. Karpathy uses it on walks and long drives for passive learning on topics outside his expertise.

Voice tip from Karpathy For queries with product names, library names, or technical terms — switch to typing. Whisper often mistranscribes niche technical vocabulary. Voice is best for natural-language questions.
🎙
Speak
Whisper
transcribes
🧠
LLM
responds
📝
Text
response
Super Whisper
Karpathy's pick · Mac
Global hotkey to record → auto-transcribe → paste anywhere. Works system-wide.
NotebookLM
Google · Free
Upload docs → generate a two-host podcast discussion. Good for passive learning.
Advanced Voice
ChatGPT
Native audio tokens — low latency, no transcription layer, genuinely conversational.
Chapter 10 · Visual Input

Vision &
Camera

Modern LLMs accept images as input — photos, screenshots, scans, diagrams. The model reasons about visual content as fluently as it reasons about text, drawing on training data that included billions of image-text pairs.

Karpathy's examples: uploading a blood test scan for interpretation, pointing a camera at an Aeronet 4 CO2 monitor to identify the device and interpret the 713 PPM reading, and showing a Lord of the Rings map which it correctly identified as Middle-Earth.

Vision is most reliable for well-documented subjects — blood test reference ranges, common consumer devices, famous maps — where training data covers the domain thoroughly. For proprietary or rare objects, expect more hallucination.

Strong vision use cases Identifying unknown objects, interpreting standard lab results, explaining charts and diagrams, OCR on printed text, reading handwriting, and analyzing screenshots.
🩸
Blood Test Panel
"Here are my lab results — explain the flagged values"
Works well — ranges are extensively documented in training data. Karpathy verified the ingredient lists against the actual box. Always confirm with a doctor for medical decisions.
📊
CO2 Monitor (Aeronet 4)
"What is this device, and is 713 PPM a good reading?"
Correctly identified the device, explained that 713 PPM is acceptable indoors (target: below 800 PPM, ventilate above 1000 PPM).
🗺
Fantasy Map Identification
"Do you know what this map is?"
Immediately identified as the map of Middle-Earth from The Lord of the Rings — a famous, widely-reproduced image in training data.
Chapter 11 · Personalization

Memory &
Personalization

By default, every conversation is stateless — the model forgets everything when the tab closes. Two features change this: Memory (ChatGPT auto-saves facts about you across sessions) and Custom Instructions (a persistent system prompt shaping every response).

Karpathy's custom instructions: request educational framing ("be educational whenever you can"), set Korean language formality register for language learning, and share context about his work and interests.

Think of custom instructions as your personal system prompt — it loads before every conversation. Good instructions compress preferences you'd otherwise repeat on every query, making each session feel like it already knows you.

Starter custom instructions "Be concise. Prefer code over prose when both work. When I give you a document, start with a one-paragraph summary. Flag your assumptions explicitly. I work in [your field]."
Custom Instructions · ChatGPT
I'm a software engineer interested in ML. I prefer concise, technical answers. I'm learning Korean — when providing Korean text, use polite-formal register (합쇼체) by default.
Be educational when explaining concepts. Lead with the most important information first. Use code snippets liberally. Flag any assumptions you make explicitly.
Memory · auto-saved across sessions
User prefers bullet lists for multi-step summaries
User monitors indoor CO2 levels at home
User is learning Korean, wants 합쇼체 register
+ saved memories accumulate over time
Chapter 12 · Reference

Tools &
Resources

Every tool, model, and resource mentioned in Karpathy's lecture — linked and categorized.

Chapter 13 · Summary

Key
Takeaways

01
You're talking to a ZIP file
The model compressed the internet into weights. Knowledge is ~6–12 months stale, output is probabilistic, and it has no working memory outside the context window. It cannot verify its own answers.
Foundation
02
Know your tier and model
Free → limited. $20/mo → GPT-4o / Claude Sonnet. $200/mo → o1 Pro, Deep Research. Match the model to the task — thinking models for hard reasoning, fast models for simple queries.
Models
03
Search for time-sensitive info only
For timeless, well-documented knowledge — the weights are enough, skip search. For recent events, changing situations, or niche topics — enable search or use Perplexity.
Search
04
Deep Research for multi-source synthesis
5–15 minutes, 20–30 sources, structured report. Genuinely useful for literature reviews and due diligence. Currently behind the $200/mo paywall on ChatGPT; Perplexity is cheaper.
Research
05
Verify code and data output
Advanced Data Analysis runs real Python — but the model can hallucinate values in the code it writes. Check the numbers against your source data, not just the chart's visual shape.
Code
06
Voice removes half the friction
A Whisper-based dictation tool eliminates the typing barrier. Karpathy routes ~50% of queries through voice. Use text for technical product names and library names that Whisper mistranscribes.
Voice
07
ChatGPT is the default — for now
Most features, largest ecosystem, most polished UX. Claude for coding. Perplexity for search-first. The landscape shifts quickly — check LM Arena for current rankings before committing.
Ecosystem

Built from Andrej Karpathy's "How I use LLMs" lecture. All content, examples, and framings traced directly to that source. Interactive visualizations built with AI assistance.

← Part 1: How LLMs Work · Full transcript · GitHub