Chapter 10 · Visual Input
Vision &
Camera
Modern LLMs accept images as input — photos, screenshots, scans, diagrams. The model reasons about visual content as fluently as it reasons about text, drawing on training data that included billions of image-text pairs.
Karpathy's examples: uploading a blood test scan for interpretation, pointing a camera at an Aeronet 4 CO2 monitor to identify the device and interpret the 713 PPM reading, and showing a Lord of the Rings map which it correctly identified as Middle-Earth.
Vision is most reliable for well-documented subjects — blood test reference ranges, common consumer devices, famous maps — where training data covers the domain thoroughly. For proprietary or rare objects, expect more hallucination.
Strong vision use cases
Identifying unknown objects, interpreting standard lab results, explaining charts and diagrams, OCR on printed text, reading handwriting, and analyzing screenshots.