Fact-Check Report
LLM Council
GPT-4.1
Gemini 2.5 Pro
Llama 4 Maverick
Reviews were anonymised before synthesis — the synthesiser did not know which model produced which findings.
High Confidence — flagged by 2+ models
High confidence
Common Crawl — indexing 2.7 billion pages by 2024
The 2.7B figure refers to a single crawl dump processed for FineWeb, not the cumulative archive since 2007 which is vastly larger. The phrasing implies a running total, which is misleading.
High confidence
GPT-4 uses a vocabulary of 100,277 tokens via BPE
OpenAI has never officially confirmed GPT-4's vocabulary size. The 100,277 figure is the
cl100k_base tokenizer associated with GPT-3.5/GPT-4 era models, but attributing it specifically to GPT-4 is unverified. Worth softening to "approximately 100K tokens."
High confidence
GPT-2 training cost ~$40K; same quality today for ~$100
The $100 figure is a significant underestimate without context. Training a 1.5B parameter model from scratch still costs thousands of dollars. The $100 figure likely refers to toy-scale reproductions (e.g. Karpathy's nanoGPT), not full GPT-2 quality. Needs clarification.
High confidence
ChatGPT is "imitating what a skilled developer-labeler would write"
Misleading framing — coding ability and core knowledge come from pre-training on vast internet data, not from SFT labelers. SFT shapes style and assistant behaviour; it does not create underlying skills from scratch.
High confidence — likely defensible
Llama 3: 405B params, 15T tokens
Two models flagged this, but Llama 3.1 405B was publicly released in July 2024 with a 15T-token training set — so the claim appears correct. The flags reflect outdated reviewer knowledge. Worth adding a citation to confirm.
Lower Confidence — flagged by 1 model
Worth reviewing
44 TB FineWeb represents ~15 trillion tokens
FineWeb's published token count is approximately 15T tokens in its standard release. One reviewer claimed 22T — worth verifying against HuggingFace's published figures directly.
Worth reviewing
GPT-2 trained on 100B tokens
GPT-2 was trained on WebText (~40GB), typically estimated at 20–30B tokens. The 100B figure appears overstated.
Worth reviewing
Embedding size ~1,000–4,000 numbers
Frontier models often use embedding dimensions exceeding 8,000. The upper bound of 4,000 is too narrow for current models.
Worth reviewing
RLHF produces "more honest" responses
RLHF optimises for human preference, not truthfulness — and can reinforce sycophancy or hallucination if raters prefer confident-sounding wrong answers. "More helpful and better structured" is safer wording.
Worth reviewing
PII removal finds "named individuals"
Full removal of all named persons is infeasible and not what filtering pipelines actually do. "Attempts to detect" is more accurate.
Dismissed — intentional simplification
SFT duration: "hours (not months)"
Borderline over-pedantic. Reads as a pedagogical contrast with pre-training duration, not a precise claim.
Dismissed — intentional simplification
The model doesn't "think" — it computes a probability distribution
Standard accepted simplification for explaining next-token prediction. Not misleading in context.
Overall Assessment
The guide is largely sound as an educational overview, but several specific numerical claims need correction or clarification. The most serious issues are the GPT-2-to-$100 cost comparison (needs context about scale), the Common Crawl page count (conflates a single dump with the full archive), and the "imitating a developer-labeler" framing (understates pre-training's role). The GPT-4 tokenizer and Llama 3 specs are defensible but should be sourced. Several flags target intentional simplifications appropriate for an educational context and can be dismissed.
Raw Findings by Model
GPT-4.1 — 6 issues
ClaimLlama 3: 405B params, 15T tokens
Verdictmisleading
NotesAs of mid-2024, the largest publicly disclosed Llama 3 model was 70B parameters. (Note: Llama 3.1 405B was released July 2024 — this flag appears outdated.)
ClaimGPT-4 uses a vocabulary of 100,277 tokens via BPE
Verdictmisleading
NotesThe 100,277 figure is accurate for cl100k_base but not officially confirmed for GPT-4 specifically.
ClaimEmbedding size ~1,000–4,000 numbers
Verdictmisleading
NotesMost frontier models use 2,048–8,192+ dimensions. The upper bound of 4,000 is too narrow.
ClaimThe model learns grammar, facts, reasoning patterns implicitly
Verdictmisleading
NotesOverstates robustness of learned reasoning. (Dismissed as acceptable simplification.)
ClaimSFT duration: hours (not months)
Verdictmisleading
NotesLarge-scale SFT can take days. (Dismissed as intentional simplification.)
ClaimRLHF produces responses that are "more honest"
Verdictmisleading
NotesRLHF optimises for human preference, not truthfulness — can reinforce sycophancy.
Gemini 2.5 Pro — 6 issues
ClaimCommon Crawl — indexing 2.7 billion pages by 2024
Verdictmisleading
NotesRefers to a single FineWeb crawl dump, not the cumulative archive since 2007.
Claim44 TB FineWeb = ~15 trillion tokens
Verdictwrong
NotesReviewer claims FineWeb contains ~22T tokens; the 15T figure is Llama 3's training mix. Needs verification against HuggingFace's published figures.
ClaimPII removal finds "named individuals"
Verdictmisleading
NotesFull removal of all named persons is infeasible; pipelines focus on PII patterns, not all proper names.
ClaimGPT-2 trained on 100B tokens
Verdictwrong
NotesWebText (~40GB) corresponds to ~20–30B tokens, not 100B.
ClaimSame GPT-2 quality today for ~$100
Verdictwrong
NotesTraining a 1.5B model from scratch still costs thousands of dollars. The $100 figure applies only to toy-scale runs.
ClaimChatGPT imitates "a skilled developer-labeler"
Verdictmisleading
NotesCore skills come from pre-training, not SFT labelers. SFT shapes style and assistant behaviour only.
Llama 4 Maverick — 7 issues
ClaimCommon Crawl — indexing 2.7 billion pages by 2024
Verdictoutdated
NotesFigure may be an underestimate or conflation with a single crawl snapshot.
ClaimGPT-4 uses a vocabulary of 100,277 tokens
Verdictoutdated
NotesExact vocabulary size not publicly disclosed by OpenAI.
ClaimGPT-2 training cost ~$40K; same quality today for ~$100
Verdictmisleading
NotesActual cost depends on hardware, electricity, and model architecture — $100 is an oversimplification.
ClaimLlama 3: 405B params, 15T tokens
Verdictunverifiable
NotesSpecific details not publicly confirmed without an official citation.
ClaimModern frontier models: hundreds of billions of parameters, trillions of tokens
Verdictunverifiable
NotesVague generalisation. (Dismissed — intentionally general and accurate.)
ClaimThe model doesn't "think" — it computes a probability distribution
Verdictmisleading
NotesStandard simplification of next-token prediction. (Dismissed.)
ClaimChatGPT imitates "a skilled developer-labeler"
Verdictmisleading
NotesModel generates based on patterns from all training data, not just labeler outputs.