Fact-Check Report

LLM Council

Generated 2026-04-26

Synthesised by Claude Opus 4.7

GPT-4.1

Gemini 2.5 Pro

Llama 4 Maverick

Reviews were anonymised before synthesis — the synthesiser did not know which model produced which findings.

High Confidence — flagged by 2+ models

High confidence

Common Crawl — indexing 2.7 billion pages by 2024

The 2.7B figure refers to a single crawl dump processed for FineWeb, not the cumulative archive since 2007 which is vastly larger. The phrasing implies a running total, which is misleading.

High confidence

GPT-4 uses a vocabulary of 100,277 tokens via BPE

OpenAI has never officially confirmed GPT-4's vocabulary size. The 100,277 figure is the cl100k_base tokenizer associated with GPT-3.5/GPT-4 era models, but attributing it specifically to GPT-4 is unverified. Worth softening to "approximately 100K tokens."

High confidence

GPT-2 training cost ~$40K; same quality today for ~$100

The $100 figure is a significant underestimate without context. Training a 1.5B parameter model from scratch still costs thousands of dollars. The $100 figure likely refers to toy-scale reproductions (e.g. Karpathy's nanoGPT), not full GPT-2 quality. Needs clarification.

High confidence

ChatGPT is "imitating what a skilled developer-labeler would write"

Misleading framing — coding ability and core knowledge come from pre-training on vast internet data, not from SFT labelers. SFT shapes style and assistant behaviour; it does not create underlying skills from scratch.

High confidence — likely defensible

Llama 3: 405B params, 15T tokens

Two models flagged this, but Llama 3.1 405B was publicly released in July 2024 with a 15T-token training set — so the claim appears correct. The flags reflect outdated reviewer knowledge. Worth adding a citation to confirm.

Lower Confidence — flagged by 1 model

Worth reviewing

44 TB FineWeb represents ~15 trillion tokens

FineWeb's published token count is approximately 15T tokens in its standard release. One reviewer claimed 22T — worth verifying against HuggingFace's published figures directly.

Worth reviewing

GPT-2 trained on 100B tokens

GPT-2 was trained on WebText (~40GB), typically estimated at 20–30B tokens. The 100B figure appears overstated.

Worth reviewing

Embedding size ~1,000–4,000 numbers

Frontier models often use embedding dimensions exceeding 8,000. The upper bound of 4,000 is too narrow for current models.

Worth reviewing

RLHF produces "more honest" responses

RLHF optimises for human preference, not truthfulness — and can reinforce sycophancy or hallucination if raters prefer confident-sounding wrong answers. "More helpful and better structured" is safer wording.

Worth reviewing

PII removal finds "named individuals"

Full removal of all named persons is infeasible and not what filtering pipelines actually do. "Attempts to detect" is more accurate.

Dismissed — intentional simplification

SFT duration: "hours (not months)"

Borderline over-pedantic. Reads as a pedagogical contrast with pre-training duration, not a precise claim.

Dismissed — intentional simplification

The model doesn't "think" — it computes a probability distribution

Standard accepted simplification for explaining next-token prediction. Not misleading in context.

Overall Assessment

The guide is largely sound as an educational overview, but several specific numerical claims need correction or clarification. The most serious issues are the GPT-2-to-$100 cost comparison (needs context about scale), the Common Crawl page count (conflates a single dump with the full archive), and the "imitating a developer-labeler" framing (understates pre-training's role). The GPT-4 tokenizer and Llama 3 specs are defensible but should be sourced. Several flags target intentional simplifications appropriate for an educational context and can be dismissed.

Raw Findings by Model

GPT-4.1 — 6 issues

ClaimLlama 3: 405B params, 15T tokens

Verdictmisleading

NotesAs of mid-2024, the largest publicly disclosed Llama 3 model was 70B parameters. (Note: Llama 3.1 405B was released July 2024 — this flag appears outdated.)

ClaimGPT-4 uses a vocabulary of 100,277 tokens via BPE

Verdictmisleading

NotesThe 100,277 figure is accurate for cl100k_base but not officially confirmed for GPT-4 specifically.

ClaimEmbedding size ~1,000–4,000 numbers

Verdictmisleading

NotesMost frontier models use 2,048–8,192+ dimensions. The upper bound of 4,000 is too narrow.

ClaimThe model learns grammar, facts, reasoning patterns implicitly

Verdictmisleading

NotesOverstates robustness of learned reasoning. (Dismissed as acceptable simplification.)

ClaimSFT duration: hours (not months)

Verdictmisleading

NotesLarge-scale SFT can take days. (Dismissed as intentional simplification.)

ClaimRLHF produces responses that are "more honest"

Verdictmisleading

NotesRLHF optimises for human preference, not truthfulness — can reinforce sycophancy.

Gemini 2.5 Pro — 6 issues

ClaimCommon Crawl — indexing 2.7 billion pages by 2024

Verdictmisleading

NotesRefers to a single FineWeb crawl dump, not the cumulative archive since 2007.

Claim44 TB FineWeb = ~15 trillion tokens

Verdictwrong

NotesReviewer claims FineWeb contains ~22T tokens; the 15T figure is Llama 3's training mix. Needs verification against HuggingFace's published figures.

ClaimPII removal finds "named individuals"

Verdictmisleading

NotesFull removal of all named persons is infeasible; pipelines focus on PII patterns, not all proper names.

ClaimGPT-2 trained on 100B tokens

Verdictwrong

NotesWebText (~40GB) corresponds to ~20–30B tokens, not 100B.

ClaimSame GPT-2 quality today for ~$100

Verdictwrong

NotesTraining a 1.5B model from scratch still costs thousands of dollars. The $100 figure applies only to toy-scale runs.

ClaimChatGPT imitates "a skilled developer-labeler"

Verdictmisleading

NotesCore skills come from pre-training, not SFT labelers. SFT shapes style and assistant behaviour only.

Llama 4 Maverick — 7 issues

ClaimCommon Crawl — indexing 2.7 billion pages by 2024

Verdictoutdated

NotesFigure may be an underestimate or conflation with a single crawl snapshot.

ClaimGPT-4 uses a vocabulary of 100,277 tokens

Verdictoutdated

NotesExact vocabulary size not publicly disclosed by OpenAI.

ClaimGPT-2 training cost ~$40K; same quality today for ~$100

Verdictmisleading

NotesActual cost depends on hardware, electricity, and model architecture — $100 is an oversimplification.

ClaimLlama 3: 405B params, 15T tokens

Verdictunverifiable

NotesSpecific details not publicly confirmed without an official citation.

ClaimModern frontier models: hundreds of billions of parameters, trillions of tokens

Verdictunverifiable

NotesVague generalisation. (Dismissed — intentionally general and accurate.)

ClaimThe model doesn't "think" — it computes a probability distribution

Verdictmisleading

NotesStandard simplification of next-token prediction. (Dismissed.)

ClaimChatGPT imitates "a skilled developer-labeler"

Verdictmisleading

NotesModel generates based on patterns from all training data, not just labeler outputs.