Llama Guard vs Llama Guard 2 vs Llama Guard 3: The Lineage, Clarified
Meta's Llama Guard series gets cited loosely, often with the wrong base model or category count. Here's the verified lineage — base models, taxonomies, and category counts — with the version differences that actually matter in production.
“We use Llama Guard” is one of the most ambiguous sentences in a content-moderation design review. There are several distinct models under that name, built on different base models, trained against different taxonomies, and covering different numbers of categories. Citations get this wrong constantly — wrong base model, wrong category count, conflated versions.
This is the verified lineage, drawn from Meta’s own model cards and the original paper. Where a number matters, the primary source is linked; verify the exact figures and any benchmark claims against those sources before you quote them, because Meta updates model cards and the details shift between point releases.
The short version
| Model | Base model | Categories | Taxonomy | Notable |
|---|---|---|---|---|
| Llama Guard (original) | Llama 2 7B | 6 | Custom (Meta) | First release; input + output classifier |
| Llama Guard 2 | Llama 3 8B | 11 | MLCommons | Category expansion, English-primary |
| Llama Guard 3 8B | Llama 3.1 8B | 14 | MLCommons + 1 | 8 languages; adds tool-use category |
The recurring confusion is treating these as the same model with a version bump. They are not. The base model changes every generation, the taxonomy changed at version 2, and the category set grew each time. If a benchmark or a blog post says “Llama Guard” without a version, treat the claim as unspecified until you pin the version.
Llama Guard (original)
Meta introduced the original Llama Guard in the paper Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (arXiv 2312.06674 ↗, published December 2023). The key facts:
- Base model: a Llama 2 7B model, instruction-tuned for safety classification.
- Taxonomy: a custom Meta taxonomy of 6 categories — at a high level covering violence, sexual content, weapons, controlled substances, suicide/self-harm, and criminal planning. (Read the paper ↗ for the exact category definitions.)
- Design: classifies both prompts (input) and responses (output) as safe/unsafe with category labels. This input-and-output framing is the part of the design that carried through every later version.
- Flexibility: because it’s instruction-tuned, the taxonomy can be adapted at inference via the prompt — you can supply your own category definitions zero- or few-shot. That capability also persisted.
The thing to take from the original release is the shape: an LLM-as-classifier that scores both sides of a conversation against a stated taxonomy. Everything after this is refinement.
Llama Guard 2
Llama Guard 2 shipped alongside the Llama 3 model family (Llama 3 was released April 18, 2024). Per its model card ↗:
- Base model: an 8B-parameter Llama 3-based model — note the jump from Llama 2 7B to Llama 3 8B.
- Taxonomy: moved to the MLCommons hazards taxonomy (the v0.5 proof-of-concept taxonomy), a deliberate step toward an industry-standard category set rather than a Meta-internal one.
- Categories: 11, up from 6. Meta’s own model card notes that measured performance shifted partly because the harm-category count expanded from 6 to 11 — a useful reminder that a “lower number” across versions can reflect a harder, broader test set rather than a worse model.
- Language: still English-primary.
The headline change at version 2 isn’t raw accuracy — it’s the taxonomy move to MLCommons. If your moderation policy needs to map onto an industry-recognized hazard set, version 2 is where Llama Guard started speaking that language.
Llama Guard 3
Llama Guard 3 shipped with Llama 3.1 (released July 23, 2024). Per the 8B model card ↗ and Meta’s model docs ↗:
- Base model: Llama 3.1 8B.
- Taxonomy: the MLCommons standardized hazards taxonomy (13 hazards), plus one additional category — Code Interpreter Abuse — for tool-use cases, for 14 categories total (labeled S1–S14).
- Languages: 8 — English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. This multilingual expansion is the most operationally significant change from version 2’s English-primary scope.
- Tool-use awareness: the added Code Interpreter Abuse category reflects the shift toward agent and tool-calling deployments.
Meta’s model card reports improved English F1 and a lower false-positive rate for version 3 versus version 2. Treat those as vendor-reported benchmark numbers — verify the current figures on the model card ↗, and as we always argue, benchmark numbers are not production numbers. Your traffic distribution, languages, and false-positive tolerance determine real-world behavior far more than a headline F1.
The Llama Guard 3 family is bigger than “the 8B”
Two more Llama Guard 3 variants matter when someone says “Llama Guard 3” generically:
- Llama Guard 3 1B — based on Llama 3.2 1B, pruned and quantized to a much smaller footprint. The relevant trade-off is the usual one: far cheaper and faster to run, with reduced capability versus the 8B. Worth evaluating where latency and cost dominate and your category needs are narrow.
- Llama Guard 3 11B Vision — built to support Llama 3.2’s image understanding, so it can classify text+image prompts and the text responses to them. This is the entry point if you need multimodal moderation in the Llama Guard family rather than text-only. (For the broader image/video moderation landscape, see our tooling roundup.)
A note on “Llama Guard 4”
Meta has since published a Llama Guard 4 12B model (a multimodal safeguard). It’s outside the version-3-and-earlier comparison this article is scoped to, but it’s worth knowing it exists so you don’t assume version 3 is the current top of the line. If you’re starting fresh, check what the latest released model is on Meta’s hub before standardizing on a version.
What actually changes your decision
Pinning the version matters because the differences are real, but the choice between them usually comes down to four practical questions, not the version number itself:
- Languages. If you serve non-English traffic, version 3’s 8-language support is the dividing line versus the English-primary earlier versions.
- Taxonomy fit. If your policy needs to map onto MLCommons hazards, that starts at version 2. If you have domain-specific categories, none of these cover them out of the box — you’ll be fine-tuning or layering a second classifier regardless.
- Footprint and latency. The 1B variant exists precisely for cost- and latency-bound deployments; the 8B needs meaningful GPU to stay interactive. We covered the latency profile in detail.
- Modality. Text-only deployments use the 8B or 1B; anything touching images needs the 11B Vision variant (or a different tool entirely).
The version is the easy part to get right once you’ve stopped citing “Llama Guard” without a number. The hard part — false-positive tolerance on your traffic — is the same regardless of which version you pick, and it’s why we keep pointing back at measuring against your own distribution rather than a vendor benchmark.
For comparative data across Llama Guard variants and other classifiers on specific harm categories, bestaisecuritytools.com ↗ maintains benchmark pointers.
Sources
AI Moderation Tools — in your inbox
Honest reviews and benchmarks of AI content-moderation tooling. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Llama Guard Benchmark Review: Real Performance vs. Vendor Claims
Meta's Llama Guard series has become a default choice for open-source content moderation. Benchmarks on the standard test sets look strong. Production behavior is more complicated.
Fine-Tuned Classifiers vs. Off-the-Shelf Moderation APIs: Cost & Tradeoffs
Off-the-shelf moderation APIs are cheap to start and expensive to outgrow. Fine-tuned classifiers are the reverse. Here's the honest cost and tradeoff comparison — including the costs teams forget — and where the crossover actually is.
Perspective API: Good at Its Original Job, Wrong for LLM Safety
Jigsaw's Perspective API has 8+ years of production data on toxicity detection. For community content moderation it remains strong. For LLM application safety it was never designed for this use case and it shows.