AI Moderation Tools
Isometric vector illustration showing icons for content moderation and safety classifiers in llama guard lineage series
guides

Llama Guard vs Llama Guard 2 vs Llama Guard 3: The Lineage, Clarified

Meta's Llama Guard series gets cited loosely, often with the wrong base model or category count. Here's the verified lineage — base models, taxonomies, and category counts — with the version differences that actually matter in production.

By AI Moderation Tools Editorial · · 8 min read

“We use Llama Guard” is one of the most ambiguous sentences in a content-moderation design review. There are several distinct models under that name, built on different base models, trained against different taxonomies, and covering different numbers of categories. Citations get this wrong constantly — wrong base model, wrong category count, conflated versions.

This is the verified lineage, drawn from Meta’s own model cards and the original paper. Where a number matters, the primary source is linked; verify the exact figures and any benchmark claims against those sources before you quote them, because Meta updates model cards and the details shift between point releases.

The short version

ModelBase modelCategoriesTaxonomyNotable
Llama Guard (original)Llama 2 7B6Custom (Meta)First release; input + output classifier
Llama Guard 2Llama 3 8B11MLCommonsCategory expansion, English-primary
Llama Guard 3 8BLlama 3.1 8B14MLCommons + 18 languages; adds tool-use category

The recurring confusion is treating these as the same model with a version bump. They are not. The base model changes every generation, the taxonomy changed at version 2, and the category set grew each time. If a benchmark or a blog post says “Llama Guard” without a version, treat the claim as unspecified until you pin the version.

Llama Guard (original)

Meta introduced the original Llama Guard in the paper Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (arXiv 2312.06674, published December 2023). The key facts:

  • Base model: a Llama 2 7B model, instruction-tuned for safety classification.
  • Taxonomy: a custom Meta taxonomy of 6 categories — at a high level covering violence, sexual content, weapons, controlled substances, suicide/self-harm, and criminal planning. (Read the paper for the exact category definitions.)
  • Design: classifies both prompts (input) and responses (output) as safe/unsafe with category labels. This input-and-output framing is the part of the design that carried through every later version.
  • Flexibility: because it’s instruction-tuned, the taxonomy can be adapted at inference via the prompt — you can supply your own category definitions zero- or few-shot. That capability also persisted.

The thing to take from the original release is the shape: an LLM-as-classifier that scores both sides of a conversation against a stated taxonomy. Everything after this is refinement.

Llama Guard 2

Llama Guard 2 shipped alongside the Llama 3 model family (Llama 3 was released April 18, 2024). Per its model card:

  • Base model: an 8B-parameter Llama 3-based model — note the jump from Llama 2 7B to Llama 3 8B.
  • Taxonomy: moved to the MLCommons hazards taxonomy (the v0.5 proof-of-concept taxonomy), a deliberate step toward an industry-standard category set rather than a Meta-internal one.
  • Categories: 11, up from 6. Meta’s own model card notes that measured performance shifted partly because the harm-category count expanded from 6 to 11 — a useful reminder that a “lower number” across versions can reflect a harder, broader test set rather than a worse model.
  • Language: still English-primary.

The headline change at version 2 isn’t raw accuracy — it’s the taxonomy move to MLCommons. If your moderation policy needs to map onto an industry-recognized hazard set, version 2 is where Llama Guard started speaking that language.

Llama Guard 3

Llama Guard 3 shipped with Llama 3.1 (released July 23, 2024). Per the 8B model card and Meta’s model docs:

  • Base model: Llama 3.1 8B.
  • Taxonomy: the MLCommons standardized hazards taxonomy (13 hazards), plus one additional category — Code Interpreter Abuse — for tool-use cases, for 14 categories total (labeled S1–S14).
  • Languages: 8 — English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. This multilingual expansion is the most operationally significant change from version 2’s English-primary scope.
  • Tool-use awareness: the added Code Interpreter Abuse category reflects the shift toward agent and tool-calling deployments.

Meta’s model card reports improved English F1 and a lower false-positive rate for version 3 versus version 2. Treat those as vendor-reported benchmark numbers — verify the current figures on the model card, and as we always argue, benchmark numbers are not production numbers. Your traffic distribution, languages, and false-positive tolerance determine real-world behavior far more than a headline F1.

The Llama Guard 3 family is bigger than “the 8B”

Two more Llama Guard 3 variants matter when someone says “Llama Guard 3” generically:

  • Llama Guard 3 1B — based on Llama 3.2 1B, pruned and quantized to a much smaller footprint. The relevant trade-off is the usual one: far cheaper and faster to run, with reduced capability versus the 8B. Worth evaluating where latency and cost dominate and your category needs are narrow.
  • Llama Guard 3 11B Vision — built to support Llama 3.2’s image understanding, so it can classify text+image prompts and the text responses to them. This is the entry point if you need multimodal moderation in the Llama Guard family rather than text-only. (For the broader image/video moderation landscape, see our tooling roundup.)

A note on “Llama Guard 4”

Meta has since published a Llama Guard 4 12B model (a multimodal safeguard). It’s outside the version-3-and-earlier comparison this article is scoped to, but it’s worth knowing it exists so you don’t assume version 3 is the current top of the line. If you’re starting fresh, check what the latest released model is on Meta’s hub before standardizing on a version.

What actually changes your decision

Pinning the version matters because the differences are real, but the choice between them usually comes down to four practical questions, not the version number itself:

  1. Languages. If you serve non-English traffic, version 3’s 8-language support is the dividing line versus the English-primary earlier versions.
  2. Taxonomy fit. If your policy needs to map onto MLCommons hazards, that starts at version 2. If you have domain-specific categories, none of these cover them out of the box — you’ll be fine-tuning or layering a second classifier regardless.
  3. Footprint and latency. The 1B variant exists precisely for cost- and latency-bound deployments; the 8B needs meaningful GPU to stay interactive. We covered the latency profile in detail.
  4. Modality. Text-only deployments use the 8B or 1B; anything touching images needs the 11B Vision variant (or a different tool entirely).

The version is the easy part to get right once you’ve stopped citing “Llama Guard” without a number. The hard part — false-positive tolerance on your traffic — is the same regardless of which version you pick, and it’s why we keep pointing back at measuring against your own distribution rather than a vendor benchmark.

For comparative data across Llama Guard variants and other classifiers on specific harm categories, bestaisecuritytools.com maintains benchmark pointers.

Sources

  1. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (paper)
  2. Llama Guard paper (arXiv 2312.06674)
  3. Meta Llama Guard 2 8B Model Card (Hugging Face)
  4. Llama Guard 3 8B Model Card (Hugging Face)
  5. Llama Guard 3 Model Cards and Prompt Formats (llama.com)
Subscribe

AI Moderation Tools — in your inbox

Honest reviews and benchmarks of AI content-moderation tooling. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments