Tag

#content-moderation

6 posts tagged content-moderation.

reviews

Perspective API: Still Good at Its Original Job, Still Wrong for LLM Safety

Jigsaw's Perspective API has 8+ years of production data on toxicity detection. For community content moderation it remains strong. For LLM application safety it was never designed for this use case and it shows.
May 5, 2026
ops

Content Moderation for RAG Applications: The Retrieval Layer Is an Attack Surface

RAG pipelines have a moderation problem at the retrieval layer that input/output classifiers don't address. Injected content in retrieved documents can override model behavior. Here's the architecture that covers it.
May 4, 2026
ops

Classifier Ensembles for Production Content Moderation

Single classifiers have characteristic failure modes. Ensembles that combine models with different architectures and training distributions reduce coverage gaps. How to build and operate them.
May 4, 2026
ops

The Real Cost of False Positives in AI Content Moderation

False positive rates in content moderation are usually discussed as a technical metric. The business costs — user abandonment, manual review queues, appeal escalations — are rarely quantified. Here's how to measure and manage them.
May 3, 2026
reviews

OpenAI Moderation API: An Honest Review After 18 Months in Production

OpenAI's Moderation API is the path-of-least-resistance choice for teams already in the OpenAI ecosystem. The speed is good. The category granularity has improved. The gaps are predictable.
May 3, 2026
reviews

Llama Guard Benchmark Review: Real-World Performance vs. Vendor Claims

Meta's Llama Guard series has become a default choice for open-source content moderation. Benchmarks on the standard test sets look strong. Production behavior is more complicated.
May 2, 2026