Topics
Browse posts by category and tag — every topic we cover, with the latest pieces under each.
Tags
- #llm-safety 7
- #content-moderation 6
- #production 3
- #meta 2
- #accuracy 1
- #api-review 1
- #architecture 1
- #benchmark 1
- #classifier 1
- #conversation-control 1
- #ensemble 1
- #false-positives 1
- #google-jigsaw 1
- #guardrails 1
- #llama-guard 1
- #nemo-guardrails 1
- #nvidia 1
- #openai-moderation 1
- #ops 1
- #perspective-api 1
- #prompt-injection 1
- #rag 1
- #retrieval-augmented-generation 1
- #safety-classifier 1
- #toxicity-detection 1
- #user-experience 1
Categories
reviews 4 posts
- Perspective API: Still Good at Its Original Job, Still Wrong for LLM SafetyJigsaw's Perspective API has 8+ years of production data on toxicity detection. For community content moderation it remains strong. For LLM application safety it was never designed for this use case and it shows.
- OpenAI Moderation API: An Honest Review After 18 Months in ProductionOpenAI's Moderation API is the path-of-least-resistance choice for teams already in the OpenAI ecosystem. The speed is good. The category granularity has improved. The gaps are predictable.
- Llama Guard Benchmark Review: Real-World Performance vs. Vendor ClaimsMeta's Llama Guard series has become a default choice for open-source content moderation. Benchmarks on the standard test sets look strong. Production behavior is more complicated.
- NeMo Guardrails in Production: What It Does Well and Where It Falls OverNVIDIA's NeMo Guardrails offers conversation-flow control that classifiers can't provide. The deployment complexity is real. This is an honest review from a team that's run it in production.
ops 3 posts
- Content Moderation for RAG Applications: The Retrieval Layer Is an Attack SurfaceRAG pipelines have a moderation problem at the retrieval layer that input/output classifiers don't address. Injected content in retrieved documents can override model behavior. Here's the architecture that covers it.
- Classifier Ensembles for Production Content ModerationSingle classifiers have characteristic failure modes. Ensembles that combine models with different architectures and training distributions reduce coverage gaps. How to build and operate them.
- The Real Cost of False Positives in AI Content ModerationFalse positive rates in content moderation are usually discussed as a technical metric. The business costs — user abandonment, manual review queues, appeal escalations — are rarely quantified. Here's how to measure and manage them.