OpenAI Moderation API: An Honest Review After 18 Months in Production

The OpenAI Moderation API is the default choice for teams already using GPT-4 or GPT-3.5 for their LLM application. It’s fast (~20ms typical), free with API credits, and requires minimal integration. For teams who want to move quickly, this is appealing.

We’ve used it in production for 18 months across three different application types. Here’s the honest assessment.

What it does well

Latency. The Moderation API is fast. At 15-25ms typical latency, it adds negligible overhead when run synchronously on user inputs. For output classification, async operation means it adds zero user-visible latency. This is the best latency profile in the category.

Category breadth. The current model (omni-moderation-2024-09-26) covers a reasonable taxonomy:

Harassment (with/without threat)
Hate (with/without threat)
Self-harm (intent, instructions, ideation)
Sexual (general, minors)
Violence (graphic/non-graphic)
Illicit (firearms, drugs — separate categories)

The subcategory structure is useful. “Sexual content” and “sexual content involving minors” warrant different business responses; having separate classification flags makes threshold calibration easier.

Ease of integration. Four lines of Python. If you’re already using the openai library, there’s essentially no integration overhead:

from openai import OpenAI
client = OpenAI()

response = client.moderations.create(
    model="omni-moderation-latest",
    input=user_message
)
flagged = response.results[0].flagged
categories = response.results[0].categories
scores = response.results[0].category_scores

Where it falls short

Coverage on obfuscated and encoded content. The OpenAI Moderation API operates on the text you send it. If that text is Base64-encoded, ROT13’d, or obfuscated with zero-width characters, the moderation model performs poorly. There’s no built-in normalization layer.

Non-English language performance. Coverage is better than most alternatives for major languages (French, Spanish, German, Portuguese), but performance on less-supported languages is significantly degraded. If you have significant traffic from users writing in Arabic, Hindi, or smaller language communities, measure performance explicitly before relying on the API.

Context-free classification. The API classifies single messages in isolation. It doesn’t have memory of prior conversation turns. A jailbreak spread across multiple turns — innocuous individually, harmful in combination — won’t be caught by per-message classification.

Lack of customization. You cannot add custom harm categories or adjust the model’s training distribution. If your application has domain-specific risks (financial advice, medical content, legal advice) that don’t map cleanly to the standard taxonomy, you’re adding a second classification layer anyway.

Opacity on borderline cases. The score output gives you a confidence score, but no explanation. When content is flagged at 0.45 (borderline), there’s no mechanism to understand why. Manual review lacks the information to make good decisions.

Threshold calibration in practice

The default behavior uses OpenAI’s internal thresholds. The raw scores give you the ability to set your own thresholds. In practice:

Sexual content default threshold is conservative — legitimate romantic fiction is frequently flagged
Violence threshold is calibrated for general consumer use — news content discussing violence occasionally trips it
The drug and firearms categories are where we’ve seen the most useful flagging in community platform contexts

Our production calibration: we run with OpenAI’s defaults for the high-severity categories (sexual content involving minors, imminent threats) and raise thresholds significantly for harassment, general violence, and drug/firearms categories to reduce false positives on legitimate content.

The vendor lock-in question

The Moderation API is free with OpenAI usage but creates architectural dependency on OpenAI’s API. Teams that want to switch LLM providers or run air-gapped deployments need to replace this too.

If architectural independence matters, Llama Guard is the portable alternative. If you’re deeply committed to OpenAI for the foreseeable future and latency matters, the OpenAI Moderation API is hard to beat on the operations side.

Comparing to the alternatives

Dimension	OpenAI Moderation API	Llama Guard 3 8B	Perspective API
Latency (p99)	~25ms	100-200ms self-hosted	~30ms
Cost	Free with OpenAI credits	Self-hosting cost	Free (limited)
Custom categories	No	Via fine-tuning	No
Context window	Single message	Single message	Single message
Multilingual	Good for major languages	English-primary	Good for toxicity

For a deeper comparison of how these tools compare on specific harm categories with numbers, aisecreviews.com ↗ publishes comparative data across the content moderation tool landscape.

OpenAI Moderation API: An Honest Review After 18 Months in Production

What it does well

Where it falls short

Threshold calibration in practice

The vendor lock-in question

Comparing to the alternatives

Sources

AI Moderation Tools — in your inbox

Related

Perspective API: Still Good at Its Original Job, Still Wrong for LLM Safety

Llama Guard Benchmark Review: Real-World Performance vs. Vendor Claims

NeMo Guardrails in Production: What It Does Well and Where It Falls Over

Comments