Introduction
Every platform that accepts user-generated content faces the same brutal tradeoff: manual moderation doesn't scale, and rule-based filters are brittle. A comment that slips past a keyword blocklist today becomes tomorrow's headline. AI-powered moderation changes the equation by understanding context, not just surface patterns.
This tutorial is for backend developers and MLOps engineers who need a real, controllable moderation pipeline — not a black-box API with opaque pricing and no insight into why content was flagged. We'll build a multi-stage pipeline using Detoxify for fast toxic content scoring and Llama Guard for deeper semantic classification. By the end you'll have a working pipeline you can plug into any Python web service.
Prerequisites
- Python 3.10+ with
pip - A machine with at least 8 GB RAM (16 GB recommended for Llama Guard)
- Basic familiarity with Hugging Face Transformers
- Optional but helpful: a GPU with CUDA support for faster inference
Step-by-Step Guide
Step 1: Set Up Your Environment
Create an isolated environment and install the required dependencies.
python -m venv moderation-env
source moderation-env/bin/activate
pip install detoxify transformers accelerate torch sentencepiece fastapi uvicornStep 2: Build the Fast First-Pass Filter with Detoxify
Detoxify runs in milliseconds and scores text across six toxicity dimensions. Use it as your cheap first gate — only escalate content that scores above a threshold to the heavier model.
# fast_filter.py
from detoxify import Detoxify
model = Detoxify('original')
THRESHOLD = 0.5
def fast_screen(text: str) -> dict:
scores = model.predict(text)
flagged = {label: score for label, score in scores.items() if score > THRESHOLD}
return {
"text": text,
"flagged": len(flagged) > 0,
"scores": scores,
"triggered_labels": flagged
}
if __name__ == "__main__":
sample = "I hope your project goes really well!"
result = fast_screen(sample)
print(result)Step 3: Add Deep Semantic Classification with Llama Guard
Llama Guard (Meta's safety-tuned model) understands intent and context that keyword scores miss entirely. We load it via Hugging Face and wrap it in a classification function.
# deep_classifier.py
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
MODEL_ID = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16,
device_map="auto"
)
def build_prompt(user_text: str) -> str:
return (
"[INST] Task: Determine if the following user message is safe or unsafe.\n"
f"User message: {user_text}\n"
"Respond with 'safe' or 'unsafe' followed by a brief reason. [/INST]"
)
def deep_classify(text: str) -> dict:
prompt = build_prompt(text)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=80,
temperature=0.1,
do_sample=False
)
decoded = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
verdict = "unsafe" if "unsafe" in decoded.lower() else "safe"
return {"verdict": verdict, "explanation": decoded.strip()}
Step 4: Wire Up the Two-Stage Pipeline
Combine both models into a single orchestrator. Fast screen first; only invoke the heavy model when necessary. This pattern keeps average latency low while maintaining accuracy on edge cases.
# pipeline.py
from fast_filter import fast_screen
from deep_classifier import deep_classify
def moderate(text: str) -> dict:
fast_result = fast_screen(text)
if not fast_result["flagged"]:
return {"action": "allow", "stage": "fast_filter", "details": fast_result}
deep_result = deep_classify(text)
action = "block" if deep_result["verdict"] == "unsafe" else "allow"
return {
"action": action,
"stage": "deep_classifier",
"fast_details": fast_result,
"deep_details": deep_result
}
if __name__ == "__main__":
test_cases = [
"Great tutorial, learned a lot!",
"I will destroy everything you love.",
"Buy cheap meds online, click here now"
]
for t in test_cases:
print(moderate(t))
print("---")
Step 5: Expose the Pipeline as a REST Endpoint
Wrap the pipeline in a FastAPI service so any service in your stack can call it over HTTP.
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
from pipeline import moderate
app = FastAPI(title="Content Moderation API")
class ContentRequest(BaseModel):
text: str
@app.post("/moderate")
def moderate_content(req: ContentRequest):
return moderate(req.text)
uvicorn app:app --host 0.0.0.0 --port 8080 --workers 1Test it with a quick curl:
curl -X POST http://localhost:8080/moderate \
-H "Content-Type: application/json" \
-d '{"text": "You are all going to regret this"}'Common Pitfalls and How to Avoid Them
- Model cold-start latency: Llama Guard takes 10–20 seconds to load into memory. Load it once at startup — never import it inside the request handler.
- Threshold sensitivity: A Detoxify threshold of 0.5 catches most toxicity but will produce false positives on sarcasm and fiction. Track your false positive rate in production and tune accordingly.
- Llama Guard gating: The model is gated on Hugging Face. Run
huggingface-cli loginand request access before your first pull, or your pipeline will silently fail at download time. - Memory fragmentation under load: Running both models in the same process eats RAM. In production, deploy the Detoxify service and the Llama Guard service as separate containers so each can scale independently.
- No audit log: Every moderation decision should be written to a store with the original text hash, verdict, and timestamp. Without this you cannot retrain or dispute decisions.
Conclusion
You now have a working two-stage moderation pipeline: a lightweight toxicity scorer that handles the bulk of traffic in milliseconds, backed by a context-aware language model that catches nuanced violations the first stage misses. The logic is yours — you control the thresholds, the categories, and the escalation rules.
The part that gets painful fast is everything else: provisioning GPU instances, keeping model weights cached across restarts, managing concurrency without OOM errors, and deploying updates without downtime. That infrastructure work can easily dwarf the time spent on the actual moderation logic.
That's the problem SwiftInference was built to solve. It gives you a managed inference endpoint for open-source models — including safety-focused ones like Llama Guard — so you deploy a pipeline like this one without standing up a single GPU server yourself. If you want to skip the ops overhead and get straight to improving your moderation accuracy, it's worth a look.