Introduction

User-generated content is a liability without a solid moderation layer. Whether you're running a social platform, a comment section, or an AI-powered chat product, shipping without automated content moderation is a risk you can't afford. Human review doesn't scale, and rule-based keyword filters are embarrassingly easy to evade.

This tutorial shows you how to build a real, working AI content moderation pipeline using open-source models — specifically Llama Guard for policy-based text classification and Detoxify for toxicity scoring. By the end, you'll have a modular pipeline you can drop into any Python backend. This guide is aimed at backend developers and ML engineers who are comfortable with Python and have basic familiarity with Hugging Face Transformers.

Prerequisites

  • Python 3.10 or higher
  • A machine with at least 16 GB RAM (GPU with 8 GB VRAM recommended for Llama Guard)
  • A Hugging Face account and API token (for gated model access)
  • pip packages: transformers, torch, detoxify, accelerate, fastapi, uvicorn

Step-by-Step Guide

Step 1: Install Dependencies

Start with a clean virtual environment and install everything you need:

python -m venv moderation-env
source moderation-env/bin/activate

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate detoxify fastapi uvicorn huggingface_hub

Step 2: Authenticate With Hugging Face

Llama Guard is a gated model. Log in with your token before attempting to download it:

huggingface-cli login --token YOUR_HF_TOKEN

Step 3: Load the Detoxify Toxicity Scorer

Detoxify is a lightweight model that scores text across several toxicity dimensions. It's fast, requires no GPU, and works well as a first-pass filter:

from detoxify import Detoxify

toxicity_model = Detoxify('original')

def score_toxicity(text: str) -> dict:
    scores = toxicity_model.predict(text)
    return {k: round(float(v), 4) for k, v in scores.items()}

# Example
print(score_toxicity("I love this community!"))
# {'toxicity': 0.0012, 'severe_toxicity': 0.0001, 'obscene': 0.0003, ...}

Step 4: Load Llama Guard for Policy Classification

Llama Guard classifies content against a set of safety categories (violence, sexual content, hate speech, etc.). Load it with 4-bit quantization to keep memory usage manageable on a single GPU:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "meta-llama/LlamaGuard-7b"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="auto",
    load_in_4bit=True,
)

def classify_with_llama_guard(user_message: str) -> str:
    prompt = f"""[INST] Task: Check if there is unsafe content in the user message.


{user_message}


Provide your safety assessment: [/INST]"""

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=64,
            temperature=0.01,
            do_sample=False,
        )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    # Extract only the model's response after the prompt
    response = decoded.split("[/INST]")[-1].strip()
    return response

Step 5: Build the Moderation Pipeline

Now wire both models together into a single moderate() function that returns a structured verdict:

TOXICITY_THRESHOLD = 0.7

def moderate(text: str) -> dict:
    # First pass: fast toxicity score
    tox_scores = score_toxicity(text)
    is_toxic = tox_scores["toxicity"] >= TOXICITY_THRESHOLD

    # Second pass: policy classification (only if first pass flags or for audit)
    guard_result = classify_with_llama_guard(text)
    is_unsafe = guard_result.lower().startswith("unsafe")

    return {
        "text": text,
        "toxicity_scores": tox_scores,
        "llama_guard_verdict": guard_result,
        "action": "block" if (is_toxic or is_unsafe) else "allow",
    }

# Test it
result = moderate("This is a perfectly normal message.")
print(result)

Step 6: Expose It as an API With FastAPI

Wrap the pipeline in a FastAPI endpoint so any service in your stack can call it over HTTP:

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Content Moderation API")

class ModerationRequest(BaseModel):
    text: str

@app.post("/moderate")
async def moderate_endpoint(request: ModerationRequest):
    result = moderate(request.text)
    return result

# Run with: uvicorn main:app --host 0.0.0.0 --port 8080
uvicorn main:app --host 0.0.0.0 --port 8080

You can now send POST requests to http://localhost:8080/moderate with a JSON body like {"text": "your content here"} and receive a structured moderation decision.

Common Pitfalls and How to Avoid Them

  • Running Llama Guard on CPU: Without a GPU, inference takes 30–90 seconds per request. Always use device_map="auto" and confirm CUDA is available with torch.cuda.is_available() before loading the model.
  • Calling Llama Guard on every request: It's expensive. Use Detoxify as a cheap first-pass filter and only invoke Llama Guard when the toxicity score is ambiguous (e.g., between 0.3 and 0.7).
  • Not batching requests: Both models support batch inference. For high-throughput workloads, collect requests into small batches (8–16 items) before running inference to maximize GPU utilization.
  • Hardcoding thresholds: A toxicity threshold of 0.7 is a starting point, not a universal truth. Monitor false positive and false negative rates in production and tune accordingly with labeled data from your specific domain.
  • Ignoring model versioning: Llama Guard has multiple versions (1, 2, 3) with different safety taxonomies. Pin your model version explicitly and document which policy categories each version covers.

Conclusion

You now have a working two-stage content moderation pipeline: Detoxify handles the fast, cheap first pass, and Llama Guard provides nuanced policy-level classification for borderline cases. The FastAPI wrapper makes it easy to integrate with any backend over HTTP.

The hard part isn't writing this code — it's keeping it running in production. GPU memory management, model loading times, cold starts, autoscaling, and keeping model weights up to date are all real operational headaches. Every time you push a new model version you're wrestling with CUDA drivers, container images, and inference server configuration.

That's exactly why I'd point you toward SwiftInference. It lets you deploy open-source models like Llama Guard behind a managed inference endpoint — no GPU provisioning, no container orchestration, no cold-start tuning. You point it at the model, it gives you an API. For teams that want the transparency and control of open-source models without the infrastructure tax, it's genuinely worth a look.