Introduction

Every platform that accepts user-generated content — forums, marketplaces, social apps, developer tools — faces the same unglamorous problem: keeping harmful content out without blocking legitimate users. Commercial moderation APIs are convenient, but they come with per-call costs that compound quickly at scale, opaque decision logic, and data-sharing agreements that can be a compliance nightmare.

Open-source models have reached a point where they are genuinely competitive for this task. Meta's Llama Guard family was purpose-built for safety classification, and smaller token-level models like toxic-bert can triage high volumes cheaply before you even touch a larger model. In this tutorial you will wire both together into a two-stage pipeline, wrap it in a FastAPI service, and have something deployable by the end of the post.

This guide is aimed at backend engineers and MLOps practitioners who are comfortable with Python and have basic familiarity with Hugging Face Transformers.

Prerequisites

  • Python 3.10+ with a virtual environment ready
  • A machine with at least 16 GB RAM (a GPU is helpful but not required for the demo)
  • Hugging Face account and CLI authenticated (huggingface-cli login)
  • Access granted to meta-llama/LlamaGuard-7b on the Hugging Face Hub
  • Familiarity with FastAPI and async Python

Step-by-Step Guide

Step 1 — Install Dependencies

pip install transformers accelerate torch fastapi uvicorn pydantic sentencepiece

Step 2 — Build the Fast Triage Layer with toxic-bert

The first stage uses a lightweight classifier to quickly score obvious violations. Requests that score below a confidence threshold skip the expensive second stage entirely.

from transformers import pipeline

# Load once at startup — keep this object alive
_triage = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    top_k=None,          # return all labels
    truncation=True,
    max_length=512,
)

TRIAGE_BLOCK_THRESHOLD = 0.85   # flag immediately if toxic score >= this
TRIAGE_PASS_THRESHOLD  = 0.10   # skip deep scan if toxic score <= this

def triage(text: str) -> dict:
    scores = {r["label"]: r["score"] for r in _triage(text)[0]}
    toxic_score = scores.get("toxic", 0.0)
    return {
        "toxic_score": toxic_score,
        "action": (
            "block"     if toxic_score >= TRIAGE_BLOCK_THRESHOLD else
            "allow"     if toxic_score <= TRIAGE_PASS_THRESHOLD  else
            "deep_scan"
        ),
    }

Step 3 — Add Llama Guard for Deep Policy Enforcement

Llama Guard uses a structured prompt format that maps to customisable harm categories. Here we use the default safety policy shipped with the model.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

LG_MODEL = "meta-llama/LlamaGuard-7b"

_lg_tokenizer = AutoTokenizer.from_pretrained(LG_MODEL)
_lg_model = AutoModelForCausalLM.from_pretrained(
    LG_MODEL,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

_SYSTEM_PROMPT = """[INST] Task: Check if there is unsafe content in the user message according to our safety policy.


O1: Violence and Hate
O2: Sexual Content
O3: Criminal Planning
O4: Guns and Illegal Weapons
O5: Regulated or Controlled Substances
O6: Self-Harm



User: {user_message}


Provide your safety assessment for the user message above. First line must read 'safe' or 'unsafe'. [/INST]"""

def llama_guard_check(text: str) -> dict:
    prompt = _SYSTEM_PROMPT.format(user_message=text)
    inputs = _lg_tokenizer(prompt, return_tensors="pt").to(_lg_model.device)
    with torch.no_grad():
        output_ids = _lg_model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=False,
        )
    decoded = _lg_tokenizer.decode(
        output_ids[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    ).strip().lower()
    verdict = "block" if decoded.startswith("unsafe") else "allow"
    return {"verdict": verdict, "raw_output": decoded}

Step 4 — Compose the Two-Stage Pipeline

def moderate(text: str) -> dict:
    stage1 = triage(text)

    if stage1["action"] in ("block", "allow"):
        return {
            "decision":    stage1["action"],
            "stage":       "triage",
            "toxic_score": stage1["toxic_score"],
            "llm_output":  None,
        }

    # Ambiguous — escalate to Llama Guard
    stage2 = llama_guard_check(text)
    return {
        "decision":    stage2["verdict"],
        "stage":       "deep_scan",
        "toxic_score": stage1["toxic_score"],
        "llm_output":  stage2["raw_output"],
    }

Step 5 — Expose the Pipeline via FastAPI

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Content Moderation API")

class ModerationRequest(BaseModel):
    text: str

@app.post("/moderate")
async def moderate_endpoint(req: ModerationRequest):
    result = moderate(req.text)
    return result
# Start the server
uvicorn main:app --host 0.0.0.0 --port 8080 --workers 1
# Quick smoke test
curl -X POST http://localhost:8080/moderate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a perfectly normal message."}'

Common Pitfalls and How to Avoid Them

  • Cold-start latency kills user experience. Load both models at application startup, not on the first request. Use FastAPI's lifespan context manager to pre-warm them.
  • Llama Guard output is not always a single word. The model sometimes emits a category list after the verdict. Always check decoded.startswith("unsafe") rather than an exact equality match.
  • Truncation silently drops context. toxic-bert caps at 512 tokens. For longer posts, chunk the text and aggregate scores with a max-pool strategy rather than relying on blind truncation.
  • Running both models on the same GPU causes OOM errors. Assign toxic-bert to CPU (it is small enough) and reserve the GPU exclusively for Llama Guard by setting device="cpu" in the pipeline call.
  • No audit log means no feedback loop. Persist every moderation decision with the input hash, scores, and final verdict. You need this data to fine-tune thresholds and catch drift over time.

Conclusion

You now have a working two-stage moderation pipeline: a fast classifier handles the easy cases in milliseconds, and Llama Guard applies nuanced policy reasoning only where it is actually needed. The architecture is sound, but operating it in production surfaces a different class of problems — provisioning GPU instances, managing model versions, scaling workers to absorb traffic spikes, and keeping inference latency within acceptable bounds under load.

That infrastructure work is genuinely time-consuming, and it pulls engineering attention away from the product logic that actually differentiates your platform. SwiftInference handles exactly this layer: you bring the model, they handle the serving infrastructure, autoscaling, and low-latency routing. If you find yourself spending more time on Kubernetes configs and CUDA driver versions than on your moderation logic, it is worth a look — the onboarding is straightforward and the operational overhead reduction is real.