Introduction

Not every team has a rack of A100s sitting idle. Whether you are prototyping a product, running inference on an air-gapped server, or simply trying to keep cloud bills under control, CPU-based large language model inference is a practical and increasingly viable option. Thanks to llama.cpp, a highly optimized C++ inference engine with 4-bit quantization support, you can run capable open-weight models on standard x86 or ARM hardware with acceptable latency for many use cases.

This tutorial is aimed at developers and DevOps engineers who want a self-hosted inference endpoint they fully control. By the end, you will have a running REST API backed by a quantized LLM, callable from any HTTP client.

Prerequisites

  • A Linux or macOS machine with at least 8 GB of RAM (16 GB recommended for 7B models)
  • Python 3.10 or later installed
  • git, cmake, and a C++17-capable compiler (gcc or clang)
  • Basic comfort with the terminal and virtual environments
  • About 5 GB of free disk space for a quantized 7B model

Step-by-Step Guide

Step 1: Clone and Build llama.cpp

Start by pulling the repository and compiling it for your CPU. The build system uses CMake and produces optimized binaries automatically.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

On Apple Silicon, add -DLLAMA_METAL=ON to the first cmake command to enable Metal acceleration, which dramatically improves throughput even on CPU-centric workflows.

Step 2: Download a Quantized GGUF Model

llama.cpp uses the GGUF format. The Hugging Face Hub hosts many pre-quantized models. We will use a Q4_K_M quantization of Mistral 7B, which strikes a good balance between quality and memory use.

# Install the huggingface_hub CLI if needed
pip install huggingface_hub

# Download the model file (~4.4 GB)
huggingface-cli download \
  TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --local-dir ./models

Step 3: Verify Inference from the CLI

Before wiring up an API, confirm the binary and model work together. A quick prompt test catches most configuration issues early.

./build/bin/llama-cli \
  -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  -n 128 \
  --threads $(nproc) \
  -p "[INST] Explain what a transformer model is in two sentences. [/INST]"

You should see tokens streaming to stdout within a few seconds. If the process is killed immediately, you likely have insufficient RAM — try a smaller quantization like Q2_K.

Step 4: Start the Built-in llama.cpp HTTP Server

llama.cpp ships a production-ready HTTP server that exposes an OpenAI-compatible /completion and /v1/chat/completions endpoint. No Python required for this part.

./build/bin/llama-server \
  -m ./models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --threads $(nproc) \
  --ctx-size 4096 \
  --n-predict 512

The server will print a startup message and begin listening. Leave this terminal open.

Step 5: Send Your First API Request

In a second terminal, test the endpoint with curl using the OpenAI-compatible chat format.

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b",
    "messages": [
      {"role": "user", "content": "What is quantization in the context of LLMs?"}
    ],
    "max_tokens": 256,
    "temperature": 0.7
  }' | python3 -m json.tool

Step 6: Build a Thin Python Wrapper (Optional)

If you need custom middleware — rate limiting, authentication, logging, or prompt templating — a lightweight FastAPI wrapper is a clean solution. Install dependencies first.

pip install fastapi uvicorn httpx
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import httpx

app = FastAPI(title="LLM Proxy")
LLAMA_SERVER = "http://localhost:8080"

class ChatRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

@app.post("/infer")
async def infer(req: ChatRequest):
    payload = {
        "messages": [{"role": "user", "content": req.prompt}],
        "max_tokens": req.max_tokens,
        "temperature": req.temperature,
    }
    async with httpx.AsyncClient(timeout=120) as client:
        resp = await client.post(
            f"{LLAMA_SERVER}/v1/chat/completions",
            json=payload
        )
    if resp.status_code != 200:
        raise HTTPException(status_code=resp.status_code, detail=resp.text)
    data = resp.json()
    return {"response": data["choices"][0]["message"]["content"]}
uvicorn app:app --host 0.0.0.0 --port 9000 --reload

Your wrapper now listens on port 9000 and proxies requests to the llama.cpp server, giving you a clean insertion point for any business logic.

Common Pitfalls and How to Avoid Them

  • Out-of-memory crashes: The model must fit entirely in RAM. Check your system's available memory before loading. Use htop or free -h and choose a lower quantization level if needed.
  • Slow first-token latency: The model is memory-mapped and loaded on the first request. Pre-warm the server by sending a dummy request at startup to avoid cold-start surprises in production.
  • Thread contention: Setting --threads higher than your physical core count often hurts throughput. Match physical cores, not hyperthreads.
  • Context length overflow: Requests longer than --ctx-size are silently truncated. Set an appropriate context window and validate input length at the API layer.
  • Model format mismatch: Only GGUF files work with current llama.cpp builds. Older GGML files must be converted using the convert.py scripts in the repository.

Conclusion

Running LLMs on CPU with llama.cpp is genuinely feasible, and this setup gives you a fully local, auditable inference stack with no third-party data exposure. That said, the operational side is real work: you are now responsible for model versioning, server restarts, scaling to concurrent users, monitoring latency regressions as models update, and securing the endpoint. For a team prototype this is fine; for production traffic it compounds quickly.

If you have reached the point where the infrastructure management is taking more time than the actual application you are building, SwiftInference is worth a look. It handles the model serving layer — auto-scaling, hardware selection, cold-start mitigation, and an OpenAI-compatible API — so you can drop it in as a backend without rewriting any of your client code. Think of it as the managed version of everything you just built, for when the DIY overhead stops making sense.