Introduction

Most document analysis pipelines still rely on brittle OCR chains — extract text, hope the layout survives, then feed it to a language model. Multi-modal vision models collapse that pipeline into a single step: give the model an image of a document and ask it what you need. The catch is that these models are usually assumed to require GPU hardware, which puts them out of reach for many teams running on standard cloud VMs or on-premise servers.

The good news is that smaller quantized vision-language models — models like LLaVA, MiniCPM-V, and Moondream — can run entirely on CPU with acceptable latency for batch document workflows. This tutorial is for developers and DevOps engineers who need to process invoices, forms, scanned contracts, or technical diagrams without provisioning GPU instances. You will get a working pipeline by the end.

Prerequisites

  • A Linux or macOS machine with at least 16 GB RAM (32 GB recommended for larger models)
  • Python 3.10 or later installed
  • pip and virtualenv available
  • Basic familiarity with Python virtual environments and the command line
  • A few sample document images in PNG or JPEG format for testing

Step-by-Step Guide

Step 1: Create a Virtual Environment and Install Dependencies

Start by isolating your environment so library versions do not conflict with other projects.

python3 -m venv docvision-env
source docvision-env/bin/activate
pip install --upgrade pip
pip install transformers pillow accelerate torch torchvision --index-url https://download.pytorch.org/whl/cpu
pip install bitsandbytes  # optional, needed only for 4-bit loading on CPU via llama.cpp path

We are installing the CPU-only build of PyTorch deliberately. This keeps the installation lean and avoids CUDA dependencies entirely.

Step 2: Choose and Download a Quantized Vision Model

For CPU inference, model size is everything. Moondream2 (1.8B parameters) is an excellent starting point — it handles document images well and fits comfortably in 8 GB of RAM at full precision, or around 2 GB with 4-bit quantization via GGUF format.

# Install llama-cpp-python with CPU support for GGUF models
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python

# Download Moondream2 GGUF (Q4_K_M quantization)
wget https://huggingface.co/vikhyatk/moondream2/resolve/main/moondream2-text-model-f16.gguf -O moondream2.gguf

Alternatively, if you prefer the HuggingFace Transformers path and have 16 GB RAM available, you can load the model directly in float32 without any quantization step.

Step 3: Build a Document Analysis Helper Class

The following module wraps model loading and inference into a clean interface. Save it as doc_analyzer.py.

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import torch

class DocumentAnalyzer:
    def __init__(self, model_id: str = "vikhyatk/moondream2", revision: str = "2025-01-09"):
        print(f"Loading model {model_id} on CPU — this may take 30–90 seconds...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            revision=revision,
            torch_dtype=torch.float32,  # CPU requires float32
            device_map="cpu",
            trust_remote_code=True,
        )
        self.model.eval()
        print("Model loaded successfully.")

    def analyze(self, image_path: str, prompt: str) -> str:
        image = Image.open(image_path).convert("RGB")
        enc_image = self.model.encode_image(image)
        result = self.model.answer_question(enc_image, prompt, self.tokenizer)
        return result

if __name__ == "__main__":
    analyzer = DocumentAnalyzer()

    # Extract key fields from an invoice image
    invoice_path = "sample_invoice.png"
    fields = [
        "What is the invoice number?",
        "What is the total amount due?",
        "What is the due date?",
        "Who is the vendor?",
    ]

    for question in fields:
        answer = analyzer.analyze(invoice_path, question)
        print(f"Q: {question}\nA: {answer}\n")

Step 4: Run the Analysis and Inspect Outputs

python doc_analyzer.py

On a modern 8-core CPU, expect roughly 5–20 seconds per question depending on document complexity. For batch workflows processing hundreds of documents overnight, this throughput is entirely practical.

Step 5: Build a Batch Processing Pipeline

For production document workflows, you will want to process directories of files and emit structured JSON output.

import json
import os
from doc_analyzer import DocumentAnalyzer

def batch_process(image_dir: str, output_file: str):
    analyzer = DocumentAnalyzer()
    results = []

    schema = [
        "What type of document is this?",
        "List all dates mentioned in the document.",
        "What monetary amounts appear in the document?",
        "Summarize the main purpose of this document in one sentence.",
    ]

    for filename in sorted(os.listdir(image_dir)):
        if not filename.lower().endswith((".png", ".jpg", ".jpeg")):
            continue

        path = os.path.join(image_dir, filename)
        doc_data = {"file": filename, "fields": {}}

        for question in schema:
            try:
                doc_data["fields"][question] = analyzer.analyze(path, question)
            except Exception as e:
                doc_data["fields"][question] = f"ERROR: {e}"

        results.append(doc_data)
        print(f"Processed: {filename}")

    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
    print(f"\nResults written to {output_file}")

if __name__ == "__main__":
    batch_process(image_dir="./documents", output_file="extraction_results.json")

Common Pitfalls and How to Avoid Them

  • Running out of RAM: The full-precision model needs roughly 8–10 GB. If your machine has less, use a GGUF Q4 quantized version via llama-cpp-python instead of the Transformers path.
  • Slow first inference due to model loading: Always load the model once and reuse the instance. Never reload it inside a loop or per-request handler.
  • Image resolution too low: Vision models struggle with tiny text. Ensure document images are at least 150 DPI. Use img = img.resize((img.width * 2, img.height * 2)) as a quick upscale if needed.
  • trust_remote_code warnings: Moondream requires trust_remote_code=True. Pin the model revision explicitly so you are not running arbitrary code from an updated upstream commit.
  • Multi-page PDFs: Convert each page to an image first using pdf2image — vision models accept images, not raw PDFs.

Conclusion

Running multi-modal vision models on CPU is genuinely viable for document analysis workloads — especially batch pipelines where you trade latency for infrastructure simplicity. The setup you have built here gives you a fully self-contained extraction system with no GPU dependency and no external API calls.

That said, managing model versions, handling RAM constraints across environments, scaling to parallel workers, and keeping inference latency acceptable as your document volume grows are all real operational challenges. This is exactly the problem SwiftInference is designed to solve. Rather than maintaining your own model serving stack and tuning quantization settings by hand, SwiftInference handles the infrastructure layer — giving you an API endpoint for vision and language models that scales automatically, with CPU and GPU backends selected based on your latency and cost requirements. If you have validated the approach locally with this tutorial and need to move it into production without the ops overhead, it is worth a look.