Build a Low-Cost Semantic Search Engine With Open-Source Embeddings

Introduction

Keyword search is fast and simple, but it breaks down the moment a user phrases their query differently from how content was written. A user searching for "how to cut expenses" won't find an article titled "reducing operational costs" — even though both mean the same thing. Semantic search solves this by comparing the meaning of text rather than its exact words.

Until recently, building semantic search required either expensive API calls to proprietary embedding providers or serious ML infrastructure. Today, open-source models like sentence-transformers run comfortably on a laptop, and lightweight vector stores like ChromaDB need zero infrastructure to get started. This tutorial is for backend developers and DevOps engineers who want a working semantic search engine they can deploy or extend — without a cloud bill that scales with every query.

Prerequisites

Python 3.9 or higher installed locally
Basic familiarity with pip and virtual environments
At least 2 GB of free disk space for the embedding model download
No GPU required — CPU inference is fast enough for this tutorial

Step-by-Step Guide

Step 1: Set Up Your Environment

Create a fresh virtual environment and install the required packages. We will use sentence-transformers for embedding generation and ChromaDB as our local vector store.

python -m venv semantic-search-env
source semantic-search-env/bin/activate  # Windows: semantic-search-env\Scripts\activate
pip install sentence-transformers chromadb

Step 2: Prepare Your Documents

For this tutorial we use a small list of sentences. In a real project, these would be database records, knowledge-base articles, or product descriptions loaded from a file or API.

documents = [
    {"id": "1", "text": "How to reduce operational costs in a startup"},
    {"id": "2", "text": "Tips for improving team collaboration remotely"},
    {"id": "3", "text": "A beginner's guide to machine learning pipelines"},
    {"id": "4", "text": "Best practices for containerizing Python applications"},
    {"id": "5", "text": "Cutting expenses without sacrificing product quality"},
    {"id": "6", "text": "How distributed teams can work better together"},
    {"id": "7", "text": "Introduction to end-to-end ML workflows"},
    {"id": "8", "text": "Docker and Kubernetes for Python microservices"},
]

Step 3: Load the Embedding Model

We will use all-MiniLM-L6-v2, a 22 MB model that punches well above its weight for English-language semantic similarity tasks. It downloads automatically on first run.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded successfully.")

Step 4: Generate Embeddings and Index Documents

We embed every document and store both the raw text and the vector in ChromaDB. ChromaDB persists the collection to disk automatically, so your index survives restarts.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="docs")

texts = [doc["text"] for doc in documents]
ids   = [doc["id"]  for doc in documents]

embeddings = model.encode(texts, show_progress_bar=True).tolist()

collection.add(
    ids=ids,
    embeddings=embeddings,
    documents=texts,
)

print(f"Indexed {len(ids)} documents.")

Step 5: Run a Semantic Query

To search, embed the user's query with the same model and ask ChromaDB to return the closest vectors. No keyword matching — pure semantic similarity.

def semantic_search(query: str, top_k: int = 3):
    query_embedding = model.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=top_k,
    )
    print(f"\nQuery: {query}")
    print("-" * 40)
    for doc, distance in zip(
        results["documents"][0],
        results["distances"][0]
    ):
        score = round(1 - distance, 4)  # cosine similarity proxy
        print(f"[{score}] {doc}")

semantic_search("how to cut expenses")
semantic_search("working from home as a team")
semantic_search("packaging apps with containers")

Running that last block should produce output similar to the following:

Query: how to cut expenses
----------------------------------------
[0.9121] Cutting expenses without sacrificing product quality
[0.7843] How to reduce operational costs in a startup
[0.5102] Tips for improving team collaboration remotely

Notice that "how to cut expenses" correctly surfaces the document about "cutting expenses" and the one about "reducing operational costs" — neither of which shares an exact keyword with the query.

Step 6: Reload the Index on Subsequent Runs

Because we used PersistentClient, you can skip the embedding and indexing steps on future runs and jump straight to querying:

import chromadb
from sentence_transformers import SentenceTransformer

model      = SentenceTransformer("all-MiniLM-L6-v2")
client     = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(name="docs")

semantic_search("machine learning for beginners")

Common Pitfalls and How to Avoid Them

Re-embedding documents on every restart. Always check whether a collection already exists before calling collection.add(). Use get_or_create_collection and track which IDs are already stored.
Mixing embedding models. Your query must be encoded with the same model used to build the index. Swapping models mid-project silently breaks relevance — bump the collection name and re-index whenever you change models.
Indexing huge documents without chunking. Embedding models have a token limit (typically 256–512 tokens). Split long documents into overlapping chunks before indexing, then group results by source document when displaying them.
Treating cosine distance as a percentage. A distance of 0.2 does not mean 80% relevant. Tune your acceptance threshold empirically on a sample of real queries before going to production.
Forgetting to normalize batch sizes. Encoding thousands of documents in a single model.encode() call can exhaust RAM. Use the batch_size parameter: model.encode(texts, batch_size=64).

Conclusion

You now have a working semantic search engine that runs entirely on your own hardware, costs nothing per query, and is built on models with permissive open-source licenses. The full solution is under 100 lines of Python and can be extended to REST APIs, Slack bots, or internal documentation portals with minimal additional work.

That said, there is a real operational cost hiding in this stack once you move to production: you still need to manage model serving latency, scale the vector store as your corpus grows, handle batch re-indexing pipelines, and keep the infrastructure running reliably. I have dealt with all of that myself, and the toil adds up quickly — usually right when you have more important product work to do.

That is the problem SwiftInference is built to solve. It handles model serving, embedding generation at scale, and retrieval infrastructure so you can stay focused on building the product layer. If you find yourself spending more time on infrastructure than on search quality, it is worth taking a look.

Build a Low-Cost Semantic Search Engine With Open-Source Embeddings

Introduction

Prerequisites

Step-by-Step Guide

Step 1: Set Up Your Environment

Step 2: Prepare Your Documents

Step 3: Load the Embedding Model

Step 4: Generate Embeddings and Index Documents

Step 5: Run a Semantic Query

Step 6: Reload the Index on Subsequent Runs

Common Pitfalls and How to Avoid Them

Conclusion

Run AI inference without the GPU bill

More from the blog

AI Legal Battles, Agent Breakthroughs, and the $500 GPU Upset

How AI Inference Is Transforming Healthcare & Life Sciences in 2026

AI Efficiency, Agent Orchestration, and LLM Trends: March 26, 2026

Introduction

Prerequisites

Step-by-Step Guide

Step 1: Set Up Your Environment

Step 2: Prepare Your Documents

Step 3: Load the Embedding Model

Step 4: Generate Embeddings and Index Documents

Step 5: Run a Semantic Query

Step 6: Reload the Index on Subsequent Runs

Common Pitfalls and How to Avoid Them

Conclusion

Share this post

Run AI inference without the GPU bill

More from the blog

AI Legal Battles, Agent Breakthroughs, and the $500 GPU Upset

How AI Inference Is Transforming Healthcare & Life Sciences in 2026

AI Efficiency, Agent Orchestration, and LLM Trends: March 26, 2026