Introduction
Keyword search is fast and simple, but it breaks down the moment a user phrases their query differently from how content was written. A user searching for "how to cut expenses" won't find an article titled "reducing operational costs" — even though both mean the same thing. Semantic search solves this by comparing the meaning of text rather than its exact words.
Until recently, building semantic search required either expensive API calls to proprietary embedding providers or serious ML infrastructure. Today, open-source models like sentence-transformers run comfortably on a laptop, and lightweight vector stores like ChromaDB need zero infrastructure to get started. This tutorial is for backend developers and DevOps engineers who want a working semantic search engine they can deploy or extend — without a cloud bill that scales with every query.
Prerequisites
- Python 3.9 or higher installed locally
- Basic familiarity with pip and virtual environments
- At least 2 GB of free disk space for the embedding model download
- No GPU required — CPU inference is fast enough for this tutorial
Step-by-Step Guide
Step 1: Set Up Your Environment
Create a fresh virtual environment and install the required packages. We will use sentence-transformers for embedding generation and ChromaDB as our local vector store.
python -m venv semantic-search-env
source semantic-search-env/bin/activate # Windows: semantic-search-env\Scripts\activate
pip install sentence-transformers chromadbStep 2: Prepare Your Documents
For this tutorial we use a small list of sentences. In a real project, these would be database records, knowledge-base articles, or product descriptions loaded from a file or API.
documents = [
{"id": "1", "text": "How to reduce operational costs in a startup"},
{"id": "2", "text": "Tips for improving team collaboration remotely"},
{"id": "3", "text": "A beginner's guide to machine learning pipelines"},
{"id": "4", "text": "Best practices for containerizing Python applications"},
{"id": "5", "text": "Cutting expenses without sacrificing product quality"},
{"id": "6", "text": "How distributed teams can work better together"},
{"id": "7", "text": "Introduction to end-to-end ML workflows"},
{"id": "8", "text": "Docker and Kubernetes for Python microservices"},
]Step 3: Load the Embedding Model
We will use all-MiniLM-L6-v2, a 22 MB model that punches well above its weight for English-language semantic similarity tasks. It downloads automatically on first run.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
print("Model loaded successfully.")Step 4: Generate Embeddings and Index Documents
We embed every document and store both the raw text and the vector in ChromaDB. ChromaDB persists the collection to disk automatically, so your index survives restarts.
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection(name="docs")
texts = [doc["text"] for doc in documents]
ids = [doc["id"] for doc in documents]
embeddings = model.encode(texts, show_progress_bar=True).tolist()
collection.add(
ids=ids,
embeddings=embeddings,
documents=texts,
)
print(f"Indexed {len(ids)} documents.")Step 5: Run a Semantic Query
To search, embed the user's query with the same model and ask ChromaDB to return the closest vectors. No keyword matching — pure semantic similarity.
def semantic_search(query: str, top_k: int = 3):
query_embedding = model.encode([query]).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=top_k,
)
print(f"\nQuery: {query}")
print("-" * 40)
for doc, distance in zip(
results["documents"][0],
results["distances"][0]
):
score = round(1 - distance, 4) # cosine similarity proxy
print(f"[{score}] {doc}")
semantic_search("how to cut expenses")
semantic_search("working from home as a team")
semantic_search("packaging apps with containers")Running that last block should produce output similar to the following:
Query: how to cut expenses
----------------------------------------
[0.9121] Cutting expenses without sacrificing product quality
[0.7843] How to reduce operational costs in a startup
[0.5102] Tips for improving team collaboration remotelyNotice that "how to cut expenses" correctly surfaces the document about "cutting expenses" and the one about "reducing operational costs" — neither of which shares an exact keyword with the query.
Step 6: Reload the Index on Subsequent Runs
Because we used PersistentClient, you can skip the embedding and indexing steps on future runs and jump straight to querying:
import chromadb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection(name="docs")
semantic_search("machine learning for beginners")Common Pitfalls and How to Avoid Them
- Re-embedding documents on every restart. Always check whether a collection already exists before calling
collection.add(). Useget_or_create_collectionand track which IDs are already stored. - Mixing embedding models. Your query must be encoded with the same model used to build the index. Swapping models mid-project silently breaks relevance — bump the collection name and re-index whenever you change models.
- Indexing huge documents without chunking. Embedding models have a token limit (typically 256–512 tokens). Split long documents into overlapping chunks before indexing, then group results by source document when displaying them.
- Treating cosine distance as a percentage. A distance of 0.2 does not mean 80% relevant. Tune your acceptance threshold empirically on a sample of real queries before going to production.
- Forgetting to normalize batch sizes. Encoding thousands of documents in a single
model.encode()call can exhaust RAM. Use thebatch_sizeparameter:model.encode(texts, batch_size=64).
Conclusion
You now have a working semantic search engine that runs entirely on your own hardware, costs nothing per query, and is built on models with permissive open-source licenses. The full solution is under 100 lines of Python and can be extended to REST APIs, Slack bots, or internal documentation portals with minimal additional work.
That said, there is a real operational cost hiding in this stack once you move to production: you still need to manage model serving latency, scale the vector store as your corpus grows, handle batch re-indexing pipelines, and keep the infrastructure running reliably. I have dealt with all of that myself, and the toil adds up quickly — usually right when you have more important product work to do.
That is the problem SwiftInference is built to solve. It handles model serving, embedding generation at scale, and retrieval infrastructure so you can stay focused on building the product layer. If you find yourself spending more time on infrastructure than on search quality, it is worth taking a look.