Semantic Product Search
Match shopper intent to your product catalog without keyword matching
The Problem
Your search box and your shoppers speak different languages
TechHeaven carries 291 products across 15 categories. A shopper searching “laptop for video editing” should find three relevant products - all with descriptions mentioning creator workflows, ProRes acceleration, and high-core-count chips. Traditional keyword search returns nothing. No product title contains the phrase “video editing.”
This vocabulary gap between how shoppers describe what they want and how product teams write descriptions is permanent. It is not a content problem that better copy solves - the same shopper who types “earbuds for the gym” is looking for products described as “IPX5 water resistant sport earphones.” No human writes product descriptions in the language shoppers use to search for them.
Semantic search closes the vocabulary gap by understanding intent rather than matching words. The same query - “laptop for video editing” - maps to the same region of an embedding space as “ProRes acceleration” and “creator workflows,” even without a single shared word. This playbook builds both systems and shows exactly where keyword search fails and why.
The Data
What the notebook indexes
Both search systems in this playbook index the same dataset: the TechHeaven product catalog loaded directly from GitHub. No database, no local setup, no API key required - the notebook fetches the data at runtime and builds the indexes from scratch.
Consumer electronics across 15 categories: Laptops, Audio, Gaming, Smartphones, Cameras, Smart Home, and more
name + category + short_description + description concatenated into a single string for indexing
Sparse matrix: most query terms score zero against most products
all-MiniLM-L6-v2 output: dense vector for every product, 22M parameter model, no API key required
data/catalog/products.jsonEvery product has: name, category, brand, price, short_description, and a longer description with technical specifications. The description field is where the vocabulary gap lives - technical terms that shoppers never use in a search box.
The Approaches
Keyword search vs. semantic search
Both systems use cosine similarity as the final ranking step. The difference is entirely in the representation: how the query and products are encoded before comparison.
scikit-learn TfidfVectorizerLow complexitysentence-transformers / all-MiniLM-L6-v2Medium complexityResults
Doubling accuracy on natural-language queries
The notebook runs 10 natural-language test queries - each designed to use shopper language that does not match product description vocabulary. Both systems return the top 5 results. The correct answer for each query is the expected product category.
Three queries illustrate the failure modes most clearly. These are not edge cases - they represent the everyday gap between how shoppers search and how products are described.
The open-back headphones case
The most striking failure is not just a wrong result - it is the right category but the actively harmful product. TF-IDF scores the Sennheiser HD 660S2 Open-Back highest for “headphones for a noisy open office” because the word “open” appears in both the query and the product title. But open-back headphones leak sound in both directions - they are the worst possible choice for a noisy environment. The model does not know this. Semantic search returns Sony WH-1000XM5 with noise cancellation because the embedding space places “noisy office” near “noise cancelling” and far from “open soundstage.”
Core Concepts
How each piece works
TF-IDF (Term Frequency-Inverse Document Frequency)
Represents each document as a sparse vector where each dimension is a vocabulary word. The weight for a word is proportional to how often it appears in this document (TF) multiplied by the inverse of how commonly it appears across all documents (IDF). Words that appear in every product get low weight; rare words that identify specific products get high weight. Cosine similarity then compares query vector to product vectors.
Sentence Embeddings
A neural network encodes any text (a product description, a search query, a sentence) as a fixed-length dense vector. The model is trained so that texts with similar meaning produce similar vectors - regardless of word overlap. all-MiniLM-L6-v2 produces 384-dimension vectors from 22M parameters. It runs locally, requires no API key, and encodes 291 products in under 10 seconds.
Cosine Similarity
Measures the angle between two vectors in a high-dimensional space. Returns a value from -1 (opposite directions) to 1 (identical direction). In practice, product search scores range from 0.1 (unrelated) to 0.85 (very close match). Both TF-IDF and semantic search use cosine similarity as the final ranking step - the difference is in the vectors being compared, not the comparison method.
Embedding Space Visualization (PCA)
The notebook reduces 384 dimensions to 2 using PCA (Principal Component Analysis) to visualize how products cluster. Products in the same category cluster together in the embedding space - headphones near headphones, laptops near laptops. A query for "headphones for a noisy office" falls into the audio cluster, near noise-cancelling products and far from open-back products.
Evaluation: Category Accuracy
For each test query, we define an expected product category (e.g., "headphones for a noisy office" expects Audio / headphones). We check whether the top-1 result falls in the correct category. This is a proxy for relevance - not a perfect metric, but one that is easy to audit and reproduce without click data.
Architecture
Offline indexing and online serving
Semantic search splits cleanly into two pipelines. Encoding products is expensive (model inference for each product) so it runs offline on a schedule. Serving queries is fast (one model inference + cosine similarity lookup) so it runs in real time at query time.
- 1Fetch product catalog from data source
- 2Concatenate name + category + description into search text
- 3Encode all product texts with sentence transformer model
- 4Store embedding vectors alongside product IDs
- 5Repeat whenever products are added, updated, or removed
- 1Receive search query from the storefront
- 2Encode query with the same sentence transformer model
- 3Compute cosine similarity against all product vectors
- 4Apply filters (in stock, category, price range)
- 5Return top-N results ranked by similarity score
- 6Log query and results for evaluation and retraining
For a catalog under 10,000 products, numpy cosine similarity across all product vectors returns results in under 50ms with no vector database required. The notebook demonstrates this exact approach on 291 products.
Business Applications
Run the notebook
The Jupyter notebook builds both search systems from scratch on 291 TechHeaven products. TF-IDF baseline, sentence-transformer semantic search, a side-by-side comparison on 10 natural-language queries, and a 2D PCA visualization of the product embedding space. No API key required.
Open in ColabEcommerce AI/ML Series
Ecommerce AI/ML Series
Resources
This playbook
Tech stack
Ecommerce AI/ML Roadmap
Completed
Products, orders, customers, policies, and FAQ