Skip to main content

Semantic Product Search

Match shopper intent to your product catalog without keyword matching

The Problem

Your search box and your shoppers speak different languages

TechHeaven carries 291 products across 15 categories. A shopper searching “laptop for video editing” should find three relevant products - all with descriptions mentioning creator workflows, ProRes acceleration, and high-core-count chips. Traditional keyword search returns nothing. No product title contains the phrase “video editing.”

This vocabulary gap between how shoppers describe what they want and how product teams write descriptions is permanent. It is not a content problem that better copy solves - the same shopper who types “earbuds for the gym” is looking for products described as “IPX5 water resistant sport earphones.” No human writes product descriptions in the language shoppers use to search for them.

10-15%
of searches return zero results
Bloomreach / Algolia ecommerce benchmarks
30-60%
exit rate after zero results
Shoppers who see nothing relevant leave immediately
2x
category accuracy
Semantic search vs TF-IDF on natural-language queries - this notebook

Semantic search closes the vocabulary gap by understanding intent rather than matching words. The same query - “laptop for video editing” - maps to the same region of an embedding space as “ProRes acceleration” and “creator workflows,” even without a single shared word. This playbook builds both systems and shows exactly where keyword search fails and why.

The Data

What the notebook indexes

Both search systems in this playbook index the same dataset: the TechHeaven product catalog loaded directly from GitHub. No database, no local setup, no API key required - the notebook fetches the data at runtime and builds the indexes from scratch.

Products indexed291

Consumer electronics across 15 categories: Laptops, Audio, Gaming, Smartphones, Cameras, Smart Home, and more

Search text per product4 fields

name + category + short_description + description concatenated into a single string for indexing

TF-IDF vocabulary~4,800 terms

Sparse matrix: most query terms score zero against most products

Embedding dimensions384

all-MiniLM-L6-v2 output: dense vector for every product, 22M parameter model, no API key required

data/catalog/products.json

Every product has: name, category, brand, price, short_description, and a longer description with technical specifications. The description field is where the vocabulary gap lives - technical terms that shoppers never use in a search box.

The Approaches

Keyword search vs. semantic search

Both systems use cosine similarity as the final ranking step. The difference is entirely in the representation: how the query and products are encoded before comparison.

TF-IDF (Baseline)
scikit-learn TfidfVectorizerLow complexity
How it works
Encodes products and queries as sparse word-frequency vectors. A product scores higher when the query words appear often in it and rarely across the corpus.
Strength
Fast, interpretable, no model download, zero false positives when words match exactly
Weakness
Zero score for any query word not in a product description. Cannot match synonyms, intent, or natural-language phrasing.
Semantic Search (Dense Embeddings)
sentence-transformers / all-MiniLM-L6-v2Medium complexity
How it works
Encodes products and queries as 384-dimension dense vectors using a pre-trained sentence transformer. Nearby vectors in this space share meaning, not just words.
Strength
Handles vocabulary gaps, synonyms, and natural-language intent. Works even when query and product share zero words.
Weakness
Slower to index (model inference per product), requires more memory per product than sparse vectors.

Results

Doubling accuracy on natural-language queries

The notebook runs 10 natural-language test queries - each designed to use shopper language that does not match product description vocabulary. Both systems return the top 5 results. The correct answer for each query is the expected product category.

5/10
TF-IDF
correct category in top-5 results
10/10
Semantic
correct category in top-5 results

Three queries illustrate the failure modes most clearly. These are not edge cases - they represent the everyday gap between how shoppers search and how products are described.

Queryheadphones for a noisy open officeExpected: Closed-back, noise-isolating headphones
TF-IDF result
Sennheiser HD 660S2 Open-Back
Matched "open" in query to "open-back" in title - confidently returns the opposite of what was needed
Semantic result
Sony WH-1000XM5 (Noise Cancelling)
Understood "noisy office" = need isolation, not open sound staging
Querylaptop for video editingExpected: High-performance laptop, GPU/CPU priority
TF-IDF result
No strong match - no products contain "video editing"
Zero vocabulary overlap with product descriptions
Semantic result
Apple MacBook Pro M4 Max
Understood creator workflows = high-performance laptop with strong GPU
Querysomething to carry my gearExpected: Bags / cases
TF-IDF result
Random weak matches across categories
No product contains "gear" - no meaningful score produced
Semantic result
Incase DSLR Pro Pack
Understood "carry my gear" = carrying case category

The open-back headphones case

The most striking failure is not just a wrong result - it is the right category but the actively harmful product. TF-IDF scores the Sennheiser HD 660S2 Open-Back highest for “headphones for a noisy open office” because the word “open” appears in both the query and the product title. But open-back headphones leak sound in both directions - they are the worst possible choice for a noisy environment. The model does not know this. Semantic search returns Sony WH-1000XM5 with noise cancellation because the embedding space places “noisy office” near “noise cancelling” and far from “open soundstage.”

Core Concepts

How each piece works

TF-IDF (Term Frequency-Inverse Document Frequency)

Represents each document as a sparse vector where each dimension is a vocabulary word. The weight for a word is proportional to how often it appears in this document (TF) multiplied by the inverse of how commonly it appears across all documents (IDF). Words that appear in every product get low weight; rare words that identify specific products get high weight. Cosine similarity then compares query vector to product vectors.

TechHeavenThe word "laptop" appears in 40 of 291 products - moderate IDF weight. The word "Thunderbolt" appears in 12 products - higher IDF weight. A query containing "Thunderbolt" gets a stronger signal toward those 12 products than a query containing "laptop" would.

Sentence Embeddings

A neural network encodes any text (a product description, a search query, a sentence) as a fixed-length dense vector. The model is trained so that texts with similar meaning produce similar vectors - regardless of word overlap. all-MiniLM-L6-v2 produces 384-dimension vectors from 22M parameters. It runs locally, requires no API key, and encodes 291 products in under 10 seconds.

TechHeaven"Laptop for video editing" and "Apple MacBook Pro M4 Max - ProRes hardware acceleration, 14-core GPU for creator workflows" produce vectors that are close in the 384-dimension space even though they share no meaningful words. The model learned this relationship from millions of sentence pairs during pre-training.

Cosine Similarity

Measures the angle between two vectors in a high-dimensional space. Returns a value from -1 (opposite directions) to 1 (identical direction). In practice, product search scores range from 0.1 (unrelated) to 0.85 (very close match). Both TF-IDF and semantic search use cosine similarity as the final ranking step - the difference is in the vectors being compared, not the comparison method.

TechHeavenQuery "headphones for a noisy office" vs. Sennheiser HD 660S2 (TF-IDF): score 0.31 (word "open" matches). Query "headphones for a noisy office" vs. Sony WH-1000XM5 (semantic): score 0.67 (embeddings capture noise cancelling intent). The semantic score is more than twice as high for the correct product.

Embedding Space Visualization (PCA)

The notebook reduces 384 dimensions to 2 using PCA (Principal Component Analysis) to visualize how products cluster. Products in the same category cluster together in the embedding space - headphones near headphones, laptops near laptops. A query for "headphones for a noisy office" falls into the audio cluster, near noise-cancelling products and far from open-back products.

TechHeavenThe 2D PCA chart shows 15 distinct product clusters. Gaming peripherals, audio equipment, and mobile accessories each occupy a distinct region. Overlaps appear at meaningful boundaries - wireless earbuds bridge the audio and mobile clusters.

Evaluation: Category Accuracy

For each test query, we define an expected product category (e.g., "headphones for a noisy office" expects Audio / headphones). We check whether the top-1 result falls in the correct category. This is a proxy for relevance - not a perfect metric, but one that is easy to audit and reproduce without click data.

TechHeaven10 queries, each with a defined expected category. TF-IDF: 5 correct. Semantic: 10 correct. The notebook prints the result table and confirms the finding reproducibly from any environment, no data download required.

Architecture

Offline indexing and online serving

Semantic search splits cleanly into two pipelines. Encoding products is expensive (model inference for each product) so it runs offline on a schedule. Serving queries is fast (one model inference + cosine similarity lookup) so it runs in real time at query time.

Offline - Indexing
Runs nightly or on catalog change
  1. 1Fetch product catalog from data source
  2. 2Concatenate name + category + description into search text
  3. 3Encode all product texts with sentence transformer model
  4. 4Store embedding vectors alongside product IDs
  5. 5Repeat whenever products are added, updated, or removed
Online - Serving
Runs on every search, target <100ms
  1. 1Receive search query from the storefront
  2. 2Encode query with the same sentence transformer model
  3. 3Compute cosine similarity against all product vectors
  4. 4Apply filters (in stock, category, price range)
  5. 5Return top-N results ranked by similarity score
  6. 6Log query and results for evaluation and retraining
Technology choices per layer
Encoding
sentence-transformers, OpenAI embeddings, Cohere embed
Vector store
pgvector, Qdrant, Pinecone, Weaviate, numpy (small catalogs)
Serving
FastAPI, Next.js API routes, Shopify App Bridge
Monitoring
Zero-result rate, click-through rate, query logs

For a catalog under 10,000 products, numpy cosine similarity across all product vectors returns results in under 50ms with no vector database required. The notebook demonstrates this exact approach on 291 products.

Business Applications

Shopify / WooCommerce
Replace the default storefront search with semantic embeddings. A shopper searching "wireless earbuds for running" finds products described as "sport earphones" and "IPX5 water resistant" even with no word overlap. Encodes at publish time, serves sub-50ms cosine similarity at query time with a small catalog.
B2B / Manufacturing
Part number search fails for buyers who describe what a part does rather than what it is called. Semantic search matches "connector that fits into a 3mm slot" to technical specifications written in engineering terms. Reduces "no results" dead ends that send buyers to a competitor.
Insurance / Professional Services
Policy and product search applies the same logic. A customer searching "I need cover if I can't work" should surface disability and income protection products - even if none of those exact words appear in the product name. Dense retrieval closes the vocabulary gap between customer language and actuarial language.
SaaS / Marketplaces
Feature discovery, documentation search, and help content retrieval all benefit from semantic search. Users searching "how do I undo a change" find results about version history and revision tracking even if neither phrase appears in the documentation title.

Run the notebook

The Jupyter notebook builds both search systems from scratch on 291 TechHeaven products. TF-IDF baseline, sentence-transformer semantic search, a side-by-side comparison on 10 natural-language queries, and a 2D PCA visualization of the product embedding space. No API key required.

Open in Colab