Building Semantic Search for E-commerce Using Product Embeddings and OpenSearch

Hey! So, I recently worked on this cool proof of concept (POC) where I tried to combine product embeddings with OpenSearch to enable a k-nearest neighbours (k-NN) search. Basically, I wanted to see if I could make searching through a product catalog smarter by using embeddings (which are essentially vector representations of the product information) instead of just basic text search.

In the end, I was able to set up a system where you could search for products based on their semantic meaning rather than just the keywords, which is a huge step forward in building intuitive search experiences for e-commerce. Let me break down what I did and how it all worked.

Table of Contents

The Problem I Was Solving

We’ve all seen e-commerce websites where the search just doesn’t understand what we want.

For example, if you search for “blue running shoes” some sites will only return products with the exact words “blue” and “running” in the title or description. That’s pretty limiting, right?

What if a product is described as “sky-colored jogger sneakers“?

Traditional text-based search would miss that.

This is where embeddings come in. Embeddings turn text (like product descriptions) into vectors (numbers in a high-dimensional space), which allows you to compare them by how similar they are instead of whether the exact words match. That’s the basis of k-NN search, where you find the items closest to each other in vector space.

I decided to build a system that would take e-commerce products, generate embeddings from their descriptions, and then store those embeddings in OpenSearch. This would allow us to perform k-NN searches based on those embeddings. In simpler terms: instead of searching based on exact words, we’d be searching based on meaning!

Understanding Vector Dimensions

Vector dimensions play a crucial role in capturing the semantic relationships between words and concepts. The higher the number of dimensions, the more accurately the vector can represent the nuances and complexities of language.

Imagine a simple 2-dimensional vector space, where each dimension represents a specific feature or characteristic. In this simplified scenario, let’s consider the dimensions as “size” and “color”. We can represent words or concepts as points in this 2D space, with their coordinates determined by their respective size and color values.

For example, consider the words “apple” and “banana.” We can represent “apple” as a point with coordinates (2, 1), indicating a medium size and a red color. Similarly, “banana” could be represented as (5, 2), indicating a larger size and a yellow color.

	Feature – Size (X-coordinate)	Feature – Color (Y-coordinate)
Apple	2	1
Banana	5	2

Now, let’s consider two distinct words, like “laptop” and “ocean.” We might represent “laptop” as (3, 4), indicating a medium size and a gray color, while “ocean” could be (1, 6), indicating a large size and a blue color.

	Feature – Size (X-coordinate)	Feature – Color (Y-coordinate)
Laptop	3	4
Ocean	1	6

In this 2D space, we can calculate the distance between these points (words) using the Euclidean distance formula. Similar words, like “apple” and “banana,” will have a smaller distance between them, while distinct words, like “laptop” and “ocean,” will have a larger distance.

However, language is far more complex than just two dimensions. Words and concepts have multiple facets, including context, connotations, and nuances. Higher-dimensional vector spaces allow us to capture these intricacies more accurately.

For instance, a 1536-dimensional vector space, like the one used by Amazon Bedrock’s Titan model, can represent words and concepts with a much higher level of detail. Each dimension could correspond to a different aspect of meaning, such as synonyms, antonyms, parts of speech, sentiment, and more.

In this high-dimensional space, semantically similar words and concepts will be clustered together, while dissimilar ones will be farther apart. This enables more accurate and meaningful semantic searches, as words with similar meanings will have vectors that are closer together in this vast vector space.

OpenSearch and KNN Integration

Let’s start with the OpenSearch part, which is where the magic happens. OpenSearch is an open-source search engine, similar to Elasticsearch.

What’s cool is that it supports k-NN, meaning you can use it to search through vectors (embeddings) instead of just text.

This is the architecture that I followed for the POC:

Setting up OpenSearch

First, I had to connect to OpenSearch using AWS credentials. Here’s how I did that:

from requests_aws4auth import AWS4Auth
from opensearchpy import OpenSearch
from opensearchpy.connection import RequestsHttpConnection
import boto3

credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(region='us-east-1', service='aoss', refreshable_credentials=credentials)

host = 'https://your-opensearch-domain'
client = OpenSearch(
    hosts=[host],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

Here, I authenticated using AWS’s boto3 and connected it to my OpenSearch instance. Once I had this connection, I could start creating indexes and working with the data.

Creating the Index

I needed to create an index that would store my product data, including the embeddings. An index in OpenSearch is like a database table where you can define the structure of the documents you want to store. Here’s how I created the index:

def create_index(index_name, field_mappings):
    index_body = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 1,
            "index.knn": True  # Enable KNN vector search
        },
        "mappings": field_mappings
    }

    response = client.indices.create(index=index_name, body=index_body)
    print(f"Index Created: {response}")

I used a mapping for the product data that included fields like name, price, category, and embedding. The important one is embedding, which is where I would store the product’s vector representation. I set the index.knn setting to True to enable vector search.

Product Data and Embeddings

Now for the fun part – getting embeddings.

I used Amazon Bedrock’s Titan model to generate the embeddings. Titan is great for creating text embeddings with higher dimensions (1536, in this case), which is perfect for our k-NN search.

Here’s how I generated embeddings:

from langchain_aws import BedrockEmbeddings

def generate_text_embeddings(text):
    text_embeddings = BedrockEmbeddings(
        region_name="us-east-1",
        model_id="amazon.titan-embed-text-v1"
    )
    embedding_vector = text_embeddings.embed_query(text)
    return embedding_vector

Basically, I took each product’s description and passed it through the Titan model, which converted the text into a 1536-dimensional vector. This vector represents the product in a way that captures its meaning, not just its words.

Ingesting Product Data

Once I had my embeddings, I needed to get the product data (including the embeddings) into OpenSearch. But before doing that, I wanted to check if the products were already in my DynamoDB table to avoid duplicate entries.

I wrote a function that checks if the product exists in DynamoDB:

def pk_exists(pk):
    response = table.get_item(Key={'pk': pk})
    if 'Item' in response:
        return response['Item']['status'] == 'created'
    return False

# for bulk fetch
def bulk_pk_exists(pks):
    try:
        # Prepare the list of keys for the BatchGetItem request
        keys_to_check = [{'pk': pk} for pk in pks]

        # Perform the BatchGetItem operation
        response = table.meta.client.batch_get_item(
            RequestItems={
                table.name: {
                    'Keys': keys_to_check
                }
            }
        )

        # Process the response to check if keys exist and their statuses
        existing_items = response['Responses'].get(table.name, [])

        # Create a dictionary to hold the status of each key
        results = {pk: False for pk in pks}  # Default to False (does not exist)

        # For each returned item, check its status
        for item in existing_items:
            pk_value = item['pk']
            if item.get('status') == 'created':
                results[pk_value] = True
        
        print(len(results))
        return results

    except ClientError as e:
        #print(f"Unable to check Partition Keys: {e.response['Error']['Message']}")
        raise Exception()

I also made a batch method to check multiple products at once and another function to write new items into DynamoDB.

With this setup, I could safely write new products to both DynamoDB (for tracking purposes) and OpenSearch (for search purposes). Here’s how I ingested the data into OpenSearch:

def bulk_index_documents(index_name, documents, success_callback):
    bulk_data = []
    for doc in documents:
        bulk_data.append({"index": {"_index": index_name}})
        bulk_data.append(doc)

    response = client.bulk(body=bulk_data)
    if not response['errors']:
        success_callback(documents, response)

This bulk ingestion allowed me to index multiple products at once, which made the whole process much faster.

Data Preparation

To prepare the product data, I wrote some helper functions to clean up the descriptions, image URLs, and other fields:

Flattened descriptions: I took complex product descriptions and parsed them into simpler, flattened text.
SKU sanitation: Made sure the SKU (a unique product identifier) was in a consistent format.

This was all handled in a generate_product_document function that yielded cleaned-up product data. For each product, I created a document field, which was a nicely formatted string containing all the relevant product details. I also added the embedding field to this product document.

Running the Ingestion Pipeline

With everything set up, I ran the ingestion script to generate embeddings for each product and index them in OpenSearch.

Here’s what that process looked like:

Generate embeddings for each product description.
Check if the product exists in DynamoDB (to avoid duplicates).
Bulk index the products into OpenSearch.
Store product metadata (including a status flag) in DynamoDB.

This looped through all the products in the dataset, and whenever a batch of 100 products was processed, it was written to both OpenSearch and DynamoDB.

Results

I’m sure you all are interested in the results and I’m not going to leave you without it. Although, I cannot share everything here, I can share some of the top results and link you to the entire image repository.

Search Query: Show me floral printed dresses

Keyword Search Results

Floral Printed Dress using Keyword Search

Vector Search Results

If you look at the above results, you can clearly see that vector based search results are pretty accurate. They are not just focused on the keyword such as “floral” but also they understand which dresses will contain “floral printed” pattern on the dress. This clearly depicts that vector based searches understand the semantic behind the user’s query. And it’s amazing to see when it comes to live.

Final Thoughts

So that’s pretty much it! By the end of this process, I had a working k-NN search for product embeddings. The cool thing about this setup is that it allows for much more intuitive search experiences. For example, a user could search for a product using natural language, and the system would return results based on the meaning of the query, not just the keywords.

This was a fun and rewarding POC, and I’m excited to see how I can build on it. In the future, I could add more product attributes or fine-tune the embedding generation process for even better search results. But for now, this was a great step forward!

Let me know if you have any thoughts or questions! 😄