Introduction

Similar to many other e-commerce platforms, Meesho places significant importance on its product search functionality. This feature plays a crucial role in assisting users in discovering the right products from a vast catalog of hundreds of millions, employing various search modes such as text, voice, or images. Among these search modes, text search is the most commonly employed. In this blog, we'll delve into our development of a semantic search system designed to decipher user search intent from text queries. This, in turn, enhances user engagement and conversions on search results pages and across the platform as a whole.

Conventional search solutions, such as term or lexicon matching engines, often struggle with understanding natural language, leading to subpar performance when handling queries that involve a mix of words from various languages or scripts other than the Latin alphabet. Meesho caters to a substantial user base in smaller towns, and our goal is to enhance the search experience, especially when users are searching with queries that include words from regional Indian languages, whether written in Latin script or their native scripts. For eg. “पूजा सामग्री” , “baccho ke liye train toys”, “ঐতিহ্যবাহী সুতির শাড়ি” etc.

In the upcoming sections, we will outline the architecture of the two-tower model, discuss the mode training process along with offline evaluation, and subsequently delve into the generation of embeddings and the real-time deployment of the model for inference. Following that, we will assess the accomplishments and outline the potential areas for future work.

Model Architecture & Training

Figure 1. Model Architecture

Drawing inspiration from Que2Search, we trained a two tower model on weakly supervised datasets and achieved state-of-the art performance for search query and product representation compared to previous baselines at Meesho. The model architecture is shown in Figure 1.

The query tower is designed to learn two distinct representations for the search query. One is built on character-3 gram embeddings, and the other is based on the output generated by the pre-trained IndicBERT encoder. We compile a list of character-3 grams from the search query, pass them through an embedding bag, and derive the representation by applying a sum pool. The representation of the [CLS] token from the final layer of the IndicBERT encoder serves as the encoder representation, which is then passed through a projection layer to align it with the dimension of the embedding bag. These two representations of the search query get merged using the attention fusion.

In the case of the product tower, the input features consist of the product title and description. Additionally, we crafted sentences for product descriptions based on attribute key-value data associated with the products. Much like the approach taken in the query tower, we create character-3 grams from the title and description and then they are passed through a shared embedding bag, and upon applying the sum pooling method, a representation is generated. Another representation for the product is produced by utilizing the IndicBERT encoder, with the [CLS] token output of the final layer serving as the representative value. These two product representations are merged through attention fusion, with the attention weights being learned during the model training process.

By employing the IndicBERT encoder, we can effectively handle queries originating from a wide array of Indian regional languages, including Bengali, Hindi, Malayalam, Marathi, and many more. While the IndicBERT encoder adeptly learns representations for standard tokens across these diverse languages, the process of learning representations from the embedding bag of character-3 grams serves to enhance robustness of the query representation in cases of mistyped or misspelled search queries.

For model training, we created a dataset from user search query logs, gathering pairs of <query, product> that displayed positive user interactions, such as clicks, buy now, add to cart, and orders. We exclusively gathered positive samples and by leveraging mixed negatives (both in-batch negatives and random negatives) trained the model to optimize the NT-Xent (normalized temperature-scaled cross-entropy) a.k.a. multi-class N-pair loss.

The loss for the ith sample, corresponding to the query-product pair <qi , pi> in a batch of size N, is computed using the following expression, where 's' denotes the scaling factor and 'cos' represents the cosine similarity."


During the model training process, we computed the BatchRecall@1 metric to track in-batch recall. Upon completion of model training, we evaluated the model using human-rated relevance labels for <query, product> pairs. Relevant pairs receive a label of 1, while irrelevant pairs are labeled as 0. For these same pairs, we calculate the cosine similarity between the query embedding and the product embedding generated by the model. We utilize the true labels and model predictions to compute the ROC AUC, which serves as a validation metric for fine-tuning model hyperparameters.

The implementation of the model was carried out using PyTorch, HuggingFace, and PyTorch Lightning. With multi-GPU training the model was trained on a g5.24xlarge instance equipped with 8 NVIDIA A10G tensor core GPUs. Regularization techniques like dropout and gradient clipping were used.

Embedding Generation & Real time Inference

We generated the embedding for all the existing products using the product tower of the trained model and also hosted the product tower through a sagemaker real-time inference endpoint to generate embeddings for newly added products. Additionally, we build and continue to maintain a Qdrant  index for all product embeddings, facilitating an approximate nearest neighbor lookup.

Figure 2. Embedding Generation and Candidate Retrieval

Every time we receive a search query, the search retriever utilizes another SageMaker endpoint to execute the query tower, generating a query embedding. This query embedding is then used to search for closest products through the Qdrant index, as depicted in Figure 2.

We've successfully implemented the entire embedding-based retrieval (EBR) process with a P-99 latency of 80 ms. Out of this time, approximately 20 ms are attributed to generating the query embedding, while the remainder is dedicated to retrieving the nearest neighbors. We combine these candidates with results obtained from a lexicon matching engine and forward them to a ranker model optimized for conversion, ultimately producing the search results.

Achievements & Future Work

We categorize the search queries into head, torso, and tail buckets, stratifying them by their frequencies. Notably, we observed the most significant enhancements in the tail bucket, where lexicon matching alone was insufficient in delivering relevant results.

For tail queries, we achieved a substantial improvement, resulting:

  • 20% increase in order conversion for tail queries
  • 5% increase in orders by searches
  • Platform's orders per visitor improved by 0.8%.

With the deployment of this EBR system into production, we've established the groundwork for semantic search. This paves the way for future rapid iterations on model architecture and training to continuously enhance our search and business metrics. Several challenges and improvements that we aim to address with the current system include:

  • Incorporating additional modalities for products, such as image representations, reviews, and past interactions on the platform.
  • Exploring the use of other pre-trained large language models like MuRIL, XLM-R, LEALLA etc. for the search query and product text and comparing performance with IndicBERT.
  • Implementing multi-objective model training to prioritize products with higher conversion probabilities among a pool of products with similar relevance.

We express our sincere appreciation to Rajesh Kumar SA (Director of Data Science) and Debdoot Mukherjee (Head of AI) for their unwavering support and guidance. A big shout-out to the dedicated team members who played crucial roles in bringing this project to life: Srinivasa Rao Jami, Nidhi Singh, Rishabh Mishra, Shubham Gupta, Saksham Raj Seth, Piyush Kumar, Sujith Cheedella, Aniket Kumar, Aditya Kumar Garg and Vipin Gupta.