A brief take on why we chose K-Nearest Neighbour algorithm over brute force technique

By Rohit Sah, Neha Mangal, Rochak Singh, Ashok Lingadurai

Imagine an online marketplace where millions of sellers and buyers transact on a daily basis;

  • 📦 Suppliers re-stocking the inventory by adding new product images;
  • ⭐️ Buyers uploading product reviews as images;
  • 👗 Users searching for that specific low-priced yellow coloured kurti via image search;

— The responsibility of product discovery in such scenarios using images becomes more challenging in the long run.

Meesho, already home to millions of suppliers, resellers and direct buyers, we are constantly preparing for the future to give our users the very best experience and features.

In this blog, we discuss why and how we developed a faster way to identify similar product images in an autonomous manner. 🤖

Why is product discovery via images important?

Once you’ve started identifying products using a machine learning algorithm, the possibilities of leveraging those data are endless. For example, we can compare visually similar product images to understand the price gaps between them.

Furthermore, we could recommend visually similar products to users if their favourite wish-listed product is out of stock and also have the capability to alert the supplier to restock the inventory.

Sure, as a human you can tell a lot about a product when you look at it. But the trick is to make a machine discover/predict them automatically so that our efforts (manual) can be minimised.

KNN Algorithm — Explained

KNN (K nearest neighbours) is a data science model used to identify the closest approximate neighbours to the input data. The input data can be of any format (in our case it would be images). For images, the comparison is done by converting that image into vectors in a multi-dimensional plane.

Fig 1: KNN Algorithm Explained

We use this concept in identifying ‘K’ nearest visually similar images (neighbours) to an input product image. The neighbours are identified by comparing the input image vector with the trained model in a multi-dimensional plane. The value of K can be configured and it determines how far we want to expand the search comparison results.

When comparing an input image vector with the trained data vector, there are numerous ways to calculate the distance between 2 points. To name a few: Euclidean distance, Manhattan distance, Cosine distance, Hamming distance and Dot (Inner) Product distance are some most commonly used formulas to calculate the distance between 2 points in a multi-dimensional plane. The quality of prediction depends upon the distance metrics. This can be achieved via two methods,

  • Exact search: Linear search, Space partitioning.
  • Approximate search: Greedy search in proximity neighbourhood graphs, locality sensitive hashing.

Finding the right library to integrate KNN

We are using Annoy, an open source library to leverage KNN in our specific use cases. To provide a brief background: Annoy (Approximate Nearest Neighbours) is a C++ library with Python bindings to search for points in space that are nearby to the query point. The best part of this library (apart from speed) is that it produces massive read-only file-based data structures that are mapped into memory, allowing several processes to access the same information.

We now have the flexibility to search using custom iterators (or vectors). This could be considered as a trade-off in terms of accuracy, however, the library performs incredibly fast with larger datasets.

We use the angular distance to determine the distance metrics between two points in a multi-dimensional vector plane.

How it works? — Annoy builds a forest of n_trees.

  • Here forest represents an ensemble of decision trees determined from a supervised dataset. More no. of trees provide higher precision while querying.
  • Similarly search_k gives a run-time tradeoff between better accuracy and speed. (Here search_k represent no. of nodes on trees)
  • During the query it will inspect up to search_k nodes. The default threshold value for search_k is n * n_trees, where n represents the no. of nearest neighbours. We can also manually adjust the value of search_k as required.

Use Cases

At Meesho, we’ve been experimenting KNN concept with 2 major use cases:

1) Collaborative filtering based catalog recommendation

We trained around 4M product data on a daily basis to classify the product variations. Feature vectors for each product were generated using a Matrix factorisation model of rank 80.

The overall requirement is to identify the top K number of nearest neighbours for each product (where the threshold for K is determined based on the level of accuracy needed).

2) Visual product recommendation

Based on features generated using the Resnet50 model (Keras), we trained the learning model with around 30M product data. The vector dimension was 2048 which was then compressed to 300 using PCA (PySpark ml PCA).

Requirement: To identify top K visually similar products images for daily uploaded products into our marketplace.

Our Architecture — An Overview

  • Feature generation: To create features, first a feature extraction model is used. This stage is carried out using an AWS EMR cluster before being sent to S3. An additional step of compression is required depending on the size of feature dimension (PCA in our case to compress vector dimensions)
  • Annoy Data Indexing: This procedure is carried out using AWS SageMaker, which allows us to utilise our own unique algorithm and deploy it as a container. The generated indexed data is stored in S3, so it can be processed for different product use cases. It also produces a new index when retraining is required.
  • Identify Approximate Neighbours: KNN is generated for a test dataset on a SageMaker Annoy container based on the requirements.

Benchmark Metrics — Lessons & Learning

Considering the scale at which the incoming product data was expanding over the months, we had to find faster algorithms than brute force approach. Here is our benchmark results of using Annoy library with different types of datasets:

  • For a 4M dataset with 80 size feature dimension on ml.m5.2xlarge sagemaker ml instance, indexing takes roughly about fifteen minutes and generating predictions for complete dataset (100 nearest neighbours) takes around 1 hour.
  • For a 10M image features dataset with 300 size feature dimension on ml.m4.10xlarge instance, indexing takes about two hours and generating predictions happens in less than 3 mins for ~100k product images.
Fig 3: The above graph describes how tuning num_trees affect the accuracy and initialisation time.

Constant parameters: 8v core machine, Nns-100, num_trees: 300, Distance Metric = angular, ~2M training dataset

For 300 trees, 100 top K nearest neighbours, on a 8v core machine, we could witness a throughput at around 1000 queries/second and a latency of less than a millisecond between cores.

This throughput can further be improved:

  • Through vertical scaling; which can be achieved by adding more cores.
  • Via horizontally scaling and distributing data on multiple instances using sagemaker. Same annoy index will be loaded on each instance and can be used to improve the overall performance.

Final Note

We found that KNN performed better & faster than the traditional brute-force method in our initial experiments. In addition, predicting & classifying product images proved useful in many of our internal cases.

Lastly, we expect Meesho’s product & user volume to grow 10x by the end of 2021; and we are fine-tuning our infrastructure to meet this scale in the coming months as we adopt KNN.

Interested in working with our Data Science team to make an impact on billions of users? — Do check out Meesho’s Careers page for more opportunities.