Isn't it amazing how technology has advanced to a stage where we can find what we want with just an image – a picture is worth a thousand searches!

But, how is it possible to query an image over millions of images within a fraction of seconds 🫣. One of the key components of such a system is Image Matching using Vector Search or Nearest Neighbour Search(NNS), which we’ll be discussing in this blog. After reading this blog you will have a good understanding of embeddings, vector similarity, vector databases and finally nearest neighbour search. And don’t worry there isn’t any prerequisite for this blog.

Just grab a cup of coffee and stay with me :)

Problem Realisation

Given an image, how do we find similar images from a collection of millions? Before we deep dive into a solution, let us first grasp a few concepts.
Let's start with a basic question:
If we have an arbitrary number, 2, and a list of positive integers [4, 6, 9, 3], how would you find the number closest to 2?

image.png



Well, it's quite simple! You iterate through the list, checking the difference between each number in the list and the input number. Wherever the distance is the minimum, that's the answer.

Time to think mathematically what you did:

  1. Defining a distance function, in this case:
    distance(a, b) = mod(a-b)
  2. Defining closeness:
    closeness is nothing but -ve(distance), that means the element having least distance is closer to the input value.

Now, imagine that instead of these numbers, images were given to you. How would you have approached it?
It's pretty straightforward: Convert the images into numbers and apply the solution mentioned above. While converting an entire image into a single number isn't possible, we can transform an image into a list of numbers, keeping everything else the same. This list of numbers representing an image is known as the image's embedding.

Embeddings - List of numbers that represent a given data-point(image/text etc) such that similar data-points have similar values in the list. These list of numbers are also known as Vectors.

image.png

But wait, it looks pretty straightforward., Is it really that simple? 😊
Wait, The Devil is in the details.Even the most straightforward tasks can become daunting when undertaken on a large scale.

Don’t worry! Every cloud has a silver lining, and Vector Databases are  the solution here.
Let’s first understand what a vector database is, and then we’ll establish some basic expectations for a scalable Image Matching Service.

Vector Database

We understand the concept of a traditional database, right? A vector database, on the other hand, is a mechanism designed to store vectors or embeddings, allowing for the rapid search of vectors similar to the target vectors in milliseconds.

But, hold on for a moment. If a database contains 200 million vectors, how would we manage to search for a target embedding so quickly?
Well, readers, you already know the answer. Forget about the vector database for a moment. What do we typically do to accelerate our queries in an SQL database? We use indexing, that's right!

We create an index for the vectors to significantly enhance the speed of querying.

Vector Indexing

While a detailed discussion about vector-indexing is beyond the scope of this blog, we'll attempt to explain it with an example.Imagine you visit a library in search of history books related to the Battle of Panipat. Let's consider the arrangement of books there to facilitate an efficient search:

  1. Locate the section containing various shelves such as History, Science, Fiction, etc.
  2. Within the History shelves, identify sections for Ancient History, Medieval History, or Modern History.
  3. In the Medieval History section, shelves might be organized by decades representing different time periods.
  4. Within a decade(time-period), numerous historical incidents are cataloged, including various wars.
  5. Among the cataloged wars, you're specifically interested in finding information about the Battle of Panipat.

Considering each book as an embedding(point) and the same strategy in sophisticated terms will sound like,

  1. Points that are related will be connected to each other.
  2. Points will be sparse in top and dense in bottom.
  3. Search Algorithm where embeddings are indexed in similar fashion-
  4. Start searching from the upper layer.
  5. Greedily search through the upper layer until local minima is reached.
  6. Switch the search to the lower layer and restart from the point which was local minima in the  previous layer.
  7. Repeat steps b. and c. until you traverse all the layers.

Congratulations !! We understood the basics of one of the most commonly used vector indexing algorithms, HNSW(Hierarchical Navigational Small World Graph).There are many open-source and licensed vector databases available today. A few of them are  Qdrant, Pinecone, Faiss, Weaviate etc.We opted for QDrant for our use case due to its robust pre-filtering capabilities and efficient payload indexing support.

Pre-Filtering

Suppose you want to search for a WhatsApp image that someone sent you a few weeks back. Now, would you prefer scanning all the images present in your gallery or simply searching within the WhatsApp directory? It's pretty obvious, right? When it comes to searching, we want to apply as many filters as possible to narrow down our search space. Pre-filters are essentially filtering conditions that help narrow down the search space. In our case, we'll use pre-filters to query the input image based on specific categories. After all, there's no point in searching for a wallet among electronic items 😊.

Vector Sizing Maths

Let’s say you are using float-32 type vectors of dimension 128, how do we know size(in Kb) of a vector.

Proposed Solution

In this section we’ll deep dive into end to end solutions.

Requirements

image.png

Basic understanding of Qdrant

If you are already aware feel free to skip this section.

1.Point - Data-Containers within QDrant that hold the data.

2.Payload - Additional metadata on which you want to apply pre-filters.

3.Collection - Set of points, you can store multiple vectors with payloads mapped with unique name within them. While creating a collection we should know some basic and some advanced parameters.

4.Storage - How does QDrant store the data(Vectors/Payload/Index)?
It stores all the data in segments that generally are non-overlapping. We can manually specify the  number of segments, but QDrant will readjust it if needed. The number of segments has direct implication on search latency: more the segments, the better will be latency, but on the other hand it has adverse effects on indexation time.
It provides two different storage option for indexes(vectors/payload)

  • In memory(RAM) storage - It definitely is super fast but at the expense of cost.
  • Disk storage - Stores data on disk and uses page cache to access the contents of the file. Based on the quality of disk performance varies. As an example NVMe/SSD should have better performance than general Disk(HDD).

5.Indexing - It supports indexing of both payloads as well as vectors.

  • Payload Indexing - Equivalent to conventional indexing in document-oriented databases, primarily used for filtering.
  • Vector Indexing - It does vector indexing using HNSW, the most widely used indexing algorithm. HNSW has two parameters that we need to tune using the experiment:

Hence the number of vectors using Vector Sizing Maths will be 20K. So, until our total number of points doesn’t exceed 20000, a full scan will be used rather than indexing.

  • optimizers - QDrant provides many optimizers to improve  performance but the one which is of our interest is Indexing Optimizer

Alright !! Now we know enough about QDrant, it’s time to pull it into action.

System Design

  • Phase 1 - One time Bulk data ingestion and indexing
qdrant-blog-Page-2.drawio.png
  • Phase 2 - Query input image and then indexing it
qdrant-blog-Page-3.drawio.png

Which EC-2 Instance should we choose in Standalone setup

Let’s say we need to bulk insert 200M image embeddings, with each embedding having a size of 128 float-32 points.

Knowledge Sharing

  • Bulk Ingestion - During bulk ingestion, it is advisable to avoid indexing as it may significantly prolong the process. Hence set indexing_threshold and memmap_threshold_kb(if planning to use) to large values.
  • Indexing - After data ingestion, we need to trigger indexing. However, determining the 'indexing_threshold' required for this remains a question: how should we proceed?
    Well it depends on how frequently you’ll update your data.

→ Only one time ingestion and no frequent data update in planned-
    → If the data size is smaller for eg (< 1L), Indexing is overkill, QDrant will work better without indexing.

→ If the data size is larger for eg (> 1L), You do need indexing in that case. Default value of indexing(20K Kb) will be sufficient.

→ Batch update (Once a day) - Default value of indexing(20K Kb) will be sufficient.
     → Real time/Near real time update - which means the current query(image) should be queryable in the

image.png

Let’s say your throughput is 200 QPS.
Size in vectors in Kb that we are querying per second - 200 * 0.5Kb - 100Kb.

In an optimal scenario, for QDrant to perform real-time indexing, the threshold should be set close to 100 Kb. However, consider the simulation above, where the same data is both queried and then ingested at a rate of 200 RPS. It takes approximately five minutes to reach non-indexed vectors to zero which indicates that newly ingested vectors are indexed within this five-minute timeframe. Therefore, in this situation, there is a delay of five minutes regardless of keeping the indexing_threshold_kb as low as 100 Kb.
Also, if you are planning for near real-time indexing with <15 mins of delay, we won’t recommend memmap based indexing as in our experiment it took memmap > 1 hr to index which in-memory based indexing did in <10 mins.
You can play around with above maths and determine which threshold works best.

  • But wait if newly ingested vectors are non-indexed in realtime then are they not queryable for five mins ?
    Fortunately, this is not the scenario. In another simulation, we continued to ingest vectors from one system while simultaneously querying the same data from another. Surprisingly, QDrant was able to retrieve all the ingested data, suggesting that QDrant might employ a brute force approach to query non-indexed vectors.
  • Recall benchmarking
    Why do we need it - Because Vector database does not perform full-scan/brute force search always, it uses indexing right !! So we need to make sure while optimizing query timing, are we losing out the desired results ?
    Recall : How many true matches can we catch using it?
    Ground source of truth generation
    Use brute-force searching to find the top 100 matches for a given query-image within a category-id. We can do brute-force searching within QDrant using exact param in search-

Now, if we do the same with exact=False which means let QDrant use indexing and then check how many of them match with actual 100, that’ll be our recall.
We did the same for ~25k images/category for our top 10 categories and the results are provided in below table.
Average recall is > 98%

  • Performance analysis
    Where QDrant stands in terms of latency and throughput?
    In our experiment, we used 1 r6g.12xlarge for 200M Image + in-memory indexing and the results are as below-

Clearly latency depends on “how many nearest neighbours we want to fetch”.

Conclusion

  1. QDrant is one of the best Open Source Vector DBs that has good support for pre-filters and payload indexing.
  2. It supports both in-memory and disk-based indexing.
  3. After a bit of tweaking, it can support near real-time indexing.
  4. If deployed in a distributed setup with proper Sharding and Replication, One can design a Qdrant based Scalable Centralised Vector Similarity Search(VSS) platform. In our next blog we’ll discuss how we have implemented our centralised VSS platform.

I am grateful to Dhaval Parmar and Srinivasa Rao Jami for their exceptional mentoring during this journey. Special mention to Rajesh Kumar SA, Director of Data Science at Meesho, and Debdoot Mukherjee, Head of AI at Meesho for providing me the opportunity. Also, Special thanks to Bhaskar Yarabala, Senior Technical Writer at Meesho for your valuable input and review of the blog.