Most online businesses depend on accurate data models. These models help predict user behaviour and trends, and help businesses decide the future course of action. However, there’s one (often unaccounted for) phenomenon that can make predictions unreliable and throw a wrench in the plans: Data Drift.
Data drift can be a tricky topic to understand and tackle, so explanations are in order.
In this article, I’ll explain the basics of data drift — what it is, why it’s critical to account for it, how to detect if there is data drift in your models, and how to use it to decide when to train the next iteration of your data models. To help you better understand how it works in real-world situations, I’ll supplement the theory with examples of how we handle it at Meesho.
What is data drift and why do we need to account for it?
Data drift is the deviation in data used during inference from training data. As a real-world example, let’s consider a feature that keeps track of the number of orders placed by a user in the last 3 months.
Assume that while training the model, the domain of the feature was [1, 5] — in other words, any given user placed at most 5 orders in the last three months. As the app rose in popularity, the domain grew to, say, [1, 30]. That is, the maximum number of orders per user in the last three months got as high as 30.
Since we now have values which weren’t in scope during training, we conclude that there’s a drift in this feature’s data.
Not accounting for data drift causes degradation in the model’s performance. As mentioned before, modern online businesses depend on accurate data models and this degradation leads to inaccurate predictions, causing non-optimal policy decisions and loss in revenue.
Is there a way to tame data drift?
Definitely! There are ways to tame data drift and not allow any adverse impact on model performance if detected accurately and on time:
- Analysing data drift helps perform root cause analysis (RCA) and get deep insight into the reason(s) for the feature to deviate from its original behaviour
- Detecting a large amount of data drift helps decide the right time to deploy new, more effective models
How to compute data drift
To compute data drift, we first capture the distribution of the features during inference and then compare it to the distribution of the same feature that is being observed in the training dataset by estimating the kernel density. We’ll then use it to get the divergence score. Let’s explore this process in more detail.
Obtaining probability distributions for features used in inference vs training
Once we’ve obtained both training and inference data, we make sure that the data type of the feature is the same. Afterwards, we use statistical methods to calculate the drift between the probability distributions.
Yes, I said probability distributions. Bear with me because this is going to get nerdy 🤓.
Recall the order tracking feature example mentioned in the beginning. It has a numeric data type and it assumes an infinite and uncountable set of values — they’re continuous random variables.
Estimating the kernel density
Afterwards, we estimate the kernel density of the feature from both datasets and create the probability distribution using the same domain. In this case, we pick the same feature from both datasets and train a kernel density estimation (KDE) from the values correspondingly.
The reason to use the Gaussian Kernel is that we assume the distribution to be a normal distribution which can be explained using the central limit theorem. According to this theorem, when dealing with independent random variables, their properly normalised sum tends toward a Gaussian distribution (or normal distribution.) This holds true even if the original variables themselves are not normally distributed.
Calculating the final drift (or divergence) score
When talking about divergence, one usually thinks of using Kullback–Leibler divergence (KL divergence). However, to calculate the score, we need a metric that is symmetric in nature, i.e., P||Q == Q||P. In other words, if feature P is distant from feature Q, then feature Q is equally distant from feature P.
However, if we use KL divergence, the above condition will not hold true as KL divergence is not symmetric, i.e., if a probability distribution p is closer to probability distribution q, the inverse is not necessarily true.
The two most popular methods when it comes to symmetricity in addition to the functionality of KL divergence are:
- Bhattacharya distance
- Jensen-Shannon (or JS) distance
A major difference between these two distances is that Bhattacharya distance does not follow triangular inequality while JS distance does. Since triangular inequality is required for a statistical distance to be considered a metric, we use the latter.
Finally, to compute the drift score for the feature, we ingest the two probability distributions that we obtained using the KDE method into the JS distance metric. This score is in the range [0, 1], where 0 means that the values of the feature from both the datasets come from the same distribution, while 1 means that there is a large drift in the dataset.
While every prediction model has different requirements and context, the rule of thumb is that the higher the drift score, the more critical it is to tweak the model to continue getting accurate predictions.
Closing words
I hope this article helped you understand the basics of data drift. No matter how much (or little) data your model currently handles, it helps to have at least a working knowledge of this topic.
In Meesho’s case, we deal with millions of feeds daily and these feeds change frequently depending on the behaviour of our users. Capturing the data drift score has been essential in keeping our prediction models up to date.
Data drift can also be used as an alerting mechanism for model owners to do RCA and figure out the cases that the model doesn’t currently cover. Additionally, it helps to detect if some out-of-scope values have come into one of the features.
Shout out to Hardik and Siddharth for working with me on this project and making it a success.
Meesho is hiring! If you found this article interesting and wish to pursue a career in data science, you’re in luck as we have multiple data science job openings! Join us and help us in our quest to bring e-commerce to the next billion users.
Credits
Author: Lokesh Todwal
Reviewed by: Debdoot Mukherjee, Rajesh Kumar SA and Debashis Mukherjee
Edited by: Shivam Raj