Backtesting through time is an effective model validation technique. In scenarios with time-based patterns where cross-validation is not an option to evaluate model performance, we perform backtesting through time.
The idea is to train a machine learning model till a point in time ‘t’, and validate model performance in the next time period following ‘t’ for which the actual labels are known. The types of problems where backtesting shines are demand-forecasting, predicting stock price, learning user preferences till a point in time to evaluate the performance improvement of a new recommendation algorithm, etc.
In this blog, we will understand how we utilized backtesting in a recommendation context to find the right balance between the feed’s relevance and platform ad revenue. We will also look at certain best practices for performing backtesting efficiently and avoiding common pitfalls.
Our Problem Statement
Meesho hosts millions of items that cater to the needs of millions of users transacting every day. Sellers often opt for paid advertisements to promote their products. As we continuously keep on improving every aspect of the platform, the challenge here was to find the right balance between the feed’s relevance and platform ad revenue (coming from the advertised products).
We already have sophisticated machine learning models to predict the relevance of the product for a user, which we can term as relevance score. To find a good balance between relevance and platform ad revenue, we wanted to evaluate the correct weights to blend terms that optimize for the two objectives:
w1 * relevance_goodness + (1-w1) * platform_ad_revenue
Sophisticated machine learning models to predict the relevance of the product for a user (probability to click and order) contributes to the relevance goodness. CPC is the cost-per-click of the advertised product and contributes to the platform ad revenue. We will address w1 as the relevance weight in the blog favoring the feed’s relevance over platform ad revenue.
The main challenges for which we utilized backtesting are the following.
- How do the choice of weights trade-offs between relevance and platform ad revenue for the given equation?
- If we perform an A/B test to find the variant’s performance, what shall be a good choice of weights since online experiments are costly with real stakes.
We were able to find answers to both questions by doing backtesting through time. Next, we will share the learnings in the form of good practices while performing such backtesting to avoid pitfalls. We will also share some results of the above-mentioned problem.
Good Practices while doing Backtesting
Backtest the model performance on multiple timeframes
It’s always a good practice to validate the model performance in more than one timeframe to remove any bias of special events or holidays or seasonal events. Also, backtesting through time on multiple timeframes gives confidence in the model’s performance.
Take Backtesting results with a pinch of salt
Backtesting gives a directional indication of model performance but not always the exact one. For example in case of a change in the recommendation algorithm, giving less weightage to relevance the platform ad revenue keeps on increasing when backtested which may not always be true in reality.
For example, if the relevance of the feed deteriorates beyond a level, customers may lose interest and churn. The way backtesting helps is by giving an idea of the rate at which the relevance deteriorates and platform ad revenue increases with a change(decrease) in relevance weight in the equation so we can choose sensible values for A/B tests.
The Backtesting data should be representative of the expected future data
For the recommendation use case, we have an intuition that as we decrease the relevance weight the feed’s relevance will deteriorate, and platform ad revenue will increase, the reason to backtest is to find the rate of deterioration or increase in revenue. But, if the pattern is counter-intuitive, it’s an indication of model performance or backtesting validation not going right. In such cases, the following things can be checked.
- Model training happened across a significantly bigger timeframe, let’s say 15–30 days.
- Backtesting is performed on the bigger timeframe to correctly capture the interaction data (what was viewed, clicked, ordered, shared, or returned) and model performance on it.
- While backtesting, the interaction data is coming from majorly the entire app and not some specific part/section of the app which may be contributing to randomness in the pattern.
- To smoothen the patterns and results further we can calculate metrics like MAP, MRR, NDCG, revenue, etc at different k’s (k denoting slot) and take an average.
In our case, the pattern was counter-intuitive and on deep-dive, we found the reason was we were looking at interaction data only of a part of the app. Including the data of the overall app, the pattern became evident and we got what we wanted (the rate at which the feed’s relevance drops and platform ad revenue peaks).
For the form, w1 * relevance_goodness + (1-w1) * platform_ad_revenue, this is how the change in weight affects different metrics.
We selected the relevance weight by keeping the relevance-drop (in terms of Mean average precision) percentages within < 3% which led to a platform ad revenue increase by close to ~9% during backtesting for the A/B experimentation.
Conclusion
With this blog, we learned when to utilize backtesting through time and the best practices around it. We have used this technique across multiple projects/experiments and seen significant improvements.
Credits
- Special thanks to Parth Gautam and Rahul Kumar for working closely on the project and Ravindra Kumar Yadav and Debdoot Mukherjee for their guidance.