We do about 90 million orders every month — roughly 3 million orders a day.
But these are just conversions. For every confirmed order, we have millions of people browsing feeds.
To increase conversions, i.e., translate window shopping into a purchase, we study every user’s behaviour, patterns, and habits — all classified as interactions.
This helps with a more refined search, and in turn, conversions.
Every interaction is a data point.
We’ve grown 100x in 3 years. But so have interactions on our platform.
More interactions = more data.
Our earlier data ingestion system could not keep up with our rapid growth. We had to create a solution from scratch that could handle a load 3 times this size.
ReactiveX did just that.
Status Quo
Ingestion-Service was initially built as a simple webservice that took a REST API request and inserted the payload into a single kafka topic. The container used was a Tomcat with Spring-Boot.
No security layer was added, as this service was hosted in a demilitarised zone. Authentication was already handled by another layer.
This was a fairly straightforward service. It could handle 350-450K requests per minute (RPM) per instance (C5.2x large 8core CPU & 16GB RAM).
This worked well for our limited needs at the time - which was only 1/100th of our final load.
However,we still had occasional hiccups. Sometimes, these resulted in catastrophic failures across the whole stack. Here’s why:
Limited concurrency capability per instance.
Difficulty in handling 3-4X sudden request spikes.
Too many socket wait threads on linux leading to machine failures.
We couldn’t scale this to 100% — the sheer increase in computational power will be outrageous.
We needed a solution that was built with high throughput messaging in mind.
Enter: ReactiveX Programming.
ReactiveX Programming
ReactiveX programming is a declarative programming paradigm concerned with data streams and the propagation of change. With this paradigm, it's possible to express static (e.g., arrays) or dynamic (e.g., event emitters) data streams with ease.
The ReactiveX programming paradigm has been around for decades. However, every stack has end to end support built for it.
Now, as the paradigm matured, so did the multiple clients we’re dependent on.
The library of our choice was Project Reactor. For security, we used Spring-WebFlux combined with ReactiveX Kafka clients.
We did a bunch of performance tests on the new stack.
Here’s what we found:
- Max latencies here show the value based on JMeter Report. These latencies include the travel time through the cloud layers, and not just the application specific time.
- Average latencies are the same as above, just averaged over all the requests.
- Average CPU percentage helped us evaluate an idea around the CPU performance degradation based on an increase in request counts. As you can see, imperative touched ~75% around 400k requests. With ReactiveX, we are able to go beyond 1000K!
- CPU Context Switches: In a multitasking context, it refers to the process of storing the system state for one task. This way, a task can be paused, and another one resumed. Context switches are usually computationally intensive — much of the design of operating systems is to optimise the use of context switches. As we can see, these numbers are way lower for ReactiveX than imperative for the same number of requests.
- CPU Interrupts: An interrupt (sometimes referred to as a trap) is a request for the processor to interrupt currently executing code (when permitted), so the event can be processed in a timely manner. An interrupt usually is followed by a context switch: lesser the number, better the performance. The graph below shows how much better it is in reactive cases.
- TCP TW: TCP TIME_WAIT is a normal TCP protocol operation, it means after delivering the last FIN-ACK, client side will wait for double maximum segment life (MSL) Time to pass to be sure the remote TCP received the acknowledgement of its connection termination request. The client here is spring application, from the linux operating system’s perspective. In terms of the ReactiveX approach these numbers are very small, as the requests are handled immediately on arrival, unlike based on a fixed pool like the ones in imperative use cases.
- TCP In Use: Are the TCP sockets currently in use. These are not limited by the container’s network thread pool, hence are able to concurrently process requests as shown here in the graph.
Conclusion
ReactiveX has surpassed our expectations.
Specific use cases for handling requests have seen a capacity increase of at least 3 times as compared to before (~350k to ~1M per instance).
This resulted in a massive 40% reduction in the overall infrastructure requirements, while fulfilling our use cases of sudden spike handling capabilities and better resiliency at scale.
But, we didn’t just stop there.
As now with ReactiveX, we are no longer limiting or assessing an instance's capability based on the thread count defined for the network threads.
We increased the server size without having to change any other configuration. We went from c5.2xLarge to c5.4xLarge instances (16 core CPU & 32GB RAM).
End result? This increased our per instance handling capacities to a whopping ~2M!
Hiccups
However, all issues aren’t ironed out just yet.
We still face the occasional problem due to the gateway service, which still runs as an imperative code-base.
To solve this, we moved the authentication from the gateway service into Ingestion-Service itself.
The authentication layer, along with support for GZipping, took a toll on the overall load on the system, leading to a reduction in the overall capacity.
Despite these issues, we are still able to accommodate a lot more than we did before while having resiliency and high scalability built into the systems.
Our current throughput capacity per-instance is around 900K RPM, including support for compression and authentication. This part is being evaluated to be further optimised by shifting away these responsibilities to the LB (load balancer) layer.
The requests after landing are translated into serialised JSON, that is temporarily stored into Kafka - for both batch, as well as real time analytics support. Stay tuned for more on that soon!
Parting words
Data drives much of the decision making we have to take on a day-to-day basis. By increasing the RPM capacity by a whopping 3 times using reactive, we’re able to give our backend team more breathing room for their decisions.
My team, Data Platform, prides itself on making decision-making easier for the rest of the teams at Meesho. Why not come join us? Head to meesho.io to check our openings!
