1. Introduction
Ever wondered how millions of orders are processed seamlessly on Meesho every day? With over 175 million users transacting annually, ensuring a smooth and reliable checkout experience is mission-critical. Behind the scenes, the checkout system acts as the backbone of order processing, and any hiccups here can significantly hamper user experience and has disproportionately higher business impact compared to other parts of the user journey.
So, how do we keep it running flawlessly? Enter our active-active strategy—a game-changing approach to ensure high availability and zero downtime for the Meesho Checkout System. In this article, we take you behind the scenes to uncover the challenges we faced, the solutions we implemented, and the key lessons we learned while building a checkout system that’s not just reliable but built to scale. Let’s dive in! 🚀
1.1 Checkout System Architecture
Here’s how a typical user journey looks like while placing an order.
Our checkout system consists of four primary microservices:
- Cart Service
- Address Service
- Order Service
- Payment Service
These services also interact with various databases and caching systems, in addition to multiple other services.
2. Evolution of the System
Our checkout system was initially a monolith, which sufficed when we operated at a smaller scale. As our platform expanded, we transitioned to a microservice architecture to accommodate growth. This involved a core checkout system and other dependent services. However, the number of services that interacted with the checkout system increased substantially over time. Currently, our checkout system has approximately 15 downstream dependencies, and some of these, like the Product Service, have multiple downstream services of their own. Each new dependency within this extensive dependency graph introduced another potential point of failure, thereby increasing the chances of failures / errors on the overall system.
Over the years, we have used different mechanisms to improve the reliability of our critical systems.
2.1 Circuit Breakers to contain failures
The most common issue is when a downstream service degrades returning error response or simply getting latent. These failures have a cascading effect if not contained quickly.
To mitigate the cascading impact of such failures, we implemented circuit breakers (CBs) using libraries like hystrix and resilience4j, which are configured to open up and return a default fallback response when error thresholds are breached.
However, implementing CBs comes with its own set of challenges:
1. Setting Timeout Thresholds
- If timeouts are too low, CBs can open prematurely, leading to unnecessary service degradation.
- If timeouts are too high, failing services may remain connected longer than they should, propagating failures across the system.
2. False Positives
- Network glitches or temporary spikes in latency can unnecessarily trigger a circuit breaker, affecting service availability.
3. Stuck Open Circuits
- Once a CB opens, it may stay open longer than necessary if the success threshold for closing is set too high. This causes artificial downtime and degrades service quality.
- We solved this by using manual overrides for circuit breakers , or simply restarting the service.
- This need for manual intervention contradicts the self-healing nature of CBs.
2.2 Fallback Strategies for Critical Services
While for certain downstream an empty or static fallback response works, for some critical services, a real fallback is needed (classified as P0 services)
For these services we deployed secondary clusters as fallback options. Secondary clusters often have their own separate database or read from a replica of the same database. Additional complexity involved here is that all upstreams need to be configured to connect with both primary and secondary clusters, with separate threadpools and timeouts and failover logic.
2.3 Context-Based strategies for handling outage
While Circuit Breakers and fallbacks work well for intermittent failures, larger outages in dependent services require a context-based fallback strategy. For such scenarios a dedicated switch off mechanism can be built into the system. Kill switches and feature flags which are imported as dynamic configuration can be toggled when there is an issue in a specific service. For instance, if our discounting platform experiences an extended outage, we can toggle the offer_kill_switch. This will allow the user to still place orders without applying for an offer, but this degraded state is any day better than a full blown order outage.
These fallback mechanisms improve system resilience while ensuring a predictable experience for users.
2.4 Unique Challenge of checkout system availability
While the above strategies helped in minimising the impact of failures due to dependencies, we were still vulnerable to failures that happen within the checkout system
The checkout process can be disrupted by the failure of any of the microservices or data stores it depends on, regardless of whether the failure is anticipated or not.
- Example 1: A database scale-up activity typically results in a predictable spike in latency, which can cause timeouts and degrade service performance.
- Example 2: Unexpected failures, such as increased network latency causing service calls to the order database to fail, pose a more unpredictable challenge.
Unlike other services, fallback strategies are ineffective for services in the checkout ecosystem. For example, Let’s assume that we have two cart serving deployables
- cart-primary with cart database-1
- cart-secondary with cart database-2
Assume a user has added his product while his request was being served by cart primary. Now due to fallback triggering cart secondary becomes the active cart. Once the user clicks place order, the order service has to check both cart primary and secondary to find his cart. Also this can lead to cases where the user sees his product appearing and disappearing, based on which cluster is currently serving fetch cart requests.
This complexity makes it impractical to maintain fallback logic at the checkout at a per service level and we realised that checkout needed to be treated as a unified system .
In the rest of this document, we will explore how we enabled high availability for the entire checkout system based on this insight.
3. Enabling High Availability for Checkout
Objective : To develop a disaster recovery strategy that minimizes the impact of such planned outages while ensuring zero user-visible disruptions.
For a highly available system, there are two well known strategies which require a redundant cluster that can seamlessly take over in the event of a failure: Active-Active (AA) and Active-Passive (AP).
Both strategies have their own set of tradeoffs in terms of cost, complexity and failover strategy. We conducted a thorough evaluation of both strategies to determine the most effective approach.
3.1 Evaluating High Availability Strategies
Active-Passive Strategy
In an Active-Passive (AP) setup, we have two clusters (primary and secondary), and only one system actively handles all traffic, while the other system remains idle, ready to take over in the event of a failure.
Pros:
- Simpler Architecture: The AP setup is easier to implement and manage due to its straightforward failover mechanism.
- Lower Complexity in Data Synchronization: in an AP setup only one is active at any given time, thus we can treat both as mutually exclusive and thus eliminating need for data replication and synchronization.
Cons:
- Higher Mean Time to Recovery (MTTR): The failover process introduces latency, which can lead to temporary service disruptions.
- Underutilized resources: The standby system remains idle during normal operations, resulting in inefficient use of computational capacity.
- Risk of cold standby and version mismatch issues: Secondary systems might require a warm-up period, causing delays during failover. Additionally, there’s a risk of the standby system running an outdated version, which may no longer be compatible or valid.
Active-Active Strategy
In an Active-Active (AA) setup, unlike active-passive, both the clusters should actively handle traffic simultaneously. In our use case, it means both the clusters should be able to allow checkout simultaneously, without any conflicts or collisions.
Pros:
- Extremely Low MTTR: AA enables instant traffic rerouting, ensuring minimal service disruption during failures.
- Better Utilisation of resources: traffic can be dynamically balanced across both clusters, enhancing scalability and effective utilisation of resources
- Zero-Downtime Maintenance: traffic can be gracefully shifted to one cluster while the other undergoes updates or maintenance
- Continuous Failover Testing: While Active-Passive configurations can only test failover through disruptive complete traffic shift, AA supports controlled traffic redistribution between active clusters during normal operations
Cons:
- Complex Implementation: Designing and maintaining AA systems is significantly more challenging compared to Active-Passive setups as it requires sophisticated mechanisms to ensure data replication and load balancing
- Potential performance overhead: Synchronization and coordination between nodes may introduce latency, affecting performance.
We chose the Active-Active (AA) for its ability to deliver high availability, lower MTTR, and ensure seamless operations during failures by instantly redirecting traffic to a secondary cluster. Its real-time fault tolerance, scalability, and uninterrupted operational assurance were key advantages. AA also enables risk-free failover testing and offers future-proof flexibility, allowing fallback clusters to be deployed across diverse regions or zones. Despite its more complex implementation, the benefits of AA solidify it as the optimal solution for our system
4. Implementing Active-Active
4.1 Challenges
While an active-active setup offers the highest level of reliability and fault tolerance, its implementation comes with significant complexity, including challenges that are unique to our system. Key considerations include:
- User Switching: How will user sessions be seamlessly switched between active nodes? Can this process be automated?
- Lost Transactions: In case of failure, what level of transaction loss is acceptable? Can lost transactions be recovered effectively?
- Data Collisions: Is there a risk of data conflicts, and if so, how will these be resolved?
- Service Compatibility: Can our current services run smoothly in an active-active environment, or will they require modifications?
- Cost Tolerance: What additional costs can the system sustain for maintaining an active-active setup?
Addressing these questions was critical to ensuring a robust, scalable and cost-effective Active-Active architecture.
To solve this, we broke the problem into two parts
- Achieving Isolation: Making changes such that an independent set of services are capable of processing orders in parallel without needing to synchronize data.
- Dynamic Traffic Switching: Building a traffic router on top of both clusters to enable seamless failover.
4.2 Achieving Isolation
In an active-active setup, one of the biggest challenges is ensuring data consistency across multiple systems.
We simplified the process by leveraging the short duration (2-3 minutes) of a typical checkout session. This meant that as long as all user data for that session was kept within the same cluster, we could avoid complex data synchronization between clusters.
Following changes were made in the services to achieve isolation
4.2.1 Cart Service
The Cart Service has two types of carts, each with different state persistence needs:
- Add-to-Cart Flow (ATC): This cart persists in the Cart DB, ensuring the items stay in the cart even after a user exits and returns.
- Buy-Now Flow (BN): This cart is stored temporarily in cache, providing a quicker, more transient experience for users who opt for immediate checkout.
Products added via the ATC flow are persistent and require data synchronization to maintain consistency for the user. This ensures that the added product remains visible in the user's cart even if traffic shifts between clusters. We took a call to only allow Buy-Now flow on the secondary cluster as 95% of our checkout transactions are via BN flow. Making this trade-off allowed us to cover the vast majority of our users while eliminating the need for a complex data synchronisation.
Custom logic:
Since only BN requests are allowed in the secondary cluster, ATC requests would fail when the secondary cluster is active. In order to handle ATC requests, we implemented custom proxy logic to always route these requests to the primary cluster. This ensures users attempting to place an ATC order receive the best possible experience.
But to handle scenarios when we know that the primary cluster is 100% down, we introduced a config flag that allows us to disable the ATC flow. In such cases, users are shown a popup message encouraging them to use the Buy Now flow instead thus providing a graceful experience and redirection.
4.2.2 Address Service
Address data poses a challenge for isolation since it needs to remain available and consistent across both clusters. To address this, we opted to use the Address-DB Slave for the secondary cluster. Given that address updates are relatively infrequent, the replication lag between the primary and secondary databases is negligible, ensuring consistency without compromising performance.
Custom logic :
Since we are using the slave database for the secondary cluster, users cannot add or update addresses directly in that cluster. To address this, we route all add/update requests to the primary cluster. This ensures we can handle address modifications even during partial failures. For instance, if the secondary cluster is active due to a Cart DB failure, address-related requests are seamlessly supported in the primary cluster, minimizing the overall impact of the outage.
4.2.3 Order Service
Order data, though not ephemeral, has a relatively short lifespan. It is primarily needed during an ongoing transaction, and once the order is confirmed, it is pushed to a Kafka topic. After this, order data is only required for debugging purposes, which can be handled manually.
Unique Order Number: The main challenge was ensuring the uniqueness of order numbers across two checkout systems operating independently. To solve this, we implemented a decentralized algorithm for order number generation, which guarantees uniqueness without relying on cross-system coordination.
4.2.4 Async Operation
There are several asynchronous operations within the checkout cluster, some of which have external side effects, such as sending a push notification upon order confirmation.
To manage this, we allowed both clusters to produce events to a Kafka topic. However, consumers that could cause duplicate side effects were activated only in the primary cluster. For consumers operating within the cluster, we ensured idempotency to prevent unintended duplicates or side effects.
4.3 Switching Clusters
Switching traffic between clusters can technically be achieved through DNS changes, but this approach poses certain challenges: it’s not immediate, and it’s abrupt. Using it would have caused user visible disruption and make traffic routing unpredictable.
To minimise user disruption we wanted session stickiness - ensure that users stick to one cluster throughout their journey. New users should be directed to a healthy cluster, while ongoing sessions should be completed on their original cluster.
Transact Proxy
To address these issues, we developed a lightweight reverse proxy in Go called Transact Proxy. It serves as an abstraction layer over the checkout system, enabling us to implement custom routing logic tailored to our needs.
The following approach was implemented to maintain session stickiness:
- At the start of a session, Proxy routes the request to the cart service of the currently active cluster.
- The cart service adds a cluster identifier into the session key and returns it to the client.
- The client ensures that the session key is present in all subsequent API calls of the entire checkout session.
- Proxy then extracts the cluster identifier from the session key and routes requests to it.
Other key features of Transact Proxy include:
- Ability to switch between primary and secondary (DR) clusters dynamically.
- Extremely low memory and CPU footprint, thus being cost-effective.
- Whitelisting logic for controlled user migration.
- Highly optimised streaming parsing, to ensure minimal latency.
- Completely stateless, thus not adding new possible failure points in the system.
- Custom logic added for handling partial failures.
5. Testing and Validation
5.1 Disaster recovery drill
A Disaster recovery failover test was conducted in the production environment to validate the reliability of our failover mechanisms.
Methodology: Users were gradually shifted from the primary cluster to the secondary cluster in phases (1%, 5%, 25%, 100%), while closely monitoring key metrics such as order failures and conversion drop.
Expectation: Users should seamlessly complete their orders after being moved to the secondary cluster, with no disruptions to ongoing order sessions.
5.2 Result
The Disaster Recovery drill was a significant success, marking a major step towards achieving high availability for the Meesho Checkout system. Here’s what we accomplished:
- Zero Order Loss: All ongoing transactions were handled flawlessly, with no orders lost during the process for switching clusters.
- Seamless User Experience: Leveraging session stickiness, users in the middle of a checkout session continued smoothly on their original cluster. Subsequent orders from these users were dynamically routed to the secondary cluster without any disruption or noticeable impact.
- Gradual Traffic Movement: We ensured a stable and controlled failover by incrementally shifting traffic—starting from 1% and scaling up to 100%—while continuously validating the stability of both clusters.
- Minimal Performance Impact: The introduction of the lightweight Transact Proxy resulted in a negligible increase of ~4 ms in P50 latency at the API Gateway, ensuring system responsiveness remained intact.
Gradual increase and drop in RPS for primary and secondary cluster during dr activity
6. Conclusion
The implementation of an active-active strategy for our checkout system has significantly enhanced its resilience and availability. Here’s all what we were able to achieve with this setup
- Elimination of planned downtime: Planned downtimes during DB and Cache scaling, DB migration can be completely avoided.
- Risk-free failover testing: With session-level stickiness and transact proxy, Failover testing can be done in risk-free manner and not require any user downtime.
- Cost-effectiveness : There is no idle backup system sitting around in an active/active system. We can effectively share the load to both the clusters, thus utilising all of capacity
- 100% users coverage regardless of app version
- Extensibility: The solution is extensible enough and doesn’t make any assumptions about where the clusters are deployed, allowing a multi-region, mulit-az setup in future
By addressing key challenges such as data consistency, traffic isolation, dynamic routing, and session stickiness, we’ve built a robust, fault-tolerant system that ensures seamless operations and minimizes user impact during outages. This marks a major step forward in delivering a reliable and high-availability experience for our users.
7. Future Steps
- We plan to add automated traffic routing to Transact Proxy, which can auto-detect failures in API and redirect users to alternate available clusters.
- Using anomaly detection to trigger scale up and failover pre-emptively.