At Meesho, we run hundreds of microservices at a large scale. Given the nature of our business, it is crucial that each service meets the defined SLAs (Service level Agreement) . Until recently, these services communicated with each other using the HTTP/1 protocol. However, to address the challenges of infra cost reduction and performance improvement, some teams decided to adopt gRPC for its numerous advantages. In terms of operational benefits, gRPC offers:
- Improved performance through streaming, multiplexing, and the use of the more efficient HTTP/2 protocol.
- Reduced network bandwidth consumption due to smaller packet sizes.
Both of these benefits directly contribute to cost reduction and performance improvement. However, we encountered some interesting challenges when it came to load distribution with gRPC.
In this blog, we're going to explore how we handled this challenge to ensure our services stayed efficient, even during high-demand period.
What’s the problem?
Kubernetes services (ClusterIP/NodePort/LoadBalancer) largely rely on connection-level load balancing. This type of load balancing is suitable for HTTP/1.1 traffic because only one active request exists in a connection at any given time and no multiplexing occurs. However, gRPC uses HTTP/2, which is designed such that requests are multiplexed and one connection can have several active requests at any given time.. This reduces the overhead of connection management, but it also means that connection-level load balancing done by Kubernetes will be problematic.
If a pod of Service-A needs to communicate with Service-B over gRPC and uses the ClusterIP of Service-B for this purpose, requests will not be equally distributed over the pods as once a connection is established more and more requests will be pushed through that connection, thus some pods will receive more requests compared to other pods. This problem does not occur in HTTP/1 traffic.
Solving Load Balancing - Pretty Straightforward
At Meesho, we use Envoy almost everywhere as the L7 proxy. We do not run Envoy as a sidecar, instead we have standalone Envoy clusters which scale independently and use our XDS APIs to handle the routing. We will provide a detailed discussion of this design in our upcoming blog.
Envoy, by default solved one component of the puzzle for us. Traffic reaching the destination deployment was perfectly distributed by Envoy over all of its pods, just because Envoy is really good at load balancing HTTP/2. But then where is the problem?
For traffic reaching Envoy, we mostly use the native Kubernetes services. So although Envoy was distributing equally over the upstream pods, its own pods were seeing a lot of imbalanced traffic because ClusterIP was pushing more and more traffic to a bunch of pods only. This meant a few Envoy pods would become overwhelmed, degrading the overall performance of the service.
Removing ClusterIP from the picture
The quickest and the most efficient solution here was to remove ClusterIP from the picture completely. This meant using a headless service and letting the client do the load balancing over Envoy pods. Our Java clients required minimal changes to do this and we quickly modified them to use a headless service. Now the client was not seeing a single ClusterIP, instead it was provided the endpoints of all the Envoy pods and it did a fairly good job at load balancing the traffic across all pods.
Connection Age
One last challenge in load balancing was handling client behavior during the addition and removal of Envoy pods. We observed that if load balancing was working correctly on X number of Envoy pods, and a new Envoy pod came up as part of the self-autoscaling Envoy cluster, the client would not send any traffic to the newly added Envoy pod.
It turns out that the default behavior of the Java Client is to not refresh the endpoints behind the DNS hostname until a connection is drained. To solve this, we set a max-connection-age parameter at the Envoy level. This forces the drainage of connections after a carefully decided interval and ensures that it happens gracefully by configuring drain timeouts at both the client and Envoy level. Now, whenever Envoy initiates a connection drain sequence, the Java Client refreshes the endpoints received from the headless service. As a result, the Java Client begins sending traffic to any newly added Envoy machine.
The shutdown manager already handled the scale-in of the Envoy pods, ensuring that the Envoy machines were only terminated after all connections had been drained. Additionally, clients would correctly handle the drain signal received from Envoy.
Multi-Cluster?
At Meesho, our clusters communicate with each other using the Cilium Clustermesh, which utilizes all the endpoints being VPC natively-routable (in other words - pods and nodes of all clusters being on a flat network). Cilium has the concept of global services, which can mirror the endpoints of one Kubernetes service in other clusters which are part of the clustermesh. So if a service-A was located in cluster A and an exactly same service (named service-A) was then created in cluster-B, cilium would put the endpoints from cluster-A behind services in both clusters. So if any application in cluster-B calls service-A, the kubernetes service would be able to load balance the traffic as if the endpoints belong to the same cluster. The services must be made global by adding the annotation io.cilium/global-service: 'true'
However, headless services are essentially DNS results, and Cilium uses loadbalancing to put the endpoints behind a single IP. In short, the headless service we created for Envoy of one-cluster cannot be resolved in another cluster using Cilium Clustermesh.
The only way to resolve a headless service of the target cluster is to somehow use CoreDNS of the target cluster. But hey!, all our pods are VPC natively-routable, what if we reroute certain DNS queries to the CoreDNS pods of the target cluster, that should work.
Making CoreDNS Service Global
The first target to achieve the requirement is to make the CoreDNS of different clusters accessible from other clusters. Let’s name our two clusters aam and jamun, and we need to resolve service of aam cluster in the jamun cluster. In each cluster, the kube-dns service is responsible for resolving the kubernetes FQDNs (svc.cluster.local). Let’s make kube-dns of aam cluster reachable from the jamun cluster. We will have to create a new service, let’s name it aam-kube-dns
aam cluster
We could not simply make the kube-dns service global, why?
Jamun cluster
Note that there is no selector in the aam-kube-dns service created in the jamun cluster, but in the aam cluster there was a selector, why?
CoreDNS Magic
Now that the CoreDNS of aam cluster is reachable from the jamun cluster, let’s move to the part where we reroute certain DNS queries from the jamun cluster to the aam cluster.
Corefile (jamun cluster)We will have to put in a convention in place, so that all hostnames which point to the headless services use the same subdomain, and then we will put rules in Corefile of jamun cluster to route all queries for that subdomain to the CoreDNS of the aam cluster. For the services in aam cluster, let’s make this subdomain aam.meesho.com
Here 10.XX.XX.XX is the ClusterIP of the aam-kube-dns inside the jamun cluster.
Corefile (aam cluster)
The above setup has ensured that any requests of the form *.aam.meesho.com are resolved using the CoreDNS of the aam cluster. Let’s now tell CoreDNS of the aam cluster to resolve all requests of the subdomain aam.meesho.com to the correct service.
This journey of a DNS Query now can be better understood with the flow shown below:
While opting for an Envoy Sidecar Proxy-based service mesh might have simplified matters, the scale of our clusters and the unique abstractions we're building for multiple development teams led us to choose a different path. In our upcoming blog, we'll explore the various reasons behind this design choice.