As modern tech teams lean more and more into automation, Platform Engineering (PE) has emerged as one of the most promising developments in this space. But what exactly is Platform Engineering? Why does it matter, and how do we use PE principles at Meesho to improve our engineering output?
Our Principal Architect, Indroneel Das, walks us through PE concepts and what it means for Meesho and other cloud-native organisations. We hope this conversation will answer all these questions and many more.
Can you briefly explain what issues engineering teams faced that led to the conception and eventual rise of Platform Engineering?
Indroneel: Engineers focus on creating and shipping code for specific business requirements. This is fine during the initial stages of a company’s growth, as requirements change rapidly, necessitating a lot of experimentation.
However, as engineering teams converge to an established business model requiring only incremental changes, their focus should change to, “How can we efficiently, reliably, and securely meet business demands?” But that usually doesn’t happen.
This is where PE comes in.
Let’s consider teams facing challenges around scale, reliability, availability, and security while trying to ship code quickly. Left to themselves, each team would solve these issues in their own way, leading to knowledge silos. Failing to leverage knowledge and experience leads to the development of duplicate and/or overlapping capabilities.
Having good PE practices solves most of these issues.
Let’s talk about the journey from development to production. Can you give some scenarios of how PE helps tackle such situations?
The journey benefits from having the right processes, tools, automation, and security practices for different stages and operations within the overall development life cycle. Here, the role of platform engineering is to augment and support these processes and tools. For example:
- Having observability as a platform capability across all services helps identify root causes during end-to-end and integration testing.
- Well-defined reliability features lead to better recommendations around issues identified through performance testing.
Additionally, PE can reduce development time and effort by introducing technical abstractions. For example, an event-driven architecture between services is usually implemented by each team using Kafka as the underlying technology. This has several constraints that lead to more development time:
- Each team must have the proficiency to integrate, debug, and tune Kafka for the best outcomes.
- Each team must separately build capabilities around monitoring, configuration, data validation, and interoperability for the eventing pipeline.
- Each team is responsible for the availability of their respective Kafka cluster. This also applies to upgrading and/or migrating to an alternative technology when required.
Having an eventing capability that exposes a set of strategies around event generation, propagation, and consumption improves developer bandwidth across engineering teams.
You touched upon knowledge silos. How does PE help avoid those?
First off, let’s understand knowledge silos in some depth.
Knowledge silos develop when engineering teams build capabilities and develop skills in isolation. These capabilities are usually contextualised to address a single business problem, which results in duplicate capabilities when solving similar problems. Duplicate capabilities mean more collective development efforts to improve and maintain all of them. Sometimes, knowledge propagation as shared capabilities does happen when driven by senior engineers, though it’s a very organic process and tends to be suboptimal.
With a central PE team and a stated vision, the following happens:
- Duplicate capabilities are identified across teams and rolled into single, shared offerings.
- Feature-driven capabilities (e.g., notifications, indexing, ranking, and relevance) are implemented in an extensible and configurable way to evolve rapidly with changing product requirements.
According to you, what are the tenets of building an effective PE team, and how have we applied them at Meesho?
Here are some of the tenets that a PE team must have:
- A clear mission: I’ve observed that every organisation has different views of what constitutes a platform in its context. Therefore, it’s essential to have a clear answer to the question, “what does a platform mean to us?” for the entire organisation.
- Product mindset: The team should think of the platform as a product with engineers and product owners as end-users. This requires everyone on the team to revisit soft traits such as empathy, strong communication, the ability to receive and act upon feedback, and continuous learning.
- Generalise and create abstractions: The team also needs to pivot from a feature-driven mindset that targets the organisation’s end users to provide capabilities that help build and operationalise those features. One way to achieve this is to think in terms of abstractions.
- Identify differentiators: For every platform feature conceived, there should be a stated motivation and consensus on the expected benefits. These benefits should translate into a dollar-value benefit for the business.
- Scope: It’s also important to understand what isn’t the responsibility of a platform team. While the purpose is to serve the application developers, it’s not an extension of Site Reliability Engineering (SRE) or operations.
It seems like PE has a lot of positives for most engineering teams. Few organisations have adopted its principles. Why do you think that is? What might be the barriers to its adoption?
I’d say it depends on the organisation. In older, established companies, there’s often a lot of legacy infrastructure. In many cases, entire portions of the system are not even in the cloud — they’re hosted on-premise. Existing rules and bureaucracy also make it difficult to drive large-scale change. These are short-term downsides that inhibit progress.
Meanwhile, for young companies, the focus is on short-term gains as they might still be working on an MVP and proving their business model, while PE’s benefits are mostly medium- and long-term. Therefore, it’s hard to justify spending effort on something whose benefits are only apparent after the company has grown beyond a certain point.
Another issue inherent to PE is that of rewards and recognition — since the team is supposed to be the “invisible glue”, it’s difficult to get recognised for your efforts. On the other hand, it’s immediately noticeable when things go wrong, increasing resistance to contributing.
Finally, PE is seen as a cost centre because the team doesn’t ship new features or solve the so-called “concrete” business problems, making it difficult to convince the top brass to invest in it.
On the cost-centre front, do you have any recommendations to convince the organisation to invest in it?
While there are indeed costs associated with building a PE team, there are many less obvious and intangible benefits that more than offset these costs. Firstly, embracing PE from the beginning allows teams to stay lean because there are fewer personnel required for infra management.
Since developers spend less time and effort put into infra management, they can focus on adding and improving features, solving business problems, and reducing time to market. As there’s also a focus on observability and repeatability, it reduces the chances of downtime caused by incorrect deployments.
There are also savings related to cloud infrastructure costs. Since the infra allocation is automated, there are less “wasted” and unused cloud resources that would otherwise be billed.
Meesho has grown by leaps and bounds in the past few years. Has PE improved our ability to scale?
Definitely. As Meesho has grown at an incredible pace, our PE initiatives have supported that growth while remaining cost-effective. We call our current efforts the 3 A’s: Adoption, Abstraction, and Automation.
Adoption: We are continuously evaluating new technologies and service providers to help us manage them. A change in our deployment strategy from VMs to containers, and spreading out across multiple cloud providers would be good examples of our ability to adapt fast. From a cost perspective, we are also getting close to finding the sweet spot between managed and self-hosted services.
Abstraction: Fast adoption comes with the inherent risk of unreliability due to the changes introduced. This is why we are in the process of finalising our abstractions over common building blocks such as storage, caching, and eventing. These abstractions help us stay true to certain standards while giving a certain level of autonomy to try out newer stuff and incorporate the changes with localised change impact.
Automation: This is a continuous process where we systematically identify and replace manual operations with processes that are repeatable, configurable, observable, and therefore efficient. For example, our deployment pipelines are fully automated. We are also rolling out tooling for environment provision and test execution during performance engineering and quality assurance. All of these have a direct influence on developer productivity in the near term.
Indroneel Das has over 25 years of experience designing and developing large enterprise solutions and web-scale applications. For over 10 years of this journey, he has served as a principal architect across multiple industry domains such as online retailing, e-governance, and IT services.
If you found this conversation insightful, well good news, because we have more of them coming! To make sure you don’t miss out, follow us on LinkedIn, Twitter, and Instagram!
Credits
Interviewee: Indroneel Das
Interviewer: Shivam Raj