In this world, Site Reliability Engineer (SRE) roles and functionalities are essential to measuring availability, delivering releases, and taking immediate action in case of failures.
This article will discuss monitoring for SREs in a cloud-native world with its dynamic and ever-changing nature. Let's start with a short introduction of the role of SRE and a comparison to DevOps culture.
What is SRE?
Site reliability engineering is a functional discipline that seeks to apply software development techniques to IT problems. Instead of reactive monitoring, deployment, and incident management, SRE focuses on building and monitoring every system in a continuous approach. The term SRE was coined at Google when they needed to shift from an IT-centric organization to a service-oriented one. Following Google, large enterprise software companies and small startups adopted the philosophy and created dedicated SRE teams within their organizations.
SRE vs. DevOps
In conventional software development, development and operations are two separate teams each with its own processes and mindsets. But this separation creates silos, communication challenges, and conflicts between groups—and, unfortunately, failed projects. DevOps was the solution that broke down silos and brought the two teams together, as the name implies: DevelopmentOperations.
DevOps is not just a team or role name but a set of concepts and practices that enriches collaboration and communication to create successful applications. For cloud-native applications that are developed and deployed as services, SRE teams are essential to achieving DevOps principles. The two critical points to discuss in terms of the relationship between SRE and DevOps are the following:
- SRE teams and their technical background help break down silos and bring development and operations together.
- Development and operations teams need to monitor and measure performance and take action for active services.
In the old days, it took months—even years—to develop a software application, and it was reasonably easy to observe its behavior. Today, software is produced and delivered as small services by SRE teams with both development and operational experience. In addition, software applications consist of microservices that have minimum interdependency to other microservices. This means that cloud-native applications and SRE teams require overall observability of systems, which is not the focus of traditional monitoring systems.
In the following section, we will discuss the characteristics of a cloud-native monitoring system for SREs.
Characteristics of monitoring for SREs
For today’s rapidly evolving infrastructure and applications, SRE teams require a novel way of monitoring. When it comes to strategies, successful companies and SRE teams have adopted the following characteristics.Fig. 1. Characteristics of monitoring for SREs
Cloud-native microservices are distributed over clusters of servers and run on their own. Cloud orchestrators, such as Kubernetes, can autoscale them, move to other nodes, and restart them if they become unresponsive.
In order to know what is going on in your system, you need to collect metrics, logs, and traces to create overall observability. The monitoring system you implement or use should be able to collect data from all instances and store them. Then, you can create dashboards to combine data from multiple sources and find correlations. For example, if you see a long response time in your frontend application, you also need to check the response time of your backend application and even your database to find the root cause. So, collecting metrics, logs, and traces is essential to creating the observability you need of your entire application.
SLO, SLA & SLI
Metrics for SRE should reflect the business objectives, as SRE teams are the "owner" of services in production. The industry has defined and adopted three essential indicators for bringing business objectives and software applications together:
- Service-Level Agreement (SLA): The agreement between a provider and the user for the performance of a service
- Service-Level Objective (SLO): The service provider's goal in terms of performance
- Service-Level Indicator (SLI): The measurement the service provider sets to define the goal
In practice, SRE teams first define the SLI by a set of metrics and a target value over a period of time. Then, SRE teams keep the services up and reliable to achieve a defined SLO value. The SLA, meanwhile, is a promise to external users and should be lower, i.e., more achievable, than the SLO.
Whether building a monitoring system from scratch or redesigning it for your SRE team, you should include SLIs and SLOs as part of your system requirements. The values will help you track the overall performance of your applications and benchmark them for enhancements.
Improved productivity and automation
SRE teams combine software development experience with operational knowledge. In other words, they are expected to automate human knowledge by creating tools and procedures. Monitoring for SRE teams should thus find productivity bottlenecks in the system and potential automation opportunities. Improved productivity and automation make systems more reliable and also allow SRE teams to focus on more productive tasks than restarting nodes or pinging servers.
Alert and incident management
Alerts and incidents are the critical events you want to keep to a minimum. There are two essential aspects of alert and incident management for SRE monitoring:
- Incident action plan: SRE teams focus on automation and converting human expertise into software as much as possible. In case of alerts and incidents, the SRE monitoring system should provide where to look and which actions to take, namely, playbooks.
- Data collection: Monitoring systems for SREs should collect and store as much information as possible about the previous alerts and incidents. This data is invaluable for analyzing previous incidents and making a complete root cause analysis (RCA).
Chaos engineering practices
Chaos engineering is a cultural paradigm focusing on the reliability of systems. It actively tries to test assumptions and create unexpected environments to check system reliability. For instance, you can unplug some of the servers or delete some configuration files, creating chaos in the system. Your monitoring system should be actively testing the reliability of running applications because your SRE team should be ready for every possible foreseen and unforeseen event.
SRE teams work with complex and large applications and infrastructure. And, unfortunately, there is no such thing as bug-free software. This means that some systems will eventually break and create downtime. SRE teams and the DevOps philosophy focus on learning from incidents and enhancing systems accordingly. Therefore, formalizing the learning process and creating an archive of postmortems is essential.
Postmortems have been widely adopted across the industry and consist of an incident description, root cause analysis, and follow-up actions. In SRE monitoring, you need to formalize how to write, store, and manage postmortems. Since incidents are inevitable for cloud-native distributed applications, gleaning valuable information from them is an opportunity you need to exploit.
Software development has evolved with the growth of cloud-native architectures and microservices. In addition, the philosophy of teams, as well as the structure of organizations, has changed with SRE and DevOps culture. These new and modern paradigms require novel methods of monitoring, as discussed in this article.
There is no single monitoring system on the market today that covers all requirements; that is, there is no silver bullet. Instead, you need to analyze your business requirements and design a monitoring system per your given needs. If you implement the characteristics mentioned above, you will have more reliable and scalable applications with global observability.