Key AWS EKS Metrics to Monitor

Monitoring is critical to an infrastructure team’s work in keeping their platform up and running. Proper monitoring also helps teams understand the system’s behavior: Through a good monitoring stack and the use of key metrics, they can take corrective measures in the system before it impacts users. With Kubernetes clusters, monitoring is crucial for ensuring the smooth operation and availability of the applications.

There are various ways to monitor infrastructure and applications. This article will describe how to monitor Amazon EKS systematically.

Identify key performance indicators (KPIs): Start by identifying the indicators for your application and infrastructure that are critical from a technical and business point of view.
Select monitoring tools: From the large set of available monitoring tools, focus only on tools that meet all your needs and cover all the key performance indicators (KPIs) defined in the step above.
Define alerts: After the monitoring tool is set up, define alerts for your KPIs. This will ensure that you are notified when your KPIs reach a certain point and can take action to avoid any effect on your application.
Review and optimize: Finally, periodically review all KPIs, tools, and alerts to improve them. Some of them may no longer be required, or you might need to add new KPIs and alerts to improve your application performance.

AWS EKS

Amazon Elastic Kubernetes Service (EKS) is a managed Kubernetes service on AWS cloud and on-premises data centers. It allows you to deploy your containerized application on the Kubernetes cluster without having to worry about the other components of the Kubernetes cluster.

AWS EKS automatically manages the Kubernetes cluster control plan node's availability and scalability. With EKS, you can also integrate containerized applications with other AWS services.

AWS observability services

AWS observability services are tools and services that help monitor, troubleshoot, and optimize your infrastructure and application. These services, detailed below, allow developers to quickly identify and resolve issues by providing real-time visibility of the performance and availability of their applications and infrastructure.

Amazon CloudWatch monitors metrics, logs, and alarms for applications and infrastructure.
Amazon X-Ray allows developers to trace user requests as they flow through the application and helps identify and troubleshoot applications.
Amazon Managed Grafana provides scalable Grafana services to visualize and analyze application logs and metrics from various sources. It also offers customizable dashboards to display a wide range of metrics, including metrics on system and application performance as well as business metrics.
Amazon Managed Service for Prometheus is a Prometheus-based monitoring service to collect and store your application and infrastructure logs and metrics from various sources.
CloudWatch Agent is a tool that can be installed on an EC2 instance or on-premises server to collect different metrics and show them on its dashboard. It can help monitor and troubleshoot issues with AWS resources and applications.

What should you monitor in EKS?

There are various key metrics that you should monitor to ensure EKS is more available and scalable. For different use cases, different key metrics must be monitored. Here are some of the most important ones:

Cluster metrics
- Resource utilization monitors CPU and memory utilization of pods and nodes to ensure the cluster has sufficient resources to handle the workload.
- Node and Pod availability monitors the availability of node and pod to ensure the smooth running of applications on the cluster and identify any issues regarding its performance.
- Network performance monitors network performance to ensure the application can communicate properly and identify potential bottlenecks.
- Cluster and Pod scaling monitors the scaling of clusters and pods to ensure it meets workload demand and any potential issues in scaling.

There are a few additional metrics that you can find in the ContainerInsights namespace in the Amazon EKS Kubernetes cluster. You can monitor them on your AWS CloudWatch dashboard.

Metric name	Description
cluster_node_count	Total number of worker nodes in the cluster
node_cpu_utilization	Total percentage of CPU units being used on the nodes in the cluster
node_memory_utilization	Percentage of memory being used by nodes
pod_cpu_utilization	Percentage of CPU units being used by pods
pod_memory_utilization	Percentage of memory being used by pods
node_network_total_bytes	Total number of bytes per second transmitted and received over the network per node in a cluster

Pod metrics
- CPU and Memory Utilization: Monitors the CPU and memory of the application container to ensure the smooth running of the application.
- Network performance: Monitors the container network to ensure the container has sufficient bandwidth to handle requests and identify potential network bottlenecks.
- Container health: Makes sure that containers are running as expected and not experiencing any issues that would suggest their health needs to be monitored.
- Application performance: Monitors applications running in containers to ensure they run efficiently and meet user needs.

The table below contains some of the metrics you should monitor to ensure the smooth running of the application within the container.

Metric name	Description
`container_cpu_load_average_10s`	Value of container CPU load average over the last 10 seconds
`container_fs_io_time_seconds_total`	Cumulative count of seconds spent doing I/Os
`container_memory_usage_bytes`	Current memory usage (includes all memory, regardless of when it was accessed)
`container_network_receive_errors_total`	Cumulative count of errors encountered while receiving
`container_tasks_state`	Number of tasks in a given state (`sleeping, running, stopped, uninterruptible, or ioawaiting`)

Event monitoring: Both Kubernetes control plane components and AWS services generate several key events. These events can help identify and fix issues related to infrastructure as well as application performance. Below is a list of some key events that we should monitor.
- Node events: Help ensure the worker nodes run smoothly and the healthy node is available. If any node becomes unavailable or unhealthy, this event will help fix the problem and prevent any outage.
- Pod events: Any change in the pod state will generate several events. These events shed light on the behavior of the application pod. They can also help us understand pod resource requirements.
- Volume events: Get a stream of events for creation, deletion, and other changes to persistent volumes and persistent volume claims in the cluster.
- Configuration events: Any change in the configuration of the application can lead to abnormal behavior of the application. This can help developers perform root cause analysis and take appropriate steps to fix or prevent it.
Error monitoring: Errors are inevitable when running applications on the Kubernetes cluster. This is why monitoring errors is key to ensuring they do not significantly impact application and Infrastructure availability. Here is a list of key error types.
- API server error: The API server acts as the brain of the Kubernetes cluster; therefore, it is important to look for failures that can impact the core functions of the Kubernetes cluster. Some of the errors in the API server to look out for are listed below.
  - Connection failures: Due to network issues, the API server may not be able to handle requests from other components of the Kubernetes cluster.
  - Resource constraints: Due to insufficient resources such as CPU, memory, or storage, the API server may need help to handle requests.
  - Configuration error: Misconfiguration of the API server may lead to misbehavior of the cluster or failure to perform a request.
- Worker node errors: Worker nodes are another main component of the Kubernetes cluster. They are responsible for running all the workload of the user application. An error in the worker node may result in the unavailability of the application. Here are some key errors in worker nodes to watch out for:
  - Hardware failures: These errors can lead to total failure of the worker node. However, in the case of EKS, we’re using an EC2 instance or AWS Fargate for the worker node, so such errors are taken care of by AWS itself.
  - Network issues: The worker node may be experiencing connectivity issues or network bottlenecks that cause it to produce errors.
  - Resource constraints: The worker node may be running out of resources such as memory, CPU, or disk space, which can cause it to produce errors.

Conclusion

Maintaining a healthy and efficient Kubernetes cluster requires proper monitoring. To successfully monitor an EKS cluster, it is crucial to determine the right metrics, select appropriate monitoring tools, establish alerts for the metrics, and consistently evaluate and improve the monitoring system.

AWS offers a variety of observability services, including Amazon CloudWatch, Amazon X-Ray, Amazon Managed Grafana, and Amazon Managed Service for Prometheus, to aid in the monitoring, debugging, and optimization of both infrastructure and applications.

Some important metrics to monitor for an EKS cluster include the utilization of cluster resources, the health of pods and nodes, and metrics specific to the applications. Monitoring these metrics helps guarantee the smooth functioning and availability of the applications running on the EKS cluster.

FAQs

1. Which EKS metrics does Site24x7 monitor by default?

Site24x7 automatically collects critical EKS metrics including cluster node count, CPU/memory utilization at both the node and pod levels, network traffic, and container health states.

2. How does Site24x7 handle EKS pod and node events?

Site24x7 uses CloudWatch APIs to collect AWS metrics, including EKS pod and node events. The monitoring agent continuously tracks events from the EKS control plane and worker nodes, generating alerts for critical events like OOMKilled, CrashLoopBackOff, or node unavailability.

3. Can I monitor application performance inside EKS pods with Site24x7?

Yes, you can deploy Site24x7 APM Insight agents alongside your applications within EKS pods to trace distributed transactions, identify code-level errors, and monitor overall application performance.

Was this article helpful?

Sorry to hear that. Let us know how we can improve the article.

Key metrics in AWS EKS monitoring