EC2 Troubleshooting Guide

Amazon EC2 instances are essential parts of many systems, and any issue with them can cause major disruptions. When an instance goes down, stops responding, or runs slower than expected, every minute counts.

This guide offers a clear, practical approach to diagnose and fix common EC2 problems. We'll provide actionable troubleshooting advice that will help get your systems back online quickly. Let’s get started!

EC2 basics

Amazon Elastic Compute Cloud (EC2) is a core service in AWS that provides scalable virtual servers for running application workloads. Users can launch, manage, and scale compute instances based on their specific needs.

Because of its flexibility, pay-as-you-go pricing, and deep integration with other AWS services, EC2 serves a variety of purposes across industries, including:

  • Web hosting: EC2 is often used to host websites and web applications because it enables businesses to scale resources up or down based on traffic demands.
  • Application servers: Many backend services, APIs, and microservices run on EC2 instances.
  • Batch processing: EC2 is also used for large data processing workloads such as log analysis, big data, and AI/ML training.
  • Development and testing: Developers use EC2 to create isolated environments for software testing and development.
  • High-Performance Computing (HPC): Complex simulations, scientific research, and large-scale data analysis are also run on EC2 instances.

Core EC2 components

To become an effective EC2 troubleshooter, you must understand its core components and how they interact.

Instances

An EC2 instance is a virtual server that runs applications. Instances come in different types that are optimized for a wide range of workloads, such as compute-intensive tasks, memory-heavy applications, and storage-heavy processes.

For example, general purpose instances like t3.medium are suitable for web servers and development environments, while mac1.metal instances are for developers looking to build and test macOS applications. Low-latency instances like m6in.32xlarge are ideal for running network-intensive workloads, and GPU instances like p5.48xlarge are fine-tuned for machine learning and graphics-intensive applications.

Security groups

Security groups are the virtual firewalls that control inbound and outbound traffic to EC2 instances. Inbound rules define which sources can connect to an instance whereas outbound rules specify what connections an instance can initiate. Accidental blocking of SSH/RDP access and overly permissive rules are common issues related to security groups.

Elastic IPs

Elastic IPs are static, public IP addresses that you can allocate and associate with your instances. These IPs don’t change even if an instance is stopped or restarted. They are useful for applications that require a fixed IP for whitelisting or DNS records.

Key pairs

Key pairs are cryptographic credentials used to authenticate SSH (Linux) or RDP (Windows) access to EC2 instances. The public key is stored on the instance whereas the private key is held by the user. A key pair misconfiguration or loss can lock you out of your instance.

Volumes

Elastic Block Store (EBS) volumes provide persistent storage for EC2 instances. They function like hard drives that retain data even if an instance stops or terminates. Supported volume types include: General Purpose (gp3, gp2), Provisioned IOPS (io2, io1), and Throughput Optimized (st1, sc1).

Launch configurations and templates

Launch configurations and templates define pre-configured settings for launching instances. They ensure consistency and reduce setup time. It’s important to note that AWS has deprecated launch configurations. As of October 1, 2024, new AWS accounts can no longer create them. Instead, AWS recommends using launch templates, as they offer greater flexibility and better support for the latest features.

The importance of prompt troubleshooting and resolution of EC2 issues

Next, let’s discuss why rapid troubleshooting of EC2 issues is crucial.

Minimizing downtime and business disruptions

Unresolved EC2 issues can cause service disruptions that affect users and business operations. For example, if an e-commerce application running on EC2 fails to auto-scale, the website may become unresponsive or go down.

Ensuring security

Misconfigurations in EC2 instances can expose sensitive data and lead to security vulnerabilities. For example, if an engineer accidentally opens port 22 (SSH) to the public in a security group, unauthorized individuals could potentially gain remote access to the instance. This would allow them to execute commands, modify files, or even exfiltrate sensitive data.

Maintaining application performance

Poorly performing instances can lead to slow load times and system instability. For example, if an instance has insufficient memory, it can cause applications to swap to disk, which drastically slows down performance.

Reducing operational costs

Unoptimized EC2 instances can lead to higher-than-necessary AWS costs. For example, if you leave idle instances running when they're not needed, or don’t right-size instances based on actual usage patterns, it can result in significant and avoidable expenses.

Tools for EC2 troubleshooting

When troubleshooting EC2 issues, having the right tools can make all the difference. Let’s discuss some you must leverage:

Amazon CloudWatch

Amazon CloudWatch is a monitoring and logging service that provides real-time insights into EC2 instance performance. Engineers can use it to track system metrics, detect anomalies, perform root cause analysis, and set up automated responses to issues.

For example, if an EC2 instance becomes unresponsive, CloudWatch can help diagnose the issue by checking its CPU usage and other metrics. If the CPU is at 100%, it can be a sign of excessive load. If auto-scaling is enabled, CloudWatch can automatically launch additional instances to handle the demand and prevent downtime.

AWS Management Console

The AWS Management Console is the main web interface for managing EC2 instances and other AWS services. Here are some ways it can come in handy while troubleshooting:

  • Displays instance health and failure diagnostics.
  • Allows you to reboot, stop, or terminate instances as needed.
  • Helps check firewall rules and connectivity settings.
  • Enables resizing, reattaching, or restoring storage volumes in case of failures.

For example, if an EC2 instance is unreachable, you can use the console to check its status, view system logs, and review recent events. You can also quickly verify the instance's network configuration, including security group rules and network interfaces, to identify potential connectivity issues.

Site24x7’s EC2 monitoring

Site24x7 is a purpose-built monitoring tool that provides deeper insights into EC2 performance and availability. It collects data using its own agents while also aggregating CloudWatch metrics into a centralized dashboard for a complete view of your EC2 instances. You can set custom thresholds and alerts to stay ahead of potential issues.

If you're looking for a single source of truth for all critical EC2 performance data, Site24x7 can be a great choice. For example, if an EC2 instance experiences high memory usage, Site24x7 can detect the spike and raise an alert before it impacts performance.

EC2 issue troubleshooting guide

Here’s a list of the most common EC2 issues along with troubleshooting advice on how to solve them:

Instance launch failures

Instances fail to launch due to configuration errors, resource limits, or AWS service issues.

Symptoms

  • Instance stuck in "Pending" state.
  • Error messages like "InstanceLimitExceeded" or "InsufficientInstanceCapacity".

Troubleshooting

  • Check your EC2 limits in the AWS Service Quotas to verify that you haven't exceeded instance limits.
  • Check that your selected instance type is available in the chosen Availability Zone.
  • Ensure that your IAM permissions allow instance creation.
  • Check if your Amazon Machine Image (AMI) is valid and available.
  • If you are launching from an Auto Scaling group, inspect the scaling policies and launch template settings.

Incorrect IAM role permissions

An EC2 instance is unable to perform certain actions due to missing or incorrect IAM role permissions.

Symptoms

  • Applications fail when trying to access AWS services (e.g., S3, DynamoDB).
  • Error messages related to "Access Denied" or missing permissions in logs.

Troubleshooting

  • Open the AWS Console to verify the IAM role attached to the instance.
  • Check the IAM role’s policies to ensure that all the necessary permissions are granted.
  • Use the AWS IAM Policy Simulator to test policies and confirm access rights.
  • If changes are made, restart the instance or refresh the credentials.

Instance connectivity problems

Instances are running but cannot be accessed via SSH, RDP, or other network protocols.

Symptoms

  • "Connection timed out" or "Permission denied" errors when connecting.
  • Cannot ping the instance’s public or private IP.

Troubleshooting

  • Verify that the security group rules allow inbound SSH (port 22) or RDP (port 3389) traffic.
  • Check the network ACLs to ensure that they are not blocking the connection.
  • If you are trying to connect from an external network, confirm that the EC2 instance has a public IP.
  • Restart the EC2 instance or use AWS Systems Manager Session Manager if SSH is unavailable.
  • Inspect CloudWatch logs for any failures related to SSH or RDP connections.

Performance bottlenecks

Instances run slower than expected due to high resource utilization or misconfiguration.

Symptoms

  • High CPU, memory, or disk usage.
  • Applications respond slowly or become unresponsive.

Troubleshooting

  • Use Amazon CloudWatch to monitor CPU, memory, and disk utilization.
  • Resize to a larger instance type if resources are insufficient.
  • Enable enhanced networking to improve network performance.
  • Check for overloaded EBS volumes and consider using provisioned IOPS (Input/output operations per second).
  • Optimize application configurations and use Auto Scaling if workload spikes are expected.

Elastic Block Store (EBS) issues

EBS volumes are not attaching, mounting, or performing as expected.

Symptoms

  • "Volume in use" or "Volume not found" errors.
  • Slow disk read/write speeds.

Troubleshooting

  • Check if the EBS volume is attached to the instance correctly.
  • Use lsblk or df -h to confirm that the volume is recognized and mounted.
  • Run sudo fsck on Linux or chkdsk on Windows to detect filesystem corruption.
  • If using gp3 or io1/io2 volumes, verify that the IOPS and throughput settings match workload demands.
  • Resize the volume if you find that storage capacity is leading to performance issues.

Networking issues

Instances cannot communicate with each other or external services.

Symptoms

  • Packets are dropped and/or latency is high.
  • Instances in the same VPC (Virtual Private Cloud) cannot connect.

Troubleshooting

  • Check security groups and network ACLs to make sure that traffic is not blocked.
  • Ensure that subnets are correctly configured and instances have the right route tables.
  • If domain resolution is failing, verify your DNS configuration.
  • If using a VPN or Direct Connect, confirm that the tunnels or connections are active.
  • Use the AWS VPC Reachability Analyzer to check if network paths are valid.

Auto scaling failures

Instances fail to launch or terminate within an Auto Scaling group.

Symptoms

  • Auto Scaling group does not maintain the desired number of instances.
  • Scaling policies do not trigger when expected.

Troubleshooting

  • Verify that the launch template or launch configuration is defined correctly.
  • Check IAM roles to ensure that the Auto Scaling group has permissions to launch instances.
  • Review CloudWatch alarms and scaling policies to confirm that they are configured properly.
  • Ensure that the selected instance type is available in the Availability Zone.
  • Check termination policies if instances are not being removed correctly.

Elastic Load Balancer (ELB) issues

Load balancer fails to distribute traffic as expected.

Symptoms

  • Some instances receive no traffic.
  • High latency or dropped connections.

Troubleshooting

  • Confirm that target instances are healthy in the ELB target group.
  • Verify listener rules to ensure that traffic is directed correctly.
  • Check security groups and subnets for misconfigurations.
  • Enable access logs to diagnose traffic patterns and issues.
  • Test with AWS Reachability Analyzer to validate network paths.

Instance termination problems

Instances do not terminate when requested.

Symptoms

  • Instance remains in "shutting down" state indefinitely.
  • "Instance termination protection enabled" error appears.

Troubleshooting

  • Check if termination protection is indeed enabled. If it’s enabled, consult with your colleagues before disabling it.
  • Verify that the instance is not part of an Auto Scaling group that automatically relaunches it.
  • Use AWS CLI to force-stop the instance:

aws ec2 terminate-instances --instance-ids i-1212351890abcddf0

  • Check CloudTrail logs to see if termination attempts are being blocked by any permissions.

Elastic IP assignment issues

Elastic IP addresses fail to attach to instances.

Symptoms

  • Elastic IP assignment fails with an error.
  • Traffic still routes to the old IP after reassignment.

Troubleshooting

  • Ensure that the Elastic IP is not already in use by another instance.
  • Verify that the instance has a public IPv4 address before assigning it an Elastic IP.
  • Use the allocate-address command to detach the IP and reassign it.
  • Restart network services on the instance (sudo systemctl restart networking).

Snapshot and backup failures

EBS snapshots or AMI backups fail to create or restore.

Symptoms

  • You see "Snapshot creation failed" or similar errors.
  • AMI restore takes too long or fails.

Troubleshooting

  • Ensure that the EBS volume is in an "available" state before taking a snapshot.
  • Check IAM permissions for ec2:CreateSnapshot and ec2:CreateImage.
  • Review CloudTrail logs to ensure that the snapshots are not corrupted or incomplete.
  • If backups fail, increase provisioned IOPS to speed up snapshot operations.

Kernel panics or OS-level issues

The instance experiences kernel panics or other operating system–level failures.

Symptoms

  • Instance becomes unresponsive and requires a hard reboot.
  • System logs show kernel panic messages.

Troubleshooting

  • Use the AWS Management Console to access system logs and identify the cause of the panic.
  • Ensure that the instance is running the latest kernel version and drivers.
  • Use EC2 instance health checks to rule out any underlying hardware problems.
  • If possible, reboot the instance into recovery mode to diagnose and fix the issue.
  • If the issue persists, terminate the instance and launch a new one from a backup or AMI.

Preventative measures and best practices

To avoid several of the aforementioned issues, follow these best practices:

Security and access management

  • Restrict IAM roles and policies to grant only the necessary permissions.
  • Remove unnecessary open ports and restrict inbound traffic to known IPs.
  • Store private keys securely and rotate them periodically. Never hardcode private keys in applications.
  • Use MFA (multi-factor authentication) for IAM users with EC2 access to prevent unauthorized access.

Instance configuration and maintenance

  • Use launch templates instead of launch configurations, as the latter are deprecated and the former unlock access to the latest features.
  • Enable detailed monitoring in CloudWatch. This provides better insights into instance performance.
  • Regularly update the base Amazon Machine Images (AMIs) and installed software to patch security vulnerabilities.
  • Use the latest instance types. Newer instance types provide better performance and efficiency.

Storage and backup strategies

  • Use Amazon Data Lifecycle Manager (DLM) to automate snapshot creation and retention.
  • Deploy instances across AZs to improve fault tolerance.
  • Prevent performance issues by setting CloudWatch alerts for disk utilization and IOPS limits.

Networking and connectivity

  • Use Elastic Load Balancing (ELB) for high availability. It allows you to distribute traffic across multiple instances to prevent overload.
  • Capture logs for troubleshooting network connectivity issues. For example, you can capture VPC flow logs, which record information about the IP traffic going to and from network interfaces in your VPC.
  • Regularly review route tables and NAT gateway settings to ensure traffic is routed as intended and that instances have the necessary internet access.

Scaling and cost optimization

  • Configure Auto Scaling Groups (ASGs) to handle fluctuating workloads efficiently.
  • Optimize costs by using a mix of On-Demand, Spot, and Reserved Instances.
  • Avoid unexpected costs by setting alerts for EC2 usage and spending limits.
  • For critical workloads, deploy instances in multiple regions for better disaster recovery.

Conclusion

AWS EC2 is a core component of many modern IT infrastructures. To keep EC2 instances running smoothly, it’s crucial to promptly troubleshoot any issues. We hope that this guide has provided you with the foundational knowledge and practical tools necessary to do so.

If you want to have complete visibility into all critical EC2 metrics, don’t forget to try out the AWS EC2 monitoring solution by Site24x7.

Was this article helpful?

Related Articles