Load average: What is it, and what's the best load average for your Linux servers?
If you're using a Linux server, you're probably familiar with the term load average/system load. Measuring the load average is critical to understanding how your servers are performing; if overloaded, you need to kill or optimize the processes consuming high amounts of resources, or provide more resources to balance the workload.
But how do you determine if your server has sufficient load capacity, and when should you be worried? Let's dive in and find out.
What is a load average?
The load average is the average system load on a Linux server for a defined period of time. In other words, it is the CPU demand of a server that includes sum of the running and the waiting threads.
Typically, the top or the uptime command will provide the load average of your server with output that looks like:
These numbers are the averages of the system load over a period of one, five, and 15 minutes.
Before getting into how to measure the load average output and what each of these values mean, let's get into the simplest example: a server with a single core processor.
Breaking down the load
A server with a single core processor is like a single line of customers waiting to get their items billed in a grocery store. During peak hours, there is usually a long line and the waiting time for every individual is also high.
If you're the cashier and want to record the waiting time, one important metric would be the number of people waiting during a particular period of time. If there are no customers waiting, then the wait time is zero. On the other hand, if there is a long line of customers, then the wait time is high.
Applying that to the load average output (0.5, 1.5, 3.0) that we got above:
- 0.5 means the minimum waiting time at the counter. Between 0.00 and 1.0, there is no need to worry. Your servers are safe!
- 1.5 means the queue is filling up. If the average gets any higher, things are going to start slowing down.
- 3.00 means there's a considerably long queue waiting, and an extra resource/counter is required to clear up the queue faster.
What you want is a queue/load average value between 0.00 and 1.00. So can we conclude that the ideal load average is 1.00, and anything above that is an action call to troubleshoot? Well, although it's a safe bet, a more proactive approach is leaving some extra headroom to manage unexpected loads.
Multicores and multiprocessors to the rescue
Are a single quad core processor and a server with four processors (with one core each) the same? Relatively, yes. The main difference between multicore and multiprocessor is that the former refers to a single CPU having multiple cores, while the latter refers to multiple CPUs. To sum up: one quad core is equal to two dual cores which is equal to four single cores.
The load average is relative to the number of cores available in the server and not how they are spread out over CPUs. This means the maximum utilization range is 0-1 for a single core, 0-2 for a dual core, 0-4 for a quad core, 0-8 for an octa-core, and so on.
Referring to the cashier example again, a load of 1.00 would mean the capacity is just right on a single core processor; while on a dual core processor, a load of 1.50 would mean one line is filled up, and the other line is filling up. Similarly, a load of 5.00 on a quad core processor is something to worry about, while on an octa-core processor, 5.00 is only just filling up, and there is optimum space available.
Role of Site24x7: Monitoring load average
Adding resources for a higher load value might add to your infrastructure costs. It's ideal to manage the load efficiently and maintain an optimum level to avoid server performance degradation issues. Site24x7 Linux Monitoring monitors load averages among over 60 performance metrics and provides the 1, 5, and 15 minute average values in an intuitive and easy-to-understand graph.
Further, you can set thresholds and receive notification when there is a breach. But what if there's a breach in the middle of the night? Site24x7 has a solution for that, too. The monitoring tool provides a set of IT automations for automatic fault resolution.
For example, if the system load threshold is set at 2.90 for a dual core processor, you can upload a server script or add server commands to kill the process consuming the highest CPU when the threshold is breached. This way, without any manual intervention, the issue can be resolved and the mean time to repair (MTTR) is vastly reduced.
Adding more cores might accelerate your server performance, but might also add on to your infrastructure spending. Monitoring the load average consistently to maintain efficient management of the existing set up is an ideal alternative. Site24x7 Server Monitoring not only monitors the load average, but also provides complementary fault resolution tools to act before a high load average impacts server performance. Sign up for a 30-day free trial now!
Great example so far
Easy way to let me understand load average