High System Load
General Information
Troubleshooting high system load.
Checklist
Understanding System Load
Load average can be seen in both “uptime” and “top”. It shows the load average for the last “1 mins, 5 mins, and 15 mins”.
Traffic/Bridge analogy
Reposting takeaway here in case it goes away.
Source: ScoutBlog Load Average
On a Single Core CPU System
The server is a bridge operator.
Cars are processes.
Cars on the bridge are using CPU time.
Cars waiting to go on the bridge are waiting for CPU time (because the bridge is backed up and they cannot get CPU time immediately.
Load of 0.00 means there is no traffic on the bridge.
Load of 1.00 means the bridge is at capacity. No more cars(processes) at this very second can get CPU time without waiting.
Load over 1.00 means there is a backup.
Multi-CPU/Core Systems
Load is relative to how many CPUs are on the system.
1 CPU/Core = 100% is load 1.00
2 CPU/Cores = 100% is load 2.00
4 CPU/Cores = 100% is load 4.00
Example: From the analogy above, each CPU Core can actively process 1 bridge lane.
Calculate Overall CPU Load
The following tools are useful when troubleshooting system load.
Typically built in
top ⇒ live system process view
uptime ⇒ system uptime and load averages
vmstat ⇒ virtual memory stats (memory, swap, i/o, cpu)
Need to install (if using a minimal install base)
Base Repo
yum -y install iotop lsof sysstat
Troubleshooting Steps
Know how many processors you have. This is essential to determine if load is high. See “Understanding Load” above for more details.
grep -c proc /proc/cpuinfo
%Load (decimal) = (Load Average / Number Processors)
Example: Number of processors = 2, load average seen = 1.50
1.50 / 2 = 0.75 or 75% load on the processors
Check load averages
Uptime shows the load average for the last 1, 5, and 15 minutes. If it is too high or trending up, time to investigate further.
-
What kind of load
Use vmstat to determine what kind of system load. “vmstat 1” prints stats every 1 second.
-
Important columns to take note of:
CPU: “wa” ⇒ Time spent waiting for I/O. If high, something is probably heavily utilizing disk.
CPU: “id” ⇒ CPU time spent idle. If close to 0, CPU is used heavily.
CPU: “sy” ⇒ CPU time spent running system/kernel processes. Mail and firewalls are common causes of high system use.
CPU: “us” ⇒ CPU time spent running user processes. If high, investigate with top.
IO: “bi” ⇒ blocks received from block device each second. (If high, something is heavily reading a disk)
IO: “bo” ⇒ blocks sent to a block device each second. (If high, something is heavily writing to disk)
SWAP: “si” ⇒ Memory swapped in from disk each second.
SWAP: “so” ⇒ Memory swapped to disk each second.
If either are high, memory is most likely also very low.
MEMORY: “free” ⇒ memory free. If this is low, there is probably swapping going on as well.
Further investigate either high CPU/Memory use or Disk I/O
High CPU
Clues that you should investigate high CPU usage:
top
While in top:
High Memory Use
Notes on Linux memory management
Linux uses free memory in RAM as a buffer cache to speed up application performance.
When memory is needed, the buffer cache shrinks to allow other applications to use it.
Actual free memory = Memory Free + Buffers + Cached
Clues that you should investigate high memory usage:
Start top and sort by %mem usage
top -a
While in top:
Other memory columns (shift+ '<' or '>' to change sort columns)
VIRT = virtual memory size used: code, data, and shared libraries plus swapped pages.
RES = resident size: non-swapped physical memory a process is using
Disk I/O
Clues that you should investigate high Disk I/O:
High CPU “id” (idle)
High CPU “wa” (wait)
iostat - View I/O stats with extended statistics, every 3 seconds
iostat -x 3
iotop - Live disk I/O similar to top
iotop
lsof - If a particular device is discovered, another option for further details is to list open files for that mount point.