This is an old revision of the document!
High System Load
General Information
Troubleshooting high system load.
Checklist
- Distro: Enterprise Linux 6.x
Understanding System Load
Load average can be seen in both “uptime” and “top”. It shows the load average for the last “1 mins, 5 mins, and 15 mins”.
Traffic/Bridge analogy
Reposting takeaway here in case it goes away. Source: ScoutBlog Load Average
On a Single Core CPU System
- The server is a bridge operator.
- Cars are processes.
- Cars on the bridge are using CPU time.
- Cars waiting to go on the bridge are waiting for CPU time (because the bridge is backed up and they cannot get CPU time immediately.
- Load of 0.00 means there is no traffic on the bridge.
- Load of 1.00 means the bridge is at capacity. No more cars(processes) at this very second can get CPU time without waiting.
- Load over 1.00 means there is a backup.
- 2.00 ⇒ there are “two lanes” worth of cars(processes) waiting for CPU time.
Multi-CPU/Core Systems
- Load is relative to how many CPUs are on the system.
- 1 CPU/Core = 100% is load 1.00
- 2 CPU/Cores = 100% is load 2.00
- 4 CPU/Cores = 100% is load 4.00
- Example: From the analogy above, each CPU Core can handle 1 bridge lane.
Calculate Overall CPU Load
- Get number of CPUs
grep -c proc /proc/cpuinfo
- Load Average / NumProccessors = decimal % load
- Example: LoadAvg(1.5) / 2 Processors = 0.75 or 75% system load on a dual core system.
Troubleshooting Tools
The following tools are useful when troubleshooting system load.
Typically built in
- top ⇒ live system process view
- uptime ⇒ system uptime and load averages
- vmstat ⇒ virtual memory stats (memory, swap, i/o, cpu)
Need to install (if using a minimal install base)
- iostat (sysstat package) ⇒ print i/o statistics
- iotop ⇒ live disk i/o
- lsof ⇒ list open files
Base Repo
yum -y install iotop lsof sysstat
Troubleshooting Steps
- Know how many processors you have. This is essential to determine if load is high. See “Understanding Load” above for more details.
grep -c proc /proc/cpuinfo
- %Load (decimal) = (Load Average / Number Processors)
- Example: Number of processors = 2, load average seen = 1.50
- 1.50 / 2 = 0.75 or 75% load on the processors
- Check load averages
- Uptime shows the load average for the last 1, 5, and 15 minutes. If it is too high or trending up, time to investigate further.
uptime
- What kind of load
- Use vmstat to determine what kind of system load. “vmstat 1” prints stats every 1 second.
vmstat 1
- Important columns to take note of:
- CPU: “wa” ⇒ Time spent waiting for I/O. If high, something is probably heavily utilizing disk.
- CPU: “id” ⇒ CPU time spent idle. If close to 0, CPU is used heavily.
- CPU: “sy” ⇒ CPU time spent running system/kernel processes. Mail and firewalls are common causes of high system use.
- CPU: “us” ⇒ CPU time spent running user processes. If high, investigate with top.
- IO: “bi” ⇒ blocks received from block device each second. (If high, something is heavily reading a disk)
- IO: “bo” ⇒ blocks sent to a block device each second. (If high, something is heavily writing to disk)
- SWAP: “si” ⇒ Memory swapped in from disk each second.
- SWAP: “so” ⇒ Memory swapped to disk each second.
- If either are high, memory is most likely also very low.
- MEMORY: “free” ⇒ memory free. If this is low, there is probably swapping going on as well.
- Further investigate either high CPU/Memory use or Disk I/O
High CPU
Clues that you should investigate high CPU usage:
- Low CPU “id” (idle)
- High CPU “sy” (system processes)
- High CPU “us” (user processes).
top
While in top:
- Turn on highlighting: 'z'
- Highlight sort column: 'x'
High Memory Use
Clues that you should investigate high memory usage:
- Memory free: Very low
- Swap si: High swapping in from disk
- Swap so: High swapping out to disk
Start top and sort by %mem usage
top -a
While in top:
- Turn on highlighting: 'z'
- Highlight sort column: 'x'
Other memory columns (shift+ '<' or '>' to change sort columns)
- VIRT = virtual memory size used: code, data, and shared libraries plus swapped pages.
- RES = resident size: non-swapped physical memory a process is using
Disk I/O
- I/O wait (wa) is the percentage of time a CPU is waiting on disk.
- If I/O wait % is > (1/# CPU cores), then the CPUs are spending a lot of time waiting on disk.
- Easiest ways to improve disk I/O
- Give the system more memory
- Tune the application to use more in memory caches than disk
Clues that you should investigate high Disk I/O:
- High CPU “id” (idle)
- High CPU “wa” (wait)
iostat - View I/O stats with extended statistics, every 3 seconds
iostat -x 3
- “%util” ⇒ If this is close to 100%, the listed “Device” is the one to investigate.
iotop - Live disk I/O similar to top
iotop
lsof - If a particular device is discovered, another option for further details is to list open files for that mount point.
- Device discovered via iostat
- Mount point discovered
- If 'dm' device:
ls -l /dev/mapper
- Then search lsof for that mount point:
lsof | grep /var/