High System Load

This is an old revision of the document!

General Information

Troubleshooting high system load.

Checklist

Distro: Enterprise Linux 6.x

Load average can be seen in both “uptime” and “top”. It shows the load average for the last “1 mins, 5 mins, and 15 mins”.

Reposting takeaway here in case it goes away. Source: ScoutBlog Load Average

On a Single Core CPU System

The server is a bridge operator.
Cars are processes.
Cars on the bridge are using CPU time.
Cars waiting to go on the bridge are waiting for CPU time (because the bridge is backed up and they cannot get CPU time immediately.
Load of 0.00 means there is no traffic on the bridge.
Load of 1.00 means the bridge is at capacity. No more cars(processes) at this very second can get CPU time without waiting.
Load over 1.00 means there is a backup.
- 2.00 ⇒ there are “two lanes” worth of cars(processes) waiting for CPU time.

Multi-CPU/Core Systems

Load is relative to how many CPUs are on the system.
- 1 CPU/Core = 100% is load 1.00
- 2 CPU/Cores = 100% is load 2.00
- 4 CPU/Cores = 100% is load 4.00
Example: From the analogy above, each CPU Core can handle 1 bridge lane.

Get number of CPUs
- ```
grep -c proc /proc/cpuinfo
```
Load Average / NumProccessors = decimal % load
- Example: LoadAvg(1.5) / 2 Processors = 0.75 or 75% system load on a dual core system.

The following tools are useful when troubleshooting system load.

Typically built in

top ⇒ live system process view
uptime ⇒ system uptime and load averages
vmstat ⇒ virtual memory stats (memory, swap, i/o, cpu)

Need to install (if using a minimal install base)

iostat (sysstat package) ⇒ print i/o statistics
iotop ⇒ live disk i/o
lsof ⇒ list open files

Base Repo

yum -y install iotop lsof sysstat

Know how many processors you have. This is essential to determine if load is high. See “Understanding Load” above for more details.
1. ```
grep -c proc /proc/cpuinfo
```
2. %Load (decimal) = (Load Average / Number Processors)
3. Example: Number of processors = 2, load average seen = 1.50
4. 1.50 / 2 = 0.75 or 75% load on the processors
Check load averages
1. Uptime shows the load average for the last 1, 5, and 15 minutes. If it is too high or trending up, time to investigate further.
2. ```
uptime
```
What kind of load
1. Use vmstat to determine what kind of system load. “vmstat 1” prints stats every 1 second.
  1. ```
  vmstat 1
```
2. Important columns to take note of:
  1. CPU: “wa” ⇒ Time spent waiting for I/O. If high, something is probably heavily utilizing disk.
  2. CPU: “id” ⇒ CPU time spent idle. If close to 0, CPU is used heavily.
  3. CPU: “sy” ⇒ CPU time spent running system/kernel processes. Mail and firewalls are common causes of high system use.
  4. CPU: “us” ⇒ CPU time spent running user processes. If high, investigate with top.
  5. IO: “bi” ⇒ blocks received from block device each second. (If high, something is heavily reading a disk)
  6. IO: “bo” ⇒ blocks sent to a block device each second. (If high, something is heavily writing to disk)
  7. SWAP: “si” ⇒ Memory swapped in from disk each second.
  8. SWAP: “so” ⇒ Memory swapped to disk each second.
    1. If either are high, memory is most likely also very low.
  9. MEMORY: “free” ⇒ memory free. If this is low, there is probably swapping going on as well.
3. Further investigate either high CPU/Memory use or Disk I/O

Clues that you should investigate high CPU usage:

Low CPU “id” (idle)
High CPU “sy” (system processes)
High CPU “us” (user processes).

top

While in top:

Turn on highlighting: 'z'
Highlight sort column: 'x'

Clues that you should investigate high memory usage:

Memory free: Very low
Swap si: High swapping in from disk
Swap so: High swapping out to disk

Start top and sort by %mem usage

top -a

While in top:

Turn on highlighting: 'z'
Highlight sort column: 'x'

Other memory columns (shift+ '<' or '>' to change sort columns)

VIRT = virtual memory size used: code, data, and shared libraries plus swapped pages.
RES = resident size: non-swapped physical memory a process is using

I/O wait (wa) is the percentage of time a CPU is waiting on disk.
- If I/O wait % is > (1/# CPU cores), then the CPUs are spending a lot of time waiting on disk.
Easiest ways to improve disk I/O
- Give the system more memory
- Tune the application to use more in memory caches than disk

Clues that you should investigate high Disk I/O:

High CPU “id” (idle)
High CPU “wa” (wait)

iostat - View I/O stats with extended statistics, every 3 seconds

iostat -x 3

“%util” ⇒ If this is close to 100%, the listed “Device” is the one to investigate.

iotop - Live disk I/O similar to top

iotop

lsof - If a particular device is discovered, another option for further details is to list open files for that mount point.

Device discovered via iostat
Mount point discovered
- If 'dm' device:
```
ls -l /dev/mapper
```
Then search lsof for that mount point:
```
lsof | grep /var/
```

High System Load

Understanding System Load

Traffic/Bridge analogy

Calculate Overall CPU Load

Troubleshooting Tools

Troubleshooting Steps

High CPU

High Memory Use

Disk I/O

Owlbear Consulting - IT Knowledge