Differences

This shows you the differences between two versions of the page.

--- linux_wiki:high_system_load [2015/10/06 23:06]
billdozor [Troubleshooting Steps]
+++ linux_wiki:high_system_load [2019/05/25 23:50] (current)
@@ Line 6: / Line 6: @@
 **Checklist**
-  * Distro: Enterprise Linux 6.x
+  * Distro(s): Enterprise Linux 6
+----
+====== Understanding System Load ======
+Load average can be seen in both "uptime" and "top". It shows the load average for the last "1 mins, 5 mins, and 15 mins".
+===== Traffic/Bridge analogy =====
+Reposting takeaway here in case it goes away.
+Source: [[http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages|ScoutBlog Load Average]]
+**On a Single Core CPU System**
+  * The server is a bridge operator.
+  * Cars are processes.
+  * Cars on the bridge are using CPU time.
+  * Cars waiting to go on the bridge are waiting for CPU time (because the bridge is backed up and they cannot get CPU time immediately.
+  * Load of 0.00 means there is no traffic on the bridge.
+  * Load of 1.00 means the bridge is at capacity. No more cars(processes) at this very second can get CPU time without waiting.
+  * Load over 1.00 means there is a backup.
+    * 2.00 => there are "two lanes" worth of cars(processes). One lane is being processed, another lane is waiting for CPU time.
+**Multi-CPU/Core Systems**
+  * Load is relative to how many CPUs are on the system.
+    * 1 CPU/Core = 100% is load 1.00
+    * 2 CPU/Cores = 100% is load 2.00
+    * 4 CPU/Cores = 100% is load 4.00
+  * Example: From the analogy above, each CPU Core can actively process 1 bridge lane.
+===== Calculate Overall CPU Load =====
+  * Get number of CPUs<code bash>grep -c proc /proc/cpuinfo
+OR
+nproc</code>
+  * Load Average / NumProccessors = decimal % load
+    * Example: LoadAvg(1.5) / 2 Processors = 0.75 or 75% system load on a dual core system.
 ----
@@ Line 15: / Line 50: @@
 Typically built in
-  * uptime
+  * top => live system process view
-  * top
+  * uptime => system uptime and load averages
-  * vmstat
+  * vmstat => virtual memory stats (memory, swap, i/o, cpu)
 Need to install (if using a minimal install base)
-  * iotop
+  * iostat (sysstat package) => print i/o statistics
-  * iostat (sysstat package)
+  * iotop => live disk i/o
-  * sar (sysstat package)
+  * lsof => list open files
 Base Repo
 <code bash>
-yum -y install iotop sysstat
+yum -y install iotop lsof sysstat
 </code>
@@ Line 33: / Line 68: @@
 ====== Troubleshooting Steps ======
-  - Know how many processors you have. This is essential to determine if load is high.
+  - **Know how many processors you have**. This is essential to determine if load is high. See "Understanding Load" above for more details.
-    - <code bash>grep -c processor /proc/cpuinfo</code>
+    - <code bash>grep -c proc /proc/cpuinfo</code>
     - %Load (decimal) = (Load Average / Number Processors)
     - Example: Number of processors = 2, load average seen = 1.50
     - 1.50 / 2 = 0.75 or 75% load on the processors
-  - Check load averages
+  - **Check load averages**
     - Uptime shows the load average for the last 1, 5, and 15 minutes. If it is too high or trending up, time to investigate further.
     - <code bash>uptime</code>
-  - What kind of load
+  - **What kind of load**
     - Use vmstat to determine what kind of system load. "vmstat 1" prints stats every 1 second.
       - <code bash>vmstat 1</code>
@@ Line 49: / Line 84: @@
       - CPU: "sy" => CPU time spent running system/kernel processes. Mail and firewalls are common causes of high system use.
       - CPU: "us" => CPU time spent running user processes. If high, investigate with top.
+      - IO: "bi" => blocks received from block device each second. (If high, something is heavily reading a disk)
+      - IO: "bo" => blocks sent to a block device each second. (If high, something is heavily writing to disk)
       - SWAP: "si" => Memory swapped in from disk each second.
       - SWAP: "so" => Memory swapped to disk each second.
         - If either are high, memory is most likely also very low.
-    - Further investigate either high CPU use or Disk I/O
+      - MEMORY: "free" => memory free. If this is low, there is probably swapping going on as well.
+    - **Further investigate either high CPU/Memory use or Disk I/O**
-==== High CPU ====
+----
+===== High CPU =====
+Clues that you should investigate high CPU usage:
+  * Low CPU "id" (idle)
+  * High CPU "sy" (system processes)
+  * High CPU "us" (user processes).
 <code bash>
@@ Line 60: / Line 105: @@
 </code>
-==== Disk I/O ====
+While in top:
+  * Turn on highlighting: 'z'
+  * Highlight sort column: 'x'
+----
+===== High Memory Use =====
+Notes on Linux memory management
+  * Linux uses free memory in RAM as a buffer cache to speed up application performance.
+  * When memory is needed, the buffer cache shrinks to allow other applications to use it.
+  * **Actual free memory = Memory Free + Buffers + Cached**
+Clues that you should investigate high memory usage:
+  * Memory free: Very low
+  * Swap si: High swapping in from disk
+  * Swap so: High swapping out to disk
+Start top and sort by %mem usage
+<code bash>
+top -a
+</code>
+While in top:
+  * Turn on highlighting: 'z'
+  * Highlight sort column: 'x'
+Other memory columns (shift+ '<' or '>' to change sort columns)
+  * VIRT = virtual memory size used: code, data, and shared libraries plus swapped pages.
+  * RES = resident size: non-swapped physical memory a process is using
+----
+===== Disk I/O =====
+  * I/O wait (wa) is the percentage of time a CPU is waiting on disk.
+    * If I/O wait % is > (1/# CPU cores), then the CPUs are spending a lot of time waiting on disk.
+  * Easiest ways to improve disk I/O
+    * Give the system more memory
+    * Tune the application to use more in memory caches than disk
+Clues that you should investigate high Disk I/O:
+  * High CPU "id" (idle)
+  * High CPU "wa" (wait)
+\\
+**iostat** - View I/O stats with extended statistics, every 3 seconds
+<code bash>
+iostat -x 3
+</code>
+  * "%util" => If this is close to 100%, the listed "Device" is the one to investigate.
+\\
+**iotop** - Live disk I/O similar to top
 <code bash>
-iostat
 iotop
 </code>
+\\
+**lsof** - If a particular device is discovered, another option for further details is to list open files for that mount point.
+  * Device discovered via iostat
+  * Mount point discovered
+    * If 'dm' device:<code bash>ls -l /dev/mapper</code>
+  * Then search lsof for that mount point:<code bash>lsof | grep /var/</code>
 ----