Introduction | Recently I have been doing real-time synchronization of logs. Before going online, I did a single online log stress test. There were no problems with the message queue, the client, and the local machine, but I didn’t expect that after the second log was uploaded, , here comes the question: |
A certain machine in the cluster top saw a huge load. The hardware configuration of the machines in the cluster was the same, and the deployed software was the same, but there was a problem with the load on this one machine alone. I initially guessed that there might be a problem with the hardware.
At the same time, we also need to find out the culprit of the abnormal load, and then find solutions from the software and hardware levels.
2. Troubleshooting:You can see from top that the load average is high, %wa is high, and %us is low:
From the above figure, we can roughly infer that IO has encountered a bottleneck. Next, we can use related IO diagnostic tools for specific verification and troubleshooting.
Commonly used combination methods are as follows:
•Use vmstat, sar, iostat to detect whether it is a CPU bottleneck
•Use free and vmstat to detect whether there is a memory bottleneck
•Use iostat and dmesg to detect whether it is a disk I/O bottleneck
•Use netstat to detect network bandwidth bottlenecks
The meaning of the vmstat command is to display virtual memory status ("Virtual Memor Statics"), but it can report on the overall operating status of the system such as processes, memory, I/O, etc.
Its related fields are described as follows:
Procs(Process)
•r: The number of processes in the run queue. This value can also be used to determine whether the CPU needs to be increased. (long term greater than 1)
•b: The number of processes waiting for IO, that is, the number of processes in non-interruptible sleep state, showing the number of tasks that are executing and waiting for CPU resources. When this value exceeds the number of CPUs, a CPU bottleneck will occur
Memory
•swpd: Use virtual memory size. If the value of swpd is not 0, but the values of SI and SO are 0 for a long time, this situation will not affect system performance.
•free: Free physical memory size.
•buff: The size of memory used as buffer.
•cache: The memory size used as cache. If the cache value is large, it means that there are many files in the cache. If frequently accessed files can be cached, the read IO bi of the disk will be very small.
Swap(swap area)
•si: The size written from the swap area to the memory per second, which is transferred into the memory from the disk.
•so: The memory size written to the swap area per second, transferred from memory to disk.
Note: When the memory is sufficient, these two values are both 0. If these two values are greater than 0 for a long time, system performance will be affected, and disk IO and CPU resources will be consumed. Some friends think that the memory is not enough when they see that the free memory (free) is very small or close to 0. You can't just look at this, but also combine si and so. If there is very little free, there are also very few si and so. (Most of the time it is 0), then don’t worry, system performance will not be affected at this time.
IO (input and output)
(The current Linux version block size is 1kb)
•bi: Number of blocks read per second
•bo: Number of blocks written per second
Note: When reading and writing random disks, the larger these two values are (such as exceeding 1024k), the larger the value you can see that the CPU is waiting for IO.
system
•in: Number of interrupts per second, including clock interrupts.
•cs: Number of context switches per second.
Note: The larger the above two values are, the greater the CPU time consumed by the kernel will be.
CPU
(expressed as a percentage)
•us: Percentage of user process execution time (user time). When the value of us is relatively high, it means that the user process consumes a lot of CPU time, but if the usage exceeds 50% for a long time, then we should consider optimizing the program algorithm or accelerating it.
•sy: Percentage of kernel system process execution time (system time). When the value of sy is high, it means that the system kernel consumes a lot of CPU resources. This is not a benign performance and we should check the reason.
•wa: IO waiting time percentage. When the value of wa is high, it means that the IO wait is serious. This may be caused by a large number of random accesses on the disk, or there may be a bottleneck (block operation) on the disk.
•id: idle time percentage
As can be seen from vmstat, most of the CPU's time is wasted waiting for IO, which may be caused by a large number of random disk accesses or disk bandwidth. Bi and bo also exceed 1024k, which should be caused by IO. bottleneck.
2.2 iostat Let’s use a more professional disk IO diagnostic tool to look at the relevant statistics.
Its related fields are described as follows:
•rrqm/s: The number of merge read operations per second. That is delta(rmerge)/s
•wrqm/s: The number of merge write operations per second. That is delta(wmerge)/s
•r/s: The number of reads from the I/O device completed per second. That is delta(rio)/s
•w/s: Number of writes to the I/O device completed per second. That is delta(wio)/s
•rsec/s: Number of sectors read per second. That is delta(rsect)/s
•wsec/s: Number of sectors written per second. That is delta(wsect)/s
•rkB/s: K bytes read per second. Is half of rsect/s because each sector size is 512 bytes. (needs calculation)
•wkB/s: Number of K bytes written per second. is half of wsect/s. (needs calculation)
•avgrq-sz: Average data size (sectors) per device I/O operation. delta(rsect wsect)/delta(rio wio)
•avgqu-sz: Average I/O queue length. That is delta(aveq)/s/1000 (because the unit of aveq is milliseconds).
•await: average waiting time (milliseconds) for each device I/O operation. That is delta(ruse wuse)/delta(rio wio)
•svctm: Average service time (milliseconds) of each device I/O operation. That is delta(use)/delta(rio wio)
•%util: What percentage of a second is used for I/O operations, or how much of a second the I/O queue is non-empty. That is delta(use)/s/1000 (because the unit of use is milliseconds)
You can see that the utilization rate of sdb in the two hard disks is 100%, and there is a serious IO bottleneck. The next step is to find out which process is reading and writing data to this hard disk.
2.3 iotopAccording to the iotop results, we quickly located the problem with the flume process, which caused a large number of IO waits.
But as I said at the beginning, the machine configurations in the cluster are the same, and the deployed programs are exactly the same as rsync. Is it because the hard disk is broken?
This has to be verified by an operation and maintenance student. The final conclusion is:
Sdb is a dual-disk raid1, the raid card used is "LSI Logic/Symbios Logic SAS1068E", and there is no cache. The pressure of nearly 400 IOPS has reached the hardware limit. The raid card used by other machines is "LSI Logic / Symbios Logic MegaRAID SAS 1078", which has a 256MB cache and has not reached the hardware bottleneck. The solution is to replace the machine with a larger IOPS. For example, we finally changed to a machine with PERC6 /i Machines with integrated RAID controller cards. It should be noted that the RAID information is stored in the RAID card and the disk firmware. The RAID information on the disk and the information format on the RAID card must match. Otherwise, the RAID card cannot recognize it and the disk needs to be formatted.
IOPS essentially depends on the disk itself, but there are many ways to improve IOPS. Adding hardware cache and using RAID arrays are common methods. If it is a scenario like DB with high IOPS, it is now popular to use SSD to replace the traditional mechanical hard disk.
But as mentioned before, our purpose of starting from both the software and hardware aspects is to see if we can find the least expensive solution respectively:
Now that we know the hardware reason, we can try to move the read and write operations to another disk, and then see the effect:
3. Final words: Find another way
In fact, in addition to using the above-mentioned professional tools to locate this problem, we can directly use the process status to find the relevant processes.
We know that the process has the following states:
•D uninterruptible sleep (usually IO)
•R running or runnable (on run queue)
•S interruptible sleep (waiting for an event to complete)
•T stopped, either by a job control signal or because it is being traced.
•W paging (not valid since the 2.6.xx kernel)
•X dead (should never be seen)
•Z defunct ("zombie") process, terminated but not reaped by its parent.
The state of D is generally the so-called "non-interruptible sleep" caused by wait IO. We can start from this point and then locate the problem step by step:
The above is the detailed content of Explore new paths - Diagnostic tool for IO waiting. For more information, please follow other related articles on the PHP Chinese website!