Linux systems are composed of several major physical components, such as CPU, memory, network cards, and storage devices. To effectively manage a Linux environment, you should be able to measure various metrics of these resources with reasonable accuracy—how many resources each component handles, whether there are bottlenecks, and so on. Below we introduce some commands related to Linux resource monitoring.
View system release version
root@cf0c6032ba2f:/# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
top(cpu)
Cpu(s) This line provides information about the current CPU operation:
Cpu(s): 11.4%us, 29.6%sy, 0.0%ni, 58.3%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st
us: The percentage of user CPU time spent running non-elegant user processes (elegant, "nicing" in English, refers to a process that allows you to change the priority based on other processes ).
sy: System CPU time The percentage of CPU time spent running the kernel and kernel processes.
ni: Elegant CPU Time If you have changed the priorities of some processes, this indicator can tell you the percentage of CPU time they occupy.
id: CPU idle time This is one of those metrics that you want to have a very high value for. It represents the idle time ratio of the CPU. If the system is running slowly, but this metric is particularly high, then you can be sure that the cause of the problem is not high CPU load.
wa: I/O wait This number represents the percentage of CPU time spent waiting to perform I/O operations. This is a very valuable metric when you're troubleshooting a slow system, because if it's low, you can easily rule out disk or network I/O issues.
hi: Hardware interrupt The percentage of time the CPU spends processing hardware interrupts.
si: Software interrupt The percentage of time the CPU spends processing software interrupts.
st: Elapsed time If you are running a virtual machine, this metric tells you the percentage of CPU time taken up by other tasks performed in the virtual machine.
Check the number of CPUs
The basis for judging the CPU status of the Linux server is as follows:
CPUs with the same core id are hyper-threaded by the same core.
CPUs with the same physical id are threads or cores encapsulated by the same CPU.
The command to display the number of physical CPUs is as follows:
cat /proc/cpuinfo | grep "physical id" | sort | uniq | wc -l
Display the number of cores (i.e. cores) in each physical CPU The command to display the number of logical CPUs is as follows:
cat /proc/cpuinfo | grep "cpu cores" | uniq
The command result is as follows: cpu cores: 1
The command to display the number of logical CPUs is as follows:
cat /proc/cpuinfo | grep "processor" | wc -l
The command result is as follows: 4 In fact, everyone starts from It can be seen here that it stands to reason that there should be the following equation: Number of physical CPUs × number of cores = number of logical CPUs. If they are not equal, it means that your server CPU supports hyper-threading technology. When configuring server applications, we should take the number of logical CPUs in the server as the standard.
uptime (average load)
Sometimes we feel that the system response is very slow, but we can’t find the reason. Then we need to check the average load to see if there are a large number of processes waiting in the queue. The average number of processes in the running queue within a specific time interval can reflect the busyness of the system, so we usually check the load of the system as soon as our website or system slows down, that is, the average load of the CPU. How should we check the average load? The simplest command is uptime, as shown below:
uptime
The command displays the results as follows:
11:31:11 up 11 days, 19:01, 2 users, load average: 0.02, 0.01, 0.00
The current mainstream servers are dual-quad-core, with quite powerful CPUs, providing When using general application services, you don’t have to worry about the load on the Linux system.
What needs to be noted here is the output value of load average. The size of these three values can generally not be greater than the number of logical CPUs in the system. For example, in this output, the system has 4 logical CPUs. If the three values of load average are greater than 4 means that the CPU is very busy and the load is very high, which may affect system performance. However, if it is occasionally greater than 4, don't worry, it generally will not affect system performance. On the contrary, if the output value of load average is less than the number of CPUs, it means that the CPU is still idle. For example, the output in this example shows that the CPU is relatively idle.
At this time, we can use the vmstat command to determine whether our system is too busy. If it is determined to be very busy, we should consider whether to replace the server or increase the number of CPUs. The summary is as follows: If r is often greater than 3 or 4, and id is often less than 50, it means that the CPU is heavily loaded.
top(mem)
Mem: 1024176k total, 997408k used, 26768k free, 85520k buffers Swap: 1004052k total, 4360k used, 999692k free, 286040k cached
Line 1 tells us how much physical memory is available, how much memory is occupied, how much memory is free, and how much memory is cached. Line 2 gives us similar information, swap storage and how much RAM is used by the Linux file cache.
To find out how much RAM a process is really using, you have to clear out the file cache in RAM. As you can see in the sample code, out of the 997408KB of RAM used, 286040KB of RAM is occupied by the file cache, so this means that only 711368KB of RAM is actually used. A good way to tell if you're running out of RAM is to look at your file cache.
如果实际用的内存减去文件缓存的值很大,同时交换存储的值也很高,很可能的确有内存问题。
free -m(内存)
显示的是当前内存的使用情况,m的意思是以M个字节来显示内容,此命令只在Linux系统下有效,在FreeBSD下是没有此命令的。命令显示结果如下所示:
total used free sharedbuffers cached Mem: 3949 1397 2551 0268917 -/+ buffers/cache:211 3737 Swap:8001 0 8001
上述结果中各个参数的详细说明如下:
total:内存总数。
used:已经使用的内存数。
free:空闲的内存数。
shared:多个进程共享的内存总额。
buffers buffer cache和cached page cache:磁盘缓存的大小。
-buffers/cache:(已用)的内存数,即used-buffers-cached。
+buffers/cache:(可用)的内存数,即free + buffers + cached。 由此得出结论,可用内存的计算公式为可用内存=free+buffers+cached 即 2551MB+268MB+917MB=3737MB 注意 上面等式两边的数值并不相等,但这个没关系,-m参数其实是以整数数值来取舍的。大家如果对这个运算结果有怀疑,可以尝试不带-m参数来观看free命令显示的结果,这样就会一目了然了。
可见-buffers/cache反映的是被程序实实在在占用的内存,而+buffers/cache反映的是可以挪用的内存总数。
vmstat(io)
vmstat是一个相当全面的性能分析工具,通过它可以观察系统的进程状态、内存使用情况、虚拟内存的使用情况、磁盘的I/O、中断、上下文切换、CPU的使用情况等性能信息,建议熟练掌握此命令。
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si sobibo incs us sy id wa
2 0 0 519024 74732 460656800 3 9510 27 5 68 0 2 0 0 519664 74732 460656800 0 0 1847 1244 20 17 63 0 1 0 0 517296 74732 460656800 0 284 2092 1617 37 17 47 0 3 0 0 515440 74732 460656800 0 164 1620 718 26 17 57 0
其中:
(1)procs r:等待运行的进程数。 b:处于非中断睡眠状态的进程数。
(2)memory swpd:虚拟内存使用情况(单位:KB)。 free:空闲的内存(单位:KB)。 buff:被用来作为缓存的内存数量(单位:KB)。
(3)swap si:从磁盘交换到内存的交换页数量(单位:KB/s)。 so:从内存交换到磁盘的交换页数量(单位:KB/s)。
(4)io bi:发送到块设备的块数(单位:块/秒)。 bo:从块设备接收到的块数(单位:块/秒)。
(5)system in:每秒的中断数,包括时钟中断。 cs:每秒的环境(上下文)切换次数。 (6)cpu 按CPU的总使用百分比来显示。 us:CPU使用时间。 sy:CPU系统使用时间。 id:闲置时间。
标准情况下r和b值应该为:r<5,b≈0。 如果user%+sys%<70%则表示系统性能较好,如果user%+sys%>=85%或以上,这表示系统性能比较糟糕,这时就要对系统进行全方面检查了。其中: user%表示CPU处在用户模式下的时间百分比。 sys%表示CPU处在系统模式下的时间百分比。
ps auxf(进程)
要查看系统中用户正在运行的所有进程,可以在ps命令后面使用以下选项:
a(表示所有用户)
u(以面向用户的格式显示,或显示拥有每个进程的用户)
x(没有控制tty或终端屏幕的进程,“显示每个进程”的另一种方法)
ps aux
请注意"ps -aux"不同于"ps aux"。POSIX和UNIX的标准要求"ps -aux"打印用户名为"x"的用户的所有进程,以及打印所有将由-a选项选择的过程。如果用户名为"x"不存在,ps的将会解释为"ps aux",而且会打印一个警告。这种行为是为了帮助转换旧脚本和习惯。它是脆弱的,即将更改,因此不应依赖。
要查看进程树,除了使用上一节用过的a、u和x选项,还要加上个f(其名称源于ASCII art forest)选项。
ps auxf
ps -ef(进程)
ps aux是用BSD格式来显示结果.ps -ef是用全格式的System V格式,显示出来就是带全路径的进程名.
一个影响使用的区别是aux会截断command列,而-ef不会。因此当需要结合grep的时候,优先选择-ef命令,避免误判
netstat(网络)
netstat命令的功能是显示网络连接、路由表和网络接口的信息,可以让用户得知目前都有哪些网络连接正在运作。 下面是它的重要参数,以及详细的说明:
-A:显示任何关联的协议控制块的地址。主要用于调试。
-a:显示所有套接字的状态。在一般情况下不显示与服务器进程相关联的套接字。
-i:显示自动配置接口的状态。那些在系统初始引导后配置的接口状态不在输出之列。
-m:打印网络存储器的使用情况。
-n:打印实际地址,而不是对地址的解释或显示主机、网络名之类的符号。
-r:打印路由选择表。
-f address:family会对于给出名字的地址簇打印统计数字和控制块信息。到目前为止,它唯一支持的地址簇是inet。
-I interface:表示只打印给出名字的接口状态。
-p protocol-name:表示只打印给出名字的协议的统计数字和协议控制块信息。
-s:打印每个协议的统计数字。
-t:表示在输出显示中用时间信息代替队列长度信息。
我们用得最多的,也是最习惯的参数有两个,即netstat-an,如下所示:
netstat -an | grep –v unix
lsof(文件)
lsof(list open files)是一个列出当前系统打开文件的工具。在UNIX环境下,任何事物都是以文件的形式存在的,通过文件不仅仅可以访问常规数据,还可以访问网络连接和硬件。所以像传输控制协议(TCP)和用户数据报协议(UDP)套接字等,系统在后台都为该应用程序分配了一个文件描述符,无论这个文件的本质如何,该文件描述符都会为应用程序与基础操作系统之间的交互提供通用接口。因为应用程序打开文件的描述符列表提供了大量关于这个应用程序的信息,因此通过lsof工具查看这个列表对系统监测,以及排错非常有帮助。顺便提一下,这工具首先出现在UNIX系统中,后才移植到Linux平台下。
工作中用得最多的是-i参数,可以用它来查看特定端口的情况,比如,我可以用lsof -i:22查看22端口是由哪些程序占用的。
fdisk -l(硬盘分区)
查看硬盘及分区信息,如下所示: fdisk –l 命令显示结果如下:
Disk /dev/sda: 160.0 GB, 160040803840 bytes 255 heads, 63 sectors/track, 19457 cylinders Units = cylinders of 16065 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sda1 1 13 104391 83 Linux / /dev/sda2 14320025599577+ 83 Linux /dev/sda332013582 3068415 82 Linux swap / Solaris /dev/sda43583 19457 127515937+ 5 Extended /dev/sda53583 19457 127515906 83 Linux
以上结果表明这是一块160GB的服务器硬盘。
df(硬盘空间)
检查文件系统的磁盘空间占用情况,命令如下所示:
df –h 命令显示结果如下:
FilesystemSize Used Avail Use% Mounted on /dev/sda2 24G 5.9G 17G 26% /
/dev/sda5 118G 8.8G 103G 8% /data
/dev/sda1 99M 20M 75M 21% /boot
tmpfs 859M 0 859M 0% /dev/shm
du(目录大小)
查看Linux系统中某目录的大小,这在工作中经常会遇到。可以使用如下命令查看:
du -sh 目录名
例如du -sh /data 命令显示结果如下所示: 8.6G /data/ 检查是否有分区使用率(Use%)过高(比如超过90%),如发现某个分区空间接近用完,可以进入该分区的挂载点,用以下命令找出占用空间最多的文件或目录,然后按照从大到小的顺序,正好可以找出系统中占用最多空间的前十个文件或目录:
du -sh * | sort -hr | head -n 10
doc