Background
"There is no problem offline", "There cannot be problems with the code because of the system", "Can I remotely debug online"
Online problems are different from bugs during development and runtime. Environment, pressure, concurrency situations, specific business related. For online problems, it is important to use the tools available in the online environment to collect necessary information to locate the problem.
It is difficult to intuitively obtain data on bugs and resource bottlenecks that cause problems. It is necessary to infer the root cause of the problem based on resource usage data, logs and other information. And the location of difficult problems usually requires the use of different methods to trace the source.
In this wiki, I have compiled the tools I have used and shared some cases.
1. Frequently Asked Questions
1.1 Availability
Here are a few common situations that lead to service availability:
a) 502 Bad Gateway
The most common problem for application systems, especially http-based applications Nothing is more serious than "502 Bad Gateway", which means that the back-end service is completely unavailable. Possible reasons
Insufficient resources 1: caused by garbage collection, which will cause serious application suspension when the CMS leaks application memory or lacks memory. .
Insufficient resources 2: Insufficient number of server threads. Common web servers such as tomcat and jetty are configured with a maximum working thread.
Insufficient resources 3: Insufficient database resources. The database usually uses a connection pool configuration. The maxConnection configuration is too low and too high. How slow a query will block the working thread of the web server
Insufficient resources 4: IO resource bottleneck, online environment IO is shared, especially for mixed environments (fortunately, CRM does not have this situation, but there are many agents), we The commonly used log4j logging tool is also an exclusive resource for each recorded log file. The thread must first obtain the lock before it can record data to the log.
... ...
Various OOM
b) Socket exceptions
Common Connection reset by peer, Broken Pipe, EOFException
Network problems: in the case of cross-operator and computer room access You may encounter
Program bug: socket is closed abnormally
1.2 Average response time
The most intuitive symptom when a problem occurs in the system. This parameter can provide early warning before the situation worsens and infects other services, causing the entire system to become unavailable. Possible reasons:
Resource competition 1: CPU
Resource competition 2: IO
Resource competition 3: network IO
Resource competition 4: Database
Resource competition 5: solr, medis
Downstream Interface: Exception causes response delay
1.3 Machine alarm
Compared with application service unavailability, this type of error will not directly lead to service unavailability, and if there is confusion, multiple services deployed on the machine may interfere with each other:
CPU
Disk
fd
IO (Network Disk)
1.4 Summary
It took me a long time to write, many cases are mentioned repeatedly, usually the cause of online problems is nothing more than system resources. , applications, and mastering the tools to monitor and view these resources and data will make it easier to locate online problems.
2Commonly used tools
2.1 Linux tools
a) sysstat:
iostat: View read and write pressure
[sankuai@cos-mop01 logs]$ iostat Linux 2.6.32-20131120.mt (cos-mop01.lf.sankuai.com) 2015年10月21日 _x86_64_ (4 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 1.88 0.00 0.87 0.12 0.05 97.07 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn vda 1.88 57.90 12.11 2451731906 512911328 vdb 0.01 0.40 1.41 17023940 59522616 vdc 1.14 28.88 36.63 1223046988 1551394969
sar: View CPU network IO IO, enable parameters to view historical data
/etc/sysconfig/sysstat HISTORY=7 /etc/cron.d/sysstat */10 * * * * root /usr/lib/sa/sa1 1 1 sar -u/-r/-B/-b/-q/-P/-n -f /var/log/sa/sa09
b) top
Pay attention to load, cpu, mem, swap
You can view resource information according to threads (version is greater than 3.2.7)
top - 19:33:00 up 490 days, 4:33, 2 users, load average: 0.13, 0.39, 0.42 Tasks: 157 total, 1 running, 156 sleeping, 0 stopped, 0 zombie Cpu(s): 4.9%us, 2.7%sy, 0.0%ni, 92.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st Mem: 5991140k total, 5788884k used, 202256k free, 4040k buffers Swap: 2096440k total, 447332k used, 1649108k free, 232884k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA COMMAND 18720 sankuai 20 0 8955m 4.3g 6744 S 22.6 74.5 174:30.73 0 4 8.6g java 27794 sankuai 20 0 5715m 489m 2116 S 11.6 8.4 3922:43 121m 4 3.9g java 13233 root 20 0 420m 205m 2528 S 0.0 3.5 1885:15 91m 4 304m puppetd 21526 sankuai 20 0 2513m 69m 4484 S 0.0 1.2 45:56.28 37m 4 2.4g java
c) vmstat
[sankuai@cos-mop01 logs]$ vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 447332 200456 4160 234512 0 0 11 6 0 0 2 1 97 0 0
d) tcpdump
A tool for locating network problems. You can see the details of TCPIP messages. You need to be familiar with the TCPIP protocol at the same time. It can be used in conjunction with wireshark.
Common scenario analysis of network delay, network packet loss, and network problem analysis in complex environments.
#!/bin/bash tcpdump -i eth0 -s 0 -l -w - dst port 3306 | strings | perl -e ' while() { chomp; next if /^[^ ]+[ ]*$/; if(/^(SELECT|UPDATE|DELETE|INSERT|SET|COMMIT|ROLLBACK|CREATE|DROP|ALTER|CALL)/i) { if (defined $q) { print "$q\n"; } $q=$_; } else { $_ =~ s/^[ \t]+//; $q.=" $_"; } }'
3.2 java tools
a) jstat
[sankuai@cos-mop01 logs]$ jstat -gc 18704 S0C S1C S0U S1U EC EU OC OU MC MU CCSC CCSU YGC YGCT FGC FGCT GCT 3584.0 3584.0 0.0 0.0 24064.0 13779.7 62976.0 0.0 4480.0 677.9 384.0 66.6 0 0.000 0 0.000 0.000
b) jmap
jmap -dump:format=b,file=heap.bin $pid
c) jstack or kill -3
View deadlocks and thread waits.
Thread status:
Running
TIMED_WAITING(on object monitor)
TIMED_WAITING(sleeping)
TIMED_WAITING(parking)
WAINTING(on object monitor)
d) jhat jconsole
jhat is difficult to use jconsole to obtain information through jmx, which has an impact on performance
e) gc log
-XX:+UseParallelOld
-XX:+ConcurrentMultiSweep
3.3 Third-party tools
a) mat
Object details
inboud/outbound
thread overview
配置项
./MemoryAnalyzer -keep_unreachable_objects heap_file
4. 案例分析
4.1 cpu高
现象:CPU报警
定位问题:
查看CPU占用高的线程
sankuai@sin2:~$ ps H -eo user,pid,ppid,tid,time,%cpu|sort -rnk6 |head -10 sankuai 13808 13807 13808 00:00:00 8.4 sankuai 29153 1 29211 00:21:13 0.9 sankuai 29153 1 29213 00:20:01 0.8 sankuai 29153 1 29205 00:17:35 0.7 sankuai 29153 1 29210 00:11:50 0.5 sankuai 29153 1 1323 00:08:37 0.5 sankuai 29153 1 29207 00:10:02 0.4 sankuai 29153 1 29206 00:07:10 0.3 sankuai 29153 1 29208 00:06:44 0.2
thread dump
jstack $pid > a.txt printf %x $tid $xTID
查找线程执行的代码
"main-SendThread(cos-zk13.lf.sankuai.com:9331)" #25 daemon prio=5 os_prio=0 tid=0x00007f78fc350000 nid=$TIDx runnable [0x00007f79c4d09000] java.lang.Thread.State: RUNNABLE at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
4.2 io高
现象:磁盘IO报警
环境:需要安装sysstat工具
定位问题:
a) 查看CPU占用高的线程
1
pidstat -d -t -p $pid
b) 其他同4.1
4.3 资源
a) 数据库
"DB-Processor-13" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f000] java.lang.Thread.State: BLOCKED (on object monitor) at ConnectionPool.getConnection(ConnectionPool.java:102) - waiting to lock <0xe0375410> (a beans.ConnectionPool) at Service.getCount(ServiceCnt.java:111) at Service.insert(ServiceCnt.java:43) "DB-Processor-14" daemon prio=5 tid=0x003edf98 nid=0xca waiting for monitor entry [0x000000000825f020] java.lang.Thread.State: BLOCKED (on object monitor) at ConnectionPool.getConnection(ConnectionPool.java:102) - waiting to lock <0xe0375410> (a beans.ConnectionPool) at Service.getCount(ServiceCnt.java:111) at Service.insertCount(ServiceCnt.java:43)
b) log
"RMI TCP Connection(267865)-172.16.5.25" daemon prio=10 tid=0x00007fd508371000 nid=0x55ae waiting for monitor entry [0x00007fd4f8684000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.log4j.Category.callAppenders(Category.java:201) - waiting to lock <0x00000000acf4d0c0> (a org.apache.log4j.Logger) at org.apache.log4j.Category.forcedLog(Category.java:388) at org.apache.log4j.Category.log(Category.java:853) at org.apache.commons.logging.impl.Log4JLogger.warn(Log4JLogger.java:234) at com.xxx.core.common.lang.cache.remote.MemcachedClient.get(MemcachedClient.java:110)
c) web server
有两个非常重要的系统参数:
maxThread: 工作线程数
backlog:TCP连接缓存数,Jetty(ServerConnector.acceptQueueSize) Tomcat(Connector.acceptCount),高并发下设置过小会有502
4.4 gc
a) CMS fail
promotion failed
172966 2015-09-18T03:47:33.108+0800: 627188.183: [GC 627188.183: [ParNew (promotion failed) 172967 Desired survivor size 17432576 bytes, new threshold 1 (max 6) 172968 - age 1: 34865032 bytes, 34865032 total 172969 : 306688K->306688K(306688K), 161.1284530 secs]627349.311: [CMS CMS: abort preclean due to time 2015-09-18T03:50:14.743+0800: 627349.818: [CMS-concurrent-abortable-preclean: 1.597/162.729 secs] [Times: user=174.58 sys=84.57, real=162.71 secs] 172970 (concurrent mode failure): 1550703K->592286K(1756416K), 2.9879760 secs] 1755158K->592286K(2063104K), [CMS Perm : 67701K->67695K(112900K)], 164.1167250 secs] [Times: user=175.61 sys=84.57, real=164.09 secs]
concurrent fail
[CMS2015-09-18T07:07:27.132+0800: 639182.207: [CMS-concurrent-sweep: 1.704/13.116 secs] [Times: user=17.16 sys=5.20,real=13.12 secs] 443222 (concurrent mode failure): 1546078K->682301K(1756416K), 4.0745320 secs] 1630977K->682301K(2063104K), [CMS Perm :67700K->67693K(112900K)], 15.4860730 secs] [Times: user=19.40 sys=5.20, real=15.48 secs]
b) 连续Full GC
应用存在内存泄漏,垃圾收集会占用系统大量cpu时间,极端情况下可能发生90%以上时间在做GC的情况。
在系统使用http访问check alive或者使用了Zookeeper这种通过心跳保证存活性的应用中,会可用性异常或者被zk的master剔除。
5. 注意
保留现场:threaddump top heapdump
注意日志记录:文件 数据库
The above is the detailed content of Summary of Java problem location and solutions. For more information, please follow other related articles on the PHP Chinese website!