Home > Database > Mysql Tutorial > 11.2.0.3RAC(VCS)节点crash以及hang的问题分析

11.2.0.3RAC(VCS)节点crash以及hang的问题分析

WBOY
Release: 2016-06-07 15:55:50
Original
2822 people have browsed it

昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort. 且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程. 首先我们来看下,在下午出现crash的节点的alert

昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort.
且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程.

首先我们来看下,在下午出现crash的节点的alert log信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Tue Apr 22 17:16:04 2014 <code class="php plain">Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4878_9l51y1cc_.arc <code class="php plain">Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4879_9l529hc6_.arc <code class="php plain">Archived Log entry 10847 added <code class="php keyword">for <code class="php plain">thread 1 sequence 5314 ID 0xffffffffae21a60f dest 1: <code class="php plain">Tue Apr 22 17:25:05 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 27573 [oracle@xhdb-server3 (LMON)] <code class="php plain">Receiver: inst 2 binc 95439 ospid 13752 <code class="php plain">Communications reconfiguration: instance_number 2 <code class="php plain">Tue Apr 22 17:26:49 2014 <code class="php plain">LMON (ospid: 27573) has not called a wait <code class="php keyword">for <code class="php plain">89 secs. <code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_lmhb_27613.trc (incident=14129): <code class="php plain">ORA-29770: <code class="php keyword">global <code class="php plain">enqueue process LMON (OSID 27573) is hung <code class="php keyword">for <code class="php plain">more than 70 seconds <code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/incident/incdir_14129/authorcl1_lmhb_27613_i14129.trc <code class="php plain">Tue Apr 22 17:26:58 2014 <code class="php plain">Sweep [inc][14129]: completed <code class="php plain">Sweep [inc2][14129]: completed <code class="php plain">ERROR: Some process(s) is not making progress. <code class="php plain">LMHB (ospid: 27613) is terminating the instance. <code class="php plain">Please check LMHB trace file <code class="php keyword">for <code class="php plain">more details. <code class="php plain">Please also check the CPU load, I/O load <code class="php keyword">and <code class="php plain">other system properties <code class="php keyword">for <code class="php plain">anomalous behavior <code class="php plain">ERROR: Some process(s) is not making progress. <code class="php plain">Tue Apr 22 17:26:58 2014 <code class="php plain">System state dump requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination]. <code class="php plain">LMHB (ospid: 27613): terminating the instance due to error 29770 <code class="php plain">System State dumped to trace file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc <code class="php plain">Tue Apr 22 17:27:00 2014 <code class="php plain">ORA-1092 : opitsk aborting process <code class="php plain">Tue Apr 22 17:27:01 2014 <code class="php plain">License high water mark = 144 <code class="php plain">Tue Apr 22 17:27:08 2014 <code class="php plain">Termination issued to instance processes. Waiting <code class="php keyword">for <code class="php plain">the processes to <code class="php functions">exit <code class="php plain">Instance termination failed to kill one <code class="php keyword">or <code class="php plain">more processes <code class="php plain">Instance terminated by LMHB, pid = 27613 <code class="php plain">Tue Apr 22 17:27:15 2014 <code class="php plain">USER (ospid: 1378): terminating the instance <code class="php plain">Termination issued to instance processes. Waiting <code class="php keyword">for <code class="php plain">the processes to <code class="php functions">exit <code class="php plain">Tue Apr 22 17:27:25 2014 <code class="php plain">Instance termination failed to kill one <code class="php keyword">or <code class="php plain">more processes <code class="php plain">Instance terminated by USER, pid = 1378 <code class="php plain">Tue Apr 22 21:51:56 2014 <code class="php plain">Adjusting the <code class="php keyword">default <code class="php plain">value of parameter parallel_max_servers <code class="php plain">from 640 to 135 due to the value of parameter processes (150) <code class="php plain">Starting ORACLE instance (normal)

我们可以看到,最早在Apr 22 17:25:05 2014 时间点,即抛出LMON IPC send timeout的错误了。

Receiver: inst 2 binc 95439 ospid 13752 这里的receiver进程为节点2的13752进程,即节点2的LMON进程。

对于LMON进程,主要是监控RAC的GES信息,当然其作用不仅仅局限于此,其还负责检查集群中各个Node的健康
情况,当有节点出现故障是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道
RAC的脑裂机制,如果IO fencing是Oracle本身来完成,也就是说由CLusterware来完成。那么Lmon进程检查到
实例级别出现脑裂时,会通知clusterware来进行脑裂操作,然而其并不会等待clusterware的处理结果。当等待
超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的
Instance membership reconfig。

其次,lmon进程主要通过2种心跳机制来检查判断集群节点的健康状态:
1) 网络心跳 (主要是通过ping进行检测)
2) 控制文件磁盘心跳,其实就是每个节点的ckpt进程每3s更新一次controlfile的机制。

所以这里大家可以看出,Lmon进程是需要操作controlfile的。否则无法判断第2点。
虽然从上面的错误来看,该实例是被LMHB进程给终止掉的,这里我们需要说明一下,LMBH进程的原理。
LMBH进程是Oracle 11R2版本引入的一个进程,该进程的作用的监控LMD,LMON,LCK,LMS等核心进程,防止这些Oracle

关键性后台进程spin或不被阻塞。该进程会定时的将监控的信息打印输出在trace文件中,便于我们进行诊断,

这也是11gR2一个亮点。当LMBH进程发现其他核心进程出现异常时,会尝试发起一些kill动作,如何有进程被阻塞的话。
如果一定时间内仍然无法解决,那么将触发保护,将实例强行终止掉,当然,这是为了保证RAC节点数据的完整性和一致性。

这里比较郁闷的是,这个diag的dump并没有产没 /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc

我们首先来看下Node1的Lmon进程的信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 *** ACTION NAME:() 2014-04-22 17:26:56.052 <code class="php plain">*** 2014-04-22 17:26:49.401 <code class="php plain">============================== <code class="php plain">LMON (ospid: 27573) has not moved <code class="php keyword">for <code class="php plain">105 sec (1398158808.1398158703) <code class="php plain">kjfmGCR_HBCheckAll: LMON (ospid: 27573) has status 6 <code class="php plain">================================================== <code class="php plain">=== LMON (ospid: 27573) Heartbeat Report <code class="php plain">================================================== <code class="php plain">LMON (ospid: 27573) has no heartbeats <code class="php keyword">for <code class="php plain">105 sec. (threshold 70 sec) <code class="php spaces"><code class="php plain">: Not in wait; last wait ended 89 secs ago. -------------等待了89秒 <code class="php spaces"><code class="php plain">: last wait_id 165313538 at <code class="php string">'enq: CF - contention'<code class="php plain">. <code class="php plain">============================== <code class="php plain">Dumping PROCESS LMON (ospid: 27573) States <code class="php plain">============================== <code class="php plain">===[ System Load State ]=== <code class="php spaces"><code class="php plain">CPU Total 16 Core 16 Socket 16 <code class="php spaces"><code class="php plain">Load normal: Cur 988 Highmark 20480 (3.85 80.00) <code class="php plain">===[ Latch State ]=== <code class="php spaces"><code class="php plain">Not in Latch Get <code class="php plain">===[ Session State Object ]=== <code class="php spaces"><code class="php plain">---------------------------------------- <code class="php spaces"><code class="php plain">SO: 0x52daba340, type: 4, owner: 0x52f5d8330, flag: INIT/-/-/0x00 <code class="php keyword">if<code class="php plain">: 0x3 c: 0x3 <code class="php spaces"><code class="php plain">proc=0x52f5d8330, name=session, file=ksu.h LINE:12624 ID:, pg=0 <code class="php spaces"><code class="php plain">(session) sid: 1057 ser: 1 trans: 0x0, creator: 0x52f5d8330 <code class="php spaces"><code class="php plain">flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- <code class="php spaces"><code class="php plain">flags2: (0x409) -/-/INC <code class="php spaces"><code class="php plain">DID: , short-term DID: <code class="php plain">txn branch: 0x0 <code class="php spaces"><code class="php plain">oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS <code class="php spaces"><code class="php plain">ksuxds FALSE at location: 0 <code class="php spaces"><code class="php plain">service name: SYS<code class="php variable">$BACKGROUND <code class="php spaces"><code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">Not in wait; last wait ended 1 min 29 sec ago <code class="php plain">There are 1 sessions blocked by this session. <code class="php spaces"><code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 1, sid: 297, ser: 6347 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'name-service call wait' <code class="php spaces"><code class="php plain">p1: <code class="php string">'waittime'<code class="php plain">=0x32 <code class="php spaces"><code class="php plain">p2: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">p3: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <code class="php spaces"><code class="php plain">min_blocked_time: 0 secs, waiter_cache_ver: 30272 <code class="php spaces"><code class="php plain">Wait State: <code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x20 boundary=0x0/-1 <code class="php spaces"><code class="php plain">Session Wait History: <code class="php spaces"><code class="php plain">elapsed time of 1 min 29 sec since last wait ---LMON进程等待enq: CF - contention,等待了1分29秒,即89秒 <code class="php spaces"><code class="php plain">0: waited <code class="php keyword">for <code class="php string">'enq: CF - contention' <code class="php spaces"><code class="php plain">name|mode=0x43460006, 0=0x0, operation=0x3 <code class="php spaces"><code class="php plain">wait_id=165313538 seq_num=35946 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=1.027254 sec, exc=1.027254 sec, total=1.027254 sec <code class="php spaces"><code class="php plain">wait times: max=1.000000 sec <code class="php spaces"><code class="php plain">wait counts: calls=1 os=1 <code class="php spaces"><code class="php plain">occurred after 0.000109 sec of elapsed time <code class="php spaces"><code class="php plain">。。。。。。

如下是该进程的资源使用情况:

?
1 2 3 4 5 6 *** 2014-04-22 17:26:57.229 <code class="php plain">loadavg : 3.94 3.80 3.99 <code class="php plain">swap info: free_mem = 36949.53M rsv = 24548.22M <code class="php plain">alloc = 23576.62M avail = 45643.61M swap_free = 46615.21M <code class="php spaces"><code class="php plain">F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD <code class="php spaces"><code class="php plain">0 O oa 27573 1 6 79 20 ? 799187 Jan 23 ? 1589:29 ora_lmon_authorcl1

我们可以看到,系统在该时间点load并不高,Memory也很充足。

这里有一个问题,该节点LMON进程hung的原因是什么? 从日志分析来看,是由于无法获得enq: CF – contention。

我们知道ckpt 进程会定时更新操作controlfile,且就需要获得该enqueue。 所有这里我大胆的假设,是由于ckpt持有CF的latch

不释放,导致LMON进程无法获得. 根据这一点我搜mos 发现一个bug,可惜该bug说已经在11.2.0.3中fixed了。

Bug 10276173 LMON hang possible while trying to get access to CKPT progress record

该bug描述说,当在进行reconfig时,lmon会尝试去获得ckpt processes record,会等待enq: CF -contention,会导致hung.

根据文档来看,显然这跟我们的实际情况不符。
下面我们来结合Node2的日志进行综合分析:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Tue Apr 22 17:25:06 2014 <code class="php plain">IPC Send timeout detected. Receiver ospid 13752 [oracle@xhdb-server4 (LMON)] <code class="php plain">Tue Apr 22 17:26:59 2014 <code class="php plain">Dumping diagnostic data in directory=[cdmp_20140422172658], requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination]. <code class="php plain">Tue Apr 22 17:29:22 2014 <code class="php plain">WARNING: aiowait timed out 1 times <code class="php plain">Tue Apr 22 17:29:53 2014 <code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3681): <code class="php plain">ORA-32701: Possible hangs up to hang ID=3 detected <code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3681/authorcl2_dia0_13750_i3681.trc <code class="php plain">Tue Apr 22 17:30:24 2014 <code class="php plain">DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3 <code class="php spaces"><code class="php plain">requested by master DIA0 process on instance 2 <code class="php spaces"><code class="php plain">Hang Resolution Reason: Automatic hang resolution was performed to free a <code class="php spaces"><code class="php plain">significant number of affected sessions. <code class="php spaces"><code class="php plain">by terminating session sid: 578 ospid: 16818 <code class="php plain">DIA0 successfully terminated session sid:578 ospid:16818 with status 31. <code class="php plain">DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3. <code class="php plain">Tue Apr 22 17:30:33 2014 <code class="php plain">LMS2 (ospid: 13764) has detected no messaging activity from instance 1 <code class="php plain">LMS2 (ospid: 13764) issues an IMR to resolve the situation <code class="php plain">Please check LMS2 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:31:48 2014 <code class="php plain">LMD0 (ospid: 13754) has detected no messaging activity from instance 1 <code class="php plain">LMD0 (ospid: 13754) issues an IMR to resolve the situation <code class="php plain">Please check LMD0 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:32:03 2014 <code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3682): <code class="php plain">ORA-32701: Possible hangs up to hang ID=3 detected <code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3682/authorcl2_dia0_13750_i3682.trc <code class="php plain">Tue Apr 22 17:32:16 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 23666 [oracle@xhdb-server4 (TNS V1-V3)] <code class="php plain">Receiver: inst 1 binc 380222 ospid 27575 <code class="php plain">IPC Send timeout to 1.0 inc 20 <code class="php keyword">for <code class="php plain">msg type 12 from opid 154 <code class="php plain">IPC Send timeout: Terminating pid 154 osid 23666 <code class="php plain">Tue Apr 22 17:32:20 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 13746 [oracle@xhdb-server4 (PING)] <code class="php plain">Receiver: inst 1 binc 380164 ospid 27565 <code class="php plain">Tue Apr 22 17:32:34 2014 <code class="php plain">DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3 <code class="php spaces"><code class="php plain">requested by master DIA0 process on instance 2 <code class="php spaces"><code class="php plain">Hang Resolution Reason: Automatic hang resolution was performed to free a <code class="php spaces"><code class="php plain">significant number of affected sessions. <code class="php spaces"><code class="php plain">by terminating the process <code class="php plain">DIA0 successfully terminated process ospid:16818. <code class="php plain">DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3. <code class="php plain">Tue Apr 22 17:32:35 2014 <code class="php plain">LMS1 (ospid: 13760) has detected no messaging activity from instance 1 <code class="php plain">LMS1 (ospid: 13760) issues an IMR to resolve the situation <code class="php plain">Please check LMS1 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:32:44 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 13754 [oracle@xhdb-server4 (LMD0)] <code class="php plain">Receiver: inst 1 binc 380222 ospid 27575 <code class="php plain">IPC Send timeout to 1.0 inc 20 <code class="php keyword">for <code class="php plain">msg type 65521 from opid 12 <code class="php plain">Tue Apr 22 17:33:11 2014 <code class="php plain">LMS0 (ospid: 13756) has detected no messaging activity from instance 1 <code class="php plain">LMS0 (ospid: 13756) issues an IMR to resolve the situation <code class="php plain">Please check LMS0 trace file <code class="php keyword">for <code class="php plain">more detail. <code class="php plain">Tue Apr 22 17:34:29 2014 <code class="php plain">IPC Send timeout detected. Sender: ospid 13764 [oracle@xhdb-server4 (LMS2)] <code class="php plain">Receiver: inst 1 binc 380309 ospid 27585 <code class="php plain">IPC Send timeout to 1.1 inc 20 <code class="php keyword">for <code class="php plain">msg type 65522 from opid 15 <code class="php plain">Tue Apr 22 17:36:31 2014

我们可以看到Node2 在Apr 22 17:26:59 2014 节点Node1的LMBH终止instance的信息了。然后在后面抛出hung的信息,
不过Oracle自动解决了hung的session。 下面我们来看下Node2上lmon进程的trace内容:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 *** 2014-04-22 17:26:59.377 <code class="php plain">Process diagnostic dump <code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (LMON), OS id=13752, <code class="php plain">pid: 11, proc_ser: 1, sid: 353, sess_ser: 1 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">current sql: <none> <code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">0: waiting <code class="php keyword">for <code class="php string">'control file sequential read' <code class="php spaces"><code class="php plain">file#=0x0, block#=0x23, blocks=0x1 <code class="php spaces"><code class="php plain">wait_id=272969233 seq_num=24337 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=7 min 42 sec, exc=7 min 42 sec, total=7 min 42 sec ---已经等待了7分42秒 <code class="php spaces"><code class="php plain">wait times: max=infinite, heur=7 min 42 sec <code class="php spaces"><code class="php plain">wait counts: calls=0 os=0 <code class="php spaces"><code class="php plain">in_wait=1 iflags=0x5a0 <code class="php plain">There are 1 sessions blocked by this session. <code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 2, sid: 1092, ser: 49369 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'name-service call wait' <code class="php spaces"><code class="php plain">p1: <code class="php string">'waittime'<code class="php plain">=0x32 <code class="php spaces"><code class="php plain">p2: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">p3: <code class="php string">''<code class="php plain">=0x0 <code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <code class="php spaces"><code class="php plain">min_blocked_time: 0 secs, waiter_cache_ver: 6248 <code class="php plain">Wait State: <code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x22 boundary=0x0/-1 <code class="php plain">Session Wait History:

从lmon的trace信息我们可以看出,该进程正在等待control file sequential read,且已经等待了7分42秒。

根据trace的时间点,我们可以向前推进7分42秒,换句话讲,从17:19:18秒就开始等待了。

既然是controlfile的等待,那么我们就有必要来看下Node2节点上的ckpt进程在干什么了? 如下是ckpt进程的信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Redo thread mounted by this instance: 2 <code class="php plain">Oracle process number: 26 <code class="php plain">Unix process pid: 13788, image: oracle@xhdb-server4 (CKPT) <code class="php plain">*** 2014-04-22 17:26:59.882 <code class="php plain">*** SESSION ID:(833.1) 2014-04-22 17:26:59.882 <code class="php plain">*** 2014-04-22 17:26:59.882 <code class="php plain">Process diagnostic dump <code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (CKPT), OS id=13788, <code class="php plain">pid: 26, proc_ser: 1, sid: 833, sess_ser: 1 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">current sql: <none> <code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">0: waiting <code class="php keyword">for <code class="php string">'control file sequential read' <code class="php spaces"><code class="php plain">file#=0x0, block#=0x1, blocks=0x1 <code class="php spaces"><code class="php plain">wait_id=14858985 seq_num=48092 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=7 min 40 sec, exc=7 min 40 sec, total=7 min 40 sec ----等待了7分40秒 <code class="php spaces"><code class="php plain">wait times: max=infinite, heur=7 min 40 sec <code class="php spaces"><code class="php plain">wait counts: calls=0 os=0 <code class="php spaces"><code class="php plain">in_wait=1 iflags=0x5a0 <code class="php plain">There are 2 sessions blocked by this session. <code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 2, sid: 291, ser: 59157 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'DFS lock handle' <code class="php spaces"><code class="php plain">p1: <code class="php string">'type|mode'<code class="php plain">=0x43490005 <code class="php spaces"><code class="php plain">p2: <code class="php string">'id1'<code class="php plain">=0xa <code class="php spaces"><code class="php plain">p3: <code class="php string">'id2'<code class="php plain">=0x2 <code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0 <code class="php spaces"><code class="php plain">min_blocked_time: 352 secs, waiter_cache_ver: 6248 <code class="php plain">Wait State: <code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x22 boundary=0x0/-1

我们可以看到,Node2的ckpt进程等待control file sequential read,等待了7分40秒。同时大家还可以看到,ckpt
进程阻塞了2个进程,也就是说ckpt进程有2个waiter,其中一个waiter的信息是:sid:291,ser:59157
且该waiter进程的等待事件居然是DFS lock handle,这是一个比较危险的event。 这里我们还无法确认这个waiter是什么?
同时ckpt进程为啥等待这么长的时间 ?

大家知道11g引入的hung auto resolution,那么我们就来看下Node1上的diag的信息:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 Unix process pid: 27571, image: oracle@xhdb-server3 (DIA0) <code class="php plain">*** 2014-04-22 17:22:01.536 <code class="php plain">*** SESSION ID:(961.1) 2014-04-22 17:22:01.536 <code class="php plain">*** CLIENT ID:() 2014-04-22 17:22:01.536 <code class="php plain">*** SERVICE NAME:(SYS<code class="php variable">$BACKGROUND<code class="php plain">) 2014-04-22 17:22:01.536 <code class="php plain">*** MODULE NAME:() 2014-04-22 17:22:01.536 <code class="php plain">*** ACTION NAME:() 2014-04-22 17:22:01.536 <code class="php plain">One <code class="php keyword">or <code class="php plain">more possible hangs have been detected on your system. <code class="php plain">These could be genuine hangs in which no further progress will <code class="php plain">be made without intervention, <code class="php keyword">or <code class="php plain">it may be very slow progress <code class="php plain">in the system due to high load. <code class="php plain">Previously detected <code class="php keyword">and <code class="php plain">output hangs are not displayed again. <code class="php plain">Instead, the <code class="php string">'Victim Information' <code class="php plain">section will indicate that <code class="php plain">the victim is from an <code class="php string">'Existing Hang' <code class="php plain">under the <code class="php string">'Previous Hang' <code class="php plain">column. <code class="php string">'Verified Hangs' <code class="php plain">below indicate one <code class="php keyword">or <code class="php plain">more hangs that were found <code class="php keyword">and <code class="php plain"> identify the <code class="php keyword">final <code class="php plain"> blocking session <code class="php keyword">and <code class="php plain"> instance on which <code class="php plain">they occurred. Since the current hang resolution state is <code class="php string">'PROCESS'<code class="php plain">, <code class="php plain">any hangs requiring session <code class="php keyword">or <code class="php plain">process termination will be <code class="php plain">automatically resolved. <code class="php plain">Any hangs with a <code class="php string">'Hang Resolution Action' <code class="php plain"> of <code class="php string">'Unresolvable' <code class="php plain">will be ignored. These types of hangs will either be resolved <code class="php plain">by another layer in the RDBMS <code class="php keyword">or <code class="php plain">cannot be resolved <code class="php keyword">as <code class="php plain">they may <code class="php keyword">require <code class="php plain"> external intervention. <code class="php plain">Deadlocks (also named cycles) are currently NOT resolved even <code class="php keyword">if <code class="php plain">hang resolution is enabled. The <code class="php string">'Hang Type' <code class="php plain">of DLCK in the <code class="php string">'Verified Hangs' <code class="php plain">output identifies these hangs. <code class="php plain">Below that are the complete hang chains from the time the hang <code class="php plain">was detected. <code class="php plain">The following information will assist Oracle Support Services <code class="php plain">in further analysis of the root cause of the hang. <code class="php plain">*** 2014-04-22 17:22:01.537 <code class="php plain">Verified Hangs in the System <code class="php spaces"><code class="php plain">Root Chain Total Hang <code class="php plain">Hang Hang Inst Root #hung #hung Hang Hang Resolution <code class="php plain">ID Type Status Num Sess Sess Sess Conf Span Action <code class="php plain">----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- <code class="php spaces"><code class="php plain">2 HANG VICSELTD 2 833 2 2 HIGH LOCAL IGNRD:InstKillNotA <code class="php plain">Hang Ignored Reason: Since instance termination is not allowed, automatic <code class="php spaces"><code class="php plain">hang resolution cannot be performed. <code class="php plain">inst# SessId Ser# OSPID PrcNm Event <code class="php spaces"><code class="php plain">----- ------ ----- --------- ----- ----- <code class="php spaces"><code class="php plain">2 291 59157 10646 M000 DFS lock handle ----大家注意这里的sid和ser#以及PrcNm <code class="php spaces"><code class="php plain">2 833 1 13788 CKPT control file sequential read

这里提到M000进程,大家应该知道这是跟AWR快照有关系的进程,该进程其实是被CKPT所阻塞,我们也可以来看下该进程
的情况,如下:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** 2014-04-22 17:27:00.778 <code class="php plain">Process diagnostic dump <code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (M000), OS id=10646, <code class="php plain">pid: 57, proc_ser: 143, sid: 291, sess_ser: 59157 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">current sql: <code class="php plain">select tablespace_id, rfno, allocated_space, file_size, file_maxsize, changescn_base, changescn_wrap, flag, <code class="php plain">inst_id from sys.ts$, GV<code class="php variable">$FILESPACE_USAGE <code class="php plain">where ts# = tablespace_id <code class="php keyword">and <code class="php plain">online$ != 3 <code class="php keyword">and <code class="php plain"> (changescn_wrap > PITRSCNWRP <code class="php keyword">or <code class="php plain"> (changescn_wrap = PITRSCNWRP <code class="php keyword">and <code class="php plain"> changescn_base >= PITRSCNBAS)) <code class="php keyword">and <code class="php plain"> inst_id != :inst <code class="php keyword">and <code class="php plain"> (changescn_wrap > :w <code class="php keyword">or <code class="php plain"> (changescn_wrap = :w <code class="php keyword">and <code class="php plain"> changescn_base >= :b)) <code class="php plain">Current Wait Stack: <code class="php spaces"><code class="php plain">0: waiting <code class="php keyword">for <code class="php string">'DFS lock handle' <code class="php spaces"><code class="php plain">type|mode=0x43490005, id1=0xa, id2=0x2 <code class="php spaces"><code class="php plain">wait_id=6 seq_num=7 snap_id=1 <code class="php spaces"><code class="php plain">wait times: snap=6 min 12 sec, exc=6 min 12 sec, total=6 min 12 sec <code class="php spaces"><code class="php plain">wait times: max=infinite, heur=6 min 12 sec <code class="php spaces"><code class="php plain">wait counts: calls=818 os=818 <code class="php spaces"><code class="php plain">in_wait=1 iflags=0x15a2 <code class="php plain">There is at least one session blocking this session. <code class="php spaces"><code class="php plain">Dumping 2 direct blocker(s): <code class="php spaces"><code class="php plain">inst: 2, sid: 833, ser: 1 <code class="php spaces"><code class="php plain">inst: 1, sid: 482, ser: 1 <code class="php spaces"><code class="php plain">Dumping <code class="php keyword">final <code class="php plain">blocker: <code class="php spaces"><code class="php plain">inst: 2, sid: 833, ser: 1 -----最终的blocker是833,也就是Node2节点的CKPT进程。 <code class="php plain">There are 1 sessions blocked by this session. <code class="php plain">Dumping one waiter: <code class="php spaces"><code class="php plain">inst: 1, sid: 581, ser: 36139 <code class="php spaces"><code class="php plain">wait event: <code class="php string">'DFS lock handle' <code class="php spaces"><code class="php plain">p1: <code class="php string">'type|mode'<code class="php plain">=0x43490005 <code class="php spaces"><code class="php plain">p2: <code class="php string">'id1'<code class="php plain">=0xa <code class="php spaces"><code class="php plain">p3: <code class="php string">'id2'<code class="php plain">=0x5

从这里看,root sess却是833,也就是我们Node2节点的CKPT进程。 到这里或许有人会说,问题的原因
应该很明确了,由于Node2 ckpt的异常,到这Node2 节点Lmon进程异常,由于需要和Node1的Lmon进程
进行通信,导致Node1 的lmon进程出现IPc send timeout的情况。

其实不然,到最后至始至终我们都没有完全弄清楚为何CKPT进程会等待这么长时间 ?

到这里我不得不怀疑IO的问题了,再回过头来分析Node1的diag trace时,发现一个有趣的事情:

?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 *** 2014-04-22 17:24:08.363 <code class="php plain">Verified Hangs in the System <code class="php spaces"><code class="php plain">Root Chain Total Hang <code class="php plain">Hang Hang Inst Root #hung #hung Hang Hang Resolution <code class="php plain">ID Type Status Num Sess Sess Sess Conf Span Action <code class="php plain">----- ---- -------- ---- ----- ----- ----- ------ ------ ------------------- <code class="php spaces"><code class="php plain">7 HANG VICSELTD 2 801 2 3 HIGH LOCAL IGNRD:InstKillNotA <code class="php plain">Hang Ignored Reason: Since instance termination is not allowed, automatic <code class="php spaces"><code class="php plain">hang resolution cannot be performed. <code class="php plain">inst# SessId Ser# OSPID PrcNm Event <code class="php spaces"><code class="php plain">----- ------ ----- --------- ----- ----- <code class="php spaces"><code class="php plain">2 996 39917 6598 FG log file sync <code class="php spaces"><code class="php plain">2 801 1 13786 LGWR log file parallel write <code class="php plain">。。。。。省略部分内容 <code class="php plain">------------------------------------------------------------------------------- <code class="php plain">Chain 2: <code class="php plain">------------------------------------------------------------------------------- <code class="php spaces"><code class="php plain">Oracle session identified by: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">instance: 1 (authorcl.authorcl1) <code class="php spaces"><code class="php plain">os id: 22323 <code class="php spaces"><code class="php plain">process id: 64, oracle@xhdb-server3 <code class="php spaces"><code class="php plain">session id: 14 <code class="php spaces"><code class="php plain">session serial #: 29257 <code class="php spaces"><code class="php plain">} <code class="php spaces"><code class="php plain">is waiting <code class="php keyword">for <code class="php string">'log file sync' <code class="php plain">with wait info: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">p1: <code class="php string">'buffer#'<code class="php plain">=0x2ab <code class="php spaces"><code class="php plain">p2: <code class="php string">'sync scn'<code class="php plain">=0x1ed4d8be <code class="php spaces"><code class="php plain">time in wait: 4 min 18 sec <code class="php spaces"><code class="php plain">timeout after: never <code class="php spaces"><code class="php plain">wait id: 13287 <code class="php spaces"><code class="php plain">blocking: 0 sessions <code class="php spaces"><code class="php plain">wait history: <code class="php spaces"><code class="php plain">* time between current wait <code class="php keyword">and <code class="php plain">wait #1: 0.000388 sec <code class="php spaces"><code class="php plain">1. event: <code class="php string">'SQL*Net message from client' <code class="php spaces"><code class="php plain">time waited: 0.000486 sec <code class="php spaces"><code class="php plain">wait id: 13286 p1: <code class="php string">'driver id'<code class="php plain">=0x54435000 <code class="php spaces"><code class="php plain">p2: <code class="php string">'#bytes'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">* time between wait #1 <code class="php keyword">and <code class="php plain">#2: 0.000027 sec <code class="php spaces"><code class="php plain">2. event: <code class="php string">'SQL*Net message to client' <code class="php spaces"><code class="php plain">time waited: 0.000003 sec <code class="php spaces"><code class="php plain">wait id: 13285 p1: <code class="php string">'driver id'<code class="php plain">=0x54435000 <code class="php spaces"><code class="php plain">p2: <code class="php string">'#bytes'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">* time between wait #2 <code class="php keyword">and <code class="php plain">#3: 0.001494 sec <code class="php spaces"><code class="php plain">3. event: <code class="php string">'SQL*Net message from client' <code class="php spaces"><code class="php plain">time waited: 0.002089 sec <code class="php spaces"><code class="php plain">wait id: 13284 p1: <code class="php string">'driver id'<code class="php plain">=0x54435000 <code class="php spaces"><code class="php plain">p2: <code class="php string">'#bytes'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">} <code class="php spaces"><code class="php keyword">and <code class="php plain">is blocked by <code class="php spaces"><code class="php plain">=> Oracle session identified by: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">instance: 1 (authorcl.authorcl1) <code class="php spaces"><code class="php plain">os id: 27634 <code class="php spaces"><code class="php plain">process id: 20, oracle@xhdb-server3 (LGWR) <code class="php spaces"><code class="php plain">session id: 386 <code class="php spaces"><code class="php plain">session serial #: 1 <code class="php spaces"><code class="php plain">} <code class="php spaces"><code class="php plain">which is waiting <code class="php keyword">for <code class="php string">'log file parallel write' <code class="php plain">with wait info: <code class="php spaces"><code class="php plain">{ <code class="php spaces"><code class="php plain">p1: <code class="php string">'files'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">p2: <code class="php string">'blocks'<code class="php plain">=0x2 <code class="php spaces"><code class="php plain">p3: <code class="php string">'requests'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">time in wait: 4 min 32 sec -----等待了4分32秒 <code class="php spaces"><code class="php plain">timeout after: never <code class="php spaces"><code class="php plain">wait id: 51742574 <code class="php spaces"><code class="php plain">blocking: 17 sessions ------阻塞了17个Session <code class="php spaces"><code class="php plain">wait history: <code class="php spaces"><code class="php plain">* time between current wait <code class="php keyword">and <code class="php plain">wait #1: 0.000176 sec <code class="php spaces"><code class="php plain">1. event: <code class="php string">'rdbms ipc message' <code class="php spaces"><code class="php plain">time waited: 0.047194 sec <code class="php spaces"><code class="php plain">wait id: 51742573 p1: <code class="php string">'timeout'<code class="php plain">=0x22 <code class="php spaces"><code class="php plain">* time between wait #1 <code class="php keyword">and <code class="php plain">#2: 0.000399 sec <code class="php spaces"><code class="php plain">2. event: <code class="php string">'log file parallel write' <code class="php spaces"><code class="php plain">time waited: 0.004006 sec <code class="php spaces"><code class="php plain">wait id: 51742572 p1: <code class="php string">'files'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">p2: <code class="php string">'blocks'<code class="php plain">=0x2 <code class="php spaces"><code class="php plain">p3: <code class="php string">'requests'<code class="php plain">=0x1 <code class="php spaces"><code class="php plain">* time between wait #2 <code class="php keyword">and <code class="php plain">#3: 0.000318 sec <code class="php spaces"><code class="php plain">3. event: <code class="php string">'rdbms ipc message' <code class="php spaces"><code class="php plain">time waited: 2.654606 sec <code class="php spaces"><code class="php plain">wait id: 51742571 p1: <code class="php string">'timeout'<code class="php plain">=0x12c <code class="php spaces"><code class="php plain">} <code class="php plain">Chain 2 Signature: <code class="php string">'log file parallel write'<code class="php plain">'log file sync'

我们可以看到,Node1上,lgwr进程阻塞了17个会话,等待log file parallel write,一直持续了4分32秒。

如果将时间点2014-04-22 17:24:08,向前推进4分32秒,那么就是2014-04-22 17:19:36.

我们再来检查Node2的操作系统日志你会发现有意思的事情:

?
1 2 3 4 5 6 7 8 9 10 11 Apr 22 17:19:16 xhdb-server4 vxio: [ID 442312 kern.notice] NOTICE: VxVM VVR vxio V-5-0-209 Log overflow protection on rlink oa_zs_bj triggered DCM protection <code class="php plain">Apr 22 17:19:16 xhdb-server4 vxio: [ID 636438 kern.warning] WARNING: VxVM VVR vxio V-5-0-1905 Replication stopped <code class="php keyword">for <code class="php plain">RVG rvg_oa due to SRL overflow, DCM protection is triggered. To start replication, perform DCM resynchronization using <code class="php string">"vradmin resync" <code class="php plain">command. <code class="php plain">Apr 22 19:41:19 xhdb-server4 su: [ID 810491 auth.crit] <code class="php string">'su root' <code class="php plain">failed <code class="php keyword">for <code class="php plain">oa on /dev/pts/5 <code class="php plain">Apr 22 21:31:38 xhdb-server4 su: [ID 810491 auth.crit] <code class="php string">'su grid' <code class="php plain">failed <code class="php keyword">for <code class="php plain">root on /dev/pts/14 <code class="php plain">Apr 22 21:36:03 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 Thread(4) Agent is calling clean <code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own. <code class="php plain">Apr 22 21:36:03 xhdb-server4 Had[5704]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling clean <code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own. <code class="php plain">Apr 22 21:36:06 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(4) Resource(cssd_oaora) - clean completed successfully. <code class="php plain">Apr 22 21:40:51 xhdb-server4 reboot: [ID 662345 auth.crit] rebooted by root <code class="php plain">Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal received <code class="php keyword">while <code class="php plain">waiting <code class="php keyword">for <code class="php plain">event on reaper <code class="php string">'CFSMount'<code class="php plain">. Returning. <code class="php plain">Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal received <code class="php keyword">while <code class="php plain">waiting <code class="php keyword">for <code class="php plain">event on reaper <code class="php string">'CFSfsckd'<code class="php plain">. Returning. <code class="php plain">Apr 23 00:31:59 xhdb-server4 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-25 64-bit

我们可以看到,在17:19:16秒时vertias的VVR出现了异常。这也就是为什么我们在前面分析看到Node2在17:19:18时出现control file sequential read
等待的原因。虽然从vcs的日志什么信息:

?
1 2 3 4 5 6 2014/04/22 14:39:12 VCS INFO V-16-1-53504 VCS Engine Alive message!! <code class="php plain">2014/04/22 18:39:13 VCS INFO V-16-1-53504 VCS Engine Alive message!! <code class="php plain">2014/04/22 21:36:03 VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling clean <code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own. <code class="php plain">2014/04/22 21:36:06 VCS INFO V-16-2-13068 (xhdb-server4) Resource(cssd_oaora) - clean completed successfully. <code class="php plain">2014/04/22 21:36:09 VCS INFO V-16-1-10307 Resource cssd_oaora (Owner: Unspecified, Group: sg_oaora) is offline on xhdb-server4 (Not initiated by VCS) <code class="php plain">2014/04/22 21:36:09 VCS NOTICE V-16-1-10446 Group sg_oaora is offline on system xhdb-server4

所以,最后我的感觉是根本原因是vcs的问题。虽然vertias的工程师一直解释这里的Log overflow protection没有太大的关系。

针对这个问题,欢迎大家探讨。

补充:关于ora-29770导致实例crash的问题,Oracle确实有不少的bug,但是针对这个情况,目前没有发现符合的。如下是来自Oracle MOS的搜索结果:

Bug 11890804:LMHB crashes instance with ORA-29770 after long “control file sequential read” waits
Bug 8888434: LMHB crashes the instance with LMON waiting on controlfile read
Bug 11890804: LMHB TERMINATE INSTANCE WHEN LMON WAIT CHANGE FROM CF READ AFTER 60 SEC
Bug 13467673: CSS MISSCOUNT AND ALL ASM DOWN WITH ORA-29770 BY LMHB
Bug 13390052: KJFMGCR_HBCHECKALL MESSAGES ARE CONTINUOUSLY LOGGED IN LMHB TRACE FILE.
Bug 13322797: LMHB TERMINATES THE INSTANCE DUE TO ERROR 29770
Bug 13061883: LMHB IS TERMINATING THE INSTANCE DURING SHUTDOWN IMMEDIATE
Bug 12886605: ESSC: LMHB TERMINATE INSTANCE DUE TO 29770 – LMON WAIT ENQ: AM – DISK OFFLINE
Bug 12757321: LMHB TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10296263: LMHB (OSPID: 15872): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10431752: SINGLE NODE RAC: LMHB TERMINATES INSTANCE DUE TO 29770
Bug 11656856: LMHB (OSPID: 27701): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10411143: INSTANCE CRASHES WITH IPC SEND TIMEOUT AND LMHB TERMINATES WITH ORA-29770
Bug 11704041: DATABASE INSTANCE CRASH BY LMHB PROCESS
Bug 10412545: ORA-29770 LMHB TERMINATE INSTANCE DUE TO VARIOUS LONG CSS WAIT
Bug 10147827: INSTANCE TERMINATED BY LMHB WITH ERROR ORA-29770
Bug 10016974: ORA-29770 LMD IS HUNG FOR MORE THAN 70 SECONDS AND LMHB TERMINATE INSTANCE
Bug 9376100: LMHB TERMINATING INSTANCE DUE ERROR 29770

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template