昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort. 且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程. 首先我们来看下,在下午出现crash的节点的alert
昨天某个客户的一套双节RAC其中一个节点crash,同时最后导致另外一个节点也hang住,只能shutdown abort.
且出现shutdown abort实例之后,还有部分进程无法通过kill -9 进行kill的情况。其中有lgwr,arch等进程.
首先我们来看下,在下午出现crash的节点的alert log信息:
?1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
Tue Apr 22 17:16:04 2014
<code class="php plain">Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4878_9l51y1cc_.arc
<code class="php plain">Deleted Oracle managed file /oa1/arch/AUTHORCL/archivelog/2014_03_14/o1_mf_2_4879_9l529hc6_.arc
<code class="php plain">Archived Log entry 10847 added
<code class="php keyword">for <code class="php plain">thread 1 sequence 5314 ID 0xffffffffae21a60f dest 1:
<code class="php plain">Tue Apr 22 17:25:05 2014
<code class="php plain">IPC Send timeout detected. Sender: ospid 27573 [oracle@xhdb-server3 (LMON)]
<code class="php plain">Receiver: inst 2 binc 95439 ospid 13752
<code class="php plain">Communications reconfiguration: instance_number 2
<code class="php plain">Tue Apr 22 17:26:49 2014
<code class="php plain">LMON (ospid: 27573) has not called a wait
<code class="php keyword">for <code class="php plain">89 secs.
<code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_lmhb_27613.trc (incident=14129):
<code class="php plain">ORA-29770:
<code class="php keyword">global <code class="php plain">enqueue process LMON (OSID 27573) is hung
<code class="php keyword">for <code class="php plain">more than 70 seconds
<code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/incident/incdir_14129/authorcl1_lmhb_27613_i14129.trc
<code class="php plain">Tue Apr 22 17:26:58 2014
<code class="php plain">Sweep [inc][14129]: completed
<code class="php plain">Sweep [inc2][14129]: completed
<code class="php plain">ERROR: Some process(s) is not making progress.
<code class="php plain">LMHB (ospid: 27613) is terminating the instance.
<code class="php plain">Please check LMHB trace file
<code class="php keyword">for <code class="php plain">more details.
<code class="php plain">Please also check the CPU load, I/O load
<code class="php keyword">and <code class="php plain">other system properties
<code class="php keyword">for <code class="php plain">anomalous behavior
<code class="php plain">ERROR: Some process(s) is not making progress.
<code class="php plain">Tue Apr 22 17:26:58 2014
<code class="php plain">System state dump requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination].
<code class="php plain">LMHB (ospid: 27613): terminating the instance due to error 29770
<code class="php plain">System State dumped to trace file /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc
<code class="php plain">Tue Apr 22 17:27:00 2014
<code class="php plain">ORA-1092 : opitsk aborting process
<code class="php plain">Tue Apr 22 17:27:01 2014
<code class="php plain">License high water mark = 144
<code class="php plain">Tue Apr 22 17:27:08 2014
<code class="php plain">Termination issued to instance processes. Waiting
<code class="php keyword">for <code class="php plain">the processes to
<code class="php functions">exit
<code class="php plain">Instance termination failed to kill one
<code class="php keyword">or <code class="php plain">more processes
<code class="php plain">Instance terminated by LMHB, pid = 27613
<code class="php plain">Tue Apr 22 17:27:15 2014
<code class="php plain">USER (ospid: 1378): terminating the instance
<code class="php plain">Termination issued to instance processes. Waiting
<code class="php keyword">for <code class="php plain">the processes to
<code class="php functions">exit
<code class="php plain">Tue Apr 22 17:27:25 2014
<code class="php plain">Instance termination failed to kill one
<code class="php keyword">or <code class="php plain">more processes
<code class="php plain">Instance terminated by USER, pid = 1378
<code class="php plain">Tue Apr 22 21:51:56 2014
<code class="php plain">Adjusting the
<code class="php keyword">default <code class="php plain">value of parameter parallel_max_servers
<code class="php plain">from 640 to 135 due to the value of parameter processes (150)
<code class="php plain">Starting ORACLE instance (normal)
|
我们可以看到,最早在Apr 22 17:25:05 2014 时间点,即抛出LMON IPC send timeout的错误了。
Receiver: inst 2 binc 95439 ospid 13752 这里的receiver进程为节点2的13752进程,即节点2的LMON进程。
对于LMON进程,主要是监控RAC的GES信息,当然其作用不仅仅局限于此,其还负责检查集群中各个Node的健康
情况,当有节点出现故障是,负责进行reconfig以及GRD(global resource Directory)的恢复等等。我们知道
RAC的脑裂机制,如果IO fencing是Oracle本身来完成,也就是说由CLusterware来完成。那么Lmon进程检查到
实例级别出现脑裂时,会通知clusterware来进行脑裂操作,然而其并不会等待clusterware的处理结果。当等待
超过一定时间,那么LMON进程会自动触发IMR(instance membership recovery),这实际上也就是我们所说的
Instance membership reconfig。
其次,lmon进程主要通过2种心跳机制来检查判断集群节点的健康状态:
1) 网络心跳 (主要是通过ping进行检测)
2) 控制文件磁盘心跳,其实就是每个节点的ckpt进程每3s更新一次controlfile的机制。
所以这里大家可以看出,Lmon进程是需要操作controlfile的。否则无法判断第2点。
虽然从上面的错误来看,该实例是被LMHB进程给终止掉的,这里我们需要说明一下,LMBH进程的原理。
LMBH进程是Oracle 11R2版本引入的一个进程,该进程的作用的监控LMD,LMON,LCK,LMS等核心进程,防止这些Oracle
关键性后台进程spin或不被阻塞。该进程会定时的将监控的信息打印输出在trace文件中,便于我们进行诊断,
这也是11gR2一个亮点。当LMBH进程发现其他核心进程出现异常时,会尝试发起一些kill动作,如何有进程被阻塞的话。
如果一定时间内仍然无法解决,那么将触发保护,将实例强行终止掉,当然,这是为了保证RAC节点数据的完整性和一致性。
这里比较郁闷的是,这个diag的dump并没有产没 /u01/app/oa_base/diag/rdbms/authorcl/authorcl1/trace/authorcl1_diag_27561.trc
我们首先来看下Node1的Lmon进程的信息:
?1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
*** ACTION NAME:() 2014-04-22 17:26:56.052
<code class="php plain">*** 2014-04-22 17:26:49.401
<code class="php plain">==============================
<code class="php plain">LMON (ospid: 27573) has not moved
<code class="php keyword">for <code class="php plain">105 sec (1398158808.1398158703)
<code class="php plain">kjfmGCR_HBCheckAll: LMON (ospid: 27573) has status 6
<code class="php plain">==================================================
<code class="php plain">=== LMON (ospid: 27573) Heartbeat Report
<code class="php plain">==================================================
<code class="php plain">LMON (ospid: 27573) has no heartbeats
<code class="php keyword">for <code class="php plain">105 sec. (threshold 70 sec)
<code class="php spaces"><code class="php plain">: Not in wait; last wait ended 89 secs ago. -------------等待了89秒
<code class="php spaces"><code class="php plain">: last wait_id 165313538 at
<code class="php string">'enq: CF - contention'<code class="php plain">.
<code class="php plain">==============================
<code class="php plain">Dumping PROCESS LMON (ospid: 27573) States
<code class="php plain">==============================
<code class="php plain">===[ System Load State ]===
<code class="php spaces"><code class="php plain">CPU Total 16 Core 16 Socket 16
<code class="php spaces"><code class="php plain">Load normal: Cur 988 Highmark 20480 (3.85 80.00)
<code class="php plain">===[ Latch State ]===
<code class="php spaces"><code class="php plain">Not in Latch Get
<code class="php plain">===[ Session State Object ]===
<code class="php spaces"><code class="php plain">----------------------------------------
<code class="php spaces"><code class="php plain">SO: 0x52daba340, type: 4, owner: 0x52f5d8330, flag: INIT/-/-/0x00
<code class="php keyword">if<code class="php plain">: 0x3 c: 0x3
<code class="php spaces"><code class="php plain">proc=0x52f5d8330, name=session, file=ksu.h LINE:12624 ID:, pg=0
<code class="php spaces"><code class="php plain">(session) sid: 1057 ser: 1 trans: 0x0, creator: 0x52f5d8330
<code class="php spaces"><code class="php plain">flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
<code class="php spaces"><code class="php plain">flags2: (0x409) -/-/INC
<code class="php spaces"><code class="php plain">DID: , short-term DID:
<code class="php plain">txn branch: 0x0
<code class="php spaces"><code class="php plain">oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
<code class="php spaces"><code class="php plain">ksuxds FALSE at location: 0
<code class="php spaces"><code class="php plain">service name: SYS<code class="php variable">$BACKGROUND
<code class="php spaces"><code class="php plain">Current Wait Stack:
<code class="php spaces"><code class="php plain">Not in wait; last wait ended 1 min 29 sec ago
<code class="php plain">There are 1 sessions blocked by this session.
<code class="php spaces"><code class="php plain">Dumping one waiter:
<code class="php spaces"><code class="php plain">inst: 1, sid: 297, ser: 6347
<code class="php spaces"><code class="php plain">wait event:
<code class="php string">'name-service call wait'
<code class="php spaces"><code class="php plain">p1:
<code class="php string">'waittime'<code class="php plain">=0x32
<code class="php spaces"><code class="php plain">p2:
<code class="php string">''<code class="php plain">=0x0
<code class="php spaces"><code class="php plain">p3:
<code class="php string">''<code class="php plain">=0x0
<code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
<code class="php spaces"><code class="php plain">min_blocked_time: 0 secs, waiter_cache_ver: 30272
<code class="php spaces"><code class="php plain">Wait State:
<code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x20 boundary=0x0/-1
<code class="php spaces"><code class="php plain">Session Wait History:
<code class="php spaces"><code class="php plain">elapsed time of 1 min 29 sec since last wait ---LMON进程等待enq: CF - contention,等待了1分29秒,即89秒
<code class="php spaces"><code class="php plain">0: waited
<code class="php keyword">for <code class="php string">'enq: CF - contention'
<code class="php spaces"><code class="php plain">name|mode=0x43460006, 0=0x0, operation=0x3
<code class="php spaces"><code class="php plain">wait_id=165313538 seq_num=35946 snap_id=1
<code class="php spaces"><code class="php plain">wait times: snap=1.027254 sec, exc=1.027254 sec, total=1.027254 sec
<code class="php spaces"><code class="php plain">wait times: max=1.000000 sec
<code class="php spaces"><code class="php plain">wait counts: calls=1 os=1
<code class="php spaces"><code class="php plain">occurred after 0.000109 sec of elapsed time
<code class="php spaces"><code class="php plain">。。。。。。
|
如下是该进程的资源使用情况:
?1 2 3 4 5 6 |
*** 2014-04-22 17:26:57.229
<code class="php plain">loadavg : 3.94 3.80 3.99
<code class="php plain">swap info: free_mem = 36949.53M rsv = 24548.22M
<code class="php plain">alloc = 23576.62M avail = 45643.61M swap_free = 46615.21M
<code class="php spaces"><code class="php plain">F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
<code class="php spaces"><code class="php plain">0 O oa 27573 1 6 79 20 ? 799187 Jan 23 ? 1589:29 ora_lmon_authorcl1
|
我们可以看到,系统在该时间点load并不高,Memory也很充足。
这里有一个问题,该节点LMON进程hung的原因是什么? 从日志分析来看,是由于无法获得enq: CF – contention。
我们知道ckpt 进程会定时更新操作controlfile,且就需要获得该enqueue。 所有这里我大胆的假设,是由于ckpt持有CF的latch
不释放,导致LMON进程无法获得. 根据这一点我搜mos 发现一个bug,可惜该bug说已经在11.2.0.3中fixed了。
Bug 10276173 LMON hang possible while trying to get access to CKPT progress record
该bug描述说,当在进行reconfig时,lmon会尝试去获得ckpt processes record,会等待enq: CF -contention,会导致hung.
根据文档来看,显然这跟我们的实际情况不符。
下面我们来结合Node2的日志进行综合分析:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
Tue Apr 22 17:25:06 2014
<code class="php plain">IPC Send timeout detected. Receiver ospid 13752 [oracle@xhdb-server4 (LMON)]
<code class="php plain">Tue Apr 22 17:26:59 2014
<code class="php plain">Dumping diagnostic data in directory=[cdmp_20140422172658], requested by (instance=1, osid=27613 (LMHB)), summary=[abnormal instance termination].
<code class="php plain">Tue Apr 22 17:29:22 2014
<code class="php plain">WARNING: aiowait timed out 1 times
<code class="php plain">Tue Apr 22 17:29:53 2014
<code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3681):
<code class="php plain">ORA-32701: Possible hangs up to hang ID=3 detected
<code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3681/authorcl2_dia0_13750_i3681.trc
<code class="php plain">Tue Apr 22 17:30:24 2014
<code class="php plain">DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3
<code class="php spaces"><code class="php plain">requested by master DIA0 process on instance 2
<code class="php spaces"><code class="php plain">Hang Resolution Reason: Automatic hang resolution was performed to free a
<code class="php spaces"><code class="php plain">significant number of affected sessions.
<code class="php spaces"><code class="php plain">by terminating session sid: 578 ospid: 16818
<code class="php plain">DIA0 successfully terminated session sid:578 ospid:16818 with status 31.
<code class="php plain">DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3.
<code class="php plain">Tue Apr 22 17:30:33 2014
<code class="php plain">LMS2 (ospid: 13764) has detected no messaging activity from instance 1
<code class="php plain">LMS2 (ospid: 13764) issues an IMR to resolve the situation
<code class="php plain">Please check LMS2 trace file
<code class="php keyword">for <code class="php plain">more detail.
<code class="php plain">Tue Apr 22 17:31:48 2014
<code class="php plain">LMD0 (ospid: 13754) has detected no messaging activity from instance 1
<code class="php plain">LMD0 (ospid: 13754) issues an IMR to resolve the situation
<code class="php plain">Please check LMD0 trace file
<code class="php keyword">for <code class="php plain">more detail.
<code class="php plain">Tue Apr 22 17:32:03 2014
<code class="php plain">Errors in file /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/trace/authorcl2_dia0_13750.trc (incident=3682):
<code class="php plain">ORA-32701: Possible hangs up to hang ID=3 detected
<code class="php plain">Incident details in: /u01/app/oa_base/diag/rdbms/authorcl/authorcl2/incident/incdir_3682/authorcl2_dia0_13750_i3682.trc
<code class="php plain">Tue Apr 22 17:32:16 2014
<code class="php plain">IPC Send timeout detected. Sender: ospid 23666 [oracle@xhdb-server4 (TNS V1-V3)]
<code class="php plain">Receiver: inst 1 binc 380222 ospid 27575
<code class="php plain">IPC Send timeout to 1.0 inc 20
<code class="php keyword">for <code class="php plain">msg type 12 from opid 154
<code class="php plain">IPC Send timeout: Terminating pid 154 osid 23666
<code class="php plain">Tue Apr 22 17:32:20 2014
<code class="php plain">IPC Send timeout detected. Sender: ospid 13746 [oracle@xhdb-server4 (PING)]
<code class="php plain">Receiver: inst 1 binc 380164 ospid 27565
<code class="php plain">Tue Apr 22 17:32:34 2014
<code class="php plain">DIA0 terminating blocker (ospid: 16818 sid: 578 ser#: 39069) of hang with ID = 3
<code class="php spaces"><code class="php plain">requested by master DIA0 process on instance 2
<code class="php spaces"><code class="php plain">Hang Resolution Reason: Automatic hang resolution was performed to free a
<code class="php spaces"><code class="php plain">significant number of affected sessions.
<code class="php spaces"><code class="php plain">by terminating the process
<code class="php plain">DIA0 successfully terminated process ospid:16818.
<code class="php plain">DIA0 successfully resolved a LOCAL, HIGH confidence hang with ID=3.
<code class="php plain">Tue Apr 22 17:32:35 2014
<code class="php plain">LMS1 (ospid: 13760) has detected no messaging activity from instance 1
<code class="php plain">LMS1 (ospid: 13760) issues an IMR to resolve the situation
<code class="php plain">Please check LMS1 trace file
<code class="php keyword">for <code class="php plain">more detail.
<code class="php plain">Tue Apr 22 17:32:44 2014
<code class="php plain">IPC Send timeout detected. Sender: ospid 13754 [oracle@xhdb-server4 (LMD0)]
<code class="php plain">Receiver: inst 1 binc 380222 ospid 27575
<code class="php plain">IPC Send timeout to 1.0 inc 20
<code class="php keyword">for <code class="php plain">msg type 65521 from opid 12
<code class="php plain">Tue Apr 22 17:33:11 2014
<code class="php plain">LMS0 (ospid: 13756) has detected no messaging activity from instance 1
<code class="php plain">LMS0 (ospid: 13756) issues an IMR to resolve the situation
<code class="php plain">Please check LMS0 trace file
<code class="php keyword">for <code class="php plain">more detail.
<code class="php plain">Tue Apr 22 17:34:29 2014
<code class="php plain">IPC Send timeout detected. Sender: ospid 13764 [oracle@xhdb-server4 (LMS2)]
<code class="php plain">Receiver: inst 1 binc 380309 ospid 27585
<code class="php plain">IPC Send timeout to 1.1 inc 20
<code class="php keyword">for <code class="php plain">msg type 65522 from opid 15
<code class="php plain">Tue Apr 22 17:36:31 2014
|
我们可以看到Node2 在Apr 22 17:26:59 2014 节点Node1的LMBH终止instance的信息了。然后在后面抛出hung的信息,
不过Oracle自动解决了hung的session。 下面我们来看下Node2上lmon进程的trace内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
*** 2014-04-22 17:26:59.377
<code class="php plain">Process diagnostic dump
<code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (LMON), OS id=13752,
<code class="php plain">pid: 11, proc_ser: 1, sid: 353, sess_ser: 1
<code class="php plain">-------------------------------------------------------------------------------
<code class="php plain">current sql: <none>
<code class="php plain">Current Wait Stack:
<code class="php spaces"><code class="php plain">0: waiting
<code class="php keyword">for <code class="php string">'control file sequential read'
<code class="php spaces"><code class="php plain">file#=0x0, block#=0x23, blocks=0x1
<code class="php spaces"><code class="php plain">wait_id=272969233 seq_num=24337 snap_id=1
<code class="php spaces"><code class="php plain">wait times: snap=7 min 42 sec, exc=7 min 42 sec, total=7 min 42 sec ---已经等待了7分42秒
<code class="php spaces"><code class="php plain">wait times: max=infinite, heur=7 min 42 sec
<code class="php spaces"><code class="php plain">wait counts: calls=0 os=0
<code class="php spaces"><code class="php plain">in_wait=1 iflags=0x5a0
<code class="php plain">There are 1 sessions blocked by this session.
<code class="php plain">Dumping one waiter:
<code class="php spaces"><code class="php plain">inst: 2, sid: 1092, ser: 49369
<code class="php spaces"><code class="php plain">wait event:
<code class="php string">'name-service call wait'
<code class="php spaces"><code class="php plain">p1:
<code class="php string">'waittime'<code class="php plain">=0x32
<code class="php spaces"><code class="php plain">p2:
<code class="php string">''<code class="php plain">=0x0
<code class="php spaces"><code class="php plain">p3:
<code class="php string">''<code class="php plain">=0x0
<code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
<code class="php spaces"><code class="php plain">min_blocked_time: 0 secs, waiter_cache_ver: 6248
<code class="php plain">Wait State:
<code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x22 boundary=0x0/-1
<code class="php plain">Session Wait History:
|
从lmon的trace信息我们可以看出,该进程正在等待control file sequential read,且已经等待了7分42秒。
根据trace的时间点,我们可以向前推进7分42秒,换句话讲,从17:19:18秒就开始等待了。
既然是controlfile的等待,那么我们就有必要来看下Node2节点上的ckpt进程在干什么了? 如下是ckpt进程的信息:
?1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
Redo thread mounted by this instance: 2
<code class="php plain">Oracle process number: 26
<code class="php plain">Unix process pid: 13788, image: oracle@xhdb-server4 (CKPT)
<code class="php plain">*** 2014-04-22 17:26:59.882
<code class="php plain">*** SESSION ID:(833.1) 2014-04-22 17:26:59.882
<code class="php plain">*** 2014-04-22 17:26:59.882
<code class="php plain">Process diagnostic dump
<code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (CKPT), OS id=13788,
<code class="php plain">pid: 26, proc_ser: 1, sid: 833, sess_ser: 1
<code class="php plain">-------------------------------------------------------------------------------
<code class="php plain">current sql: <none>
<code class="php plain">Current Wait Stack:
<code class="php spaces"><code class="php plain">0: waiting
<code class="php keyword">for <code class="php string">'control file sequential read'
<code class="php spaces"><code class="php plain">file#=0x0, block#=0x1, blocks=0x1
<code class="php spaces"><code class="php plain">wait_id=14858985 seq_num=48092 snap_id=1
<code class="php spaces"><code class="php plain">wait times: snap=7 min 40 sec, exc=7 min 40 sec, total=7 min 40 sec ----等待了7分40秒
<code class="php spaces"><code class="php plain">wait times: max=infinite, heur=7 min 40 sec
<code class="php spaces"><code class="php plain">wait counts: calls=0 os=0
<code class="php spaces"><code class="php plain">in_wait=1 iflags=0x5a0
<code class="php plain">There are 2 sessions blocked by this session.
<code class="php plain">Dumping one waiter:
<code class="php spaces"><code class="php plain">inst: 2, sid: 291, ser: 59157
<code class="php spaces"><code class="php plain">wait event:
<code class="php string">'DFS lock handle'
<code class="php spaces"><code class="php plain">p1:
<code class="php string">'type|mode'<code class="php plain">=0x43490005
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'id1'<code class="php plain">=0xa
<code class="php spaces"><code class="php plain">p3:
<code class="php string">'id2'<code class="php plain">=0x2
<code class="php spaces"><code class="php plain">row_wait_obj#: 4294967295, block#: 0, row#: 0, file# 0
<code class="php spaces"><code class="php plain">min_blocked_time: 352 secs, waiter_cache_ver: 6248
<code class="php plain">Wait State:
<code class="php spaces"><code class="php plain">fixed_waits=0 flags=0x22 boundary=0x0/-1
|
我们可以看到,Node2的ckpt进程等待control file sequential read,等待了7分40秒。同时大家还可以看到,ckpt
进程阻塞了2个进程,也就是说ckpt进程有2个waiter,其中一个waiter的信息是:sid:291,ser:59157
且该waiter进程的等待事件居然是DFS lock handle,这是一个比较危险的event。 这里我们还无法确认这个waiter是什么?
同时ckpt进程为啥等待这么长的时间 ?
大家知道11g引入的hung auto resolution,那么我们就来看下Node1上的diag的信息:
?1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
Unix process pid: 27571, image: oracle@xhdb-server3 (DIA0)
<code class="php plain">*** 2014-04-22 17:22:01.536
<code class="php plain">*** SESSION ID:(961.1) 2014-04-22 17:22:01.536
<code class="php plain">*** CLIENT ID:() 2014-04-22 17:22:01.536
<code class="php plain">*** SERVICE NAME:(SYS<code class="php variable">$BACKGROUND<code class="php plain">) 2014-04-22 17:22:01.536
<code class="php plain">*** MODULE NAME:() 2014-04-22 17:22:01.536
<code class="php plain">*** ACTION NAME:() 2014-04-22 17:22:01.536
<code class="php plain">One <code class="php keyword">or
<code class="php plain">more possible hangs have been detected on your system.
<code class="php plain">These could be genuine hangs in which no further progress will
<code class="php plain">be made without intervention,
<code class="php keyword">or <code class="php plain">it may be very slow progress
<code class="php plain">in the system due to high load.
<code class="php plain">Previously detected
<code class="php keyword">and <code class="php plain">output hangs are not displayed again.
<code class="php plain">Instead, the
<code class="php string">'Victim Information' <code class="php plain">section will indicate that
<code class="php plain">the victim is from an
<code class="php string">'Existing Hang' <code class="php plain">under the
<code class="php string">'Previous Hang'
<code class="php plain">column.
<code class="php string">'Verified Hangs'
<code class="php plain">below indicate one <code class="php keyword">or
<code class="php plain">more hangs that were found
<code class="php keyword">and <code class="php plain">
identify the <code class="php keyword">final <code class="php plain">
blocking session <code class="php keyword">and <code class="php plain">
instance on which
<code class="php plain">they occurred. Since the current hang resolution state is
<code class="php string">'PROCESS'<code class="php plain">,
<code class="php plain">any hangs requiring session
<code class="php keyword">or <code class="php plain">process termination will be
<code class="php plain">automatically resolved.
<code class="php plain">Any hangs with a
<code class="php string">'Hang Resolution Action' <code class="php plain">
of <code class="php string">'Unresolvable'
<code class="php plain">will be ignored. These types of hangs will either be resolved
<code class="php plain">by another layer in the RDBMS
<code class="php keyword">or <code class="php plain">cannot be resolved
<code class="php keyword">as <code class="php plain">they may
<code class="php keyword">require <code class="php plain">
external intervention.
<code class="php plain">Deadlocks (also named cycles) are currently NOT resolved even
<code class="php keyword">if
<code class="php plain">hang resolution is enabled. The
<code class="php string">'Hang Type' <code class="php plain">of DLCK in the
<code class="php string">'Verified Hangs'
<code class="php plain">output identifies these hangs.
<code class="php plain">Below that are the complete hang chains from the time the hang
<code class="php plain">was detected.
<code class="php plain">The following information will assist Oracle Support Services
<code class="php plain">in further analysis of the root cause of the hang.
<code class="php plain">*** 2014-04-22 17:22:01.537
<code class="php plain">Verified Hangs in the System
<code class="php spaces"><code class="php plain">Root Chain Total Hang
<code class="php plain">Hang Hang Inst Root #hung #hung Hang Hang Resolution
<code class="php plain">ID Type Status Num Sess Sess Sess Conf Span Action
<code class="php plain">----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
<code class="php spaces"><code class="php plain">2 HANG VICSELTD 2 833 2 2 HIGH LOCAL IGNRD:InstKillNotA
<code class="php plain">Hang Ignored Reason: Since instance termination is not allowed, automatic
<code class="php spaces"><code class="php plain">hang resolution cannot be performed.
<code class="php plain">inst# SessId Ser# OSPID PrcNm Event
<code class="php spaces"><code class="php plain">----- ------ ----- --------- ----- -----
<code class="php spaces"><code class="php plain">2 291 59157 10646 M000 DFS lock handle ----大家注意这里的sid和ser#以及PrcNm
<code class="php spaces"><code class="php plain">2 833 1 13788 CKPT control file sequential read
|
这里提到M000进程,大家应该知道这是跟AWR快照有关系的进程,该进程其实是被CKPT所阻塞,我们也可以来看下该进程
的情况,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
*** 2014-04-22 17:27:00.778
<code class="php plain">Process diagnostic dump
<code class="php keyword">for <code class="php plain">oracle@xhdb-server4 (M000), OS id=10646,
<code class="php plain">pid: 57, proc_ser: 143, sid: 291, sess_ser: 59157
<code class="php plain">-------------------------------------------------------------------------------
<code class="php plain">current sql:
<code class="php plain">select tablespace_id, rfno, allocated_space, file_size, file_maxsize, changescn_base, changescn_wrap, flag,
<code class="php plain">inst_id from sys.ts$, GV<code class="php variable">$FILESPACE_USAGE
<code class="php plain">where ts# = tablespace_id <code class="php keyword">and
<code class="php plain">online$ != 3 <code class="php keyword">and <code class="php plain">
(changescn_wrap > PITRSCNWRP
<code class="php keyword">or <code class="php plain">
(changescn_wrap = PITRSCNWRP <code class="php keyword">and <code class="php plain">
changescn_base >= PITRSCNBAS)) <code class="php keyword">and <code class="php plain">
inst_id != :inst <code class="php keyword">and <code class="php plain">
(changescn_wrap > :w
<code class="php keyword">or <code class="php plain">
(changescn_wrap = :w <code class="php keyword">and <code class="php plain">
changescn_base >= :b))
<code class="php plain">Current Wait Stack:
<code class="php spaces"><code class="php plain">0: waiting
<code class="php keyword">for <code class="php string">'DFS lock handle'
<code class="php spaces"><code class="php plain">type|mode=0x43490005, id1=0xa, id2=0x2
<code class="php spaces"><code class="php plain">wait_id=6 seq_num=7 snap_id=1
<code class="php spaces"><code class="php plain">wait times: snap=6 min 12 sec, exc=6 min 12 sec, total=6 min 12 sec
<code class="php spaces"><code class="php plain">wait times: max=infinite, heur=6 min 12 sec
<code class="php spaces"><code class="php plain">wait counts: calls=818 os=818
<code class="php spaces"><code class="php plain">in_wait=1 iflags=0x15a2
<code class="php plain">There is at least one session blocking this session.
<code class="php spaces"><code class="php plain">Dumping 2 direct blocker(s):
<code class="php spaces"><code class="php plain">inst: 2, sid: 833, ser: 1
<code class="php spaces"><code class="php plain">inst: 1, sid: 482, ser: 1
<code class="php spaces"><code class="php plain">Dumping
<code class="php keyword">final <code class="php plain">blocker:
<code class="php spaces"><code class="php plain">inst: 2, sid: 833, ser: 1 -----最终的blocker是833,也就是Node2节点的CKPT进程。
<code class="php plain">There are 1 sessions blocked by this session.
<code class="php plain">Dumping one waiter:
<code class="php spaces"><code class="php plain">inst: 1, sid: 581, ser: 36139
<code class="php spaces"><code class="php plain">wait event:
<code class="php string">'DFS lock handle'
<code class="php spaces"><code class="php plain">p1:
<code class="php string">'type|mode'<code class="php plain">=0x43490005
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'id1'<code class="php plain">=0xa
<code class="php spaces"><code class="php plain">p3:
<code class="php string">'id2'<code class="php plain">=0x5
|
从这里看,root sess却是833,也就是我们Node2节点的CKPT进程。 到这里或许有人会说,问题的原因
应该很明确了,由于Node2 ckpt的异常,到这Node2 节点Lmon进程异常,由于需要和Node1的Lmon进程
进行通信,导致Node1 的lmon进程出现IPc send timeout的情况。
其实不然,到最后至始至终我们都没有完全弄清楚为何CKPT进程会等待这么长时间 ?
到这里我不得不怀疑IO的问题了,再回过头来分析Node1的diag trace时,发现一个有趣的事情:
?1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
*** 2014-04-22 17:24:08.363
<code class="php plain">Verified Hangs in the System
<code class="php spaces"><code class="php plain">Root Chain Total Hang
<code class="php plain">Hang Hang Inst Root #hung #hung Hang Hang Resolution
<code class="php plain">ID Type Status Num Sess Sess Sess Conf Span Action
<code class="php plain">----- ---- -------- ---- ----- ----- ----- ------ ------ -------------------
<code class="php spaces"><code class="php plain">7 HANG VICSELTD 2 801 2 3 HIGH LOCAL IGNRD:InstKillNotA
<code class="php plain">Hang Ignored Reason: Since instance termination is not allowed, automatic
<code class="php spaces"><code class="php plain">hang resolution cannot be performed.
<code class="php plain">inst# SessId Ser# OSPID PrcNm Event
<code class="php spaces"><code class="php plain">----- ------ ----- --------- ----- -----
<code class="php spaces"><code class="php plain">2 996 39917 6598 FG log file sync
<code class="php spaces"><code class="php plain">2 801 1 13786 LGWR log file parallel write
<code class="php plain">。。。。。省略部分内容
<code class="php plain">-------------------------------------------------------------------------------
<code class="php plain">Chain 2:
<code class="php plain">-------------------------------------------------------------------------------
<code class="php spaces"><code class="php plain">Oracle session identified by:
<code class="php spaces"><code class="php plain">{
<code class="php spaces"><code class="php plain">instance: 1 (authorcl.authorcl1)
<code class="php spaces"><code class="php plain">os id: 22323
<code class="php spaces"><code class="php plain">process id: 64, oracle@xhdb-server3
<code class="php spaces"><code class="php plain">session id: 14
<code class="php spaces"><code class="php plain">session serial #: 29257
<code class="php spaces"><code class="php plain">}
<code class="php spaces"><code class="php plain">is waiting
<code class="php keyword">for <code class="php string">'log file sync'
<code class="php plain">with wait info:
<code class="php spaces"><code class="php plain">{
<code class="php spaces"><code class="php plain">p1:
<code class="php string">'buffer#'<code class="php plain">=0x2ab
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'sync scn'<code class="php plain">=0x1ed4d8be
<code class="php spaces"><code class="php plain">time in wait: 4 min 18 sec
<code class="php spaces"><code class="php plain">timeout after: never
<code class="php spaces"><code class="php plain">wait id: 13287
<code class="php spaces"><code class="php plain">blocking: 0 sessions
<code class="php spaces"><code class="php plain">wait history:
<code class="php spaces"><code class="php plain">* time between current wait
<code class="php keyword">and <code class="php plain">wait #1: 0.000388 sec
<code class="php spaces"><code class="php plain">1. event:
<code class="php string">'SQL*Net message from client'
<code class="php spaces"><code class="php plain">time waited: 0.000486 sec
<code class="php spaces"><code class="php plain">wait id: 13286 p1:
<code class="php string">'driver id'<code class="php plain">=0x54435000
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'#bytes'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">* time between wait #1
<code class="php keyword">and <code class="php plain">#2: 0.000027 sec
<code class="php spaces"><code class="php plain">2. event:
<code class="php string">'SQL*Net message to client'
<code class="php spaces"><code class="php plain">time waited: 0.000003 sec
<code class="php spaces"><code class="php plain">wait id: 13285 p1:
<code class="php string">'driver id'<code class="php plain">=0x54435000
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'#bytes'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">* time between wait #2
<code class="php keyword">and <code class="php plain">#3: 0.001494 sec
<code class="php spaces"><code class="php plain">3. event:
<code class="php string">'SQL*Net message from client'
<code class="php spaces"><code class="php plain">time waited: 0.002089 sec
<code class="php spaces"><code class="php plain">wait id: 13284 p1:
<code class="php string">'driver id'<code class="php plain">=0x54435000
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'#bytes'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">}
<code class="php spaces"><code class="php keyword">and
<code class="php plain">is blocked by
<code class="php spaces"><code class="php plain">=> Oracle session identified by:
<code class="php spaces"><code class="php plain">{
<code class="php spaces"><code class="php plain">instance: 1 (authorcl.authorcl1)
<code class="php spaces"><code class="php plain">os id: 27634
<code class="php spaces"><code class="php plain">process id: 20, oracle@xhdb-server3 (LGWR)
<code class="php spaces"><code class="php plain">session id: 386
<code class="php spaces"><code class="php plain">session serial #: 1
<code class="php spaces"><code class="php plain">}
<code class="php spaces"><code class="php plain">which is waiting
<code class="php keyword">for <code class="php string">'log file parallel write'
<code class="php plain">with wait info:
<code class="php spaces"><code class="php plain">{
<code class="php spaces"><code class="php plain">p1:
<code class="php string">'files'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'blocks'<code class="php plain">=0x2
<code class="php spaces"><code class="php plain">p3:
<code class="php string">'requests'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">time in wait: 4 min 32 sec -----等待了4分32秒
<code class="php spaces"><code class="php plain">timeout after: never
<code class="php spaces"><code class="php plain">wait id: 51742574
<code class="php spaces"><code class="php plain">blocking: 17 sessions ------阻塞了17个Session
<code class="php spaces"><code class="php plain">wait history:
<code class="php spaces"><code class="php plain">* time between current wait
<code class="php keyword">and <code class="php plain">wait #1: 0.000176 sec
<code class="php spaces"><code class="php plain">1. event:
<code class="php string">'rdbms ipc message'
<code class="php spaces"><code class="php plain">time waited: 0.047194 sec
<code class="php spaces"><code class="php plain">wait id: 51742573 p1:
<code class="php string">'timeout'<code class="php plain">=0x22
<code class="php spaces"><code class="php plain">* time between wait #1
<code class="php keyword">and <code class="php plain">#2: 0.000399 sec
<code class="php spaces"><code class="php plain">2. event:
<code class="php string">'log file parallel write'
<code class="php spaces"><code class="php plain">time waited: 0.004006 sec
<code class="php spaces"><code class="php plain">wait id: 51742572 p1:
<code class="php string">'files'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">p2:
<code class="php string">'blocks'<code class="php plain">=0x2
<code class="php spaces"><code class="php plain">p3:
<code class="php string">'requests'<code class="php plain">=0x1
<code class="php spaces"><code class="php plain">* time between wait #2
<code class="php keyword">and <code class="php plain">#3: 0.000318 sec
<code class="php spaces"><code class="php plain">3. event:
<code class="php string">'rdbms ipc message'
<code class="php spaces"><code class="php plain">time waited: 2.654606 sec
<code class="php spaces"><code class="php plain">wait id: 51742571 p1:
<code class="php string">'timeout'<code class="php plain">=0x12c
<code class="php spaces"><code class="php plain">}
<code class="php plain">Chain 2 Signature:
<code class="php string">'log file parallel write'<code class="php plain">'log file sync'
|
我们可以看到,Node1上,lgwr进程阻塞了17个会话,等待log file parallel write,一直持续了4分32秒。
如果将时间点2014-04-22 17:24:08,向前推进4分32秒,那么就是2014-04-22 17:19:36.
我们再来检查Node2的操作系统日志你会发现有意思的事情:
?1 2 3 4 5 6 7 8 9 10 11 |
Apr 22 17:19:16 xhdb-server4 vxio: [ID 442312 kern.notice] NOTICE: VxVM VVR vxio V-5-0-209 Log overflow protection on rlink oa_zs_bj triggered DCM protection
<code class="php plain">Apr 22 17:19:16 xhdb-server4 vxio: [ID 636438 kern.warning] WARNING: VxVM VVR vxio V-5-0-1905 Replication stopped
<code class="php keyword">for <code class="php plain">RVG rvg_oa due to SRL overflow, DCM protection is triggered. To start replication, perform DCM resynchronization using
<code class="php string">"vradmin resync" <code class="php plain">command.
<code class="php plain">Apr 22 19:41:19 xhdb-server4 su: [ID 810491 auth.crit]
<code class="php string">'su root' <code class="php plain">failed
<code class="php keyword">for <code class="php plain">oa on /dev/pts/5
<code class="php plain">Apr 22 21:31:38 xhdb-server4 su: [ID 810491 auth.crit]
<code class="php string">'su grid' <code class="php plain">failed
<code class="php keyword">for <code class="php plain">root on /dev/pts/14
<code class="php plain">Apr 22 21:36:03 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 Thread(4) Agent is calling clean
<code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own.
<code class="php plain">Apr 22 21:36:03 xhdb-server4 Had[5704]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling clean
<code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own.
<code class="php plain">Apr 22 21:36:06 xhdb-server4 AgentFramework[5814]: [ID 702911 daemon.notice] VCS ERROR V-16-2-13068 Thread(4) Resource(cssd_oaora) - clean completed successfully.
<code class="php plain">Apr 22 21:40:51 xhdb-server4 reboot: [ID 662345 auth.crit] rebooted by root
<code class="php plain">Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal received
<code class="php keyword">while <code class="php plain">waiting
<code class="php keyword">for <code class="php plain">event on reaper
<code class="php string">'CFSMount'<code class="php plain">. Returning.
<code class="php plain">Apr 22 21:41:14 xhdb-server4 amf: [ID 451996 kern.notice] AMF NOTICE V-292-1-67 Signal received
<code class="php keyword">while <code class="php plain">waiting
<code class="php keyword">for <code class="php plain">event on reaper
<code class="php string">'CFSfsckd'<code class="php plain">. Returning.
<code class="php plain">Apr 23 00:31:59 xhdb-server4 genunix: [ID 540533 kern.notice] ^MSunOS Release 5.10 Version Generic_147440-25 64-bit
|
我们可以看到,在17:19:16秒时vertias的VVR出现了异常。这也就是为什么我们在前面分析看到Node2在17:19:18时出现control file sequential read
等待的原因。虽然从vcs的日志什么信息:
1 2 3 4 5 6 |
2014/04/22 14:39:12 VCS INFO V-16-1-53504 VCS Engine Alive message!!
<code class="php plain">2014/04/22 18:39:13 VCS INFO V-16-1-53504 VCS Engine Alive message!!
<code class="php plain">2014/04/22 21:36:03 VCS ERROR V-16-2-13067 (xhdb-server4) Agent is calling clean
<code class="php keyword">for <code class="php plain">resource(cssd_oaora) because the resource became OFFLINE unexpectedly, on its own.
<code class="php plain">2014/04/22 21:36:06 VCS INFO V-16-2-13068 (xhdb-server4) Resource(cssd_oaora) - clean completed successfully.
<code class="php plain">2014/04/22 21:36:09 VCS INFO V-16-1-10307 Resource cssd_oaora (Owner: Unspecified, Group: sg_oaora) is offline on xhdb-server4 (Not initiated by VCS)
<code class="php plain">2014/04/22 21:36:09 VCS NOTICE V-16-1-10446 Group sg_oaora is offline on system xhdb-server4
|
所以,最后我的感觉是根本原因是vcs的问题。虽然vertias的工程师一直解释这里的Log overflow protection没有太大的关系。
针对这个问题,欢迎大家探讨。
补充:关于ora-29770导致实例crash的问题,Oracle确实有不少的bug,但是针对这个情况,目前没有发现符合的。如下是来自Oracle MOS的搜索结果:
Bug 11890804:LMHB crashes instance with ORA-29770 after long “control file sequential read” waits
Bug 8888434: LMHB crashes the instance with LMON waiting on controlfile read
Bug 11890804: LMHB TERMINATE INSTANCE WHEN LMON WAIT CHANGE FROM CF READ AFTER 60 SEC
Bug 13467673: CSS MISSCOUNT AND ALL ASM DOWN WITH ORA-29770 BY LMHB
Bug 13390052: KJFMGCR_HBCHECKALL MESSAGES ARE CONTINUOUSLY LOGGED IN LMHB TRACE FILE.
Bug 13322797: LMHB TERMINATES THE INSTANCE DUE TO ERROR 29770
Bug 13061883: LMHB IS TERMINATING THE INSTANCE DURING SHUTDOWN IMMEDIATE
Bug 12886605: ESSC: LMHB TERMINATE INSTANCE DUE TO 29770 – LMON WAIT ENQ: AM – DISK OFFLINE
Bug 12757321: LMHB TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10296263: LMHB (OSPID: 15872): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10431752: SINGLE NODE RAC: LMHB TERMINATES INSTANCE DUE TO 29770
Bug 11656856: LMHB (OSPID: 27701): TERMINATING THE INSTANCE DUE TO ERROR 29770
Bug 10411143: INSTANCE CRASHES WITH IPC SEND TIMEOUT AND LMHB TERMINATES WITH ORA-29770
Bug 11704041: DATABASE INSTANCE CRASH BY LMHB PROCESS
Bug 10412545: ORA-29770 LMHB TERMINATE INSTANCE DUE TO VARIOUS LONG CSS WAIT
Bug 10147827: INSTANCE TERMINATED BY LMHB WITH ERROR ORA-29770
Bug 10016974: ORA-29770 LMD IS HUNG FOR MORE THAN 70 SECONDS AND LMHB TERMINATE INSTANCE
Bug 9376100: LMHB TERMINATING INSTANCE DUE ERROR 29770