Home > Database > Mysql Tutorial > NDMCDB数据库hang住故障分析-cursor:pinSwaitonX

NDMCDB数据库hang住故障分析-cursor:pinSwaitonX

WBOY
Release: 2016-06-07 16:01:01
Original
1748 people have browsed it

问题描述: 上午刚刚到办公室,就有监控人员邮件反馈,昨晚NDMCDB407数据库被重启过,让我分析一下数据库重启的原因。由于昨晚业务有版本上线,所以短信警告关闭了,所以没有短信下发到我手机上,而且故障时相关人员也没有通知到我。 1 检查alert日志 从aler

问题描述:

上午刚刚到办公室,就有监控人员邮件反馈,昨晚NDMCDB407数据库被重启过,让我分析一下数据库重启的原因。由于昨晚业务有版本上线,所以短信警告关闭了,所以没有短信下发到我手机上,而且故障时相关人员也没有通知到我。

1 检查alert日志

从alert日志中,可以看到,先是在03:29时有一个job运行失败了:
Fri Aug 22 03:29:29 2014
Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:
ORA-12012: error on auto execute of job 31
ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorized
ORA-06512: at "NDMC.PROC_NDMC_CANCEL_OPEN",line 5
ORA-06512: at line 1
然后在03:49时,出现了连接超时失败,而且一直持续到05:00:08:
Fri Aug 22 03:49:43 2014
***********************************************************************
 
Fatal NI connect error 12170.
 
 VERSION INFORMATION:
       TNS for Linux: Version 11.1.0.7.0 - Production
       Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0 -Production
       TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0 - Production
 Time: 22-AUG-2014 03:49:43
 Tracing not turned on.
  Tnserror struct:
   ns main err code: 12535
   
TNS-12535: TNS:operation timed out
   ns secondary err code: 12606
   nt main err code: 0
   nt secondary err code: 0
   nt OS err code: 0
 Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.130.87)(PORT=36628))
WARNING: inbound connection timed out(ORA-3136)
Fri Aug 22 03:49:44 2014
……
而且出现了连接数耗尽了:
Fri Aug 22 03:49:50 2014
ORA-00020: maximum number of processes 0exceeded
   ns secondary err code: 12560
   ns secondary err code: 12560
   ns main err code: 12537
Fri Aug 22 03:49:50 2014
……
Fri Aug 22 03:51:48 2014
 
***********************************************************************
 
Fatal NI connect error 12537, connectingto:
 (LOCAL=NO)
 
 VERSION INFORMATION:
       TNS for Linux: Version 11.1.0.7.0 - Production
       Oracle Bequeath NT Protocol Adapter for Linux: Version 11.1.0.7.0 -Production
       TCP/IP NT Protocol Adapter for Linux: Version 11.1.0.7.0 - Production
 Time: 22-AUG-2014 03:51:48
 Tracing not turned on.
  Tnserror struct:
   ns main err code: 12537
   
TNS-12537: TNS:connection closed
ns secondaryerr code: 12560
   nt main err code: 0
   nt secondary err code: 0
   nt OS err code: 0
ORA-609 : opiodr aborting process unknownospid (30476_47044991385184)
Fri Aug 22 04:14:15 2014
ORA-28 : opiodr aborting process unknownospid (24925_46986315964000)
Fri Aug 22 04:16:27 2014
ORA-28 : opiodr aborting process unknownospid (22475_47013891882592)
Fri Aug 22 04:16:28 2014
ORA-28 : opiodr aborting process unknownospid (21356_47116835528288)
Fri Aug 22 04:16:29 2014
ORA-28 : opiodr aborting process unknownospid (24947_47774766210656)
ORA-28 : opiodr aborting process unknownospid (14958_47053435166304)
……
Fri Aug 22 05:00:05 2014
ORA-28 : opiodr aborting process unknownospid (25765_46941307182688)
Fri Aug 22 05:00:08 2014
ORA-28 : opiodr aborting process unknownospid (4949_47396524895840)
于是在05:04数据库被关闭,从日志来看,这是正常关闭的,初步怀疑是人为关闭或是VCS双机自动将数据库关闭了:
Fri Aug 22 05:04:10 2014
Stopping background process SMCO
Stopping background process FBDA
Shutting down instance: further logonsdisabled
Fri Aug 22 05:04:12 2014
Stopping background process CJQ0
Stopping background process QMNC
Stopping background process MMNL
Stopping background process MMON
Shutting down instance (immediate)
License high water mark = 1220
Stopping Job queue slave processes, flags =7
Fri Aug 22 05:04:20 2014
Waiting for Job queue slaves to complete
Job queue slave processes stopped
Fri Aug 22 05:09:11 2014
License high water mark = 1220
USER (ospid: 25110): terminating theinstance
Termination issued to instance processes.Waiting for the processes to exit
Fri Aug 22 05:09:21 2014
Instance termination failed to kill one ormore processes
Instance terminated by USER, pid = 25110
Copy after login

2 检查messages日志

大概在05:03:51时,人为的想将双机切换到备机中:

Aug 22 05:03:51 NDMCDB11 user_cmd:2014-08-22 05:03:51 hagrp -switch RCS_DB_SG -to system by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]
Aug 22 05:04:01 NDMCDB11/usr/sbin/cron[15348]: (root) CMD (su - root -c'/opt/watchdog/watchdog_schedule -n OS,oracle' >/dev/null 2>&1)
Aug 22 05:04:01 NDMCDB11 su: (to root) rooton none
Aug 22 05:04:03 NDMCDB11 su: (to oracle)root on none
Aug 22 05:04:09 NDMCDB11 user_cmd:2014-08-22 05:04:09 hagrp -switch RCS_DB_SG -to NDMCDB12 by root from [oraclepts/9 Aug 22 04:29 (192.168.128.142)]
Aug 22 05:04:09 NDMCDB11 su: (to oracle)root on none 
Copy after login

但双机切换失败,最后是直接将双机停止,重启VCS:

Aug 22 05:06:18 NDMCDB11 user_cmd:2014-08-22 05:06:18 hastop -all by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]
……
Aug 22 05:07:02 NDMCDB11 user_cmd:2014-08-22 05:07:02 hastat by root from [oracle pts/9 Aug 22 04:29(192.168.128.142)]
Copy after login

所以,到这里就已经确定,数据库这所以重启了,完全是由于人为将VCS集群重启引起的。那么为什么要VCS群集重启呢?数据库到底有没有问题呢?再来看看。

最后,经向升级人员操作确认,在升级时,有一个存储过程需要跑,但执行后,数据库基本响应就非常慢了,一直运行到3:29左右,人为cancel掉了,所以这也就是为什么会出现这样的报错了:

Fri Aug 22 03:29:29 2014
Errors in file/opt/oracle/diag/rdbms/ndmcdb/NDMCDB/trace/NDMCDB_j000_28856.trc:
ORA-12012: error on auto execute of job 31
ORA-04023: ObjectNDMC.DELETE_ANONY_RSHARE_INFO could not be validated or authorized
ORA-06512: at"NDMC.PROC_NDMC_CANCEL_OPEN", line 5
ORA-06512: at line 1 
Copy after login

3 查看系统负载

CPU负载:

\

内存负载:

\

可见,系统在3:49左右,出现了CPU及内存均被耗尽的情况,这个时间段,刚好数据库出现了大量连接超时失败,甚至是出现了连接数超过阀值:

Fri Aug 22 03:49:50 2014
ORA-00020: maximum number of processes 0exceeded
   ns secondary err code: 12560
   ns secondary err code: 12560
   ns main err code: 12537
Fri Aug 22 03:49:50 2014
Copy after login

4 分析AWR

\

从这里看,数据库在2点到3点时,已经非常的繁忙,但从之前有系统负载来看,2点到3点时,CPU及内存使用率都不算很高的。接着看:

\

指标都没有什么特别高的。

\

从top 5 event中,看到了有大量的cursor: pin S wait on X等待,可见出现mutex争用,但通常这只是表象而已,并非根因。

\

绝大部分时间都在做SQL的解析,而且解析还失败了,这就是数据库hang住的根因。正常来说,一个数据库的绝大部分时间应该是用于SQL的执行,所以这个是占用最多时间的:sql execute elapsedtime等。

\

不存在较高的versioncount。

那么数据库什么时候出现的不停解析SQL,并且解析失败了呢?

查了DBA_HIST_ACTIVE_SESS_HISTORY,分析了下历史会话信息,发现在02:57:00至03:00:00出现的问题:

\

经过确认,恰巧就是执行存储过程的时间点左右。

至此,数据库从3:00开始,已经是不正常的,数据库不停的在解析SQL,SQL都还没有到执行这一步,数据库已经处于无响应的状态,连接会话都被阻塞住了,直到连接数达到了最大连接数,最后被升级操作人员重启了VCS集群。

5 分析结论

(1)数据库down机主要还是人为进行了VCS切换失败后,进行了VCS重启操作引起。

(2)这套数据库故障的根因,还是为什么数据库在2:58左右时出现解析SQL失败上。从目前的日志分析来看,看不出是什么原因。

-- Bosco ---- END ----
Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template