Home > Database > Mysql Tutorial > 线上48组nodes一组db0101主库down了之后的failover处理过程_MySQL

线上48组nodes一组db0101主库down了之后的failover处理过程_MySQL

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
Release: 2016-06-01 13:30:12
Original
1041 people have browsed it

bitsCN.com

线上48组nodes一组db0101主库down了之后的failover处理过程

 

路上接到call,db0101 Down了,报错现象:

(1) 应用页面500,503,504 Error 错误

(2) email报警db0201 is down now!

 

1 症状初步判断

赶快ping 20.222.21.173, 报unreachle的错误. 赶紧call 系统管理员以及硬件工程师,让他们登陆物理主机看看出了什么故障.

 

这里赶紧去mmm control服务器去看看状况如何?

[nova@db0203 ~]$ sudo -u mmmd mmm_control show

# Warning: agent on host db1 is not reachable

  db1(20.222.21.173) master/HARD_OFFLINE. Roles: reader(20.222.22.57), writer(20.222.22.56)

  db2(20.222.22.145) master/ONLINE. Roles: reader(20.222.22.58)

master/HARD_OFFLINE,猜测应该是硬件介质故障导致的.

 

2  紧急failover,恢复应用

由于应用页面报错,而且db0201已经down了,所以需要马上做failover操作,尽快切到db0202上面去,下面手动切换。

[nova@db0203 ~]$  sudo -u mmmd /usr/sbin/mmm_control move_role writer db2

OK: Role 'writer' has been moved from 'db1' to 'db2'. Now you can wait some time and check new roles info!

[nova@db0203 ~]$ sudo -u mmmd mmm_control show

# Warning: agent on host db1 is not reachable

  db1(20.222.21.173) master/HARD_OFFLINE. Roles: reader(20.222.22.57)

  db2(20.222.22.145) master/ONLINE. Roles: reader(20.222.22.58), writer(20.222.22.56)

 

it is ok,看到已经切换到了db0202上面去了,writer已经指向了db0202,页面也没有报错了,而且登陆db0202,执行show full processlist; 看到已经有500多个client connection了,表示应用已经切换到db0202上面。

 

3  重新做failover的疑惑?

failover之前需要做什么事情? 需要等待吗?还是直接执行failover就可以了? 这是一次线上操作,没有可以借鉴的,我这里是直接执行failover操作了。

执行时间:18:45

执行命令:sudo -u mmmd /usr/sbin/mmm_control move_role writer db2

 

 

过了一个小时,sa以及hard engineer已经检查完物理主机了,是out of memory了,所以默认就kill了战局内存最大的mysql虚拟机了。他们调整了参数设置以及保护措施(具体细节不是太懂)

 

4 设置db1 online

等db0201服务器启动之后,需要手动开启replication,手动执行start slave; replication正常开始同步数据。再去check下mmm状态

[nova@db0203 ~]$ sudo -u mmmd mmm_control show

  db1(20.222.21.173) master/AWAITING_RECOVERY.Roles: reader(20.222.22.57)

  db2(20.222.22.145) master/ONLINE. Roles: reader(20.222.22.58), writer(20.222.22.56)

看到这个awaiting_recovery,不要慌,这是因为介质故障,所以虽然mmm_control监控到了db1,但是它不会把db1设置成online的,需要我们自己去判断db1是否正常,如果正常,我们可以自己把db1设置成online,这也算是mmm的一个谨慎的地方吧。所以我这里check完db1之后,发现db1的replication正常后,就可以设置db1 online了。

执行命令:sudo -u mmmd mmm_control set_online db1

看到   db1(20.222.21.173) master/ONLINE. Roles: reader(20.222.22.57),  OK,db1已经online了

 

5 Change writer from db2  to db1  

之后检查db1和db2双master运行一段时间,大概monitor20分钟后,就可以执行切换操作了,毕竟db1是ssd,db2是普通介质。

[nova@db0203 ~]$ date

Thu Sep  5 12:11:02 GMT 2013

[nova@db0203 ~]$ sudo -u mmmd /usr/sbin/mmm_control move_role writer db1

OK: Role 'writer' has been moved from 'db2' to 'db1'. Now you can wait some time and check new roles info!

[nova@db0203 ~]$ sudo -u mmmd mmm_control show

  db1(20.222.21.173) master/ONLINE. Roles: reader(20.222.22.57), writer(20.222.22.56)

  db2(20.222.22.145) master/ONLINE. Roles: reader(20.222.22.58)

看到db1已经成为了writer了。

 

bitsCN.com
Related labels:
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Issues
PHP mail function cannot send email completely
From 1970-01-01 08:00:00
0
0
0
PHP mail function cannot complete sending email
From 1970-01-01 08:00:00
0
0
0
PHP mail function cannot complete sending email
From 1970-01-01 08:00:00
0
0
0
PHP mail function cannot send email completely
From 1970-01-01 08:00:00
0
0
0
PHP mail function did not complete sending email
From 1970-01-01 08:00:00
0
0
0
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template