Contents
  1. 1. 测试背景
  2. 2. 一、MHA安装
  3. 3. 二、MHA配置文件
  4. 4. 三、前提要求(必须满足)
    1. 4.1. 3.1 SSH公钥认证
    2. 4.2. 3.2 操作系统
    3. 4.3. 3.3 单写master和多slave或者只读master
    4. 4.4. 3.4 三层或者多层复制环境
    5. 4.5. 3.5 MySQL版本必须是5.0 或者高于 5.0
    6. 4.6. 3.6 使用mysqlbinlog 5.1+ 支持MySQL5.1+
    7. 4.7. 3.7 log-bin必须在候选master上开启
    8. 4.8. 3.8 binlog,relay-log 主从环境必须全部一致
    9. 4.9. 3.9 复制用户必须在候选master上要存在
    10. 4.10. 3.10 使用purge_relay_logs来定期删除relay logs
    11. 4.11. 3.11 不要在SBR的环境中使用load data infile
  5. 5. 四、实战演练
    1. 5.1. 4.0 简单的failover过程
  6. 6. 4.2 测试场景一、 手工操作,手动完成,failover
  7. 7. 4.3 [功能测试] 测试场景二、手工操作, 自动完成,failover
  8. 8. 4.4 [功能测试] 测试场景三、手工操作,自动完成,online master switch
  9. 9. 4.5 测试场景四、自动监控,自动操作,自动完成,failover
  10. 10. 4.6 [用例测试] MySQL master too many connection,无权限,响应慢
  11. 11. 4.7 [用例测试] MySQL master 服务器 down掉,且候选master落后的最多, 是否自动failover,知否可以成功的做日志补偿
  12. 12. 4.8 [用例测试] Master 没挂,但是MySQL slave 服务 down掉,是否自动failover
  13. 13. 4.9 [用例测试] Master 挂了,但是MySQL slave 有问题,是否自动failover
  14. 14. 4.10 [用例测试] MySQL master 有大事务超过100s再执行,是否online master switch
  15. 15. 4.11 [用例测试] MySQL master 网络断掉
  16. 16. 4.12 [用例测试] MySQL master 网路瞬断(1~30秒),是否自动failover
  17. 17. 4.14 [用例测试] 多段网络测试0 (没有二次检测的机制)
  18. 18. 4.14 [用例测试] 多段网络测试1 (有二次检测的机制)
  19. 19. 4.14 [用例测试] 多段网络测试2 (有二次检测的机制)
  20. 20. 4.14 [用例测试] 多段网络测试3 (有二次检测的机制)
  21. 21. 4.14 [用例测试] 多段网络测试4 (没有二次检测的机制)
  22. 22. 4.14 [用例测试] 多段网络测试5 (有二次检测的机制)
  23. 23. 4.14 [用例测试] 多段网络测试6 (没有二次检测的机制)
  24. 24. 4.14 [用例测试] 多段网络测试7 (有二次检测的机制)
  25. 25. 4.15 [用例测试] Non-GTID模式下,需要relay-log吗?是否能够成功的补齐日志
  26. 26. 4.16 [用例测试] GTID模式下,需要relay-log吗?是否能够成功的补齐日志
  27. 27. 4.18 [用例测试] 在一开始没有开启MHA的group中,然后master挂了,如何做到日志补偿,然后change master呢?

测试背景

以下所有测试,全部基于以下复制结构完成

1
2
3
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)

一、MHA安装

mha node 所有服务器节点都要安装
mha manager 只需要安装在manager节点

  • MHA NODE
1
2
1. yum install perl-DBD-MySQL
2. rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm
  • MHA Manager
1
2
3
4
5
6
1. yum install perl-DBD-MySQL
2. yum install perl-Config-Tiny
3. yum install perl-Log-Dispatch
4. yum install perl-Parallel-ForkManager
5. yum install perl-Time-HiRes
6. rpm -ivh mha4mysql-manager-0.56-0.el6.noarch.rpm
  • MHA rpm安装路径
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
* Node
Node> rpm -qpl mha4mysql-node-0.56-0.el6.noarch.rpm
/usr/bin/apply_diff_relay_logs
/usr/bin/filter_mysqlbinlog
/usr/bin/purge_relay_logs
/usr/bin/save_binary_logs
/usr/share/man/man1/apply_diff_relay_logs.1.gz
/usr/share/man/man1/filter_mysqlbinlog.1.gz
/usr/share/man/man1/purge_relay_logs.1.gz
/usr/share/man/man1/save_binary_logs.1.gz
/usr/share/perl5/vendor_perl/MHA/BinlogHeaderParser.pm
/usr/share/perl5/vendor_perl/MHA/BinlogManager.pm
/usr/share/perl5/vendor_perl/MHA/BinlogPosFindManager.pm
/usr/share/perl5/vendor_perl/MHA/BinlogPosFinder.pm
/usr/share/perl5/vendor_perl/MHA/BinlogPosFinderElp.pm
/usr/share/perl5/vendor_perl/MHA/BinlogPosFinderXid.pm
/usr/share/perl5/vendor_perl/MHA/NodeConst.pm
/usr/share/perl5/vendor_perl/MHA/NodeUtil.pm
/usr/share/perl5/vendor_perl/MHA/SlaveUtil.pm
* Manager
Manager> rpm -qpl mha4mysql-manager-0.56-0.el6.noarch.rpm
/usr/bin/masterha_check_repl
/usr/bin/masterha_check_ssh
/usr/bin/masterha_check_status
/usr/bin/masterha_conf_host
/usr/bin/masterha_manager
/usr/bin/masterha_master_monitor
/usr/bin/masterha_master_switch
/usr/bin/masterha_secondary_check
/usr/bin/masterha_stop
/usr/share/man/man1/masterha_check_repl.1.gz
/usr/share/man/man1/masterha_check_ssh.1.gz
/usr/share/man/man1/masterha_check_status.1.gz
/usr/share/man/man1/masterha_conf_host.1.gz
/usr/share/man/man1/masterha_manager.1.gz
/usr/share/man/man1/masterha_master_monitor.1.gz
/usr/share/man/man1/masterha_master_switch.1.gz
/usr/share/man/man1/masterha_secondary_check.1.gz
/usr/share/man/man1/masterha_stop.1.gz
/usr/share/perl5/vendor_perl/MHA/Config.pm
/usr/share/perl5/vendor_perl/MHA/DBHelper.pm
/usr/share/perl5/vendor_perl/MHA/FileStatus.pm
/usr/share/perl5/vendor_perl/MHA/HealthCheck.pm
/usr/share/perl5/vendor_perl/MHA/ManagerAdmin.pm
/usr/share/perl5/vendor_perl/MHA/ManagerAdminWrapper.pm
/usr/share/perl5/vendor_perl/MHA/ManagerConst.pm
/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm
/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm
/usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm
/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm
/usr/share/perl5/vendor_perl/MHA/SSHCheck.pm
/usr/share/perl5/vendor_perl/MHA/Server.pm
/usr/share/perl5/vendor_perl/MHA/ServerManager.pm

二、MHA配置文件

  • global scope
1
2
3
4
5
6
7
8
9
10
11
12
* /etc/masterha_default.cnf
[server default] -- 这下面的都是全局配置,适用于所有app.cnf
user=dba
password=dba
ssh_user=root
master_binlog_dir= /data/mysql.bin
secondary_check_script= masterha_secondary_check -s remote_host1 -s remote_host2
ping_interval=3
master_ip_failover_script=/script/masterha/master_ip_failover
shutdown_script= /script/masterha/power_manager
report_script= /script/masterha/send_master_failover_mail
  • application scope
1
2
3
4
5
6
7
8
* /etc/app1.cnf
[server default]
remote_workdir=/var/log/masterha/app1
manager_workdir=/var/log/masterha/app1
manager_log=/var/log/masterha/app1/app1.log
[server default] 里面的这些配置,会影响到下面的N 个 server
  • local scope
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
* /etc/app1.cnf
[server1]
hostname=192.168.1.10
candidate_master=1
[server2]
hostname=192.168.1.11
candidate_master=1
[server3]
hostname=192.168.1.12
no_master=1
每个server下面的配置,专属于某个server

三、前提要求(必须满足)

3.1 SSH公钥认证

基本上MHA manager,MNA node,以及二次检测的节点,都需要互相信任

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# masterha_check_ssh --conf=/etc/app1.cnf
Sat May 14 14:42:19 2011 - [warn] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Sat May 14 14:42:19 2011 - [info] Reading application default configurations from /etc/app1.cnf..
Sat May 14 14:42:19 2011 - [info] Reading server configurations from /etc/app1.cnf..
Sat May 14 14:42:19 2011 - [info] Starting SSH connection tests..
Sat May 14 14:42:19 2011 - [debug] Connecting via SSH from root@host1(192.168.0.1) to root@host2(192.168.0.2)..
Sat May 14 14:42:20 2011 - [debug] ok.
Sat May 14 14:42:20 2011 - [debug] Connecting via SSH from root@host1(192.168.0.1) to root@host3(192.168.0.3)..
Sat May 14 14:42:20 2011 - [debug] ok.
Sat May 14 14:42:21 2011 - [debug] Connecting via SSH from root@host2(192.168.0.2) to root@host1(192.168.0.1)..
Sat May 14 14:42:21 2011 - [debug] ok.
Sat May 14 14:42:21 2011 - [debug] Connecting via SSH from root@host2(192.168.0.2) to root@host3(192.168.0.3)..
Sat May 14 14:42:21 2011 - [debug] ok.
Sat May 14 14:42:22 2011 - [debug] Connecting via SSH from root@host3(192.168.0.3) to root@host1(192.168.0.1)..
Sat May 14 14:42:22 2011 - [debug] ok.
Sat May 14 14:42:22 2011 - [debug] Connecting via SSH from root@host3(192.168.0.3) to root@host2(192.168.0.2)..
Sat May 14 14:42:22 2011 - [debug] ok.
Sat May 14 14:42:22 2011 - [info] All SSH connection tests passed successfully.

如果slave比较多,实例比较多,最好提高下 /etc/ssh/sshd_config MaxStartups 的值(默认是10)

3.2 操作系统

仅在Linux上测试过

3.3 单写master和多slave或者只读master

打从一开始,MHA就是为了解决数据一致性而出生,所以,最好是多个slave
如果你只有一个slave,根本就碰不到数据一致性问题,也就不需要mha了
如果是一个slave,用版同步复制也能解决

从0.52开始,MHA就支持多master的复制架构了,下面列举了多master环境下注意点

  • 多master,但是只允许单点写入
  • 默认情况下,只支持2层复制架构

3.4 三层或者多层复制环境

默认情况下,MHA是不支持3层或多层复制架构的(Master1 -> Master2 -> Slave3)
MHA可以恢复Master2,但是不能恢复Slave3,因为Master2,Slave3有不同的master
为了让MHA支持以上架构,可以参考如下配置:

  • 在配置文件中,只配置两层(master1 and master2)
  • 使用 “multi_tier_slave=1” 参数,然后设置所有hosts

3.5 MySQL版本必须是5.0 或者高于 5.0

  • MySQL版本必须大于等于5.0
  • 尽量使用高版本的MySQL

3.6 使用mysqlbinlog 5.1+ 支持MySQL5.1+

  • MHA使用mysqlbinlog来应用日志到目标slave上的
  • 如果MySQL master设置的是row格式,那么MySQL必须是大于等于5.1版本,因为5.0不支持row
  • mysqlbinlog版本可以这样被检测:
1
2
[app@slave_host1]$ mysqlbinlog --version
mysqlbinlog Ver 3.3 for unknown-linux-gnu at x86_64
  • 如果你使用的是MySQL5.1,那么mysqlbinlog必须大于等于3.3
  • 如果mysqlbinlog的版本是3.2,而mysql的版本是5.1,那么mha manager会报错,且停止monitoring

3.7 log-bin必须在候选master上开启

  • 如果当前slave没有设置log-bin,那么很显然它不能成为提升为new master
  • 如果没有任何机器设置了log-bin,那么mha会报错且停止failover

3.8 binlog,relay-log 主从环境必须全部一致

  • 复制过滤规则(binlog-do-db, replicate-ignore-db 等等)必须全部一致

3.9 复制用户必须在候选master上要存在

  • 切换完成后,所有slave都必须执行change master 命令。在new master上复制用户必须有(REPLICATEION SLAVE权限)

3.10 使用purge_relay_logs来定期删除relay logs

默认情况下,如果SQL线程执行完relay-log,relay logs就会被自动删除。
但是这些relay-logs 也许还会用来恢复其他的slave,所以你需要关闭自动删除relay-logs的purge线程,然后自己阶段性的来删除
如果是你自己来删的话,必须考虑repl 延迟问题

最好让slave删除relay log不要在同一时间点,假如需要恢复,那么这个时间点所有relay logs都被删除了就不好了

3.11 不要在SBR的环境中使用load data infile

不管是SBR,还是RBR,最好不要使用load data

四、实战演练

  1. [功能测试] 测试场景一、手工操作,手动完成,failover
  2. [功能测试] 测试场景二、手工操作, 自动完成,failover
  3. [功能测试] 测试场景三、手工操作,自动完成,online master switch
  4. [功能测试] 测试场景四、自动监控,自动操作,自动完成,failover
  5. [用例测试] MySQL master 服务 down掉,是否成功自动failover
  6. [用例测试] MySQL master too many connection,无权限,响应慢,是否自动failover
  7. [用例测试] MySQL master 服务器 down掉,且候选master落后的最多, 是否自动failover,知否可以成功的做日志补偿
  8. [用例测试] MySQL slave 服务 down掉,是否自动failover
  9. [用例测试] MySQL 有一台slave,或者多台slave服务器延迟很大,是否自动failover
  10. [用例测试] MySQL slave IO/SQL线程 stop,是否自动failover
  11. [用例测试] MySQL slave IO/SQL线程 报错,是否自动failover
  12. [用例测试] MySQL master 有大事务超过100s再执行,是否可以online master switch
  13. [用例测试] MySQL master 网络断掉,是否自动failover
  14. [用例测试] MySQL master 网路瞬断(1~30秒),是否自动failover
  15. [用例测试] MySQL master 和 候选master 网络都挂掉的情况,是否自动failover
  16. [用例测试] MHA manager 和 MySQL master之间的网络断掉,但是master和slave之间的网络是好的,是否自动failover
  17. [用例测试] MHA manager 和 MySQL master,slave 网络都断掉的情况,是否自动failover
  18. [用例测试] GTID模式下,还需要relay-log吗?是否能够成功的补齐日志
  19. [用例测试] 多线程复制模式下,做failover 和 online master switch,会不会有问题呢?
  20. [用例测试] 在一开始没有开启MHA的group中,如何做到日志补偿,然后change master呢?

4.0 简单的failover过程

  • step1:安装MHA node在所有节点上

  • step2:安装MHA manager

  • step3:创建配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
manager_host$ cat /etc/app1.cnf
[server default]
# mysql user and password
user=root
password=mysqlpass
ssh_user=root
# working directory on the manager
manager_workdir=/var/log/masterha/app1
# working directory on MySQL servers
remote_workdir=/var/log/masterha/app1
[server1]
hostname=host1
[server2]
hostname=host2
[server3]
hostname=host3
  • step4: 检查SSH 互信
1
# masterha_check_ssh --conf=/etc/app1.cnf
  • step5: 检查复制配置
1
manager_host$ masterha_check_repl --conf=/etc/app1.cnf
  • step6:开启Manager
1
manager_host$ masterha_manager --conf=/etc/app1.cnf
  • step7: 检查manager的状态
1
manager_host$ masterha_check_status --conf=/etc/app1.cnf
  • step8: 测试关闭manager
1
manager_host$ masterha_stop --conf=/etc/app1.cnf
  • step9:测试master failover
1
host1$ killall -9 mysqld mysqld_safe
  • 未完成的步骤(进阶)
1
2
3
4
5
6
7
8
* 二次检测: 检查master是否真的挂了,避免脑裂 secondary_network_script
* master_ip_failover: 默认的是空,什么都不做
ip漂移
新master赋予写
日志保留
报警
等等

4.2 测试场景一、 手工操作,手动完成,failover

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
* 关闭mha 监控
masterha_stop --conf=/etc/app1.cnf
  • step2: 手工制造master mysql挂了
1
killall -9 mysqld mysqld_safe
  • step3: 人工交互式failover
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
masterha_master_switch --master_state=dead --conf=/etc/app1.cnf --dead_master_host=host_1 --interactive=1 --ignore_last_failover
* 切换日志如下:
--dead_master_ip=<dead_master_ip> is not set. Using host_1.
--dead_master_port=<dead_master_port> is not set. Using 3306.
Thu Jul 28 16:37:04 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Jul 28 16:37:04 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Thu Jul 28 16:37:04 2016 - [info] Reading server configuration from /etc/app1.cnf..
Thu Jul 28 16:37:04 2016 - [info] MHA::MasterFailover version 0.56.
Thu Jul 28 16:37:04 2016 - [info] Starting master failover.
Thu Jul 28 16:37:04 2016 - [info]
Thu Jul 28 16:37:04 2016 - [info] * Phase 1: Configuration Check Phase..
Thu Jul 28 16:37:04 2016 - [info]
Thu Jul 28 16:37:04 2016 - [info] GTID failover mode = 0
Thu Jul 28 16:37:04 2016 - [info] Dead Servers:
Thu Jul 28 16:37:04 2016 - [info] host_1(host_1:3306)
Thu Jul 28 16:37:04 2016 - [info] Checking master reachability via MySQL(double check)...
Thu Jul 28 16:37:04 2016 - [info] ok.
Thu Jul 28 16:37:04 2016 - [info] Alive Servers:
Thu Jul 28 16:37:04 2016 - [info] host_2(host_2:3306)
Thu Jul 28 16:37:04 2016 - [info] host_3(host_3:3306)
Thu Jul 28 16:37:04 2016 - [info] Alive Slaves:
Thu Jul 28 16:37:04 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:04 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:04 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 16:37:04 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:04 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:04 2016 - [info] Not candidate for the new Master (no_master is set)
Master host_1(host_1:3306) is dead. Proceed? (yes/NO): yes
Thu Jul 28 16:37:06 2016 - [info] Starting Non-GTID based failover.
Thu Jul 28 16:37:06 2016 - [info]
Thu Jul 28 16:37:06 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jul 28 16:37:06 2016 - [info]
Thu Jul 28 16:37:06 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Jul 28 16:37:06 2016 - [info]
Thu Jul 28 16:37:06 2016 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Jul 28 16:37:07 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Jul 28 16:37:07 2016 - [info] Executing master IP deactivation script:
Thu Jul 28 16:37:07 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Jul 28 16:37:07 2016 - [info] done.
Thu Jul 28 16:37:07 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Jul 28 16:37:07 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] * Phase 3: Master Recovery Phase..
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000004:154
Thu Jul 28 16:37:07 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Jul 28 16:37:07 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:07 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 16:37:07 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:07 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 16:37:07 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000004:154
Thu Jul 28 16:37:07 2016 - [info] Oldest slaves:
Thu Jul 28 16:37:07 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:07 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 16:37:07 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:07 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] Fetching dead master's binary logs..
Thu Jul 28 16:37:07 2016 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1_name.000004 --start_pos=154 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728163704.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/app1 if not exists.. ok.
Concat binary/relay logs from host_1_name.000004 pos 154 to host_1_name.000004 EOF into /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728163704.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
No need to dump effective binlog data from /data/mysql.bin/host_1_name.000004 (pos starts 154, filesize 154). Skipping.
Binlog Checksum enabled
/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728163704.binlog has no effective data events.
Event not exists.
Thu Jul 28 16:37:07 2016 - [info] Additional events were not found from the orig master. No need to save.
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] * Phase 3.3: Determining New Master Phase..
Thu Jul 28 16:37:07 2016 - [info]
Thu Jul 28 16:37:07 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Jul 28 16:37:07 2016 - [info] All slaves received relay logs to the same position. No need to resync each other.
Thu Jul 28 16:37:07 2016 - [info] Searching new master from slaves..
Thu Jul 28 16:37:07 2016 - [info] Candidate masters from the configuration file:
Thu Jul 28 16:37:07 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:07 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 16:37:07 2016 - [info] Non-candidate masters:
Thu Jul 28 16:37:07 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 16:37:07 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 16:37:07 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Jul 28 16:37:07 2016 - [info] New master is host_2(host_2:3306)
Thu Jul 28 16:37:07 2016 - [info] Starting master failover..
Thu Jul 28 16:37:07 2016 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Starting master switch from host_1(host_1:3306) to host_2(host_2:3306)? (yes/NO): yes
Thu Jul 28 16:37:09 2016 - [info] New master decided manually is host_2(host_2:3306)
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Jul 28 16:37:09 2016 - [info] Starting recovery on host_2(host_2:3306)..
Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Jul 28 16:37:09 2016 - [info] done.
Thu Jul 28 16:37:09 2016 - [info] All relay logs were successfully applied.
Thu Jul 28 16:37:09 2016 - [info] Getting new master's binlog name and position..
Thu Jul 28 16:37:09 2016 - [info] host_2_name.000002:294
Thu Jul 28 16:37:09 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2_name.000002', MASTER_LOG_POS=294, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jul 28 16:37:09 2016 - [info] Executing master IP activate script:
Thu Jul 28 16:37:09 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Set read_only=0 on the new master.
No need to Creating app user on the new master..
Thu Jul 28 16:37:09 2016 - [info] OK.
Thu Jul 28 16:37:09 2016 - [info] ** Finished master recovery successfully.
Thu Jul 28 16:37:09 2016 - [info] * Phase 3: Master Recovery Phase completed.
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] * Phase 4: Slaves Recovery Phase..
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 6211. Check tmp log /var/log/masterha/app1/host_3_3306_20160728163704.log if it takes time..
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] Log messages from host_3 ...
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Jul 28 16:37:09 2016 - [info] End of log messages from host_3.
Thu Jul 28 16:37:09 2016 - [info] -- host_3(host_3:3306) has the latest relay log events.
Thu Jul 28 16:37:09 2016 - [info] Generating relay diff files from the latest slave succeeded.
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Jul 28 16:37:09 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 6216. Check tmp log /var/log/masterha/app1/host_3_3306_20160728163704.log if it takes time..
Thu Jul 28 16:37:10 2016 - [info]
Thu Jul 28 16:37:10 2016 - [info] Log messages from host_3 ...
Thu Jul 28 16:37:10 2016 - [info]
Thu Jul 28 16:37:09 2016 - [info] Starting recovery on host_3(host_3:3306)..
Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Jul 28 16:37:09 2016 - [info] done.
Thu Jul 28 16:37:09 2016 - [info] All relay logs were successfully applied.
Thu Jul 28 16:37:09 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Thu Jul 28 16:37:09 2016 - [info] Executed CHANGE MASTER.
Thu Jul 28 16:37:10 2016 - [info] Slave started.
Thu Jul 28 16:37:10 2016 - [info] End of log messages from host_3.
Thu Jul 28 16:37:10 2016 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Jul 28 16:37:10 2016 - [info] All new slave servers recovered successfully.
Thu Jul 28 16:37:10 2016 - [info]
Thu Jul 28 16:37:10 2016 - [info] * Phase 5: New master cleanup phase..
Thu Jul 28 16:37:10 2016 - [info]
Thu Jul 28 16:37:10 2016 - [info] Resetting slave info on the new master..
Thu Jul 28 16:37:10 2016 - [info] host_2: Resetting slave info succeeded.
Thu Jul 28 16:37:10 2016 - [info] Master failover to host_2(host_2:3306) completed successfully.
Thu Jul 28 16:37:10 2016 - [info]
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name for details.
Started manual(interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.3 [功能测试] 测试场景二、手工操作, 自动完成,failover

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
* 关闭mha 监控
masterha_stop --conf=/etc/app1.cnf
  • step2: 手工制造master mysql挂了
1
killall -9 mysqld mysqld_safe
  • step3: 人工-非交互式 failover
1
2
3
4
5
6
7
8
9
10
masterha_master_switch --master_state=dead --conf=/etc/app1.cnf --dead_master_host=host_1 --interactive=0 --ignore_last_failover
* 切换日志如下:
--dead_master_ip=<dead_master_ip> is not set. Using host_1.
--dead_master_port=<dead_master_port> is not set. Using 3306.
Thu Jul 28 16:33:21 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Jul 28 16:33:21 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Thu Jul 28 16:33:21 2016 - [info] Reading server configuration from /etc/app1.cnf..

4.4 [功能测试] 测试场景三、手工操作,自动完成,online master switch

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手工制造master mysql挂了
1
killall -9 mysqld mysqld_safe
  • step3: 手工操作,自动完成,online master switch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
masterha_master_switch --master_state=alive --conf=/etc/app1.cnf --orig_master_is_new_slave --interactive=0
* 在online master切换过程中,如果MHA manager 开启,那么切换不会成功
Thu Jul 28 16:13:57 2016 - [info] Checking MHA is not monitoring or doing failover..
Thu Jul 28 16:13:57 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln142] Getting advisory lock failed on the current master. MHA Monitor runs on the current master. Stop MHA Manager/Monitor and try again.
Thu Jul 28 16:13:57 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53
* 在online master切换过程中,如果设置--orig_master_is_new_slave, 但是没有设置repl_password,就会报错
Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln784] Slave could not be started on host_1(host_1:3306)! Check slave status.
Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln862] Starting slave IO/SQL thread on host_1(host_1:3306) failed!
Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln573] Failed!
Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln602] Switching master to host_2(host_2:3306) done, but switching slaves partially failed.
* 切换成功的日志如下:
Thu Jul 28 16:23:25 2016 - [info] MHA::MasterRotate version 0.56.
Thu Jul 28 16:23:25 2016 - [info] Starting online master switch..
Thu Jul 28 16:23:25 2016 - [info]
Thu Jul 28 16:23:25 2016 - [info] * Phase 1: Configuration Check Phase..
Thu Jul 28 16:23:25 2016 - [info]
Thu Jul 28 16:23:25 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Jul 28 16:23:25 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Thu Jul 28 16:23:25 2016 - [info] Reading server configuration from /etc/app1.cnf..
Thu Jul 28 16:23:25 2016 - [info] GTID failover mode = 0
Thu Jul 28 16:23:25 2016 - [info] Current Alive Master: host_2(host_2:3306)
Thu Jul 28 16:23:25 2016 - [info] Alive Slaves:
Thu Jul 28 16:23:25 2016 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306)
Thu Jul 28 16:23:25 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 16:23:25 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306)
Thu Jul 28 16:23:25 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 16:23:25 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
Thu Jul 28 16:23:25 2016 - [info] ok.
Thu Jul 28 16:23:25 2016 - [info] Checking MHA is not monitoring or doing failover..
Thu Jul 28 16:23:25 2016 - [info] Checking replication health on host_1..
Thu Jul 28 16:23:25 2016 - [info] ok.
Thu Jul 28 16:23:25 2016 - [info] Checking replication health on host_3..
Thu Jul 28 16:23:25 2016 - [info] ok.
Thu Jul 28 16:23:25 2016 - [info] Searching new master from slaves..
Thu Jul 28 16:23:25 2016 - [info] Candidate masters from the configuration file:
Thu Jul 28 16:23:25 2016 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306)
Thu Jul 28 16:23:25 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 16:23:25 2016 - [info] host_2(host_2:3306) Version=5.7.13-log log-bin:enabled
Thu Jul 28 16:23:25 2016 - [info] Non-candidate masters:
Thu Jul 28 16:23:25 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306)
Thu Jul 28 16:23:25 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 16:23:25 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Jul 28 16:23:25 2016 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
+--host_2(host_2:3306)
Thu Jul 28 16:23:25 2016 - [info] Checking whether host_1(host_1:3306) is ok for the new master..
Thu Jul 28 16:23:25 2016 - [info] ok.
Thu Jul 28 16:23:25 2016 - [info] host_2(host_2:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host.
Thu Jul 28 16:23:25 2016 - [info] host_2(host_2:3306): Resetting slave pointing to the dummy host.
Thu Jul 28 16:23:25 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jul 28 16:23:25 2016 - [info]
Thu Jul 28 16:23:25 2016 - [info] * Phase 2: Rejecting updates Phase..
Thu Jul 28 16:23:25 2016 - [info]
Thu Jul 28 16:23:25 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master.
Thu Jul 28 16:23:25 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root):
Thu Jul 28 16:23:25 2016 - [info] Executing FLUSH TABLES WITH READ LOCK..
Thu Jul 28 16:23:25 2016 - [info] ok.
Thu Jul 28 16:23:25 2016 - [info] Orig master binlog:pos is host_2_name.000002:294.
Thu Jul 28 16:23:25 2016 - [info] Waiting to execute all relay logs on host_1(host_1:3306)..
Thu Jul 28 16:23:25 2016 - [info] master_pos_wait(host_2_name.000002:294) completed on host_1(host_1:3306). Executed 0 events.
Thu Jul 28 16:23:25 2016 - [info] done.
Thu Jul 28 16:23:25 2016 - [info] Getting new master's binlog name and position..
Thu Jul 28 16:23:25 2016 - [info] host_1_name.000004:154
Thu Jul 28 16:23:25 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1_name.000004', MASTER_LOG_POS=154, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jul 28 16:23:25 2016 - [info] Setting read_only=0 on host_1(host_1:3306)..
Thu Jul 28 16:23:26 2016 - [info] ok.
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] * Switching slaves in parallel..
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] -- Slave switch on host host_3(host_3:3306) started, pid: 32120
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] Log messages from host_3 ...
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] Waiting to execute all relay logs on host_3(host_3:3306)..
Thu Jul 28 16:23:26 2016 - [info] master_pos_wait(host_2_name.000002:294) completed on host_3(host_3:3306). Executed 0 events.
Thu Jul 28 16:23:26 2016 - [info] done.
Thu Jul 28 16:23:26 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Thu Jul 28 16:23:26 2016 - [info] Executed CHANGE MASTER.
Thu Jul 28 16:23:26 2016 - [info] Slave started.
Thu Jul 28 16:23:26 2016 - [info] End of log messages from host_3 ...
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] -- Slave switch on host host_3(host_3:3306) succeeded.
Thu Jul 28 16:23:26 2016 - [info] Unlocking all tables on the orig master:
Thu Jul 28 16:23:26 2016 - [info] Executing UNLOCK TABLES..
Thu Jul 28 16:23:26 2016 - [info] ok.
Thu Jul 28 16:23:26 2016 - [info] Starting orig master as a new slave..
Thu Jul 28 16:23:26 2016 - [info] Resetting slave host_2(host_2:3306) and starting replication from the new master host_1(host_1:3306)..
Thu Jul 28 16:23:26 2016 - [info] Executed CHANGE MASTER.
Thu Jul 28 16:23:26 2016 - [info] Slave started.
Thu Jul 28 16:23:26 2016 - [info] All new slave servers switched successfully.
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] * Phase 5: New master cleanup phase..
Thu Jul 28 16:23:26 2016 - [info]
Thu Jul 28 16:23:26 2016 - [info] host_1: Resetting slave info succeeded.
Thu Jul 28 16:23:26 2016 - [info] Switching master to host_1(host_1:3306) completed successfully.

4.5 测试场景四、自动监控,自动操作,自动完成,failover

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手工制造master mysql挂了
1
killall -9 mysqld mysqld_safe
  • step3: 自动监控,自动操作,自动完成,自动failover
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
Thu Jul 28 17:10:25 2016 - [info] MHA::MasterMonitor version 0.56.
Thu Jul 28 17:10:26 2016 - [info] GTID failover mode = 0
Thu Jul 28 17:10:26 2016 - [info] Dead Servers:
Thu Jul 28 17:10:26 2016 - [info] Alive Servers:
Thu Jul 28 17:10:26 2016 - [info] host_1(host_1:3306)
Thu Jul 28 17:10:26 2016 - [info] host_2(host_2:3306)
Thu Jul 28 17:10:26 2016 - [info] host_3(host_3:3306)
Thu Jul 28 17:10:26 2016 - [info] Alive Slaves:
Thu Jul 28 17:10:26 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:10:26 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:10:26 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 17:10:26 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:10:26 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:10:26 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 17:10:26 2016 - [info] Current Alive Master: host_1(host_1:3306)
Thu Jul 28 17:10:26 2016 - [info] Checking slave configurations..
Thu Jul 28 17:10:26 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306).
Thu Jul 28 17:10:26 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
Thu Jul 28 17:10:26 2016 - [info] Checking replication filtering settings..
Thu Jul 28 17:10:26 2016 - [info] binlog_do_db= , binlog_ignore_db=
Thu Jul 28 17:10:26 2016 - [info] Replication filtering check ok.
Thu Jul 28 17:10:26 2016 - [info] GTID (with auto-pos) is not supported
Thu Jul 28 17:10:26 2016 - [info] Starting SSH connection tests..
Thu Jul 28 17:10:27 2016 - [info] All SSH connection tests passed successfully.
Thu Jul 28 17:10:27 2016 - [info] Checking MHA Node version..
Thu Jul 28 17:10:28 2016 - [info] Version check ok.
Thu Jul 28 17:10:28 2016 - [info] Checking SSH publickey authentication settings on the current master..
Thu Jul 28 17:10:28 2016 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Jul 28 17:10:29 2016 - [info] Master MHA Node version is 0.56.
Thu Jul 28 17:10:29 2016 - [info] Checking recovery script configurations on host_1(host_1:3306)..
Thu Jul 28 17:10:29 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=host_1_name.000004
Thu Jul 28 17:10:29 2016 - [info] Connecting to root@host_1(host_1:22)..
Creating /var/log/masterha/app1 if not exists.. ok.
Checking output directory is accessible or not..
ok.
Binlog found at /data/mysql.bin, up to host_1_name.000004
Thu Jul 28 17:10:29 2016 - [info] Binlog setting check done.
Thu Jul 28 17:10:29 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers..
Thu Jul 28 17:10:29 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='dba' --slave_host=host_2 --slave_ip=host_2 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.7.13-log --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2_name-relay-bin.000001 --slave_pass=xxx
Thu Jul 28 17:10:29 2016 - [info] Connecting to root@host_2(host_2:22)..
Checking slave recovery environment settings..
Relay log found at /data/mysql_data, up to host_2_name-relay-bin.000002
Temporary relay log file is /data/mysql_data/host_2_name-relay-bin.000002
Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Jul 28 17:10:29 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.7.13-log --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3_name-relay-bin.000001 --slave_pass=xxx
Thu Jul 28 17:10:29 2016 - [info] Connecting to root@host_3(host_3:22)..
Checking slave recovery environment settings..
Relay log found at /data/mysql_data, up to host_3_name-relay-bin.000002
Temporary relay log file is /data/mysql_data/host_3_name-relay-bin.000002
Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure.
done.
Testing mysqlbinlog output.. done.
Cleaning up test file(s).. done.
Thu Jul 28 17:10:29 2016 - [info] Slaves settings check done.
Thu Jul 28 17:10:29 2016 - [info]
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
Thu Jul 28 17:10:29 2016 - [info] Checking master_ip_failover_script status:
Thu Jul 28 17:10:29 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=status --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306
Thu Jul 28 17:10:30 2016 - [info] OK.
Thu Jul 28 17:10:30 2016 - [warning] shutdown_script is not defined.
Thu Jul 28 17:10:30 2016 - [info] Set master ping interval 3 seconds.
Thu Jul 28 17:10:30 2016 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Thu Jul 28 17:10:30 2016 - [info] Starting ping health check on host_1(host_1:3306)..
Thu Jul 28 17:10:30 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Thu Jul 28 17:11:21 2016 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away)
Thu Jul 28 17:11:21 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Thu Jul 28 17:11:21 2016 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Jul 28 17:11:24 2016 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Thu Jul 28 17:11:24 2016 - [warning] Connection failed 2 time(s)..
Thu Jul 28 17:11:27 2016 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Thu Jul 28 17:11:27 2016 - [warning] Connection failed 3 time(s)..
Thu Jul 28 17:11:30 2016 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111)
Thu Jul 28 17:11:30 2016 - [warning] Connection failed 4 time(s)..
Thu Jul 28 17:11:30 2016 - [warning] Master is not reachable from health checker!
Thu Jul 28 17:11:30 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Thu Jul 28 17:11:30 2016 - [warning] SSH is reachable.
Thu Jul 28 17:11:30 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status..
Thu Jul 28 17:11:30 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Thu Jul 28 17:11:30 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Thu Jul 28 17:11:30 2016 - [info] Reading server configuration from /etc/app1.cnf..
Thu Jul 28 17:11:30 2016 - [info] GTID failover mode = 0
Thu Jul 28 17:11:30 2016 - [info] Dead Servers:
Thu Jul 28 17:11:30 2016 - [info] host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Alive Servers:
Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306)
Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306)
Thu Jul 28 17:11:30 2016 - [info] Alive Slaves:
Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 17:11:30 2016 - [info] Checking slave configurations..
Thu Jul 28 17:11:30 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306).
Thu Jul 28 17:11:30 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
Thu Jul 28 17:11:30 2016 - [info] Checking replication filtering settings..
Thu Jul 28 17:11:30 2016 - [info] Replication filtering check ok.
Thu Jul 28 17:11:30 2016 - [info] Master is down!
Thu Jul 28 17:11:30 2016 - [info] Terminating monitoring script.
Thu Jul 28 17:11:30 2016 - [info] Got exit code 20 (Master dead).
Thu Jul 28 17:11:30 2016 - [info] MHA::MasterFailover version 0.56.
Thu Jul 28 17:11:30 2016 - [info] Starting master failover.
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] * Phase 1: Configuration Check Phase..
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] GTID failover mode = 0
Thu Jul 28 17:11:30 2016 - [info] Dead Servers:
Thu Jul 28 17:11:30 2016 - [info] host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Checking master reachability via MySQL(double check)...
Thu Jul 28 17:11:30 2016 - [info] ok.
Thu Jul 28 17:11:30 2016 - [info] Alive Servers:
Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306)
Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306)
Thu Jul 28 17:11:30 2016 - [info] Alive Slaves:
Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 17:11:30 2016 - [info] Starting Non-GTID based failover.
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Jul 28 17:11:30 2016 - [info] Executing master IP deactivation script:
Thu Jul 28 17:11:30 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Jul 28 17:11:30 2016 - [info] done.
Thu Jul 28 17:11:30 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Jul 28 17:11:30 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] * Phase 3: Master Recovery Phase..
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000004:154
Thu Jul 28 17:11:30 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 17:11:30 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000004:154
Thu Jul 28 17:11:30 2016 - [info] Oldest slaves:
Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Jul 28 17:11:30 2016 - [info]
Thu Jul 28 17:11:30 2016 - [info] Fetching dead master's binary logs..
Thu Jul 28 17:11:30 2016 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1_name.000004 --start_pos=154 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728171130.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/app1 if not exists.. ok.
Concat binary/relay logs from host_1_name.000004 pos 154 to host_1_name.000004 EOF into /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728171130.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
No need to dump effective binlog data from /data/mysql.bin/host_1_name.000004 (pos starts 154, filesize 154). Skipping.
Binlog Checksum enabled
/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728171130.binlog has no effective data events.
Event not exists.
Thu Jul 28 17:11:31 2016 - [info] Additional events were not found from the orig master. No need to save.
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 3.3: Determining New Master Phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Jul 28 17:11:31 2016 - [info] All slaves received relay logs to the same position. No need to resync each other.
Thu Jul 28 17:11:31 2016 - [info] Searching new master from slaves..
Thu Jul 28 17:11:31 2016 - [info] Candidate masters from the configuration file:
Thu Jul 28 17:11:31 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:31 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:31 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Jul 28 17:11:31 2016 - [info] Non-candidate masters:
Thu Jul 28 17:11:31 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Jul 28 17:11:31 2016 - [info] Replicating from host_1(host_1:3306)
Thu Jul 28 17:11:31 2016 - [info] Not candidate for the new Master (no_master is set)
Thu Jul 28 17:11:31 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Jul 28 17:11:31 2016 - [info] New master is host_2(host_2:3306)
Thu Jul 28 17:11:31 2016 - [info] Starting master failover..
Thu Jul 28 17:11:31 2016 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Jul 28 17:11:31 2016 - [info] Starting recovery on host_2(host_2:3306)..
Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Jul 28 17:11:31 2016 - [info] done.
Thu Jul 28 17:11:31 2016 - [info] All relay logs were successfully applied.
Thu Jul 28 17:11:31 2016 - [info] Getting new master's binlog name and position..
Thu Jul 28 17:11:31 2016 - [info] host_2_name.000002:294
Thu Jul 28 17:11:31 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2_name.000002', MASTER_LOG_POS=294, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Jul 28 17:11:31 2016 - [info] Executing master IP activate script:
Thu Jul 28 17:11:31 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Set read_only=0 on the new master.
No need to Creating app user on the new master..
Thu Jul 28 17:11:31 2016 - [info] OK.
Thu Jul 28 17:11:31 2016 - [info] ** Finished master recovery successfully.
Thu Jul 28 17:11:31 2016 - [info] * Phase 3: Master Recovery Phase completed.
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 4: Slaves Recovery Phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 22427. Check tmp log /var/log/masterha/app1/host_3_3306_20160728171130.log if it takes time..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] Log messages from host_3 ...
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Jul 28 17:11:31 2016 - [info] End of log messages from host_3.
Thu Jul 28 17:11:31 2016 - [info] -- host_3(host_3:3306) has the latest relay log events.
Thu Jul 28 17:11:31 2016 - [info] Generating relay diff files from the latest slave succeeded.
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 22429. Check tmp log /var/log/masterha/app1/host_3_3306_20160728171130.log if it takes time..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] Log messages from host_3 ...
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] Starting recovery on host_3(host_3:3306)..
Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Jul 28 17:11:31 2016 - [info] done.
Thu Jul 28 17:11:31 2016 - [info] All relay logs were successfully applied.
Thu Jul 28 17:11:31 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Thu Jul 28 17:11:31 2016 - [info] Executed CHANGE MASTER.
Thu Jul 28 17:11:31 2016 - [info] Slave started.
Thu Jul 28 17:11:31 2016 - [info] End of log messages from host_3.
Thu Jul 28 17:11:31 2016 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Jul 28 17:11:31 2016 - [info] All new slave servers recovered successfully.
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] * Phase 5: New master cleanup phase..
Thu Jul 28 17:11:31 2016 - [info]
Thu Jul 28 17:11:31 2016 - [info] Resetting slave info on the new master..
Thu Jul 28 17:11:31 2016 - [info] host_2: Resetting slave info succeeded.
Thu Jul 28 17:11:31 2016 - [info] Master failover to host_2(host_2:3306) completed successfully.
Thu Jul 28 17:11:31 2016 - [info]
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.6 [用例测试] MySQL master too many connection,无权限,响应慢

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手工制造master too many connection
1
2
3
4
5
mysqlslap --concurrency=4000 --iterations=10 --query='insert into tb(id_2) select (1);' --debug-info -hhost_1 -uxx -pxx --create-schema=lc
mysqlslap: Error when connecting to server: 1040 Too many connections
mysqlslap: Error when connecting to server: 1040 Too many connections
mysqlslap: Error when connecting to server: 1040 Too many connections
  • step3: 检查是否发生failover
1
2
3
4
5
6
7
* 1. 手动开启另一个session,链接报错tmc。 但是并没有发生failover,说明master 还活着
* 2. 检查了MHAhealth check机制,原来是长链接
DIGEST_TEXT: SELECT ? AS VALUE
COUNT_STAR: 46
FIRST_SEEN: 2016-08-02 11:28:00
LAST_SEEN: 2016-08-02 11:30:37

4.7 [用例测试] MySQL master 服务器 down掉,且候选master落后的最多, 是否自动failover,知否可以成功的做日志补偿

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造候选master落后的情况
1
候选master> stop slave; sleep 600s ; start slave;
  • step3: 手动制造master 服务器挂掉的情况
1
master > reboot
  • step4: 观察MHA 日志
1
2
3
Tue Aug 2 14:33:34 2016 - [warning] Slave host_2(host_2:3306) SQL Thread delays too much. Latest log file:host_1_name.000002:576864, Current log file:host_1_name.000001:39731438. This server is not selected as a new master because recovery will take long time.
Tue Aug 2 14:33:34 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln970] None of existing slaves matches as a new master. Maybe preferred node is misconfigured or all slaves are too far behind.
Tue Aug 2 14:33:34 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53
  • step5:重新修正配置文件

由于要指定特定的slave为候选master,而此slave还落后非常多
必须在每组服务器上都加上check_repl_delay=0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
vi /etc/app1.cnf
[server default]
remote_workdir=/var/log/masterha/app1
manager_workdir=/var/log/masterha/app1
manager_log=/var/log/masterha/app1/app1.log
[server1]
hostname=host_1
candidate_master=1
check_repl_delay=0
[server2]
hostname=host_2
candidate_master=1
check_repl_delay=0
[server3]
hostname=host_3
no_master=1
check_repl_delay=0
  • step6:用手工failover再来模拟
1
masterha_master_switch --master_state=dead --conf=/etc/app1.cnf --dead_master_host=host_1 --interactive=1 --ignore_last_failover --new_master_host=host_2
  • step7: 观察MHA 日志
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
Tue Aug 2 14:38:55 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Tue Aug 2 14:38:55 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Tue Aug 2 14:38:55 2016 - [info] Reading server configuration from /etc/app1.cnf..
Tue Aug 2 14:38:55 2016 - [info] MHA::MasterFailover version 0.56.
Tue Aug 2 14:38:55 2016 - [info] Starting master failover.
Tue Aug 2 14:38:55 2016 - [info]
Tue Aug 2 14:38:55 2016 - [info] * Phase 1: Configuration Check Phase..
Tue Aug 2 14:38:55 2016 - [info]
Tue Aug 2 14:38:55 2016 - [info] GTID failover mode = 0
Tue Aug 2 14:38:55 2016 - [info] Dead Servers:
Tue Aug 2 14:38:55 2016 - [info] host_1(host_1:3306)
Tue Aug 2 14:38:55 2016 - [info] Checking master reachability via MySQL(double check)...
Tue Aug 2 14:38:55 2016 - [info] ok.
Tue Aug 2 14:38:55 2016 - [info] Alive Servers:
Tue Aug 2 14:38:55 2016 - [info] host_2(host_2:3306)
Tue Aug 2 14:38:55 2016 - [info] host_3(host_3:3306)
Tue Aug 2 14:38:55 2016 - [info] Alive Slaves:
Tue Aug 2 14:38:55 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Aug 2 14:38:55 2016 - [info] Replicating from host_1(host_1:3306)
Tue Aug 2 14:38:55 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Aug 2 14:38:55 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Aug 2 14:38:55 2016 - [info] Replicating from host_1(host_1:3306)
Tue Aug 2 14:38:55 2016 - [info] Not candidate for the new Master (no_master is set)
Master host_1(host_1:3306) is dead. Proceed? (yes/NO): yes
Tue Aug 2 14:38:57 2016 - [info] Starting Non-GTID based failover.
Tue Aug 2 14:38:57 2016 - [info]
Tue Aug 2 14:38:57 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Aug 2 14:38:57 2016 - [info]
Tue Aug 2 14:38:57 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Aug 2 14:38:57 2016 - [info]
Tue Aug 2 14:38:58 2016 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Aug 2 14:38:58 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Aug 2 14:38:58 2016 - [info] Executing master IP deactivation script:
Tue Aug 2 14:38:58 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Tue Aug 2 14:38:58 2016 - [info] done.
Tue Aug 2 14:38:58 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Aug 2 14:38:58 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Aug 2 14:38:58 2016 - [info]
Tue Aug 2 14:38:58 2016 - [info] * Phase 3: Master Recovery Phase..
Tue Aug 2 14:38:58 2016 - [info]
Tue Aug 2 14:38:58 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Aug 2 14:38:58 2016 - [info]
Tue Aug 2 14:38:58 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000002:576864
Tue Aug 2 14:38:58 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Aug 2 14:38:58 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Aug 2 14:38:58 2016 - [info] Replicating from host_1(host_1:3306)
Tue Aug 2 14:38:58 2016 - [info] Not candidate for the new Master (no_master is set)
Tue Aug 2 14:38:58 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000001:39731619
Tue Aug 2 14:38:58 2016 - [info] Oldest slaves:
Tue Aug 2 14:38:58 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Aug 2 14:38:58 2016 - [info] Replicating from host_1(host_1:3306)
Tue Aug 2 14:38:58 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Aug 2 14:38:58 2016 - [info]
Tue Aug 2 14:38:58 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Tue Aug 2 14:38:58 2016 - [info]
Tue Aug 2 14:38:58 2016 - [info] Fetching dead master's binary logs..
Tue Aug 2 14:38:58 2016 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1_name.000002 --start_pos=576864 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/app1 if not exists.. ok.
Concat binary/relay logs from host_1_name.000002 pos 576864 to host_1_name.000002 EOF into /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
Dumping effective binlog data from /data/mysql.bin/host_1_name.000002 position 576864 to tail(577435).. ok.
Binlog Checksum enabled
Concat succeeded.
saved_master_binlog_from_host_1_3306_20160802143855.binlog 100% 725 0.7KB/s 00:00
Tue Aug 2 14:38:59 2016 - [info] scp from root@host_1:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog to local:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog succeeded.
Tue Aug 2 14:38:59 2016 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Aug 2 14:39:00 2016 - [info] HealthCheck: SSH to host_3 is reachable.
Tue Aug 2 14:39:00 2016 - [info]
Tue Aug 2 14:39:00 2016 - [info] * Phase 3.3: Determining New Master Phase..
Tue Aug 2 14:39:00 2016 - [info]
Tue Aug 2 14:39:00 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Tue Aug 2 14:39:00 2016 - [info] Checking whether host_3 has relay logs from the oldest position..
Tue Aug 2 14:39:00 2016 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1_name.000002 --latest_rmlp=576864 --target_mlf=host_1_name.000001 --target_rmlp=39731619 --server_id=12616606 --workdir=/var/log/masterha/app1 --timestamp=20160802143855 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3_name-relay-bin.000004 :
Relay log found at /data/mysql_data, up to host_3_name-relay-bin.000004
Fast relay log position search failed. Reading relay logs to find..
Reading host_3_name-relay-bin.000004
Binlog Checksum enabled
Master Version is 5.7.13-log
Binlog Checksum enabled
host_3_name-relay-bin.000004 contains master host_1_name.000002 from position 4
Reading host_3_name-relay-bin.000003
Binlog Checksum enabled
host_3_name-relay-bin.000003 contains master host_1_name.000001 from position 218246073
Reading host_3_name-relay-bin.000002
Binlog Checksum enabled
host_3_name-relay-bin.000002 contains master host_1_name.000001 from position 154
Target relay log FOUND!
Tue Aug 2 14:39:00 2016 - [info] OK. host_3 has all relay logs.
Tue Aug 2 14:39:00 2016 - [info] Searching new master from slaves..
Tue Aug 2 14:39:00 2016 - [info] Candidate masters from the configuration file:
Tue Aug 2 14:39:00 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Aug 2 14:39:00 2016 - [info] Replicating from host_1(host_1:3306)
Tue Aug 2 14:39:00 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Aug 2 14:39:00 2016 - [info] Non-candidate masters:
Tue Aug 2 14:39:00 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Aug 2 14:39:00 2016 - [info] Replicating from host_1(host_1:3306)
Tue Aug 2 14:39:00 2016 - [info] Not candidate for the new Master (no_master is set)
Tue Aug 2 14:39:00 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Tue Aug 2 14:39:00 2016 - [info] Not found.
Tue Aug 2 14:39:00 2016 - [info] Searching from all candidate_master slaves..
Tue Aug 2 14:39:00 2016 - [info] New master is host_2(host_2:3306)
Tue Aug 2 14:39:00 2016 - [info] Starting master failover..
Tue Aug 2 14:39:00 2016 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Starting master switch from host_1(host_1:3306) to host_2(host_2:3306)? (yes/NO): yes
Tue Aug 2 14:39:05 2016 - [info] New master decided manually is host_2(host_2:3306)
Tue Aug 2 14:39:05 2016 - [info]
Tue Aug 2 14:39:05 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Tue Aug 2 14:39:05 2016 - [info]
Tue Aug 2 14:39:05 2016 - [info] Server host_2 received relay logs up to: host_1_name.000001:39731619
Tue Aug 2 14:39:05 2016 - [info] Need to get diffs from the latest slave(host_3) up to: host_1_name.000002:576864 (using the latest slave's relay logs)
Tue Aug 2 14:39:05 2016 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Tue Aug 2 14:39:05 2016 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_2 --latest_mlf=host_1_name.000002 --latest_rmlp=576864 --target_mlf=host_1_name.000001 --target_rmlp=39731619 --server_id=12616606 --diff_file_readtolatest=/var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog --workdir=/var/log/masterha/app1 --timestamp=20160802143855 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3_name-relay-bin.000004
Tue Aug 2 14:39:11 2016 - [info]
Relay log found at /data/mysql_data, up to host_3_name-relay-bin.000004
Fast relay log position search failed. Reading relay logs to find..
Reading host_3_name-relay-bin.000004
Binlog Checksum enabled
Master Version is 5.7.13-log
Binlog Checksum enabled
host_3_name-relay-bin.000004 contains master host_1_name.000002 from position 4
Reading host_3_name-relay-bin.000003
Binlog Checksum enabled
host_3_name-relay-bin.000003 contains master host_1_name.000001 from position 218246073
Reading host_3_name-relay-bin.000002
Binlog Checksum enabled
host_3_name-relay-bin.000002 contains master host_1_name.000001 from position 154
Target relay log file/position found. start_file:host_3_name-relay-bin.000002, start_pos:39731788.
Concat binary/relay logs from host_3_name-relay-bin.000002 pos 39731788 to host_3_name-relay-bin.000004 EOF into /var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog ..
Binlog Checksum enabled
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 323.. ok.
Dumping effective binlog data from /data/mysql_data/host_3_name-relay-bin.000002 position 39731788 to tail(218246252).. ok.
Dumping binlog head events (rotate events), skipping format description events from /data/mysql_data/host_3_name-relay-bin.000003.. Binlog Checksum enabled
dumped up to pos 264. ok.
No need to dump effective binlog data from /data/mysql_data/host_3_name-relay-bin.000003 (pos starts 264, filesize 264). Skipping.
Dumping binlog head events (rotate events), skipping format description events from /data/mysql_data/host_3_name-relay-bin.000004.. Binlog Checksum enabled
Binlog Checksum enabled
dumped up to pos 373. ok.
Dumping effective binlog data from /data/mysql_data/host_3_name-relay-bin.000004 position 373 to tail(577083).. ok.
Binlog Checksum enabled
Binlog Checksum enabled
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog .
scp host_3_name.58os.org:/var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog to root@host_2(22) succeeded.
Tue Aug 2 14:39:11 2016 - [info] Generating diff files succeeded.
Tue Aug 2 14:39:11 2016 - [info] Sending binlog..
saved_master_binlog_from_host_1_3306_20160802143855.binlog 100% 725 0.7KB/s 00:00
Tue Aug 2 14:39:12 2016 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog to root@host_2:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog succeeded.
Tue Aug 2 14:39:12 2016 - [info]
Tue Aug 2 14:39:12 2016 - [info] * Phase 3.4: Master Log Apply Phase..
Tue Aug 2 14:39:12 2016 - [info]
Tue Aug 2 14:39:12 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Tue Aug 2 14:39:12 2016 - [info] Starting recovery on host_2(host_2:3306)..
Tue Aug 2 14:39:12 2016 - [info] Generating diffs succeeded.
Tue Aug 2 14:39:12 2016 - [info] Waiting until all relay logs are applied.
Tue Aug 2 14:39:12 2016 - [info] done.
Tue Aug 2 14:39:12 2016 - [info] Getting slave status..
Tue Aug 2 14:39:12 2016 - [info] This slave(host_2)'s Exec_Master_Log_Pos(host_1_name.000001:39731438) does not equal to Read_Master_Log_Pos(host_1_name.000001:39731619). It is likely that relay log was cut during transaction. Need to recover from Exec_Master_Log_Pos.
Tue Aug 2 14:39:12 2016 - [info] Saving local relay logs from exec pos to read pos on host_2: from host_2_name-relay-bin.000003:39726380 to the end of the relay log..
Tue Aug 2 14:39:12 2016 - [info] Executing command : save_binary_logs --command=save --start_file=host_2_name-relay-bin.000003 --start_pos=39726380 --output_file=/var/log/masterha/app1/relay_from_exec_to_read_host_2_3306_20160802143855.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --binlog_dir=/data/mysql_data
Tue Aug 2 14:39:12 2016 - [info]
Creating /var/log/masterha/app1 if not exists.. ok.
Concat binary/relay logs from host_2_name-relay-bin.000003 pos 39726380 to host_2_name-relay-bin.000003 EOF into /var/log/masterha/app1/relay_from_exec_to_read_host_2_3306_20160802143855.binlog ..
Binlog Checksum enabled
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 323.. ok.
Dumping effective binlog data from /data/mysql_data/host_2_name-relay-bin.000003 position 39726380 to tail(39726561).. ok.
Binlog Checksum enabled
Binlog Checksum enabled
Concat succeeded.
Tue Aug 2 14:39:12 2016 - [info] Connecting to the target slave host host_2, running recover script..
Tue Aug 2 14:39:12 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_2 --slave_ip=host_2 --slave_port=3306 --apply_files=/var/log/masterha/app1/relay_from_exec_to_read_host_2_3306_20160802143855.binlog,/var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog,/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog --workdir=/var/log/masterha/app1 --target_version=5.7.13-log --timestamp=20160802143855 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Aug 2 14:42:11 2016 - [info]
Concat all apply files to /var/log/masterha/app1/total_binlog_for_host_2_3306.20160802143855.binlog ..
Copying the first binlog file /var/log/masterha/app1/relay_from_exec_to_read_host_2_3306_20160802143855.binlog to /var/log/masterha/app1/total_binlog_for_host_2_3306.20160802143855.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog.. Binlog Checksum enabled
Binlog Checksum enabled
dumped up to pos 323. ok.
/var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog has effective binlog events from pos 323.
Dumping effective binlog data from /var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog position 323 to tail(179091707).. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog.. Binlog Checksum enabled
dumped up to pos 154. ok.
/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog has effective binlog events from pos 154.
Dumping effective binlog data from /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog position 154 to tail(725).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/app1/total_binlog_for_host_2_3306.20160802143855.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/app1/relay_from_exec_to_read_host_2_3306_20160802143855.binlog,/var/log/masterha/app1/relay_from_read_to_latest_host_2_3306_20160802143855.binlog,/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog on host_2:3306. This may take long time...
Applying log files succeeded.
Tue Aug 2 14:42:11 2016 - [info] All relay logs were successfully applied.
Tue Aug 2 14:42:11 2016 - [info] Getting new master's binlog name and position..
Tue Aug 2 14:42:11 2016 - [info] host_2_name.000001:217647730
Tue Aug 2 14:42:11 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2_name.000001', MASTER_LOG_POS=217647730, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Aug 2 14:42:11 2016 - [info] Executing master IP activate script:
Tue Aug 2 14:42:11 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Set read_only=0 on the new master.
No need to Creating app user on the new master..
Tue Aug 2 14:42:11 2016 - [info] OK.
Tue Aug 2 14:42:11 2016 - [info] ** Finished master recovery successfully.
Tue Aug 2 14:42:11 2016 - [info] * Phase 3: Master Recovery Phase completed.
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] * Phase 4: Slaves Recovery Phase..
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 22541. Check tmp log /var/log/masterha/app1/host_3_3306_20160802143855.log if it takes time..
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] Log messages from host_3 ...
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Tue Aug 2 14:42:11 2016 - [info] End of log messages from host_3.
Tue Aug 2 14:42:11 2016 - [info] -- host_3(host_3:3306) has the latest relay log events.
Tue Aug 2 14:42:11 2016 - [info] Generating relay diff files from the latest slave succeeded.
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Tue Aug 2 14:42:11 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 22543. Check tmp log /var/log/masterha/app1/host_3_3306_20160802143855.log if it takes time..
saved_master_binlog_from_host_1_3306_20160802143855.binlog 100% 725 0.7KB/s 00:00
Tue Aug 2 14:42:12 2016 - [info]
Tue Aug 2 14:42:12 2016 - [info] Log messages from host_3 ...
Tue Aug 2 14:42:12 2016 - [info]
Tue Aug 2 14:42:11 2016 - [info] Sending binlog..
Tue Aug 2 14:42:12 2016 - [info] scp from local:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog to root@host_3:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog succeeded.
Tue Aug 2 14:42:12 2016 - [info] Starting recovery on host_3(host_3:3306)..
Tue Aug 2 14:42:12 2016 - [info] Generating diffs succeeded.
Tue Aug 2 14:42:12 2016 - [info] Waiting until all relay logs are applied.
Tue Aug 2 14:42:12 2016 - [info] done.
Tue Aug 2 14:42:12 2016 - [info] Getting slave status..
Tue Aug 2 14:42:12 2016 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1_name.000002:576864). No need to recover from Exec_Master_Log_Pos.
Tue Aug 2 14:42:12 2016 - [info] Connecting to the target slave host host_3, running recover script..
Tue Aug 2 14:42:12 2016 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog --workdir=/var/log/masterha/app1 --target_version=5.7.13-log --timestamp=20160802143855 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Aug 2 14:42:12 2016 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Tue Aug 2 14:42:12 2016 - [info] All relay logs were successfully applied.
Tue Aug 2 14:42:12 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Tue Aug 2 14:42:12 2016 - [info] Executed CHANGE MASTER.
Tue Aug 2 14:42:12 2016 - [info] Slave started.
Tue Aug 2 14:42:12 2016 - [info] End of log messages from host_3.
Tue Aug 2 14:42:12 2016 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Tue Aug 2 14:42:12 2016 - [info] All new slave servers recovered successfully.
Tue Aug 2 14:42:12 2016 - [info]
Tue Aug 2 14:42:12 2016 - [info] * Phase 5: New master cleanup phase..
Tue Aug 2 14:42:12 2016 - [info]
Tue Aug 2 14:42:12 2016 - [info] Resetting slave info on the new master..
Tue Aug 2 14:42:12 2016 - [info] host_2: Resetting slave info succeeded.
Tue Aug 2 14:42:12 2016 - [info] Master failover to host_2(host_2:3306) completed successfully.
Tue Aug 2 14:42:12 2016 - [info]
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name for details.
Started manual(interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.8 [用例测试] Master 没挂,但是MySQL slave 服务 down掉,是否自动failover

  • MySQL slave down掉,不会自动failover
1
2
3
* MHA 只会监控master的状态,只有master挂了,才会自动failover
[info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
  • master没挂,其他slave的错误,都不会导致自动failover
1
2
> 10. [用例测试] MySQL slave IO/SQL线程 stop,是否自动failover
> 11. [用例测试] MySQL slave IO/SQL线程 报错,是否自动failover

4.9 [用例测试] Master 挂了,但是MySQL slave 有问题,是否自动failover

MySQL slave IO/SQL线程 stop
MySQL slave IO/SQL线程 延迟很多

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动让某台slave stop slave ,然后再让master 挂掉
1
2
某台slave> stop slave;
master> shutdown mysql;
  • step3: 观察日志,观察情况
1
2
* 如果是slave 停止同步,然后master也挂了,任然会自动failover,不会报错
* 如果是slave 延迟严重,然后master也挂了,任然会自动failover,不会报错

4.10 [用例测试] MySQL master 有大事务超过100s再执行,是否online master switch

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
  • step2: 手动在master上制造大事务
1
master> select id , sleep(20) from tb for update;
  • step3: 哪种情况MHA会报错
1
2
3
4
5
6
7
8
* slave 有问题的时候
IO,sql thread有问题
slave 挂了
slave 延迟太多
复制过滤规则不统一
等等
以上情况,都会导致切换失败
  • step4:日志如下

    如果有大事务,MHA会一直等待,直到事务结束

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Wed Aug 3 15:33:04 2016 - [info] MHA::MasterRotate version 0.56.
Wed Aug 3 15:33:04 2016 - [info] Starting online master switch..
Wed Aug 3 15:33:04 2016 - [info]
Wed Aug 3 15:33:04 2016 - [info] * Phase 1: Configuration Check Phase..
Wed Aug 3 15:33:04 2016 - [info]
Wed Aug 3 15:33:04 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Wed Aug 3 15:33:04 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Wed Aug 3 15:33:04 2016 - [info] Reading server configuration from /etc/app1.cnf..
Wed Aug 3 15:33:04 2016 - [info] GTID failover mode = 0
Wed Aug 3 15:33:04 2016 - [info] Current Alive Master: host_1(host_1:3306)
Wed Aug 3 15:33:04 2016 - [info] Alive Slaves:
Wed Aug 3 15:33:04 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Aug 3 15:33:04 2016 - [info] Replicating from host_1(host_1:3306)
Wed Aug 3 15:33:04 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Aug 3 15:33:04 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Aug 3 15:33:04 2016 - [info] Replicating from host_1(host_1:3306)
Wed Aug 3 15:33:04 2016 - [info] Not candidate for the new Master (no_master is set)
Wed Aug 3 15:33:04 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time..
... 一直等待 ...

4.11 [用例测试] MySQL master 网络断掉

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造master网络不通的情况
1
2
3
4
5
6
* MASTER 网络断掉,只允许host_monitor的22端口访问(便于恢复网络)
master> iptables -A INPUT -p tcp -s host_monitor --dprot 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
* MASTER 恢复网络
master> service iptables restart
  • step3: master的网络不通,观察MHA日志
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Wed Aug 3 18:32:26 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Wed Aug 3 18:32:26 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Wed Aug 3 18:32:29 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Wed Aug 3 18:32:29 2016 - [warning] Connection failed 2 time(s)..
Wed Aug 3 18:32:31 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Wed Aug 3 18:32:32 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Wed Aug 3 18:32:32 2016 - [warning] Connection failed 3 time(s)..
Wed Aug 3 18:32:35 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Wed Aug 3 18:32:35 2016 - [warning] Connection failed 4 time(s)..
Wed Aug 3 18:32:35 2016 - [warning] Master is not reachable from health checker!
Wed Aug 3 18:32:35 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Wed Aug 3 18:32:35 2016 - [warning] SSH is NOT reachable.
Wed Aug 3 18:32:35 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status..
Wed Aug 3 18:32:35 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Wed Aug 3 18:32:35 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Wed Aug 3 18:32:35 2016 - [info] Reading server configuration from /etc/app1.cnf..
Wed Aug 3 18:32:35 2016 - [info] GTID failover mode = 0
Wed Aug 3 18:32:35 2016 - [info] Dead Servers:
Wed Aug 3 18:32:35 2016 - [info] host_1(host_1:3306)
Wed Aug 3 18:32:35 2016 - [info] Alive Servers:
Wed Aug 3 18:32:35 2016 - [info] host_2(host_2:3306)
Wed Aug 3 18:32:35 2016 - [info] host_3(host_3:3306)
Wed Aug 3 18:32:35 2016 - [info] Alive Slaves:
Wed Aug 3 18:32:35 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Aug 3 18:32:35 2016 - [info] Replicating from host_1(host_1:3306)
Wed Aug 3 18:32:35 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Aug 3 18:32:35 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Aug 3 18:32:35 2016 - [info] Replicating from host_1(host_1:3306)
Wed Aug 3 18:32:35 2016 - [info] Not candidate for the new Master (no_master is set)
Wed Aug 3 18:32:35 2016 - [info] Checking slave configurations..
Wed Aug 3 18:32:35 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306).
Wed Aug 3 18:32:35 2016 - [info] read_only=1 is not set on slave host_3(host_3:3306).
Wed Aug 3 18:32:35 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
Wed Aug 3 18:32:35 2016 - [info] Checking replication filtering settings..
Wed Aug 3 18:32:35 2016 - [info] Replication filtering check ok.
Wed Aug 3 18:32:35 2016 - [info] Master is down!
Wed Aug 3 18:32:35 2016 - [info] Terminating monitoring script.
Wed Aug 3 18:32:35 2016 - [info] Got exit code 20 (Master dead).
Wed Aug 3 18:32:35 2016 - [info] MHA::MasterFailover version 0.56.
Wed Aug 3 18:32:35 2016 - [info] Starting master failover.
Wed Aug 3 18:32:35 2016 - [info]
Wed Aug 3 18:32:35 2016 - [info] * Phase 1: Configuration Check Phase..
Wed Aug 3 18:32:35 2016 - [info]
Wed Aug 3 18:32:35 2016 - [info] GTID failover mode = 0
Wed Aug 3 18:32:35 2016 - [info] Dead Servers:
Wed Aug 3 18:32:35 2016 - [info] host_1(host_1:3306)
Wed Aug 3 18:32:35 2016 - [info] Checking master reachability via MySQL(double check)...
Wed Aug 3 18:32:36 2016 - [info] ok.
Wed Aug 3 18:32:36 2016 - [info] Alive Servers:
Wed Aug 3 18:32:36 2016 - [info] host_2(host_2:3306)
Wed Aug 3 18:32:36 2016 - [info] host_3(host_3:3306)
Wed Aug 3 18:32:36 2016 - [info] Alive Slaves:
Wed Aug 3 18:32:36 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Aug 3 18:32:36 2016 - [info] Replicating from host_1(host_1:3306)
Wed Aug 3 18:32:36 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Aug 3 18:32:36 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Aug 3 18:32:36 2016 - [info] Replicating from host_1(host_1:3306)
Wed Aug 3 18:32:36 2016 - [info] Not candidate for the new Master (no_master is set)
Wed Aug 3 18:32:36 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln309] Last failover was done at 2016/08/03 16:07:43. Current time is too early to do failover again. If you want to do failover, manually remove /var/log/masterha/app1/app1.failover.complete and run this script again.
Wed Aug 3 18:32:36 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_manager line 65
* 由于之前做过多次切换,然后时间太短,所以failover终止,但是从这里可以看出,MHA已经检测到网络问题,进行failover动作了

4.12 [用例测试] MySQL master 网路瞬断(1~30秒),是否自动failover

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造master网络不通的情况, 断掉1~10s左右
1
2
3
4
5
6
7
* MASTER 网络断掉,只允许host_monitor的22端口访问(便于恢复网络)
master> iptables -A INPUT -p tcp -s host_monitor --dprot 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
* 1~10秒左右 MASTER 恢复网络
master> service iptables restart
  • step3: 观察MHA日志

结论:如果时间非常短,比如1~2秒的网络瞬断,不会failover
结论:如果时间比较长,比如:> 5~10秒,那么如果MHA已经判定到Master is not reachable from health checker! 那么,就会进行failover阶段

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Wed Aug 3 18:48:35 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Wed Aug 3 18:48:50 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Wed Aug 3 18:48:50 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Wed Aug 3 18:48:53 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Wed Aug 3 18:48:53 2016 - [warning] Connection failed 2 time(s)..
Wed Aug 3 18:48:55 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Wed Aug 3 18:48:56 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Wed Aug 3 18:48:56 2016 - [warning] Connection failed 3 time(s)..
Wed Aug 3 18:48:59 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Wed Aug 3 18:48:59 2016 - [warning] Connection failed 4 time(s)..
Wed Aug 3 18:48:59 2016 - [warning] Master is not reachable from health checker!
Wed Aug 3 18:48:59 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Wed Aug 3 18:48:59 2016 - [warning] SSH is NOT reachable.
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.14 [用例测试] 多段网络测试0 (没有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 正常 —> S1 <— 正常 —> master
Manager <— 正常 —> S2 <— 正常 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
4
5
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s host_2 -j ACCEPT
master> iptables -A INPUT -p tcp -s host_3 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 检查日志

已经发生failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.14 [用例测试] 多段网络测试1 (有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 正常 —> S1 <— 正常 —> master
Manager <— 正常 —> S2 <— 正常 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf --last_failover_minute=1 &
* 二次检测
secondary_check_script= masterha_secondary_check -s host_2 -s host_3
  • step2: 手动制造上述网络故障
1
2
3
4
5
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s host_2 -j ACCEPT
master> iptables -A INPUT -p tcp -s host_3 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

结论: 因为二次检查可以通过,所以MHA认为master没有挂,没有failover,但是在一直循环检测中
结论: 只要有二次检测,那么只要有一个别的server可以连通master,那么就会认为master没有挂,就不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Fri Aug 5 10:16:24 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Fri Aug 5 10:16:24 2016 - [info] HealthCheck: SSH to host_1 is reachable.
Fri Aug 5 10:17:06 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Fri Aug 5 10:17:06 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:17:06 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Master is reachable from host_2!
Fri Aug 5 10:17:07 2016 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Aug 5 10:17:09 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:09 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:17:11 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Aug 5 10:17:12 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:12 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:17:15 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:15 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 10:17:15 2016 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Aug 5 10:17:18 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:18 2016 - [warning] Connection failed 1 time(s)..
Fri Aug 5 10:17:19 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:17:19 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Master is reachable from host_2!
Fri Aug 5 10:17:19 2016 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Aug 5 10:17:22 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:22 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:17:24 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Aug 5 10:17:25 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:25 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:17:28 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:28 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 10:17:28 2016 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Aug 5 10:17:31 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:17:31 2016 - [warning] Connection failed 1 time(s)..
Fri Aug 5 10:17:31 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:17:31 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Master is reachable from host_2!

4.14 [用例测试] 多段网络测试2 (有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 正常 —> S1 <— 不通 —> master
Manager <— 正常 —> S2 <— 正常 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
4
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s host_3 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

结论: 因为二次检查可以通过,所以MHA认为master没有挂,没有failover,但是在一直循环检测中
结论: 只要有二次检测,那么只要有一个别的server可以连通master,那么就会认为master没有挂,就不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
Fri Aug 5 10:32:27 2016 - [info] Starting ping health check on host_1(host_1:3306)..
Fri Aug 5 10:32:27 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Fri Aug 5 10:33:30 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Fri Aug 5 10:33:30 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:33:30 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:33:33 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:33 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:33:35 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Monitoring server host_2 is reachable, Master is not reachable from host_2. OK.
Master is reachable from host_3!
Fri Aug 5 10:33:35 2016 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Aug 5 10:33:36 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:36 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:33:39 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:39 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 10:33:39 2016 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Aug 5 10:33:42 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:42 2016 - [warning] Connection failed 1 time(s)..
Fri Aug 5 10:33:42 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:33:42 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:33:45 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:45 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:33:47 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Monitoring server host_2 is reachable, Master is not reachable from host_2. OK.
Master is reachable from host_3!
Fri Aug 5 10:33:47 2016 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Aug 5 10:33:48 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:48 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:33:51 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:51 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 10:33:51 2016 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Aug 5 10:33:54 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:54 2016 - [warning] Connection failed 1 time(s)..
Fri Aug 5 10:33:54 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:33:54 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:33:57 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:33:57 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:33:59 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Monitoring server host_2 is reachable, Master is not reachable from host_2. OK.
Master is reachable from host_3!
Fri Aug 5 10:34:00 2016 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Aug 5 10:34:00 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:34:00 2016 - [warning] Connection failed 3 time(s)..

4.14 [用例测试] 多段网络测试3 (有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 正常 —> S1 <— 不通 —> master
Manager <— 正常 —> S2 <— 不通 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

由于都不通,那么会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
Fri Aug 5 10:27:56 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Fri Aug 5 10:27:56 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:27:56 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:27:59 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:27:59 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:28:01 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Monitoring server host_2 is reachable, Master is not reachable from host_2. OK.
Fri Aug 5 10:28:02 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:28:02 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:28:05 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:28:05 2016 - [warning] Connection failed 4 time(s)..
Monitoring server host_3 is reachable, Master is not reachable from host_3. OK.
Fri Aug 5 10:28:06 2016 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Aug 5 10:28:06 2016 - [warning] Master is not reachable from health checker!
Fri Aug 5 10:28:06 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Fri Aug 5 10:28:06 2016 - [warning] SSH is NOT reachable.
Fri Aug 5 10:28:06 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status..
Fri Aug 5 10:28:06 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Fri Aug 5 10:28:06 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Fri Aug 5 10:28:06 2016 - [info] Reading server configuration from /etc/app1.cnf..
Fri Aug 5 10:28:06 2016 - [info] GTID failover mode = 0
Fri Aug 5 10:28:06 2016 - [info] Dead Servers:
Fri Aug 5 10:28:06 2016 - [info] host_1(host_1:3306)
Fri Aug 5 10:28:06 2016 - [info] Alive Servers:
Fri Aug 5 10:28:06 2016 - [info] host_2(host_2:3306)
Fri Aug 5 10:28:06 2016 - [info] host_3(host_3:3306)
Fri Aug 5 10:28:06 2016 - [info] Alive Slaves:
Fri Aug 5 10:28:06 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:06 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:06 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 10:28:06 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:06 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:06 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 10:28:06 2016 - [info] Checking slave configurations..
Fri Aug 5 10:28:06 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306).
Fri Aug 5 10:28:06 2016 - [info] read_only=1 is not set on slave host_3(host_3:3306).
Fri Aug 5 10:28:06 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
Fri Aug 5 10:28:06 2016 - [info] Checking replication filtering settings..
Fri Aug 5 10:28:06 2016 - [info] Replication filtering check ok.
Fri Aug 5 10:28:06 2016 - [info] Master is down!
Fri Aug 5 10:28:06 2016 - [info] Terminating monitoring script.
Fri Aug 5 10:28:06 2016 - [info] Got exit code 20 (Master dead).
Fri Aug 5 10:28:06 2016 - [info] MHA::MasterFailover version 0.56.
Fri Aug 5 10:28:06 2016 - [info] Starting master failover.
Fri Aug 5 10:28:06 2016 - [info]
Fri Aug 5 10:28:06 2016 - [info] * Phase 1: Configuration Check Phase..
Fri Aug 5 10:28:06 2016 - [info]
Fri Aug 5 10:28:07 2016 - [info] GTID failover mode = 0
Fri Aug 5 10:28:07 2016 - [info] Dead Servers:
Fri Aug 5 10:28:07 2016 - [info] host_1(host_1:3306)
Fri Aug 5 10:28:07 2016 - [info] Checking master reachability via MySQL(double check)...
Fri Aug 5 10:28:08 2016 - [info] ok.
Fri Aug 5 10:28:08 2016 - [info] Alive Servers:
Fri Aug 5 10:28:08 2016 - [info] host_2(host_2:3306)
Fri Aug 5 10:28:08 2016 - [info] host_3(host_3:3306)
Fri Aug 5 10:28:08 2016 - [info] Alive Slaves:
Fri Aug 5 10:28:08 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 10:28:08 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 10:28:08 2016 - [info] Starting Non-GTID based failover.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Aug 5 10:28:08 2016 - [info] Executing master IP deactivation script:
Fri Aug 5 10:28:08 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
Fri Aug 5 10:28:08 2016 - [info] done.
Fri Aug 5 10:28:08 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Aug 5 10:28:08 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 3: Master Recovery Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000001:154
Fri Aug 5 10:28:08 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Aug 5 10:28:08 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 10:28:08 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 10:28:08 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000001:154
Fri Aug 5 10:28:08 2016 - [info] Oldest slaves:
Fri Aug 5 10:28:08 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 10:28:08 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 3.3: Determining New Master Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Fri Aug 5 10:28:08 2016 - [info] All slaves received relay logs to the same position. No need to resync each other.
Fri Aug 5 10:28:08 2016 - [info] Searching new master from slaves..
Fri Aug 5 10:28:08 2016 - [info] Candidate masters from the configuration file:
Fri Aug 5 10:28:08 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 10:28:08 2016 - [info] Non-candidate masters:
Fri Aug 5 10:28:08 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 10:28:08 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 10:28:08 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 10:28:08 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Aug 5 10:28:08 2016 - [info] New master is host_2(host_2:3306)
Fri Aug 5 10:28:08 2016 - [info] Starting master failover..
Fri Aug 5 10:28:08 2016 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 3.4: Master Log Apply Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Fri Aug 5 10:28:08 2016 - [info] Starting recovery on host_2(host_2:3306)..
Fri Aug 5 10:28:08 2016 - [info] This server has all relay logs. Waiting all logs to be applied..
Fri Aug 5 10:28:08 2016 - [info] done.
Fri Aug 5 10:28:08 2016 - [info] All relay logs were successfully applied.
Fri Aug 5 10:28:08 2016 - [info] Getting new master's binlog name and position..
Fri Aug 5 10:28:08 2016 - [info] host_2_name.000001:154
Fri Aug 5 10:28:08 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2_name.000001', MASTER_LOG_POS=154, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Fri Aug 5 10:28:08 2016 - [info] Executing master IP activate script:
Fri Aug 5 10:28:08 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Set read_only=0 on the new master.
No need to Creating app user on the new master..
Fri Aug 5 10:28:08 2016 - [info] OK.
Fri Aug 5 10:28:08 2016 - [info] ** Finished master recovery successfully.
Fri Aug 5 10:28:08 2016 - [info] * Phase 3: Master Recovery Phase completed.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 4: Slaves Recovery Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 3964. Check tmp log /var/log/masterha/app1/host_3_3306_20160805102806.log if it takes time..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] Log messages from host_3 ...
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Fri Aug 5 10:28:08 2016 - [info] End of log messages from host_3.
Fri Aug 5 10:28:08 2016 - [info] -- host_3(host_3:3306) has the latest relay log events.
Fri Aug 5 10:28:08 2016 - [info] Generating relay diff files from the latest slave succeeded.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 3966. Check tmp log /var/log/masterha/app1/host_3_3306_20160805102806.log if it takes time..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] Log messages from host_3 ...
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] Starting recovery on host_3(host_3:3306)..
Fri Aug 5 10:28:08 2016 - [info] This server has all relay logs. Waiting all logs to be applied..
Fri Aug 5 10:28:08 2016 - [info] done.
Fri Aug 5 10:28:08 2016 - [info] All relay logs were successfully applied.
Fri Aug 5 10:28:08 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Fri Aug 5 10:28:08 2016 - [info] Executed CHANGE MASTER.
Fri Aug 5 10:28:08 2016 - [info] Slave started.
Fri Aug 5 10:28:08 2016 - [info] End of log messages from host_3.
Fri Aug 5 10:28:08 2016 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Fri Aug 5 10:28:08 2016 - [info] All new slave servers recovered successfully.
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] * Phase 5: New master cleanup phase..
Fri Aug 5 10:28:08 2016 - [info]
Fri Aug 5 10:28:08 2016 - [info] Resetting slave info on the new master..
Fri Aug 5 10:28:08 2016 - [info] host_2: Resetting slave info succeeded.
Fri Aug 5 10:28:08 2016 - [info] Master failover to host_2(host_2:3306) completed successfully.
Fri Aug 5 10:28:08 2016 - [info]
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.14 [用例测试] 多段网络测试4 (没有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 不通 —> S1 <— 正常 —> master
Manager <— 不通 —> S2 <— 正常 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
4
5
6
7
8
9
10
11
12
13
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s host_2 -j ACCEPT
master> iptables -A INPUT -p tcp -s host_3 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s1> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s1> iptables -A INPUT -p tcp -s host_1 -j ACCEPT
s1> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s2> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s2> iptables -A INPUT -p tcp -s host_1 -j ACCEPT
s2> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

结论:因为master不通,其他slave也不通,那么没有存活的slave,也就不会failover了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Fri Aug 5 10:39:39 2016 - [info] Checking master_ip_failover_script status:
Fri Aug 5 10:39:39 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=status --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306
Fri Aug 5 10:39:39 2016 - [info] OK.
Fri Aug 5 10:39:39 2016 - [warning] shutdown_script is not defined.
Fri Aug 5 10:39:39 2016 - [info] Set master ping interval 3 seconds.
Fri Aug 5 10:39:39 2016 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Fri Aug 5 10:39:39 2016 - [info] Starting ping health check on host_1(host_1:3306)..
Fri Aug 5 10:39:39 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Fri Aug 5 10:41:51 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Fri Aug 5 10:41:51 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:41:54 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:41:54 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:41:56 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Aug 5 10:41:57 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:41:57 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:42:00 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:42:00 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 10:42:00 2016 - [warning] Master is not reachable from health checker!
Fri Aug 5 10:42:00 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Fri Aug 5 10:42:00 2016 - [warning] SSH is NOT reachable.
Fri Aug 5 10:42:00 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status..
Fri Aug 5 10:42:00 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Fri Aug 5 10:42:00 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Fri Aug 5 10:42:00 2016 - [info] Reading server configuration from /etc/app1.cnf..
Fri Aug 5 10:42:08 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln188] There is no alive server. We can't do failover
Fri Aug 5 10:42:08 2016 - [warning] Got Error: at /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm line 558
Fri Aug 5 10:42:08 2016 - [info] Got exit code 1 (Not master dead).

4.14 [用例测试] 多段网络测试5 (有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 不通 —> S1 <— 正常 —> master
Manager <— 不通 —> S2 <— 正常 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
4
5
6
7
8
9
10
11
12
13
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s host_2 -j ACCEPT
master> iptables -A INPUT -p tcp -s host_3 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s1> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s1> iptables -A INPUT -p tcp -s host_1 -j ACCEPT
s1> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s2> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s2> iptables -A INPUT -p tcp -s host_1 -j ACCEPT
s2> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

Monitoring server host_2 is NOT reachable!
只要monitor 服务器,有一台不通,那么就不会failover,MHA manager会认为这是网络故障

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Fri Aug 5 10:47:41 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Fri Aug 5 10:55:50 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Fri Aug 5 10:55:50 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:55:50 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:55:53 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:55:53 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:55:55 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
ssh: connect to host host_2 port 22: Connection timed out
Monitoring server host_2 is NOT reachable!
Fri Aug 5 10:55:55 2016 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Aug 5 10:55:56 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:55:56 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 10:55:59 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:55:59 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 10:55:59 2016 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Aug 5 10:56:02 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:56:02 2016 - [warning] Connection failed 1 time(s)..
Fri Aug 5 10:56:02 2016 - [info] Executing secondary network check script: masterha_secondary_check -s host_2 -s host_3 --user=root --master_host=host_1 --master_ip=host_1 --master_port=3306 --master_user=dba --master_password=dba --ping_type=SELECT
Fri Aug 5 10:56:02 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 10:56:05 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 10:56:05 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 10:56:07 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
ssh: connect to host host_2 port 22: Connection timed out
Monitoring server host_2 is NOT reachable!
Fri Aug 5 10:56:07 2016 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.

4.14 [用例测试] 多段网络测试6 (没有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 不通 —> S1 <— 不通 —> master
Manager <— 不通 —> S2 <— 不通 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
4
5
6
7
8
9
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s1> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s1> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s2> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s2> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

由于都不通,所以不会failover

4.14 [用例测试] 多段网络测试7 (有二次检测的机制)

Manager <— 不通 —> Master
Manager <— 不通 —> S1 <— 不通 —> master
Manager <— 不通 —> S2 <— 不通 —> master

  • step1: 检查各种MHA manager的各种环境
1
2
3
4
5
6
7
8
9
10
11
12
13
* 检查SSH
masterha_check_ssh --conf=/etc/app1.cnf
* 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告)
masterha_check_repl --conf=/etc/app1.cnf
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
* 检查mha状态
masterha_check_status --conf=/etc/app1.cnf
* 开启mha 监控
nohup masterha_manager --conf=/etc/app1.cnf &
  • step2: 手动制造上述网络故障
1
2
3
4
5
6
7
8
9
master> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s1> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s1> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
s2> iptables -A INPUT -p tcp -s host_monitor -j ACCEPT
s2> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
  • step3: 观察日志

由于都不通,所以不会failover

4.15 [用例测试] Non-GTID模式下,需要relay-log吗?是否能够成功的补齐日志

如果没有足够的relay-log,在恢复的时候,会报错的
在生产差异relay log的时候发现没有足够的日志,所以就报错,不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
Fri Aug 5 11:09:51 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
Fri Aug 5 11:10:42 2016 - [warning] Got timeout on MySQL Ping(SELECT) child process and killed it! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 431.
Fri Aug 5 11:10:42 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name
Fri Aug 5 11:10:45 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 11:10:45 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 11:10:47 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Aug 5 11:10:48 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 11:10:48 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 11:10:51 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 11:10:51 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 11:10:51 2016 - [warning] Master is not reachable from health checker!
Fri Aug 5 11:10:51 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Fri Aug 5 11:10:51 2016 - [warning] SSH is NOT reachable.
Fri Aug 5 11:10:51 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status..
Fri Aug 5 11:10:51 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Fri Aug 5 11:10:51 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Fri Aug 5 11:10:51 2016 - [info] Reading server configuration from /etc/app1.cnf..
Fri Aug 5 11:10:51 2016 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Fri Aug 5 11:10:51 2016 - [info] GTID failover mode = 0
Fri Aug 5 11:10:51 2016 - [info] Dead Servers:
Fri Aug 5 11:10:51 2016 - [info] host_1(host_1:3306)
Fri Aug 5 11:10:51 2016 - [info] Alive Servers:
Fri Aug 5 11:10:51 2016 - [info] host_2(host_2:3306)
Fri Aug 5 11:10:51 2016 - [info] host_3(host_3:3306)
Fri Aug 5 11:10:51 2016 - [info] Alive Slaves:
Fri Aug 5 11:10:51 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 11:10:51 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 11:10:51 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 11:10:51 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 11:10:51 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 11:10:51 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 11:10:51 2016 - [info] Checking slave configurations..
Fri Aug 5 11:10:51 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306).
Fri Aug 5 11:10:51 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306).
Fri Aug 5 11:10:51 2016 - [info] read_only=1 is not set on slave host_3(host_3:3306).
Fri Aug 5 11:10:51 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306).
Fri Aug 5 11:10:51 2016 - [info] Checking replication filtering settings..
Fri Aug 5 11:10:51 2016 - [info] Replication filtering check ok.
Fri Aug 5 11:10:51 2016 - [info] Master is down!
Fri Aug 5 11:10:51 2016 - [info] Terminating monitoring script.
Fri Aug 5 11:10:51 2016 - [info] Got exit code 20 (Master dead).
Fri Aug 5 11:10:51 2016 - [info] MHA::MasterFailover version 0.56.
Fri Aug 5 11:10:51 2016 - [info] Starting master failover.
Fri Aug 5 11:10:51 2016 - [info]
Fri Aug 5 11:10:51 2016 - [info] * Phase 1: Configuration Check Phase..
Fri Aug 5 11:10:51 2016 - [info]
Fri Aug 5 11:10:51 2016 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Fri Aug 5 11:10:51 2016 - [info] GTID failover mode = 0
Fri Aug 5 11:10:51 2016 - [info] Dead Servers:
Fri Aug 5 11:10:51 2016 - [info] host_1(host_1:3306)
Fri Aug 5 11:10:51 2016 - [info] Checking master reachability via MySQL(double check)...
Fri Aug 5 11:10:52 2016 - [info] ok.
Fri Aug 5 11:10:52 2016 - [info] Alive Servers:
Fri Aug 5 11:10:52 2016 - [info] host_2(host_2:3306)
Fri Aug 5 11:10:52 2016 - [info] host_3(host_3:3306)
Fri Aug 5 11:10:52 2016 - [info] Alive Slaves:
Fri Aug 5 11:10:52 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 11:10:52 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 11:10:52 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 11:10:52 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 11:10:52 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 11:10:52 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 11:10:52 2016 - [info] Starting SQL thread on host_3(host_3:3306) ..
Fri Aug 5 11:10:52 2016 - [info] done.
Fri Aug 5 11:10:52 2016 - [info] Starting Non-GTID based failover.
Fri Aug 5 11:10:52 2016 - [info]
Fri Aug 5 11:10:52 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Aug 5 11:10:52 2016 - [info]
Fri Aug 5 11:10:52 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Aug 5 11:10:52 2016 - [info]
Fri Aug 5 11:10:52 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Aug 5 11:10:52 2016 - [info] Executing master IP deactivation script:
Fri Aug 5 11:10:52 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
Fri Aug 5 11:10:53 2016 - [info] done.
Fri Aug 5 11:10:53 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Aug 5 11:10:53 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [info] * Phase 3: Master Recovery Phase..
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000001:32925754
Fri Aug 5 11:10:53 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Aug 5 11:10:53 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 11:10:53 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 11:10:53 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 11:10:53 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000001:2093512
Fri Aug 5 11:10:53 2016 - [info] Oldest slaves:
Fri Aug 5 11:10:53 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 11:10:53 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 11:10:53 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [info] * Phase 3.3: Determining New Master Phase..
Fri Aug 5 11:10:53 2016 - [info]
Fri Aug 5 11:10:53 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Fri Aug 5 11:10:53 2016 - [info] HealthCheck: SSH to host_2 is reachable.
Fri Aug 5 11:10:53 2016 - [info] Checking whether host_2 has relay logs from the oldest position..
Fri Aug 5 11:10:53 2016 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1_name.000001 --latest_rmlp=32925754 --target_mlf=host_1_name.000001 --target_rmlp=2093512 --server_id=12616506 --workdir=/var/log/masterha/app1 --timestamp=20160805111051 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2_name-relay-bin.000008 :
Relay log found at /data/mysql_data, up to host_2_name-relay-bin.000008
Fast relay log position search failed. Reading relay logs to find..
Reading host_2_name-relay-bin.000008
Binlog Checksum enabled
Master Version is 5.7.13-log
Binlog Checksum enabled
host_2_name-relay-bin.000008 contains master host_1_name.000001 from position 26175199
Reading host_2_name-relay-bin.000007
Binlog Checksum enabled
host_2_name-relay-bin.000007 contains master host_1_name.000001 from position 12184106
Reading host_2_name-relay-bin.000006
No such file or directory:/data/mysql_data/host_2_name-relay-bin.000006 at /usr/share/perl5/vendor_perl/MHA/BinlogPosFindManager.pm line 102
Fri Aug 5 11:10:53 2016 - [warning] host_2 doesn't have all relay logs. Maybe some logs were purged.
Fri Aug 5 11:10:53 2016 - [warning] None of latest servers have enough relay logs from oldest position. We can't recover oldest slaves.
Fri Aug 5 11:10:53 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln947] None of the latest slaves has enough relay logs for recovery.
Fri Aug 5 11:10:53 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_manager line 65
Fri Aug 5 11:10:53 2016 - [info]
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306)
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
None of the latest slaves has enough relay logs for recovery.
Got Error so couldn't continue failover from here.

4.16 [用例测试] GTID模式下,需要relay-log吗?是否能够成功的补齐日志

GTID模式下,不需要relay-log,因为是走新的协议。
但是必须打开 binlog,log_slave_updates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
Fri Aug 5 13:32:25 2016 - [info] OK.
Fri Aug 5 13:32:25 2016 - [warning] shutdown_script is not defined.
Fri Aug 5 13:32:25 2016 - [info] Set master ping interval 3 seconds.
Fri Aug 5 13:32:25 2016 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Fri Aug 5 13:32:25 2016 - [info] Starting ping health check on host_1(host_1:3306)..
Fri Aug 5 13:32:28 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 13:32:28 2016 - [warning] Connection failed 1 time(s)..
Fri Aug 5 13:32:28 2016 - [info] Executing SSH check script: exit 0
Fri Aug 5 13:32:31 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 13:32:31 2016 - [warning] Connection failed 2 time(s)..
Fri Aug 5 13:32:33 2016 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Aug 5 13:32:34 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 13:32:34 2016 - [warning] Connection failed 3 time(s)..
Fri Aug 5 13:32:37 2016 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on 'host_1' (4))
Fri Aug 5 13:32:37 2016 - [warning] Connection failed 4 time(s)..
Fri Aug 5 13:32:37 2016 - [warning] Master is not reachable from health checker!
Fri Aug 5 13:32:37 2016 - [warning] Master host_1(host_1:3306) is not reachable!
Fri Aug 5 13:32:37 2016 - [warning] SSH is NOT reachable.
Fri Aug 5 13:32:37 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status..
Fri Aug 5 13:32:37 2016 - [info] Reading default configuration from /etc/masterha_default.cnf..
Fri Aug 5 13:32:37 2016 - [info] Reading application default configuration from /etc/app1.cnf..
Fri Aug 5 13:32:37 2016 - [info] Reading server configuration from /etc/app1.cnf..
Fri Aug 5 13:32:37 2016 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Fri Aug 5 13:32:37 2016 - [info] GTID failover mode = 1
Fri Aug 5 13:32:37 2016 - [info] Dead Servers:
Fri Aug 5 13:32:37 2016 - [info] host_1(host_1:3306)
Fri Aug 5 13:32:37 2016 - [info] Alive Servers:
Fri Aug 5 13:32:37 2016 - [info] host_2(host_2:3306)
Fri Aug 5 13:32:37 2016 - [info] host_3(host_3:3306)
Fri Aug 5 13:32:37 2016 - [info] Alive Slaves:
Fri Aug 5 13:32:37 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:37 2016 - [info] GTID ON
Fri Aug 5 13:32:37 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:37 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 13:32:37 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:37 2016 - [info] GTID ON
Fri Aug 5 13:32:37 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:37 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 13:32:37 2016 - [info] Checking slave configurations..
Fri Aug 5 13:32:37 2016 - [info] Checking replication filtering settings..
Fri Aug 5 13:32:37 2016 - [info] Replication filtering check ok.
Fri Aug 5 13:32:37 2016 - [info] Master is down!
Fri Aug 5 13:32:37 2016 - [info] Terminating monitoring script.
Fri Aug 5 13:32:37 2016 - [info] Got exit code 20 (Master dead).
Fri Aug 5 13:32:37 2016 - [info] MHA::MasterFailover version 0.56.
Fri Aug 5 13:32:37 2016 - [info] Starting master failover.
Fri Aug 5 13:32:37 2016 - [info]
Fri Aug 5 13:32:37 2016 - [info] * Phase 1: Configuration Check Phase..
Fri Aug 5 13:32:37 2016 - [info]
Fri Aug 5 13:32:37 2016 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Fri Aug 5 13:32:37 2016 - [info] GTID failover mode = 1
Fri Aug 5 13:32:37 2016 - [info] Dead Servers:
Fri Aug 5 13:32:37 2016 - [info] host_1(host_1:3306)
Fri Aug 5 13:32:37 2016 - [info] Checking master reachability via MySQL(double check)...
Fri Aug 5 13:32:38 2016 - [info] ok.
Fri Aug 5 13:32:38 2016 - [info] Alive Servers:
Fri Aug 5 13:32:38 2016 - [info] host_2(host_2:3306)
Fri Aug 5 13:32:38 2016 - [info] host_3(host_3:3306)
Fri Aug 5 13:32:38 2016 - [info] Alive Slaves:
Fri Aug 5 13:32:38 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:38 2016 - [info] GTID ON
Fri Aug 5 13:32:38 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:38 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 13:32:38 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:38 2016 - [info] GTID ON
Fri Aug 5 13:32:38 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:38 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 13:32:38 2016 - [info] Starting SQL thread on host_3(host_3:3306) ..
Fri Aug 5 13:32:38 2016 - [info] done.
Fri Aug 5 13:32:38 2016 - [info] Starting GTID based failover.
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Aug 5 13:32:38 2016 - [info] Executing master IP deactivation script:
Fri Aug 5 13:32:38 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
Fri Aug 5 13:32:38 2016 - [info] done.
Fri Aug 5 13:32:38 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Aug 5 13:32:38 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 3: Master Recovery Phase..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000002:56745425
Fri Aug 5 13:32:38 2016 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-210949
Fri Aug 5 13:32:38 2016 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Aug 5 13:32:38 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:38 2016 - [info] GTID ON
Fri Aug 5 13:32:38 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:38 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 13:32:38 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000002:517969
Fri Aug 5 13:32:38 2016 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-1925
Fri Aug 5 13:32:38 2016 - [info] Oldest slaves:
Fri Aug 5 13:32:38 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:38 2016 - [info] GTID ON
Fri Aug 5 13:32:38 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:38 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 3.3: Determining New Master Phase..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] Searching new master from slaves..
Fri Aug 5 13:32:38 2016 - [info] Candidate masters from the configuration file:
Fri Aug 5 13:32:38 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:38 2016 - [info] GTID ON
Fri Aug 5 13:32:38 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:38 2016 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Aug 5 13:32:38 2016 - [info] Non-candidate masters:
Fri Aug 5 13:32:38 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Aug 5 13:32:38 2016 - [info] GTID ON
Fri Aug 5 13:32:38 2016 - [info] Replicating from host_1(host_1:3306)
Fri Aug 5 13:32:38 2016 - [info] Not candidate for the new Master (no_master is set)
Fri Aug 5 13:32:38 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Aug 5 13:32:38 2016 - [info] New master is host_2(host_2:3306)
Fri Aug 5 13:32:38 2016 - [info] Starting master failover..
Fri Aug 5 13:32:38 2016 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] Waiting all logs to be applied..
Fri Aug 5 13:32:38 2016 - [info] done.
Fri Aug 5 13:32:38 2016 - [info] Getting new master's binlog name and position..
Fri Aug 5 13:32:38 2016 - [info] host_2_name.000008:10198806
Fri Aug 5 13:32:38 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Fri Aug 5 13:32:38 2016 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_2_name.000008, 10198806, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-210949
Fri Aug 5 13:32:38 2016 - [info] Executing master IP activate script:
Fri Aug 5 13:32:38 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Set read_only=0 on the new master.
No need to Creating app user on the new master..
Fri Aug 5 13:32:38 2016 - [info] OK.
Fri Aug 5 13:32:38 2016 - [info] Setting read_only=0 on host_2(host_2:3306)..
Fri Aug 5 13:32:38 2016 - [info] ok.
Fri Aug 5 13:32:38 2016 - [info] ** Finished master recovery successfully.
Fri Aug 5 13:32:38 2016 - [info] * Phase 3: Master Recovery Phase completed.
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 4: Slaves Recovery Phase..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Aug 5 13:32:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 22029. Check tmp log /var/log/masterha/app1/host_3_3306_20160805133237.log if it takes time..
Fri Aug 5 13:33:38 2016 - [info]
Fri Aug 5 13:33:38 2016 - [info] Log messages from host_3 ...
Fri Aug 5 13:33:38 2016 - [info]
Fri Aug 5 13:32:38 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Fri Aug 5 13:32:38 2016 - [info] Executed CHANGE MASTER.
Fri Aug 5 13:32:39 2016 - [info] Slave started.
Fri Aug 5 13:33:38 2016 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-210949) completed on host_3(host_3:3306). Executed 206170 events.
Fri Aug 5 13:33:38 2016 - [info] End of log messages from host_3.
Fri Aug 5 13:33:38 2016 - [info] -- Slave on host host_3(host_3:3306) started.
Fri Aug 5 13:33:38 2016 - [info] All new slave servers recovered successfully.
Fri Aug 5 13:33:38 2016 - [info]
Fri Aug 5 13:33:38 2016 - [info] * Phase 5: New master cleanup phase..
Fri Aug 5 13:33:38 2016 - [info]
Fri Aug 5 13:33:38 2016 - [info] Resetting slave info on the new master..
Fri Aug 5 13:33:38 2016 - [info] host_2: Resetting slave info succeeded.
Fri Aug 5 13:33:38 2016 - [info] Master failover to host_2(host_2:3306) completed successfully.
Fri Aug 5 13:33:38 2016 - [info]
----- Failover Report -----
app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): OK: Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.

4.18 [用例测试] 在一开始没有开启MHA的group中,然后master挂了,如何做到日志补偿,然后change master呢?

1
2
3
4
5
6
7
* 如果master挂了,而且没有HA,那怎么做日志补偿呢?
1. 其实很简单,只要之前在slave都设置了relay-log-purge=0就可以
2. 然后自己在按照正常流程部署好manager和node
3. 操作 手工-自动 failover 就可以了
Contents
  1. 1. 测试背景
  2. 2. 一、MHA安装
  3. 3. 二、MHA配置文件
  4. 4. 三、前提要求(必须满足)
    1. 4.1. 3.1 SSH公钥认证
    2. 4.2. 3.2 操作系统
    3. 4.3. 3.3 单写master和多slave或者只读master
    4. 4.4. 3.4 三层或者多层复制环境
    5. 4.5. 3.5 MySQL版本必须是5.0 或者高于 5.0
    6. 4.6. 3.6 使用mysqlbinlog 5.1+ 支持MySQL5.1+
    7. 4.7. 3.7 log-bin必须在候选master上开启
    8. 4.8. 3.8 binlog,relay-log 主从环境必须全部一致
    9. 4.9. 3.9 复制用户必须在候选master上要存在
    10. 4.10. 3.10 使用purge_relay_logs来定期删除relay logs
    11. 4.11. 3.11 不要在SBR的环境中使用load data infile
  5. 5. 四、实战演练
    1. 5.1. 4.0 简单的failover过程
  6. 6. 4.2 测试场景一、 手工操作,手动完成,failover
  7. 7. 4.3 [功能测试] 测试场景二、手工操作, 自动完成,failover
  8. 8. 4.4 [功能测试] 测试场景三、手工操作,自动完成,online master switch
  9. 9. 4.5 测试场景四、自动监控,自动操作,自动完成,failover
  10. 10. 4.6 [用例测试] MySQL master too many connection,无权限,响应慢
  11. 11. 4.7 [用例测试] MySQL master 服务器 down掉,且候选master落后的最多, 是否自动failover,知否可以成功的做日志补偿
  12. 12. 4.8 [用例测试] Master 没挂,但是MySQL slave 服务 down掉,是否自动failover
  13. 13. 4.9 [用例测试] Master 挂了,但是MySQL slave 有问题,是否自动failover
  14. 14. 4.10 [用例测试] MySQL master 有大事务超过100s再执行,是否online master switch
  15. 15. 4.11 [用例测试] MySQL master 网络断掉
  16. 16. 4.12 [用例测试] MySQL master 网路瞬断(1~30秒),是否自动failover
  17. 17. 4.14 [用例测试] 多段网络测试0 (没有二次检测的机制)
  18. 18. 4.14 [用例测试] 多段网络测试1 (有二次检测的机制)
  19. 19. 4.14 [用例测试] 多段网络测试2 (有二次检测的机制)
  20. 20. 4.14 [用例测试] 多段网络测试3 (有二次检测的机制)
  21. 21. 4.14 [用例测试] 多段网络测试4 (没有二次检测的机制)
  22. 22. 4.14 [用例测试] 多段网络测试5 (有二次检测的机制)
  23. 23. 4.14 [用例测试] 多段网络测试6 (没有二次检测的机制)
  24. 24. 4.14 [用例测试] 多段网络测试7 (有二次检测的机制)
  25. 25. 4.15 [用例测试] Non-GTID模式下,需要relay-log吗?是否能够成功的补齐日志
  26. 26. 4.16 [用例测试] GTID模式下,需要relay-log吗?是否能够成功的补齐日志
  27. 27. 4.18 [用例测试] 在一开始没有开启MHA的group中,然后master挂了,如何做到日志补偿,然后change master呢?

幸福,不在于得到的多

而在于计较的少