Contents
  1. 1. 假定环境(经典三节点)
  2. 2. 一、Master : MySQL down
    1. 2.1. 1.1 etl 延迟8小时
    2. 2.2. 1.2 slave(候选master)比etl还要落后更多
    3. 2.3. 1.3 slave(候选master)的日志是最新的,比etl要多
    4. 2.4. 1.4 slave(候选master)上面有大事务在跑
    5. 2.5. 1.5 binlog server 不同场景的测试
    6. 2.6. 1.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
    7. 2.7. 1.7 Master:MySQL down小结
  3. 3. 二、Master : Server down
    1. 3.1. 2.1 etl 延迟8小时
    2. 3.2. 2.2 slave(候选master)比etl还要落后更多
    3. 3.3. 2.3 slave(候选master)的日志是最新的,比etl要多
    4. 3.4. 2.4 slave(候选master)上面有大事务在跑
    5. 3.5. 2.5 binlog server 不同场景的测试
    6. 3.6. 2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
  4. 4. 三、遇到的坑
    1. 4.1. 3.1 交互模式下,如果没有及时敲’YES’,则终止切换
  5. 5. 四、总结
  6. 6. 五、流程简介

这里以masterha_master_switch为背景详解各种可能遇到的场景

假定环境(经典三节点)

1
2
3
host_1(host_1:3306) (current master)
+--host_2(host_2:3306 slave[candidate master])
+--host_3(host_3:3306 etl)

一、Master : MySQL down

1.1 etl 延迟8小时

1
配置文件中加上no_check_delay=0 即可忽略报错

1.2 slave(候选master)比etl还要落后更多

  • 1.2.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
### 模拟现场,现场的3台DB gtid状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_2.000002 | 2885 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave (candidate master) host_1
Retrieved_Gtid_Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446353
Auto_Position: 1
* etl (other slave) host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353-446356
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446356
Auto_Position: 1
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Thu Nov 9 10:43:49 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 9 10:43:49 2017 - [info] Starting master failover.
Thu Nov 9 10:43:49 2017 - [info]
Thu Nov 9 10:43:49 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 9 10:43:49 2017 - [info]
Thu Nov 9 10:43:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 9 10:43:50 2017 - [info] Binlog server host_2 is reachable.
Thu Nov 9 10:43:50 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 9 10:43:50 2017 - [info] Binlog server host_1 is reachable.
Thu Nov 9 10:43:50 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 9 10:43:50 2017 - [info] Binlog server host_3 is reachable.
Thu Nov 9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on host_1( host_1:3306)
Thu Nov 9 10:43:51 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Thu Nov 9 10:43:51 2017 - [info] GTID failover mode = 1
Thu Nov 9 10:43:51 2017 - [info] Dead Servers:
Thu Nov 9 10:43:51 2017 - [info] host_2( host_2:3306)
Thu Nov 9 10:43:51 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 9 10:43:51 2017 - [info] ok.
Thu Nov 9 10:43:51 2017 - [info] Alive Servers:
Thu Nov 9 10:43:51 2017 - [info] host_1( host_1:3306)
Thu Nov 9 10:43:51 2017 - [info] host_3( host_3:3306)
Thu Nov 9 10:43:51 2017 - [info] Alive Slaves:
Thu Nov 9 10:43:51 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:43:51 2017 - [info] GTID ON
Thu Nov 9 10:43:51 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 10:43:51 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 10:43:51 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:43:51 2017 - [info] GTID ON
Thu Nov 9 10:43:51 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 10:43:51 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 10:43:51 2017 - [info] Starting SQL thread on host_1( host_1:3306) ..
Thu Nov 9 10:43:51 2017 - [info] done.
Thu Nov 9 10:43:51 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Thu Nov 9 10:43:51 2017 - [info] done.
Thu Nov 9 10:43:51 2017 - [info] Starting GTID based failover.
Thu Nov 9 10:43:51 2017 - [info]
Thu Nov 9 10:43:51 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 9 10:43:51 2017 - [info]
Thu Nov 9 10:43:51 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 9 10:43:51 2017 - [info]
Thu Nov 9 10:43:51 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 9 10:43:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 9 10:43:51 2017 - [info] Executing master IP deactivation script:
Thu Nov 9 10:43:51 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Nov 9 10:43:53 2017 - [info] done.
Thu Nov 9 10:43:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 9 10:43:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] The latest binary log file/position on all slaves is host_2.000002:1115
Thu Nov 9 10:43:53 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:4-16,
Thu Nov 9 10:43:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 9 10:43:53 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:43:53 2017 - [info] GTID ON
Thu Nov 9 10:43:53 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 10:43:53 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 10:43:53 2017 - [info] The oldest binary log file/position on all slaves is host_2.000002:230
Thu Nov 9 10:43:53 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446353
Thu Nov 9 10:43:53 2017 - [info] Oldest slaves:
Thu Nov 9 10:43:53 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:43:53 2017 - [info] GTID ON
Thu Nov 9 10:43:53 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 10:43:53 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] Searching new master from slaves..
Thu Nov 9 10:43:53 2017 - [info] Candidate masters from the configuration file:
Thu Nov 9 10:43:53 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:43:53 2017 - [info] GTID ON
Thu Nov 9 10:43:53 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 10:43:53 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 10:43:53 2017 - [info] Non-candidate masters:
Thu Nov 9 10:43:53 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:43:53 2017 - [info] GTID ON
Thu Nov 9 10:43:53 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 10:43:53 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 10:43:53 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 9 10:43:53 2017 - [info] Not found.
Thu Nov 9 10:43:53 2017 - [info] Searching from all candidate_master slaves..
Thu Nov 9 10:43:53 2017 - [info] New master is host_1( host_1:3306)
Thu Nov 9 10:43:53 2017 - [info] Starting master failover..
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov 9 10:43:53 2017 - [info]
Thu Nov 9 10:43:53 2017 - [info] Waiting all logs to be applied..
Thu Nov 9 10:43:53 2017 - [info] done.
Thu Nov 9 10:43:53 2017 - [info] Replicating from the latest slave host_3( host_3:3306) and waiting to apply..
Thu Nov 9 10:43:53 2017 - [info] Waiting all logs to be applied on the latest slave..
Thu Nov 9 10:43:53 2017 - [info] Resetting slave host_1( host_1:3306) and starting replication from the new master host_3( host_3:3306)..
Thu Nov 9 10:43:53 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 10:43:54 2017 - [info] Slave started.
Thu Nov 9 10:43:54 2017 - [info] Waiting to execute all relay logs on host_1( host_1:3306)..
Thu Nov 9 10:43:54 2017 - [info] master_pos_wait( host_3.000049:18041) completed on host_1( host_1:3306). Executed 0 events.
Thu Nov 9 10:43:54 2017 - [info] done.
Thu Nov 9 10:43:54 2017 - [info] done.
Thu Nov 9 10:43:54 2017 - [info] -- Saving binlog from host host_2 started, pid: 150294
Thu Nov 9 10:43:54 2017 - [info] -- Saving binlog from host host_1 started, pid: 150295
Thu Nov 9 10:43:54 2017 - [info] -- Saving binlog from host host_3 started, pid: 150297
Thu Nov 9 10:43:54 2017 - [info]
Thu Nov 9 10:43:54 2017 - [info] Log messages from host_1 ...
Thu Nov 9 10:43:54 2017 - [info]
Thu Nov 9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_1..
Thu Nov 9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002 --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov 9 10:43:54 2017 - [info] End of log messages from host_1.
Thu Nov 9 10:43:54 2017 - [warning] Got error from host_1.
Thu Nov 9 10:43:54 2017 - [info]
Thu Nov 9 10:43:54 2017 - [info] Log messages from host_3 ...
Thu Nov 9 10:43:54 2017 - [info]
Thu Nov 9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_3..
Thu Nov 9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002 --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 10:43:54 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov 9 10:43:54 2017 - [info] End of log messages from host_3.
Thu Nov 9 10:43:54 2017 - [warning] Got error from host_3.
Thu Nov 9 10:43:55 2017 - [info]
Thu Nov 9 10:43:55 2017 - [info] Log messages from host_2 ...
Thu Nov 9 10:43:55 2017 - [info]
Thu Nov 9 10:43:54 2017 - [info] Fetching binary logs from binlog server host_2..
Thu Nov 9 10:43:54 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000002 --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 10:43:55 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109104349.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109104349.binlog succeeded.
Thu Nov 9 10:43:55 2017 - [info] End of log messages from host_2.
Thu Nov 9 10:43:55 2017 - [info] Saved mysqlbinlog size from host_2 is 6047 bytes.
Thu Nov 9 10:43:55 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109104349.binlog ..
Thu Nov 9 10:43:55 2017 - [info] Differential log apply from binlog server succeeded.
Thu Nov 9 10:43:55 2017 - [info] Getting new master's binlog name and position..
Thu Nov 9 10:43:55 2017 - [info] host_1.000053:3624
Thu Nov 9 10:43:55 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 9 10:43:55 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000053, 3624, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Thu Nov 9 10:43:55 2017 - [info] Executing master IP activate script:
Thu Nov 9 10:43:55 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Thu Nov 9 10:43:57 2017 - [info] OK.
Thu Nov 9 10:43:57 2017 - [info] Setting read_only=0 on host_1( host_1:3306)..
Thu Nov 9 10:43:57 2017 - [info] ok.
Thu Nov 9 10:43:57 2017 - [info] ** Finished master recovery successfully.
Thu Nov 9 10:43:57 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 9 10:43:57 2017 - [info]
Thu Nov 9 10:43:57 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 9 10:43:57 2017 - [info]
Thu Nov 9 10:43:57 2017 - [info]
Thu Nov 9 10:43:57 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov 9 10:43:57 2017 - [info]
Thu Nov 9 10:43:57 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 155162. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109104349.log if it takes time..
Thu Nov 9 10:43:58 2017 - [info]
Thu Nov 9 10:43:58 2017 - [info] Log messages from host_3 ...
Thu Nov 9 10:43:58 2017 - [info]
Thu Nov 9 10:43:57 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_1( host_1:3306)..
Thu Nov 9 10:43:57 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 10:43:58 2017 - [info] Slave started.
Thu Nov 9 10:43:58 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Thu Nov 9 10:43:58 2017 - [info] End of log messages from host_3.
Thu Nov 9 10:43:58 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Thu Nov 9 10:43:58 2017 - [info] All new slave servers recovered successfully.
Thu Nov 9 10:43:58 2017 - [info]
Thu Nov 9 10:43:58 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 9 10:43:58 2017 - [info]
Thu Nov 9 10:43:58 2017 - [info] Resetting slave info on the new master..
Thu Nov 9 10:43:58 2017 - [info] host_1: Resetting slave info succeeded.
Thu Nov 9 10:43:58 2017 - [info] Master failover to host_1( host_1:3306) completed successfully.
Thu Nov 9 10:43:58 2017 - [info]
Thu Nov 9 10:43:58 2017 - [info] Sending mail..
  • 1.2.2 当master的所有日志已经传递到1个etl,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
### 模拟现场,现场的3台DB gtid状态
* master host_1
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_1.000053 | 5229 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave (candidate master) host_2
Retrieved_Gtid_Set:
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362
Auto_Position: 1
* etl (other slave) host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446357-446362
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446362
Auto_Position: 1
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_1 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Thu Nov 9 10:59:14 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 9 10:59:14 2017 - [info] Starting master failover.
Thu Nov 9 10:59:14 2017 - [info]
Thu Nov 9 10:59:14 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 9 10:59:14 2017 - [info]
Thu Nov 9 10:59:15 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 9 10:59:15 2017 - [info] Binlog server host_2 is reachable.
Thu Nov 9 10:59:15 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 9 10:59:15 2017 - [info] Binlog server host_1 is reachable.
Thu Nov 9 10:59:15 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 9 10:59:16 2017 - [info] Binlog server host_3 is reachable.
Thu Nov 9 10:59:16 2017 - [warning] SQL Thread is stopped(no error) on host_2( host_2:3306)
Thu Nov 9 10:59:16 2017 - [info] GTID failover mode = 1
Thu Nov 9 10:59:16 2017 - [info] Dead Servers:
Thu Nov 9 10:59:16 2017 - [info] host_1( host_1:3306)
Thu Nov 9 10:59:16 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 9 10:59:16 2017 - [info] ok.
Thu Nov 9 10:59:16 2017 - [info] Alive Servers:
Thu Nov 9 10:59:16 2017 - [info] host_2( host_2:3306)
Thu Nov 9 10:59:16 2017 - [info] host_3( host_3:3306)
Thu Nov 9 10:59:16 2017 - [info] Alive Slaves:
Thu Nov 9 10:59:16 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:59:16 2017 - [info] GTID ON
Thu Nov 9 10:59:16 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 10:59:16 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 10:59:16 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:59:16 2017 - [info] GTID ON
Thu Nov 9 10:59:16 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 10:59:16 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 10:59:16 2017 - [info] Starting SQL thread on host_2( host_2:3306) ..
Thu Nov 9 10:59:16 2017 - [info] done.
Thu Nov 9 10:59:16 2017 - [info] Starting GTID based failover.
Thu Nov 9 10:59:16 2017 - [info]
Thu Nov 9 10:59:16 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 9 10:59:16 2017 - [info]
Thu Nov 9 10:59:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 9 10:59:16 2017 - [info]
Thu Nov 9 10:59:16 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 9 10:59:16 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 9 10:59:16 2017 - [info] Executing master IP deactivation script:
Thu Nov 9 10:59:16 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Nov 9 10:59:20 2017 - [info] done.
Thu Nov 9 10:59:20 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 9 10:59:20 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] The latest binary log file/position on all slaves is host_1.000053:5229
Thu Nov 9 10:59:20 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:17-21,
Thu Nov 9 10:59:20 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 9 10:59:20 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:59:20 2017 - [info] GTID ON
Thu Nov 9 10:59:20 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 10:59:20 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 10:59:20 2017 - [info] The oldest binary log file/position on all slaves is host_1.000053:3624
Thu Nov 9 10:59:20 2017 - [info] Oldest slaves:
Thu Nov 9 10:59:20 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:59:20 2017 - [info] GTID ON
Thu Nov 9 10:59:20 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 10:59:20 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] Searching new master from slaves..
Thu Nov 9 10:59:20 2017 - [info] Candidate masters from the configuration file:
Thu Nov 9 10:59:20 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:59:20 2017 - [info] GTID ON
Thu Nov 9 10:59:20 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 10:59:20 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 10:59:20 2017 - [info] Non-candidate masters:
Thu Nov 9 10:59:20 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 10:59:20 2017 - [info] GTID ON
Thu Nov 9 10:59:20 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 10:59:20 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 10:59:20 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 9 10:59:20 2017 - [info] Not found.
Thu Nov 9 10:59:20 2017 - [info] Searching from all candidate_master slaves..
Thu Nov 9 10:59:20 2017 - [info] New master is host_2( host_2:3306)
Thu Nov 9 10:59:20 2017 - [info] Starting master failover..
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov 9 10:59:20 2017 - [info]
Thu Nov 9 10:59:20 2017 - [info] Waiting all logs to be applied..
Thu Nov 9 10:59:20 2017 - [info] done.
Thu Nov 9 10:59:20 2017 - [info] Replicating from the latest slave host_3( host_3:3306) and waiting to apply..
Thu Nov 9 10:59:20 2017 - [info] Waiting all logs to be applied on the latest slave..
Thu Nov 9 10:59:20 2017 - [info] Resetting slave host_2( host_2:3306) and starting replication from the new master host_3( host_3:3306)..
Thu Nov 9 10:59:20 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 10:59:21 2017 - [info] Slave started.
Thu Nov 9 10:59:21 2017 - [info] Waiting to execute all relay logs on host_2( host_2:3306)..
Thu Nov 9 10:59:21 2017 - [info] master_pos_wait( host_3.000049:22035) completed on host_2( host_2:3306). Executed 0 events.
Thu Nov 9 10:59:21 2017 - [info] done.
Thu Nov 9 10:59:21 2017 - [info] done.
Thu Nov 9 10:59:21 2017 - [info] -- Saving binlog from host host_2 started, pid: 184482
Thu Nov 9 10:59:21 2017 - [info] -- Saving binlog from host host_1 started, pid: 184483
Thu Nov 9 10:59:21 2017 - [info] -- Saving binlog from host host_3 started, pid: 184487
Thu Nov 9 10:59:21 2017 - [info]
Thu Nov 9 10:59:21 2017 - [info] Log messages from host_2 ...
Thu Nov 9 10:59:21 2017 - [info]
Thu Nov 9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_2..
Thu Nov 9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053 --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov 9 10:59:21 2017 - [info] End of log messages from host_2.
Thu Nov 9 10:59:21 2017 - [warning] Got error from host_2.
Thu Nov 9 10:59:21 2017 - [info]
Thu Nov 9 10:59:21 2017 - [info] Log messages from host_3 ...
Thu Nov 9 10:59:21 2017 - [info]
Thu Nov 9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_3..
Thu Nov 9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053 --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 10:59:21 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov 9 10:59:21 2017 - [info] End of log messages from host_3.
Thu Nov 9 10:59:21 2017 - [warning] Got error from host_3.
Thu Nov 9 10:59:22 2017 - [info]
Thu Nov 9 10:59:22 2017 - [info] Log messages from host_1 ...
Thu Nov 9 10:59:22 2017 - [info]
Thu Nov 9 10:59:21 2017 - [info] Fetching binary logs from binlog server host_1..
Thu Nov 9 10:59:21 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000053 --start_pos=5229 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 10:59:22 2017 - [info] scp from root@ host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171109105914.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171109105914.binlog succeeded.
Thu Nov 9 10:59:22 2017 - [info] End of log messages from host_1.
Thu Nov 9 10:59:22 2017 - [info] Saved mysqlbinlog size from host_1 is 800 bytes.
Thu Nov 9 10:59:22 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171109105914.binlog ..
Thu Nov 9 10:59:22 2017 - [info] Differential log apply from binlog server succeeded.
Thu Nov 9 10:59:22 2017 - [info] Getting new master's binlog name and position..
Thu Nov 9 10:59:22 2017 - [info] host_2.000003:1680
Thu Nov 9 10:59:22 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 9 10:59:22 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_2.000003, 1680, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
Thu Nov 9 10:59:22 2017 - [info] Executing master IP activate script:
Thu Nov 9 10:59:22 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Thu Nov 9 10:59:24 2017 - [info] OK.
Thu Nov 9 10:59:24 2017 - [info] Setting read_only=0 on host_2( host_2:3306)..
Thu Nov 9 10:59:24 2017 - [info] ok.
Thu Nov 9 10:59:24 2017 - [info] ** Finished master recovery successfully.
Thu Nov 9 10:59:24 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 9 10:59:24 2017 - [info]
Thu Nov 9 10:59:24 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 9 10:59:24 2017 - [info]
Thu Nov 9 10:59:24 2017 - [info]
Thu Nov 9 10:59:24 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov 9 10:59:24 2017 - [info]
Thu Nov 9 10:59:24 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 189393. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109105914.log if it takes time..
Thu Nov 9 10:59:25 2017 - [info]
Thu Nov 9 10:59:25 2017 - [info] Log messages from host_3 ...
Thu Nov 9 10:59:25 2017 - [info]
Thu Nov 9 10:59:24 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_2( host_2:3306)..
Thu Nov 9 10:59:24 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 10:59:25 2017 - [info] Slave started.
Thu Nov 9 10:59:25 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
Thu Nov 9 10:59:25 2017 - [info] End of log messages from host_3.
Thu Nov 9 10:59:25 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Thu Nov 9 10:59:25 2017 - [info] All new slave servers recovered successfully.
Thu Nov 9 10:59:25 2017 - [info]
Thu Nov 9 10:59:25 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 9 10:59:25 2017 - [info]
Thu Nov 9 10:59:25 2017 - [info] Resetting slave info on the new master..
Thu Nov 9 10:59:25 2017 - [info] host_2: Resetting slave info succeeded.
Thu Nov 9 10:59:25 2017 - [info] Master failover to host_2( host_2:3306) completed successfully.
Thu Nov 9 10:59:25 2017 - [info]
Thu Nov 9 10:59:25 2017 - [info] Sending mail..

1.3 slave(候选master)的日志是最新的,比etl要多

  • 1.3.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_1 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Tue Nov 7 17:11:29 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov 7 17:11:29 2017 - [info] Starting master failover.
Tue Nov 7 17:11:29 2017 - [info]
Tue Nov 7 17:11:29 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 7 17:11:29 2017 - [info]
Tue Nov 7 17:11:29 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Nov 7 17:11:29 2017 - [info] Binlog server host_2 is reachable.
Tue Nov 7 17:11:29 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Nov 7 17:11:30 2017 - [info] Binlog server host_1 is reachable.
Tue Nov 7 17:11:30 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Tue Nov 7 17:11:30 2017 - [info] Binlog server host_3 is reachable.
Tue Nov 7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on host_2( host_2:3306)
Tue Nov 7 17:11:30 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Tue Nov 7 17:11:30 2017 - [info] GTID failover mode = 1
Tue Nov 7 17:11:30 2017 - [info] Dead Servers:
Tue Nov 7 17:11:30 2017 - [info] host_1( host_1:3306)
Tue Nov 7 17:11:30 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov 7 17:11:30 2017 - [info] ok.
Tue Nov 7 17:11:30 2017 - [info] Alive Servers:
Tue Nov 7 17:11:30 2017 - [info] host_2( host_2:3306)
Tue Nov 7 17:11:30 2017 - [info] host_3( host_3:3306)
Tue Nov 7 17:11:30 2017 - [info] Alive Slaves:
Tue Nov 7 17:11:30 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 17:11:30 2017 - [info] GTID ON
Tue Nov 7 17:11:30 2017 - [info] Replicating from host_1( host_1:3306)
Tue Nov 7 17:11:30 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 7 17:11:30 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 17:11:30 2017 - [info] GTID ON
Tue Nov 7 17:11:30 2017 - [info] Replicating from host_1( host_1:3306)
Tue Nov 7 17:11:30 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 7 17:11:30 2017 - [info] Starting SQL thread on host_2( host_2:3306) ..
Tue Nov 7 17:11:30 2017 - [info] done.
Tue Nov 7 17:11:30 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Tue Nov 7 17:11:30 2017 - [info] done.
Tue Nov 7 17:11:30 2017 - [info] Starting GTID based failover.
Tue Nov 7 17:11:30 2017 - [info]
Tue Nov 7 17:11:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov 7 17:11:30 2017 - [info]
Tue Nov 7 17:11:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov 7 17:11:30 2017 - [info]
Tue Nov 7 17:11:30 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Nov 7 17:11:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov 7 17:11:31 2017 - [info] Executing master IP deactivation script:
Tue Nov 7 17:11:31 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Tue Nov 7 17:11:33 2017 - [info] done.
Tue Nov 7 17:11:33 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov 7 17:11:33 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] The latest binary log file/position on all slaves is host_1.000051:13508
Tue Nov 7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:3-8
Tue Nov 7 17:11:33 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov 7 17:11:33 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 17:11:33 2017 - [info] GTID ON
Tue Nov 7 17:11:33 2017 - [info] Replicating from host_1( host_1:3306)
Tue Nov 7 17:11:33 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 7 17:11:33 2017 - [info] The oldest binary log file/position on all slaves is host_1.000051:11918
Tue Nov 7 17:11:33 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:2-3,
Tue Nov 7 17:11:33 2017 - [info] Oldest slaves:
Tue Nov 7 17:11:33 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 17:11:33 2017 - [info] GTID ON
Tue Nov 7 17:11:33 2017 - [info] Replicating from host_1( host_1:3306)
Tue Nov 7 17:11:33 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Searching new master from slaves..
Tue Nov 7 17:11:33 2017 - [info] Candidate masters from the configuration file:
Tue Nov 7 17:11:33 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 17:11:33 2017 - [info] GTID ON
Tue Nov 7 17:11:33 2017 - [info] Replicating from host_1( host_1:3306)
Tue Nov 7 17:11:33 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 7 17:11:33 2017 - [info] Non-candidate masters:
Tue Nov 7 17:11:33 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 17:11:33 2017 - [info] GTID ON
Tue Nov 7 17:11:33 2017 - [info] Replicating from host_1( host_1:3306)
Tue Nov 7 17:11:33 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 7 17:11:33 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov 7 17:11:33 2017 - [info] New master is host_2( host_2:3306)
Tue Nov 7 17:11:33 2017 - [info] Starting master failover..
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Waiting all logs to be applied..
Tue Nov 7 17:11:33 2017 - [info] done.
Tue Nov 7 17:11:33 2017 - [info] -- Saving binlog from host host_2 started, pid: 54677
Tue Nov 7 17:11:33 2017 - [info] -- Saving binlog from host host_1 started, pid: 54681
Tue Nov 7 17:11:33 2017 - [info] -- Saving binlog from host host_3 started, pid: 54683
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Log messages from host_3 ...
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_3..
Tue Nov 7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051 --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov 7 17:11:33 2017 - [info] End of log messages from host_3.
Tue Nov 7 17:11:33 2017 - [warning] Got error from host_3.
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Log messages from host_2 ...
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_2..
Tue Nov 7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051 --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 17:11:33 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov 7 17:11:33 2017 - [info] End of log messages from host_2.
Tue Nov 7 17:11:33 2017 - [warning] Got error from host_2.
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Log messages from host_1 ...
Tue Nov 7 17:11:33 2017 - [info]
Tue Nov 7 17:11:33 2017 - [info] Fetching binary logs from binlog server host_1..
Tue Nov 7 17:11:33 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000051 --start_pos=13508 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 17:11:33 2017 - [info] scp from root@ host_1:/var/log/masterha/mha_test/saved_binlog_binlog2_20171107171129.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171107171129.binlog succeeded.
Tue Nov 7 17:11:33 2017 - [info] End of log messages from host_1.
Tue Nov 7 17:11:33 2017 - [info] Saved mysqlbinlog size from host_1 is 8578 bytes.
Tue Nov 7 17:11:33 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_1_binlog2_20171107171129.binlog ..
Tue Nov 7 17:11:33 2017 - [info] Differential log apply from binlog server succeeded.
Tue Nov 7 17:11:33 2017 - [info] Getting new master's binlog name and position..
Tue Nov 7 17:11:33 2017 - [info] host_2.000001:5048
Tue Nov 7 17:11:33 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov 7 17:11:33 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_2.000001, 5048, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Tue Nov 7 17:11:33 2017 - [info] Executing master IP activate script:
Tue Nov 7 17:11:33 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Tue Nov 7 17:11:36 2017 - [info] OK.
Tue Nov 7 17:11:36 2017 - [info] Setting read_only=0 on host_2( host_2:3306)..
Tue Nov 7 17:11:36 2017 - [info] ok.
Tue Nov 7 17:11:36 2017 - [info] ** Finished master recovery successfully.
Tue Nov 7 17:11:36 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov 7 17:11:36 2017 - [info]
Tue Nov 7 17:11:36 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov 7 17:11:36 2017 - [info]
Tue Nov 7 17:11:36 2017 - [info]
Tue Nov 7 17:11:36 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Tue Nov 7 17:11:36 2017 - [info]
Tue Nov 7 17:11:36 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 58422. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171107171129.log if it takes time..
Tue Nov 7 17:11:37 2017 - [info]
Tue Nov 7 17:11:37 2017 - [info] Log messages from host_3 ...
Tue Nov 7 17:11:37 2017 - [info]
Tue Nov 7 17:11:36 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_2( host_2:3306)..
Tue Nov 7 17:11:36 2017 - [info] Executed CHANGE MASTER.
Tue Nov 7 17:11:37 2017 - [info] Slave started.
Tue Nov 7 17:11:37 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-16,
Tue Nov 7 17:11:37 2017 - [info] End of log messages from host_3.
Tue Nov 7 17:11:37 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Tue Nov 7 17:11:37 2017 - [info] All new slave servers recovered successfully.
Tue Nov 7 17:11:37 2017 - [info]
Tue Nov 7 17:11:37 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov 7 17:11:37 2017 - [info]
Tue Nov 7 17:11:37 2017 - [info] Resetting slave info on the new master..
Tue Nov 7 17:11:37 2017 - [info] host_2: Resetting slave info succeeded.
Tue Nov 7 17:11:37 2017 - [info] Master failover to host_2( host_2:3306) completed successfully.
Tue Nov 7 17:11:37 2017 - [info]
Tue Nov 7 17:11:37 2017 - [info] Sending mail..
  • 1.3.2 当master的所有日志已经传递slave,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Tue Nov 7 15:56:11 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov 7 15:56:11 2017 - [info] Starting master failover.
Tue Nov 7 15:56:11 2017 - [info]
Tue Nov 7 15:56:11 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 7 15:56:11 2017 - [info]
Tue Nov 7 15:56:11 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Nov 7 15:56:12 2017 - [info] Binlog server host_2 is reachable.
Tue Nov 7 15:56:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Nov 7 15:56:12 2017 - [info] Binlog server host_1 is reachable.
Tue Nov 7 15:56:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Tue Nov 7 15:56:13 2017 - [info] Binlog server host_3 is reachable.
Tue Nov 7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on host_1( host_1:3306)
Tue Nov 7 15:56:13 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Tue Nov 7 15:56:13 2017 - [info] GTID failover mode = 1
Tue Nov 7 15:56:13 2017 - [info] Dead Servers:
Tue Nov 7 15:56:13 2017 - [info] host_2( host_2:3306)
Tue Nov 7 15:56:13 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov 7 15:56:13 2017 - [info] ok.
Tue Nov 7 15:56:13 2017 - [info] Alive Servers:
Tue Nov 7 15:56:13 2017 - [info] host_1( host_1:3306)
Tue Nov 7 15:56:13 2017 - [info] host_3( host_3:3306)
Tue Nov 7 15:56:13 2017 - [info] Alive Slaves:
Tue Nov 7 15:56:13 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 15:56:13 2017 - [info] GTID ON
Tue Nov 7 15:56:13 2017 - [info] Replicating from host_2( host_2:3306)
Tue Nov 7 15:56:13 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 7 15:56:13 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 15:56:13 2017 - [info] GTID ON
Tue Nov 7 15:56:13 2017 - [info] Replicating from host_2( host_2:3306)
Tue Nov 7 15:56:13 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 7 15:56:13 2017 - [info] Starting SQL thread on host_1( host_1:3306) ..
Tue Nov 7 15:56:13 2017 - [info] done.
Tue Nov 7 15:56:13 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Tue Nov 7 15:56:13 2017 - [info] done.
Tue Nov 7 15:56:13 2017 - [info] Starting GTID based failover.
Tue Nov 7 15:56:13 2017 - [info]
Tue Nov 7 15:56:13 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov 7 15:56:13 2017 - [info]
Tue Nov 7 15:56:13 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov 7 15:56:13 2017 - [info]
Tue Nov 7 15:56:13 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Nov 7 15:56:13 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov 7 15:56:13 2017 - [info] Executing master IP deactivation script:
Tue Nov 7 15:56:13 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
Tue Nov 7 15:56:16 2017 - [info] done.
Tue Nov 7 15:56:16 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov 7 15:56:16 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] The latest binary log file/position on all slaves is host_2.000049:11291
Tue Nov 7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446352
Tue Nov 7 15:56:16 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov 7 15:56:16 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 15:56:16 2017 - [info] GTID ON
Tue Nov 7 15:56:16 2017 - [info] Replicating from host_2( host_2:3306)
Tue Nov 7 15:56:16 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 7 15:56:16 2017 - [info] The oldest binary log file/position on all slaves is host_2.000049:10703
Tue Nov 7 15:56:16 2017 - [info] Retrieved Gtid Set: ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:3-446350
Tue Nov 7 15:56:16 2017 - [info] Oldest slaves:
Tue Nov 7 15:56:16 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 15:56:16 2017 - [info] GTID ON
Tue Nov 7 15:56:16 2017 - [info] Replicating from host_2( host_2:3306)
Tue Nov 7 15:56:16 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Searching new master from slaves..
Tue Nov 7 15:56:16 2017 - [info] Candidate masters from the configuration file:
Tue Nov 7 15:56:16 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 15:56:16 2017 - [info] GTID ON
Tue Nov 7 15:56:16 2017 - [info] Replicating from host_2( host_2:3306)
Tue Nov 7 15:56:16 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 7 15:56:16 2017 - [info] Non-candidate masters:
Tue Nov 7 15:56:16 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 7 15:56:16 2017 - [info] GTID ON
Tue Nov 7 15:56:16 2017 - [info] Replicating from host_2( host_2:3306)
Tue Nov 7 15:56:16 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 7 15:56:16 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov 7 15:56:16 2017 - [info] New master is host_1( host_1:3306)
Tue Nov 7 15:56:16 2017 - [info] Starting master failover..
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Tue Nov 7 15:56:16 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Waiting all logs to be applied..
Tue Nov 7 15:56:16 2017 - [info] done.
Tue Nov 7 15:56:16 2017 - [info] -- Saving binlog from host host_2 started, pid: 79759
Tue Nov 7 15:56:16 2017 - [info] -- Saving binlog from host host_1 started, pid: 79768
Tue Nov 7 15:56:16 2017 - [info] -- Saving binlog from host host_3 started, pid: 79770
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:17 2017 - [info] Log messages from host_1 ...
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_1..
Tue Nov 7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049 --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov 7 15:56:17 2017 - [info] End of log messages from host_1.
Tue Nov 7 15:56:17 2017 - [warning] Got error from host_1.
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:17 2017 - [info] Log messages from host_3 ...
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_3..
Tue Nov 7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049 --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov 7 15:56:17 2017 - [info] End of log messages from host_3.
Tue Nov 7 15:56:17 2017 - [warning] Got error from host_3.
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:17 2017 - [info] Log messages from host_2 ...
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_2..
Tue Nov 7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049 --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 15:56:17 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog succeeded.
Tue Nov 7 15:56:17 2017 - [info] End of log messages from host_2.
Tue Nov 7 15:56:17 2017 - [info] Saved mysqlbinlog size from host_2 is 768 bytes.
Tue Nov 7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog ..
Tue Nov 7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.
Tue Nov 7 15:56:17 2017 - [info] Getting new master's binlog name and position..
Tue Nov 7 15:56:17 2017 - [info] host_1.000051:11449
Tue Nov 7 15:56:17 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov 7 15:56:17 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000051, 11449, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,
Tue Nov 7 15:56:17 2017 - [info] Executing master IP activate script:
Tue Nov 7 15:56:17 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Tue Nov 7 15:56:20 2017 - [info] OK.
Tue Nov 7 15:56:20 2017 - [info] Setting read_only=0 on host_1( host_1:3306)..
Tue Nov 7 15:56:20 2017 - [info] ok.
Tue Nov 7 15:56:20 2017 - [info] ** Finished master recovery successfully.
Tue Nov 7 15:56:20 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov 7 15:56:20 2017 - [info]
Tue Nov 7 15:56:20 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov 7 15:56:20 2017 - [info]
Tue Nov 7 15:56:20 2017 - [info]
Tue Nov 7 15:56:20 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Tue Nov 7 15:56:20 2017 - [info]
Tue Nov 7 15:56:20 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 85941. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171107155611.log if it takes time..
Tue Nov 7 15:56:21 2017 - [info]
Tue Nov 7 15:56:21 2017 - [info] Log messages from host_3 ...
Tue Nov 7 15:56:21 2017 - [info]
Tue Nov 7 15:56:20 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_1( host_1:3306)..
Tue Nov 7 15:56:20 2017 - [info] Executed CHANGE MASTER.
Tue Nov 7 15:56:21 2017 - [info] Slave started.
Tue Nov 7 15:56:21 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1,
Tue Nov 7 15:56:21 2017 - [info] End of log messages from host_3.
Tue Nov 7 15:56:21 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Tue Nov 7 15:56:21 2017 - [info] All new slave servers recovered successfully.
Tue Nov 7 15:56:21 2017 - [info]
Tue Nov 7 15:56:21 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov 7 15:56:21 2017 - [info]
Tue Nov 7 15:56:21 2017 - [info] Resetting slave info on the new master..
Tue Nov 7 15:56:21 2017 - [info] host_1: Resetting slave info succeeded.
Tue Nov 7 15:56:21 2017 - [info] Master failover to host_1( host_1:3306) completed successfully.
Tue Nov 7 15:56:21 2017 - [info]
Tue Nov 7 15:56:21 2017 - [info] Sending mail..

1.4 slave(候选master)上面有大事务在跑

  • 1000s的大查询
1
无影响,正常切换
  • flush tables with readlock
1
2
3
4
5
6
7
8
9
10
11
12
dba:(none)> show processlist;
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| 63 | dba | localhost | NULL | Query | 0 | starting | show processlist |
| 65 | dba | xx:11164 | NULL | Sleep | 121 | | NULL |
| 83 | dba | new master:49022 | NULL | Query | 176 | Waiting for global read lock | BINLOG '
GpAKWhNYUy1LMAAAAGYHAAAAAG0AAAAAAAEAAmxjAAh0X2NoYXJfMgACAw8CLAEC
GpAKWh5YUy1LJwAAAI0HAAAAAG |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)

1.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl (这是最严苛的情况)

  • binlog server 写3台
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_1 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Tue Nov 7 15:56:17 2017 - [info] Log messages from host_1 ...
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_1..
Tue Nov 7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049 --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov 7 15:56:17 2017 - [info] End of log messages from host_1.
Tue Nov 7 15:56:17 2017 - [warning] Got error from host_1.
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:17 2017 - [info] Log messages from host_3 ...
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_3..
Tue Nov 7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049 --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 15:56:17 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Tue Nov 7 15:56:17 2017 - [info] End of log messages from host_3.
Tue Nov 7 15:56:17 2017 - [warning] Got error from host_3.
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:17 2017 - [info] Log messages from host_2 ...
Tue Nov 7 15:56:17 2017 - [info]
Tue Nov 7 15:56:16 2017 - [info] Fetching binary logs from binlog server host_2..
Tue Nov 7 15:56:16 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000049 --start_pos=11291 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Tue Nov 7 15:56:17 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171107155611.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog succeeded.
Tue Nov 7 15:56:17 2017 - [info] End of log messages from host_2.
Tue Nov 7 15:56:17 2017 - [info] Saved mysqlbinlog size from host_2 is 768 bytes.
Tue Nov 7 15:56:17 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171107155611.binlog ..
Tue Nov 7 15:56:17 2017 - [info] Differential log apply from binlog server succeeded.
  • binlog server 只写master
1
2
3
4
5
6
7
8
9
10
11
12
13
14
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Thu Nov 9 11:20:04 2017 - [info] -- Saving binlog from host host_2 started, pid: 117389
Thu Nov 9 11:20:05 2017 - [info]
Thu Nov 9 11:20:05 2017 - [info] Log messages from host_2 ...
Thu Nov 9 11:20:05 2017 - [info]
Thu Nov 9 11:20:04 2017 - [info] Fetching binary logs from binlog server host_2..
Thu Nov 9 11:20:04 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000004 --start_pos=1115 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 11:20:05 2017 - [info] scp from root@ host_2:/var/log/masterha/mha_test/saved_binlog_binlog1_20171109111957.binlog to local:/var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109111957.binlog succeeded.
Thu Nov 9 11:20:05 2017 - [info] End of log messages from host_2.
Thu Nov 9 11:20:05 2017 - [info] Saved mysqlbinlog size from host_2 is 4444 bytes.
Thu Nov 9 11:20:05 2017 - [info] Applying differential binlog /var/log/masterha/mha_test/saved_binlog_ host_2_binlog1_20171109111957.binlog ..
Thu Nov 9 11:20:05 2017 - [info] Differential log apply from binlog server succeeded.
  • binlog server 只写slave
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
### 3台服务器的GTID状态
* master host_1
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_1.000055 | 6016 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave host_2
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-21,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
Auto_Position: 1
* etl host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446366-446369
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
Auto_Position: 1
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_1 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Thu Nov 9 15:00:09 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 9 15:00:09 2017 - [info] Starting master failover.
Thu Nov 9 15:00:09 2017 - [info]
Thu Nov 9 15:00:09 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 9 15:00:09 2017 - [info]
Thu Nov 9 15:00:09 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 9 15:00:09 2017 - [info] Binlog server host_2 is reachable.
Thu Nov 9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on host_2( host_2:3306)
Thu Nov 9 15:00:10 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Thu Nov 9 15:00:10 2017 - [info] GTID failover mode = 1
Thu Nov 9 15:00:10 2017 - [info] Dead Servers:
Thu Nov 9 15:00:10 2017 - [info] host_1( host_1:3306)
Thu Nov 9 15:00:10 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 9 15:00:10 2017 - [info] ok.
Thu Nov 9 15:00:10 2017 - [info] Alive Servers:
Thu Nov 9 15:00:10 2017 - [info] host_2( host_2:3306)
Thu Nov 9 15:00:10 2017 - [info] host_3( host_3:3306)
Thu Nov 9 15:00:10 2017 - [info] Alive Slaves:
Thu Nov 9 15:00:10 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 15:00:10 2017 - [info] GTID ON
Thu Nov 9 15:00:10 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 15:00:10 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 15:00:10 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 15:00:10 2017 - [info] GTID ON
Thu Nov 9 15:00:10 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 15:00:10 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 15:00:10 2017 - [info] Starting SQL thread on host_2( host_2:3306) ..
Thu Nov 9 15:00:10 2017 - [info] done.
Thu Nov 9 15:00:10 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Thu Nov 9 15:00:10 2017 - [info] done.
Thu Nov 9 15:00:10 2017 - [info] Starting GTID based failover.
Thu Nov 9 15:00:10 2017 - [info]
Thu Nov 9 15:00:10 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 9 15:00:10 2017 - [info]
Thu Nov 9 15:00:10 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 9 15:00:10 2017 - [info]
Thu Nov 9 15:00:10 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 9 15:00:10 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 9 15:00:10 2017 - [info] Executing master IP deactivation script:
Thu Nov 9 15:00:10 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
Thu Nov 9 15:00:17 2017 - [info] done.
Thu Nov 9 15:00:17 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 9 15:00:17 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] The latest binary log file/position on all slaves is host_1.000055:4090
Thu Nov 9 15:00:17 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:22-25,
Thu Nov 9 15:00:17 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 9 15:00:17 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 15:00:17 2017 - [info] GTID ON
Thu Nov 9 15:00:17 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 15:00:17 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 15:00:17 2017 - [info] The oldest binary log file/position on all slaves is host_1.000055:2806
Thu Nov 9 15:00:17 2017 - [info] Oldest slaves:
Thu Nov 9 15:00:17 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 15:00:17 2017 - [info] GTID ON
Thu Nov 9 15:00:17 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 15:00:17 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] Searching new master from slaves..
Thu Nov 9 15:00:17 2017 - [info] Candidate masters from the configuration file:
Thu Nov 9 15:00:17 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 15:00:17 2017 - [info] GTID ON
Thu Nov 9 15:00:17 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 15:00:17 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 15:00:17 2017 - [info] Non-candidate masters:
Thu Nov 9 15:00:17 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 15:00:17 2017 - [info] GTID ON
Thu Nov 9 15:00:17 2017 - [info] Replicating from host_1( host_1:3306)
Thu Nov 9 15:00:17 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 15:00:17 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 9 15:00:17 2017 - [info] Not found.
Thu Nov 9 15:00:17 2017 - [info] Searching from all candidate_master slaves..
Thu Nov 9 15:00:17 2017 - [info] New master is host_2( host_2:3306)
Thu Nov 9 15:00:17 2017 - [info] Starting master failover..
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov 9 15:00:17 2017 - [info]
Thu Nov 9 15:00:17 2017 - [info] Waiting all logs to be applied..
Thu Nov 9 15:00:17 2017 - [info] done.
Thu Nov 9 15:00:17 2017 - [info] Replicating from the latest slave host_3( host_3:3306) and waiting to apply..
Thu Nov 9 15:00:17 2017 - [info] Waiting all logs to be applied on the latest slave..
Thu Nov 9 15:00:17 2017 - [info] Resetting slave host_2( host_2:3306) and starting replication from the new master host_3( host_3:3306)..
Thu Nov 9 15:00:17 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 15:00:18 2017 - [info] Slave started.
Thu Nov 9 15:00:18 2017 - [info] Waiting to execute all relay logs on host_2( host_2:3306)..
Thu Nov 9 15:00:18 2017 - [info] master_pos_wait( host_3.000049:25843) completed on host_2( host_2:3306). Executed 0 events.
Thu Nov 9 15:00:18 2017 - [info] done.
Thu Nov 9 15:00:18 2017 - [info] done.
Thu Nov 9 15:00:18 2017 - [info] -- Saving binlog from host host_2 started, pid: 175683
Thu Nov 9 15:00:18 2017 - [info]
Thu Nov 9 15:00:18 2017 - [info] Log messages from host_2 ...
Thu Nov 9 15:00:18 2017 - [info]
Thu Nov 9 15:00:18 2017 - [info] Fetching binary logs from binlog server host_2..
Thu Nov 9 15:00:18 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000055 --start_pos=4090 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171109150009.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Thu Nov 9 15:00:18 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Thu Nov 9 15:00:18 2017 - [info] End of log messages from host_2.
Thu Nov 9 15:00:18 2017 - [warning] Got error from host_2.
Thu Nov 9 15:00:18 2017 - [info] Getting new master's binlog name and position..
Thu Nov 9 15:00:18 2017 - [info] host_2.000005:1390
Thu Nov 9 15:00:18 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 9 15:00:18 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_2.000005, 1390, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
Thu Nov 9 15:00:18 2017 - [info] Executing master IP activate script:
Thu Nov 9 15:00:18 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Thu Nov 9 15:00:22 2017 - [info] OK.
Thu Nov 9 15:00:22 2017 - [info] Setting read_only=0 on host_2( host_2:3306)..
Thu Nov 9 15:00:22 2017 - [info] ok.
Thu Nov 9 15:00:22 2017 - [info] ** Finished master recovery successfully.
Thu Nov 9 15:00:22 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 9 15:00:22 2017 - [info]
Thu Nov 9 15:00:22 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 9 15:00:22 2017 - [info]
Thu Nov 9 15:00:22 2017 - [info]
Thu Nov 9 15:00:22 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov 9 15:00:22 2017 - [info]
Thu Nov 9 15:00:22 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 180681. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109150009.log if it takes time..
Thu Nov 9 15:00:23 2017 - [info]
Thu Nov 9 15:00:23 2017 - [info] Log messages from host_3 ...
Thu Nov 9 15:00:23 2017 - [info]
Thu Nov 9 15:00:22 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_2( host_2:3306)..
Thu Nov 9 15:00:22 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 15:00:23 2017 - [info] Slave started.
Thu Nov 9 15:00:23 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-25,
Thu Nov 9 15:00:23 2017 - [info] End of log messages from host_3.
Thu Nov 9 15:00:23 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Thu Nov 9 15:00:23 2017 - [info] All new slave servers recovered successfully.
Thu Nov 9 15:00:23 2017 - [info]
Thu Nov 9 15:00:23 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 9 15:00:23 2017 - [info]
Thu Nov 9 15:00:23 2017 - [info] Resetting slave info on the new master..
Thu Nov 9 15:00:23 2017 - [info] host_2: Resetting slave info succeeded.
Thu Nov 9 15:00:23 2017 - [info] Master failover to host_2( host_2:3306) completed successfully.
Thu Nov 9 15:00:23 2017 - [info]
Thu Nov 9 15:00:23 2017 - [info] Sending mail..

结论: 由于binlog server没有配置master,所以会丢失master没有传递过来的事务日志
好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

  • binlog server 啥都不写
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
### 3台DB的GTID状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_2.000005 | 5785 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446378 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave host_1
Retrieved_Gtid_Set:
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446369
Auto_Position: 1
* etl host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372
Auto_Position: 1
### 切换日志
Thu Nov 9 16:22:41 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 9 16:22:41 2017 - [info] Starting master failover.
Thu Nov 9 16:22:41 2017 - [info]
Thu Nov 9 16:22:41 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 9 16:22:41 2017 - [info]
Thu Nov 9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on host_1( host_1:3306)
Thu Nov 9 16:22:41 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Thu Nov 9 16:22:41 2017 - [info] GTID failover mode = 1
Thu Nov 9 16:22:41 2017 - [info] Dead Servers:
Thu Nov 9 16:22:41 2017 - [info] host_2( host_2:3306)
Thu Nov 9 16:22:41 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 9 16:22:41 2017 - [info] ok.
Thu Nov 9 16:22:41 2017 - [info] Alive Servers:
Thu Nov 9 16:22:41 2017 - [info] host_1( host_1:3306)
Thu Nov 9 16:22:41 2017 - [info] host_3( host_3:3306)
Thu Nov 9 16:22:41 2017 - [info] Alive Slaves:
Thu Nov 9 16:22:41 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 16:22:41 2017 - [info] GTID ON
Thu Nov 9 16:22:41 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 16:22:41 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 16:22:41 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 16:22:41 2017 - [info] GTID ON
Thu Nov 9 16:22:41 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 16:22:41 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 16:22:41 2017 - [info] Starting SQL thread on host_1( host_1:3306) ..
Thu Nov 9 16:22:41 2017 - [info] done.
Thu Nov 9 16:22:41 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Thu Nov 9 16:22:41 2017 - [info] done.
Thu Nov 9 16:22:41 2017 - [info] Starting GTID based failover.
Thu Nov 9 16:22:41 2017 - [info]
Thu Nov 9 16:22:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 9 16:22:41 2017 - [info]
Thu Nov 9 16:22:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 9 16:22:41 2017 - [info]
Thu Nov 9 16:22:42 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 9 16:22:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 9 16:22:42 2017 - [info] Executing master IP deactivation script:
Thu Nov 9 16:22:42 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-09 16:22:42-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 9.79M=0s
2017-11-09 16:22:44 (9.79 MB/s) - 已写入标准输出 [38]
Thu Nov 9 16:22:44 2017 - [info] done.
Thu Nov 9 16:22:44 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 9 16:22:44 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] The latest binary log file/position on all slaves is host_2.000005:4015
Thu Nov 9 16:22:44 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:26-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446370-446372
Thu Nov 9 16:22:44 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 9 16:22:44 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 16:22:44 2017 - [info] GTID ON
Thu Nov 9 16:22:44 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 16:22:44 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 16:22:44 2017 - [info] The oldest binary log file/position on all slaves is host_2.000005:3130
Thu Nov 9 16:22:44 2017 - [info] Oldest slaves:
Thu Nov 9 16:22:44 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 16:22:44 2017 - [info] GTID ON
Thu Nov 9 16:22:44 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 16:22:44 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] Searching new master from slaves..
Thu Nov 9 16:22:44 2017 - [info] Candidate masters from the configuration file:
Thu Nov 9 16:22:44 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 16:22:44 2017 - [info] GTID ON
Thu Nov 9 16:22:44 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 16:22:44 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 9 16:22:44 2017 - [info] Non-candidate masters:
Thu Nov 9 16:22:44 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 9 16:22:44 2017 - [info] GTID ON
Thu Nov 9 16:22:44 2017 - [info] Replicating from host_2( host_2:3306)
Thu Nov 9 16:22:44 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 9 16:22:44 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 9 16:22:44 2017 - [info] Not found.
Thu Nov 9 16:22:44 2017 - [info] Searching from all candidate_master slaves..
Thu Nov 9 16:22:44 2017 - [info] New master is host_1( host_1:3306)
Thu Nov 9 16:22:44 2017 - [info] Starting master failover..
Thu Nov 9 16:22:44 2017 - [info]
From:
host_2( host_2:3306) (current master)
+-- host_1( host_1:3306)
+-- host_3( host_3:3306)
To:
host_1( host_1:3306) (new master)
+-- host_3( host_3:3306)
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Thu Nov 9 16:22:44 2017 - [info]
Thu Nov 9 16:22:44 2017 - [info] Waiting all logs to be applied..
Thu Nov 9 16:22:44 2017 - [info] done.
Thu Nov 9 16:22:44 2017 - [info] Replicating from the latest slave host_3( host_3:3306) and waiting to apply..
Thu Nov 9 16:22:44 2017 - [info] Waiting all logs to be applied on the latest slave..
Thu Nov 9 16:22:44 2017 - [info] Resetting slave host_1( host_1:3306) and starting replication from the new master host_3( host_3:3306)..
Thu Nov 9 16:22:44 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 16:22:45 2017 - [info] Slave started.
Thu Nov 9 16:22:45 2017 - [info] Waiting to execute all relay logs on host_1( host_1:3306)..
Thu Nov 9 16:22:45 2017 - [info] master_pos_wait( host_3.000049:28663) completed on host_1( host_1:3306). Executed 0 events.
Thu Nov 9 16:22:45 2017 - [info] done.
Thu Nov 9 16:22:45 2017 - [info] done.
Thu Nov 9 16:22:45 2017 - [info] Getting new master's binlog name and position..
Thu Nov 9 16:22:45 2017 - [info] host_1.000056:1170
Thu Nov 9 16:22:45 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 9 16:22:45 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000056, 1170, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372
Thu Nov 9 16:22:45 2017 - [info] Executing master IP activate script:
Thu Nov 9 16:22:45 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Thu Nov 9 16:22:47 2017 - [info] OK.
Thu Nov 9 16:22:47 2017 - [info] Setting read_only=0 on host_1( host_1:3306)..
Thu Nov 9 16:22:47 2017 - [info] ok.
Thu Nov 9 16:22:47 2017 - [info] ** Finished master recovery successfully.
Thu Nov 9 16:22:47 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 9 16:22:47 2017 - [info]
Thu Nov 9 16:22:47 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 9 16:22:47 2017 - [info]
Thu Nov 9 16:22:47 2017 - [info]
Thu Nov 9 16:22:47 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Thu Nov 9 16:22:47 2017 - [info]
Thu Nov 9 16:22:47 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 112317. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171109162241.log if it takes time..
Thu Nov 9 16:22:48 2017 - [info]
Thu Nov 9 16:22:48 2017 - [info] Log messages from host_3 ...
Thu Nov 9 16:22:48 2017 - [info]
Thu Nov 9 16:22:47 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_1( host_1:3306)..
Thu Nov 9 16:22:47 2017 - [info] Executed CHANGE MASTER.
Thu Nov 9 16:22:48 2017 - [info] Slave started.
Thu Nov 9 16:22:48 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-31,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446372) completed on host_3( host_3:3306). Executed 0 events.
Thu Nov 9 16:22:48 2017 - [info] End of log messages from host_3.
Thu Nov 9 16:22:48 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Thu Nov 9 16:22:48 2017 - [info] All new slave servers recovered successfully.
Thu Nov 9 16:22:48 2017 - [info]
Thu Nov 9 16:22:48 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 9 16:22:48 2017 - [info]
Thu Nov 9 16:22:48 2017 - [info] Resetting slave info on the new master..
Thu Nov 9 16:22:49 2017 - [info] host_1: Resetting slave info succeeded.
Thu Nov 9 16:22:49 2017 - [info] Master failover to host_1( host_1:3306) completed successfully.
Thu Nov 9 16:22:49 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2( host_2:3306) to host_1( host_1:3306) succeeded
Master host_2( host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2( host_2:3306)
Selected host_1( host_1:3306) as a new master.
host_1( host_1:3306): OK: Applying all logs succeeded.
host_1( host_1:3306): OK: Activated master IP address.
host_3( host_3:3306): OK: Slave started, replicating from host_1( host_1:3306)
host_1( host_1:3306): Resetting slave info succeeded.
Master failover to host_1( host_1:3306) completed successfully.

结论: 由于binlog server没有配置master,所以会丢失master没有传递过来的事务日志
好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

1.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?

  • 99%的场景都是可以重新执行的

  • 1%的场景不能再次执行,执行会报错

一般这种场景就是:已经failover到最后的change master阶段,这样主从结构已经变更,MHA无法重新走一遍。
不过,即便到这步骤失败了,表示master的日志已经补完,由于是gtid模式,自己再让slave change master到最新的master即可,最后ACTIVE new ip和readonly=1就好了

1
2
3
4
5
6
7
8
Thu Nov 9 16:49:39 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 9 16:49:39 2017 - [info] Starting master failover.
Thu Nov 9 16:49:39 2017 - [info]
Thu Nov 9 16:49:39 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 9 16:49:39 2017 - [info]
Thu Nov 9 16:49:39 2017 - [info] GTID failover mode = 1
Thu Nov 9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master host_1( host_1:3306) does not match with specified dead master host_2( host_2:3306)!
Thu Nov 9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53

1.7 Master:MySQL down小结

1
2
3
4
5
6
7
8
9
10
11
12
1. failover最终命令
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
2. binlog server建议
配置master就可以了
[binlog1]
$master_ip
只配置slave,或者没有配置,会导致丢失部分没有从master传递过来的日志事务

二、Master : Server down

2.1 etl 延迟8小时

1
同1.1 结论

2.2 slave(候选master)比etl还要落后更多

  • 2.2.1 当master的部分日志还没传递两个slave,这时候master server挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
### 3台DB的GTID状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_2.000008 | 5445 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave host_1
Retrieved_Gtid_Set:
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446385
Auto_Position: 1
* etl host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388
Auto_Position: 1
### 模拟故障场景
* 隔离master的网络,让其等同于down机
master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Fri Nov 10 11:12:38 2017 - [info] MHA::MasterFailover version 0.56.
Fri Nov 10 11:12:38 2017 - [info] Starting master failover.
Fri Nov 10 11:12:38 2017 - [info]
Fri Nov 10 11:12:38 2017 - [info] * Phase 1: Configuration Check Phase..
Fri Nov 10 11:12:38 2017 - [info]
Fri Nov 10 11:13:28 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 11:13:28 2017 - [warning] Failed to SSH to binlog server host_2
Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Fri Nov 10 11:13:29 2017 - [info] Binlog server host_1 is reachable.
Fri Nov 10 11:13:29 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Fri Nov 10 11:13:29 2017 - [info] Binlog server host_3 is reachable.
Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on host_1( host_1:3306)
Fri Nov 10 11:13:29 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Fri Nov 10 11:13:29 2017 - [info] GTID failover mode = 1
Fri Nov 10 11:13:29 2017 - [info] Dead Servers:
Fri Nov 10 11:13:29 2017 - [info] host_2( host_2:3306)
Fri Nov 10 11:13:29 2017 - [info] Checking master reachability via MySQL(double check)...
Fri Nov 10 11:13:30 2017 - [info] ok.
Fri Nov 10 11:13:30 2017 - [info] Alive Servers:
Fri Nov 10 11:13:30 2017 - [info] host_1( host_1:3306)
Fri Nov 10 11:13:30 2017 - [info] host_3( host_3:3306)
Fri Nov 10 11:13:30 2017 - [info] Alive Slaves:
Fri Nov 10 11:13:30 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:13:30 2017 - [info] GTID ON
Fri Nov 10 11:13:30 2017 - [info] Replicating from host_2( host_2:3306)
Fri Nov 10 11:13:30 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 11:13:30 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:13:30 2017 - [info] GTID ON
Fri Nov 10 11:13:30 2017 - [info] Replicating from host_2( host_2:3306)
Fri Nov 10 11:13:30 2017 - [info] Not candidate for the new Master (no_master is set)
Fri Nov 10 11:13:30 2017 - [info] Starting SQL thread on host_1( host_1:3306) ..
Fri Nov 10 11:13:30 2017 - [info] done.
Fri Nov 10 11:13:30 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Fri Nov 10 11:13:30 2017 - [info] done.
Fri Nov 10 11:13:30 2017 - [info] Starting GTID based failover.
Fri Nov 10 11:13:30 2017 - [info]
Fri Nov 10 11:13:30 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Nov 10 11:13:30 2017 - [info]
Fri Nov 10 11:13:30 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Nov 10 11:13:30 2017 - [info]
Fri Nov 10 11:14:20 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 11:14:20 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Nov 10 11:14:20 2017 - [info] Executing master IP deactivation script:
Fri Nov 10 11:14:20 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --command=stop
ssh: connect to host host_2 port 22: Connection timed out
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-10 11:14:27-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT
0K 11.4M=0s
2017-11-10 11:16:27 (11.4 MB/s) - 已写入标准输出 [38]
Fri Nov 10 11:16:27 2017 - [info] done.
Fri Nov 10 11:16:27 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Nov 10 11:16:27 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3: Master Recovery Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] The latest binary log file/position on all slaves is host_2.000008:4265
Fri Nov 10 11:16:27 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:46-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446386-446388
Fri Nov 10 11:16:27 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Nov 10 11:16:27 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info] GTID ON
Fri Nov 10 11:16:27 2017 - [info] Replicating from host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info] Not candidate for the new Master (no_master is set)
Fri Nov 10 11:16:27 2017 - [info] The oldest binary log file/position on all slaves is host_2.000008:3380
Fri Nov 10 11:16:27 2017 - [info] Oldest slaves:
Fri Nov 10 11:16:27 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info] GTID ON
Fri Nov 10 11:16:27 2017 - [info] Replicating from host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: Determining New Master Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] Searching new master from slaves..
Fri Nov 10 11:16:27 2017 - [info] Candidate masters from the configuration file:
Fri Nov 10 11:16:27 2017 - [info] host_1( host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info] GTID ON
Fri Nov 10 11:16:27 2017 - [info] Replicating from host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 11:16:27 2017 - [info] Non-candidate masters:
Fri Nov 10 11:16:27 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 11:16:27 2017 - [info] GTID ON
Fri Nov 10 11:16:27 2017 - [info] Replicating from host_2( host_2:3306)
Fri Nov 10 11:16:27 2017 - [info] Not candidate for the new Master (no_master is set)
Fri Nov 10 11:16:27 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Nov 10 11:16:27 2017 - [info] Not found.
Fri Nov 10 11:16:27 2017 - [info] Searching from all candidate_master slaves..
Fri Nov 10 11:16:27 2017 - [info] New master is host_1( host_1:3306)
Fri Nov 10 11:16:27 2017 - [info] Starting master failover..
Fri Nov 10 11:16:27 2017 - [info]
From:
host_2( host_2:3306) (current master)
+-- host_1( host_1:3306)
+-- host_3( host_3:3306)
To:
host_1( host_1:3306) (new master)
+-- host_3( host_3:3306)
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Nov 10 11:16:27 2017 - [info]
Fri Nov 10 11:16:27 2017 - [info] Waiting all logs to be applied..
Fri Nov 10 11:16:27 2017 - [info] done.
Fri Nov 10 11:16:27 2017 - [info] Replicating from the latest slave host_3( host_3:3306) and waiting to apply..
Fri Nov 10 11:16:27 2017 - [info] Waiting all logs to be applied on the latest slave..
Fri Nov 10 11:16:27 2017 - [info] Resetting slave host_1( host_1:3306) and starting replication from the new master host_3( host_3:3306)..
Fri Nov 10 11:16:27 2017 - [info] Executed CHANGE MASTER.
Fri Nov 10 11:16:28 2017 - [info] Slave started.
Fri Nov 10 11:16:28 2017 - [info] Waiting to execute all relay logs on host_1( host_1:3306)..
Fri Nov 10 11:16:28 2017 - [info] master_pos_wait( host_3.000049:40136) completed on host_1( host_1:3306). Executed 0 events.
Fri Nov 10 11:16:28 2017 - [info] done.
Fri Nov 10 11:16:28 2017 - [info] done.
Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_2 started, pid: 43038
Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_1 started, pid: 43039
Fri Nov 10 11:16:28 2017 - [info] -- Saving binlog from host host_3 started, pid: 43041
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Log messages from host_2 ...
Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_2.
Fri Nov 10 11:16:28 2017 - [warning] SSH is not reachable on host_2. Skipping
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Log messages from host_1 ...
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server host_1..
Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000008 --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog2_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
at /usr/bin/save_binary_logs line 123
eval {...} called at /usr/bin/save_binary_logs line 70
main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_1.
Fri Nov 10 11:16:28 2017 - [warning] Got error from host_1.
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Log messages from host_3 ...
Fri Nov 10 11:16:28 2017 - [info]
Fri Nov 10 11:16:28 2017 - [info] Fetching binary logs from binlog server host_3..
Fri Nov 10 11:16:28 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_2.000008 --start_pos=4265 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110111238.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
at /usr/bin/save_binary_logs line 123
eval {...} called at /usr/bin/save_binary_logs line 70
main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 11:16:28 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 11:16:28 2017 - [info] End of log messages from host_3.
Fri Nov 10 11:16:28 2017 - [warning] Got error from host_3.
Fri Nov 10 11:16:28 2017 - [info] Getting new master's binlog name and position..
Fri Nov 10 11:16:28 2017 - [info] host_1.000058:4059
Fri Nov 10 11:16:28 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_1', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Fri Nov 10 11:16:28 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_1.000058, 4059, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388
Fri Nov 10 11:16:28 2017 - [info] Executing master IP activate script:
Fri Nov 10 11:16:28 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_2 --orig_master_ip= host_2 --orig_master_port=3306 --new_master_host= host_1 --new_master_ip= host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Fri Nov 10 11:16:30 2017 - [info] OK.
Fri Nov 10 11:16:30 2017 - [info] ** Finished master recovery successfully.
Fri Nov 10 11:16:30 2017 - [info] * Phase 3: Master Recovery Phase completed.
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] * Phase 4: Slaves Recovery Phase..
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Nov 10 11:16:30 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 46878. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171110111238.log if it takes time..
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:31 2017 - [info] Log messages from host_3 ...
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:30 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_1( host_1:3306)..
Fri Nov 10 11:16:30 2017 - [info] Executed CHANGE MASTER.
Fri Nov 10 11:16:31 2017 - [info] Slave started.
Fri Nov 10 11:16:31 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446388) completed on host_3( host_3:3306). Executed 0 events.
Fri Nov 10 11:16:31 2017 - [info] End of log messages from host_3.
Fri Nov 10 11:16:31 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Fri Nov 10 11:16:31 2017 - [info] All new slave servers recovered successfully.
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:31 2017 - [info] * Phase 5: New master cleanup phase..
Fri Nov 10 11:16:31 2017 - [info]
Fri Nov 10 11:16:31 2017 - [info] Resetting slave info on the new master..
Fri Nov 10 11:16:31 2017 - [info] host_1: Resetting slave info succeeded.
Fri Nov 10 11:16:31 2017 - [info] Master failover to host_1( host_1:3306) completed successfully.
Fri Nov 10 11:16:31 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2( host_2:3306) to host_1( host_1:3306) succeeded
Master host_2( host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2( host_2:3306)
Selected host_1( host_1:3306) as a new master.
host_1( host_1:3306): OK: Applying all logs succeeded.
host_1( host_1:3306): OK: Activated master IP address.
host_3( host_3:3306): OK: Slave started, replicating from host_1( host_1:3306)
host_1( host_1:3306): Resetting slave info succeeded.
Master failover to host_1( host_1:3306) completed successfully.
Fri Nov 10 11:16:31 2017 - [info] Sending mail..
### 最后一步很重要
如果dead master之后又活过来了,那么这一步要做
dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==》 tgw章节详细描述

结论: 由于master 已挂,然而最后的日志没有传递到其他服务器,所以会丢失master没有传递过来的事务日志
好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

  • 2.2.2 当master的所有日志已经传递到1个etl,这时候master server挂了
1
测试省略,和2.2.1基本一样

结论:由于master上的所有日志全部传递到etl,所以最后是不会丢失master上任何数据的。

2.3 slave(候选master)的日志是最新的,比etl要多

  • 2.3.1 当master的部分日志还没传递两个slave,这时候master server挂了
1
测试省略,和2.2.1基本一样

结论: 由于master 已挂,然而最后的日志没有传递到其他服务器,所以会丢失master没有传递过来的事务日志
好在,slave和etl之间会互相change master,所以尽管slave(candidate master)的日志落后,最终也还是用etl的日志补齐了slave缺失的日志。

  • 2.3.2 当master的所有日志已经传递slave,这时候master server挂了
1
测试省略,和2.2.1基本一样

结论:由于master上的所有日志全部传递到slave,所以最后是不会丢失master上任何数据的。

2.4 slave(候选master)上面有大事务在跑

  • 1000s的大查询
1
同1.4结论
  • flush tables with readlock
1
2
同1.4结论

2.5 binlog server 不同场景的测试

dead_master上的最后部分日志没有传递到slave和etl的情况, 然而slave的日志也落后etl (这是最严苛的情况)

  • binlog server 写3台
1
2.2.1 测试的就是这种情况,详细日志切换请看2.2.1

结论: 由于binlog server配置了3台,但是由于master server已经挂掉,无法从master的binlog server上获取日志,所以会丢失master上没有传递的日志事务

  • binlog server 只写master
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
### 3台DB的gtid 状态
* master host_1
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_1.000058 | 8517 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave host_2
Retrieved_Gtid_Set:
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
Auto_Position: 1
* etl host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
Auto_Position: 1
### 模拟故障
master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 故障切换
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_1 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Fri Nov 10 14:15:51 2017 - [info] MHA::MasterFailover version 0.56.
Fri Nov 10 14:15:51 2017 - [info] Starting master failover.
Fri Nov 10 14:15:51 2017 - [info]
Fri Nov 10 14:15:51 2017 - [info] * Phase 1: Configuration Check Phase..
Fri Nov 10 14:15:51 2017 - [info]
Fri Nov 10 14:16:41 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 14:16:41 2017 - [warning] Failed to SSH to binlog server host_1
Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ServerManager.pm, ln239] Binlog Server is defined but there is no alive server.
Fri Nov 10 14:16:41 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/share/perl5/vendor_perl/MHA/MasterFailover.pm line 2082

结论: binlog server 必须要配置一个活的 server,如果只配置master,如果master挂了,那么就等于一个都没有,MHA不会切换

  • binlog server 只写slave
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
### 3台DB的gtid 状态
* master host_1
dba:lc> show master status;
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
| host_1.000058 | 8517 | | | 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-60,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392 |
+---------------------+----------+--------------+------------------+------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
* slave host_2
Retrieved_Gtid_Set:
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-50,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
Auto_Position: 1
* etl host_3
Retrieved_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
Executed_Gtid_Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
Auto_Position: 1
### 模拟故障
master> iptables -A INPUT -p tcp -s other_host --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 故障切换
Fri Nov 10 14:29:50 2017 - [info] MHA::MasterFailover version 0.56.
Fri Nov 10 14:29:50 2017 - [info] Starting master failover.
Fri Nov 10 14:29:50 2017 - [info]
Fri Nov 10 14:29:50 2017 - [info] * Phase 1: Configuration Check Phase..
Fri Nov 10 14:29:50 2017 - [info]
Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Fri Nov 10 14:29:50 2017 - [info] Binlog server host_2 is reachable.
Fri Nov 10 14:29:50 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Fri Nov 10 14:29:50 2017 - [info] Binlog server host_3 is reachable.
Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on host_2( host_2:3306)
Fri Nov 10 14:29:50 2017 - [warning] SQL Thread is stopped(no error) on host_3( host_3:3306)
Fri Nov 10 14:29:50 2017 - [info] GTID failover mode = 1
Fri Nov 10 14:29:50 2017 - [info] Dead Servers:
Fri Nov 10 14:29:50 2017 - [info] host_1( host_1:3306)
Fri Nov 10 14:29:50 2017 - [info] Checking master reachability via MySQL(double check)...
Fri Nov 10 14:29:51 2017 - [info] ok.
Fri Nov 10 14:29:51 2017 - [info] Alive Servers:
Fri Nov 10 14:29:51 2017 - [info] host_2( host_2:3306)
Fri Nov 10 14:29:51 2017 - [info] host_3( host_3:3306)
Fri Nov 10 14:29:51 2017 - [info] Alive Slaves:
Fri Nov 10 14:29:51 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:29:51 2017 - [info] GTID ON
Fri Nov 10 14:29:51 2017 - [info] Replicating from host_1( host_1:3306)
Fri Nov 10 14:29:51 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 14:29:51 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:29:51 2017 - [info] GTID ON
Fri Nov 10 14:29:51 2017 - [info] Replicating from host_1( host_1:3306)
Fri Nov 10 14:29:51 2017 - [info] Not candidate for the new Master (no_master is set)
Fri Nov 10 14:29:51 2017 - [info] Starting SQL thread on host_2( host_2:3306) ..
Fri Nov 10 14:29:51 2017 - [info] done.
Fri Nov 10 14:29:51 2017 - [info] Starting SQL thread on host_3( host_3:3306) ..
Fri Nov 10 14:29:52 2017 - [info] done.
Fri Nov 10 14:29:52 2017 - [info] Starting GTID based failover.
Fri Nov 10 14:29:52 2017 - [info]
Fri Nov 10 14:29:52 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Nov 10 14:29:52 2017 - [info]
Fri Nov 10 14:29:52 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Nov 10 14:29:52 2017 - [info]
Fri Nov 10 14:30:42 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Fri Nov 10 14:30:42 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Nov 10 14:30:42 2017 - [info] Executing master IP deactivation script:
Fri Nov 10 14:30:42 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --command=stop
ssh: connect to host host_1 port 22: Connection timed out
=================== swift vip : tgw_vip from host_1 is deleted ==============================
--2017-11-10 14:30:49-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 12.1M=0s
2017-11-10 14:32:47 (12.1 MB/s) - 已写入标准输出 [38]
Fri Nov 10 14:32:47 2017 - [info] done.
Fri Nov 10 14:32:47 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Nov 10 14:32:47 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3: Master Recovery Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] The latest binary log file/position on all slaves is host_1.000058:6912
Fri Nov 10 14:32:47 2017 - [info] Retrieved Gtid Set: 0923e916-3c36-11e6-82a5-ecf4bbf1f518:51-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:446389-446392
Fri Nov 10 14:32:47 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Nov 10 14:32:47 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info] GTID ON
Fri Nov 10 14:32:47 2017 - [info] Replicating from host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info] Not candidate for the new Master (no_master is set)
Fri Nov 10 14:32:47 2017 - [info] The oldest binary log file/position on all slaves is host_1.000058:5307
Fri Nov 10 14:32:47 2017 - [info] Oldest slaves:
Fri Nov 10 14:32:47 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info] GTID ON
Fri Nov 10 14:32:47 2017 - [info] Replicating from host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: Determining New Master Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] Searching new master from slaves..
Fri Nov 10 14:32:47 2017 - [info] Candidate masters from the configuration file:
Fri Nov 10 14:32:47 2017 - [info] host_2( host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info] GTID ON
Fri Nov 10 14:32:47 2017 - [info] Replicating from host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Nov 10 14:32:47 2017 - [info] Non-candidate masters:
Fri Nov 10 14:32:47 2017 - [info] host_3( host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Fri Nov 10 14:32:47 2017 - [info] GTID ON
Fri Nov 10 14:32:47 2017 - [info] Replicating from host_1( host_1:3306)
Fri Nov 10 14:32:47 2017 - [info] Not candidate for the new Master (no_master is set)
Fri Nov 10 14:32:47 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Nov 10 14:32:47 2017 - [info] Not found.
Fri Nov 10 14:32:47 2017 - [info] Searching from all candidate_master slaves..
Fri Nov 10 14:32:47 2017 - [info] New master is host_2( host_2:3306)
Fri Nov 10 14:32:47 2017 - [info] Starting master failover..
Fri Nov 10 14:32:47 2017 - [info]
From:
host_1( host_1:3306) (current master)
+-- host_2( host_2:3306)
+-- host_3( host_3:3306)
To:
host_2( host_2:3306) (new master)
+-- host_3( host_3:3306)
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Nov 10 14:32:47 2017 - [info]
Fri Nov 10 14:32:47 2017 - [info] Waiting all logs to be applied..
Fri Nov 10 14:32:47 2017 - [info] done.
Fri Nov 10 14:32:47 2017 - [info] Replicating from the latest slave host_3( host_3:3306) and waiting to apply..
Fri Nov 10 14:32:47 2017 - [info] Waiting all logs to be applied on the latest slave..
Fri Nov 10 14:32:47 2017 - [info] Resetting slave host_2( host_2:3306) and starting replication from the new master host_3( host_3:3306)..
Fri Nov 10 14:32:47 2017 - [info] Executed CHANGE MASTER.
Fri Nov 10 14:32:48 2017 - [info] Slave started.
Fri Nov 10 14:32:48 2017 - [info] Waiting to execute all relay logs on host_2( host_2:3306)..
Fri Nov 10 14:32:48 2017 - [info] master_pos_wait( host_3.000049:42954) completed on host_2( host_2:3306). Executed 0 events.
Fri Nov 10 14:32:48 2017 - [info] done.
Fri Nov 10 14:32:48 2017 - [info] done.
Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host host_2 started, pid: 76664
Fri Nov 10 14:32:48 2017 - [info] -- Saving binlog from host host_3 started, pid: 76665
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Log messages from host_2 ...
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server host_2..
Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000058 --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog1_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
at /usr/bin/save_binary_logs line 123
eval {...} called at /usr/bin/save_binary_logs line 70
main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 14:32:48 2017 - [info] End of log messages from host_2.
Fri Nov 10 14:32:48 2017 - [warning] Got error from host_2.
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Log messages from host_3 ...
Fri Nov 10 14:32:48 2017 - [info]
Fri Nov 10 14:32:48 2017 - [info] Fetching binary logs from binlog server host_3..
Fri Nov 10 14:32:48 2017 - [info] Executing binlog save command: save_binary_logs --command=save --start_file= host_1.000058 --start_pos=6912 --output_file=/var/log/masterha/mha_test/saved_binlog_binlog3_20171110142950.binlog --handle_raw_binlog=0 --skip_filter=1 --disable_log_bin=0 --manager_version=0.56 --oldest_version=5.7.13-log --binlog_dir=/data/mysql.bin
Failed to save binary log: Binlog not found from /data/mysql.bin! If you got this error at MHA Manager, please set "master_binlog_dir=/path/to/binlog_directory_of_the_master" correctly in the MHA Manager's configuration file and try again.
at /usr/bin/save_binary_logs line 123
eval {...} called at /usr/bin/save_binary_logs line 70
main::main() called at /usr/bin/save_binary_logs line 66
Fri Nov 10 14:32:48 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln660] Failed to save binary log events from the binlog server. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
Fri Nov 10 14:32:48 2017 - [info] End of log messages from host_3.
Fri Nov 10 14:32:48 2017 - [warning] Got error from host_3.
Fri Nov 10 14:32:48 2017 - [info] Getting new master's binlog name and position..
Fri Nov 10 14:32:48 2017 - [info] host_2.000008:6895
Fri Nov 10 14:32:48 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST=' host_2', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Fri Nov 10 14:32:48 2017 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: host_2.000008, 6895, 0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392
Fri Nov 10 14:32:48 2017 - [info] Executing master IP activate script:
Fri Nov 10 14:32:48 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host= host_1 --orig_master_ip= host_1 --orig_master_port=3306 --new_master_host= host_2 --new_master_ip= host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_2 is added ==============================
Fri Nov 10 14:32:51 2017 - [info] OK.
Fri Nov 10 14:32:51 2017 - [info] ** Finished master recovery successfully.
Fri Nov 10 14:32:51 2017 - [info] * Phase 3: Master Recovery Phase completed.
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] * Phase 4: Slaves Recovery Phase..
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Nov 10 14:32:51 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] -- Slave recovery on host host_3( host_3:3306) started, pid: 80398. Check tmp log /var/log/masterha/mha_test/ host_3_3306_20171110142950.log if it takes time..
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:52 2017 - [info] Log messages from host_3 ...
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:51 2017 - [info] Resetting slave host_3( host_3:3306) and starting replication from the new master host_2( host_2:3306)..
Fri Nov 10 14:32:51 2017 - [info] Executed CHANGE MASTER.
Fri Nov 10 14:32:52 2017 - [info] Slave started.
Fri Nov 10 14:32:52 2017 - [info] gtid_wait(0923e916-3c36-11e6-82a5-ecf4bbf1f518:1-55,
ebd9ff93-c5b2-11e6-b21d-ecf4bbf1f42c:1-446392) completed on host_3( host_3:3306). Executed 0 events.
Fri Nov 10 14:32:52 2017 - [info] End of log messages from host_3.
Fri Nov 10 14:32:52 2017 - [info] -- Slave on host host_3( host_3:3306) started.
Fri Nov 10 14:32:52 2017 - [info] All new slave servers recovered successfully.
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:52 2017 - [info] * Phase 5: New master cleanup phase..
Fri Nov 10 14:32:52 2017 - [info]
Fri Nov 10 14:32:52 2017 - [info] Resetting slave info on the new master..
Fri Nov 10 14:32:52 2017 - [info] host_2: Resetting slave info succeeded.
Fri Nov 10 14:32:52 2017 - [info] Master failover to host_2( host_2:3306) completed successfully.
Fri Nov 10 14:32:52 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_1( host_1:3306) to host_2( host_2:3306) succeeded
Master host_1( host_1:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1( host_1:3306)
Selected host_2( host_2:3306) as a new master.
host_2( host_2:3306): OK: Applying all logs succeeded.
host_2( host_2:3306): OK: Activated master IP address.
host_3( host_3:3306): OK: Slave started, replicating from host_2( host_2:3306)
host_2( host_2:3306): Resetting slave info succeeded.
Master failover to host_2( host_2:3306) completed successfully.
Fri Nov 10 14:32:52 2017 - [info] Sending mail..

结论: binlog server 配置成多台slave,这是正确的方案。 由于master 挂了,master没有传递过来的binlog会丢失,这是没办法的. 好在,其余slave自动补齐现有日志

  • binlog server 啥都不写
1
2
3
4
会切换成功,由于master 挂了,master没有传递过来的binlog会丢失
好在,其余slave自动补齐现有日志

2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?

1
同1.6结论

三、遇到的坑

3.1 交互模式下,如果没有及时敲’YES’,则终止切换

四、总结

MHA + GTID 模式,重点配置和用法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
1. command
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host= host_1 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
2. binlog server
在配置文件中对 master,slave,etl 都写在binlog server中。对MySQL down 和 DB server down 综合考虑下,建议这样配置。
3. tgw 清理
dead master 如果还可以起来,那么必须在上面执行: /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
原因可参看:http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节

五、流程简介

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
* Phase 1: Configuration Check Phase..
HealthCheck: SSH N台DB是否reachable
Binlog server: N台DB 是否reachable
GTID failover mode = ?
Dead Servers is
Primary candidate for the new Master (candidate_master is set) ?
* Phase 2: Dead Master Shutdown Phase..
Executing master IP deactivation script:
TGW-vip delete操作
shutdown_script: ?
* Phase 3: Master Recovery Phase..
* Phase 3.1: Getting Latest Slaves Phase..
Latest slaves ,file position
Oldest slaves , file position
* Phase 3.3: Determining New Master Phase..
选择哪个slave为new master
* Phase 3.3: New Master Recovery Phase..
Replicating from the latest slave and waiting to apply.. --让new master change master 到 latest slave
Waiting all logs to be applied on the latest slave.. --让new master跟 latest slave的日志保持一致
Saving binlog from binlog server。。。 --根据配置的binlog server,开始生成latest slave和dead master之间的diff日志
Applying differential binlog --apply这些差异日志到new master,让new master执行完所有缺失的日志
Getting new master's binlog name and position.. --获取new master现在的binlog file pos
All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST, MASTER_PORT=3306, MASTER_AUTO_POSITION=1
Executing master IP activate script: --TGW-vip 激活操作,并且设置readonly=0
* Phase 4: Slaves Recovery Phase.. (并行操作)
Resetting slave and starting replication from the new master
* Phase 5: New master cleanup phase..
Resetting slave info on the new master.
Contents
  1. 1. 假定环境(经典三节点)
  2. 2. 一、Master : MySQL down
    1. 2.1. 1.1 etl 延迟8小时
    2. 2.2. 1.2 slave(候选master)比etl还要落后更多
    3. 2.3. 1.3 slave(候选master)的日志是最新的,比etl要多
    4. 2.4. 1.4 slave(候选master)上面有大事务在跑
    5. 2.5. 1.5 binlog server 不同场景的测试
    6. 2.6. 1.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
    7. 2.7. 1.7 Master:MySQL down小结
  3. 3. 二、Master : Server down
    1. 3.1. 2.1 etl 延迟8小时
    2. 3.2. 2.2 slave(候选master)比etl还要落后更多
    3. 3.3. 2.3 slave(候选master)的日志是最新的,比etl要多
    4. 3.4. 2.4 slave(候选master)上面有大事务在跑
    5. 3.5. 2.5 binlog server 不同场景的测试
    6. 3.6. 2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
  4. 4. 三、遇到的坑
    1. 4.1. 3.1 交互模式下,如果没有及时敲’YES’,则终止切换
  5. 5. 四、总结
  6. 6. 五、流程简介

幸福,不在于得到的多

而在于计较的少