Contents
  1. 1. MHA failover NON-GTID 专题
    1. 1.1. 假定环境(经典三节点)
    2. 1.2. 一、Master : MySQL down
      1. 1.2.1. 1.1 etl 延迟8小时
      2. 1.2.2. 1.2 slave(候选master)比etl还要落后更多
      3. 1.2.3. 1.3 slave(候选master)的日志是最新的,比etl要多
      4. 1.2.4. 1.4 slave(候选master)上面有大事务在跑
      5. 1.2.5. 1.5 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
      6. 1.2.6. 1.7 Master:MySQL down小结
    3. 1.3. 二、Master : Server down
      1. 1.3.1. 2.1 etl 延迟8小时
      2. 1.3.2. 2.2 slave(候选master)比etl还要落后更多
      3. 1.3.3. 2.3 slave(候选master)的日志是最新的,比etl要多
      4. 1.3.4. 2.4 slave(候选master)上面有大事务在跑
      5. 1.3.5. 2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
    4. 1.4. 三、遇到的坑
      1. 1.4.1. 3.1 交互模式下,如果没有及时敲’YES’,则终止切换
      2. 1.4.2. 3.2 如果在non-gtid模式的机器上,配置了binlog server,会有什么影响呢?
      3. 1.4.3. 3.3 不要在relay-log的地方伪造日志
      4. 1.4.4. 3.4 flush tables with read lock 会阻塞
      5. 1.4.5. 3.5 binlog伪造
    5. 1.5. 四、总结
    6. 1.6. 五、 流程简介

MHA failover NON-GTID 专题

这里以masterha_master_switch为背景详解各种可能遇到的场景

假定环境(经典三节点)

1
2
3
host_1(host_1:3306) (current master)
+--host_2(host_2:3306 slave[candidate master])
+--host_3(host_3:3306 etl)

一、Master : MySQL down

1.1 etl 延迟8小时

1
配置文件中加上no_check_delay=0 即可忽略报错

1.2 slave(候选master)比etl还要落后更多

  • 1.2.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
### 模拟现场,现场的3台DB binlog状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000002 | 1920 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
* slave (candidate master) host_1
Master_Log_File: host_2.000002
Read_Master_Log_Pos: 150
Exec_Master_Log_Pos: 150
* etl (other slave)
Master_Log_File: host_2.000002
Read_Master_Log_Pos: 1035
Exec_Master_Log_Pos: 1035
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Mon Nov 13 16:24:03 2017 - [info] MHA::MasterFailover version 0.56.
Mon Nov 13 16:24:03 2017 - [info] Starting master failover.
Mon Nov 13 16:24:03 2017 - [info]
Mon Nov 13 16:24:03 2017 - [info] * Phase 1: Configuration Check Phase..
Mon Nov 13 16:24:03 2017 - [info]
Mon Nov 13 16:24:03 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Mon Nov 13 16:24:03 2017 - [info] Binlog server host_2 is reachable.
Mon Nov 13 16:24:06 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Mon Nov 13 16:24:06 2017 - [info] Binlog server host_3 is reachable.
Mon Nov 13 16:24:06 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Mon Nov 13 16:24:06 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Mon Nov 13 16:24:06 2017 - [info] GTID failover mode = 0
Mon Nov 13 16:24:06 2017 - [info] Dead Servers:
Mon Nov 13 16:24:06 2017 - [info] host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Checking master reachability via MySQL(double check)...
Mon Nov 13 16:24:06 2017 - [info] ok.
Mon Nov 13 16:24:06 2017 - [info] Alive Servers:
Mon Nov 13 16:24:06 2017 - [info] host_1(host_1:3306)
Mon Nov 13 16:24:06 2017 - [info] host_3(host_3:3306)
Mon Nov 13 16:24:06 2017 - [info] Alive Slaves:
Mon Nov 13 16:24:06 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:06 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:06 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:06 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:06 2017 - [info] Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:06 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Mon Nov 13 16:24:06 2017 - [info] done.
Mon Nov 13 16:24:06 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Mon Nov 13 16:24:06 2017 - [info] done.
Mon Nov 13 16:24:06 2017 - [info] Starting Non-GTID based failover.
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Mon Nov 13 16:24:06 2017 - [info]
Mon Nov 13 16:24:06 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Mon Nov 13 16:24:06 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Mon Nov 13 16:24:06 2017 - [info] Executing master IP deactivation script:
Mon Nov 13 16:24:06 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-13 16:24:07-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 11.2M=0s
2017-11-13 16:24:11 (11.2 MB/s) - 已写入标准输出 [38]
Mon Nov 13 16:24:11 2017 - [info] done.
Mon Nov 13 16:24:11 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Mon Nov 13 16:24:11 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3: Master Recovery Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] The latest binary log file/position on all slaves is host_2.000002:1035
Mon Nov 13 16:24:11 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Mon Nov 13 16:24:11 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:11 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:11 2017 - [info] Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:11 2017 - [info] The oldest binary log file/position on all slaves is host_2.000002:150
Mon Nov 13 16:24:11 2017 - [info] Oldest slaves:
Mon Nov 13 16:24:11 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:11 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:11 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Mon Nov 13 16:24:11 2017 - [info]
Mon Nov 13 16:24:11 2017 - [info] Fetching dead master's binary logs..
Mon Nov 13 16:24:11 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000002 --start_pos=1035 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_2.000002 pos 1035 to host_2.000002 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog ..
Dumping binlog format description event, from position 0 to 150.. ok.
Dumping effective binlog data from /data/mysql.bin/host_2.000002 position 1035 to tail(1939).. ok.
Concat succeeded.
Mon Nov 13 16:24:12 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Mon Nov 13 16:24:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] * Phase 3.3: Determining New Master Phase..
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Mon Nov 13 16:24:13 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Mon Nov 13 16:24:13 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000002 --latest_rmlp=1035 --target_mlf=host_2.000002 --target_rmlp=150 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171113162403 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000005 :
Relay log found at /data/mysql_data, up to host_3-relay-bin.000005
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000005, start_pos:269.
Target relay log FOUND!
Mon Nov 13 16:24:13 2017 - [info] OK. host_3 has all relay logs.
Mon Nov 13 16:24:13 2017 - [info] Searching new master from slaves..
Mon Nov 13 16:24:13 2017 - [info] Candidate masters from the configuration file:
Mon Nov 13 16:24:13 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:13 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:13 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Mon Nov 13 16:24:13 2017 - [info] Non-candidate masters:
Mon Nov 13 16:24:13 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Mon Nov 13 16:24:13 2017 - [info] Replicating from host_2(host_2:3306)
Mon Nov 13 16:24:13 2017 - [info] Not candidate for the new Master (no_master is set)
Mon Nov 13 16:24:13 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Mon Nov 13 16:24:13 2017 - [info] Not found.
Mon Nov 13 16:24:13 2017 - [info] Searching from all candidate_master slaves..
Mon Nov 13 16:24:13 2017 - [info] New master is host_1(host_1:3306)
Mon Nov 13 16:24:13 2017 - [info] Starting master failover..
Mon Nov 13 16:24:13 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Mon Nov 13 16:24:13 2017 - [info]
Mon Nov 13 16:24:13 2017 - [info] Server host_1 received relay logs up to: host_2.000002:150
Mon Nov 13 16:24:13 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000002:1035 (using the latest slave's relay logs)
Mon Nov 13 16:24:13 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Mon Nov 13 16:24:13 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000002 --latest_rmlp=1035 --target_mlf=host_2.000002 --target_rmlp=150 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000005
Mon Nov 13 16:24:15 2017 - [info]
Relay log found at /data/mysql_data, up to host_3-relay-bin.000005
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000005, start_pos:269.
Concat binary/relay logs from host_3-relay-bin.000005 pos 269 to host_3-relay-bin.000005 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog ..
Dumping binlog format description event, from position 0 to 269.. ok.
Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000005 position 269 to tail(1154).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog .
scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog to root@host_1(22) succeeded.
Mon Nov 13 16:24:15 2017 - [info] Generating diff files succeeded.
Mon Nov 13 16:24:15 2017 - [info] Sending binlog..
Mon Nov 13 16:24:16 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:16 2017 - [info]
Mon Nov 13 16:24:16 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Mon Nov 13 16:24:16 2017 - [info]
Mon Nov 13 16:24:16 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Mon Nov 13 16:24:16 2017 - [info] Starting recovery on host_1(host_1:3306)..
Mon Nov 13 16:24:16 2017 - [info] Generating diffs succeeded.
Mon Nov 13 16:24:16 2017 - [info] Waiting until all relay logs are applied.
Mon Nov 13 16:24:16 2017 - [info] done.
Mon Nov 13 16:24:16 2017 - [info] Getting slave status..
Mon Nov 13 16:24:16 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000002:150). No need to recover from Exec_Master_Log_Pos.
Mon Nov 13 16:24:16 2017 - [info] Connecting to the target slave host host_1, running recover script..
Mon Nov 13 16:24:16 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Mon Nov 13 16:24:16 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog.. dumped up to pos 150. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog has effective binlog events from pos 150.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog position 150 to tail(1054).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171113162403.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171113162403.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Mon Nov 13 16:24:16 2017 - [info] All relay logs were successfully applied.
Mon Nov 13 16:24:16 2017 - [info] Getting new master's binlog name and position..
Mon Nov 13 16:24:16 2017 - [info] host_1.000001:3232
Mon Nov 13 16:24:16 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000001', MASTER_LOG_POS=3232, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Mon Nov 13 16:24:16 2017 - [info] Executing master IP activate script:
Mon Nov 13 16:24:16 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Mon Nov 13 16:24:21 2017 - [info] OK.
Mon Nov 13 16:24:21 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Mon Nov 13 16:24:21 2017 - [info] ok.
Mon Nov 13 16:24:21 2017 - [info] ** Finished master recovery successfully.
Mon Nov 13 16:24:21 2017 - [info] * Phase 3: Master Recovery Phase completed.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4: Slaves Recovery Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 104955. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171113162403.log if it takes time..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Log messages from host_3 ...
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Mon Nov 13 16:24:21 2017 - [info] End of log messages from host_3.
Mon Nov 13 16:24:21 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Mon Nov 13 16:24:21 2017 - [info] Generating relay diff files from the latest slave succeeded.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 104966. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171113162403.log if it takes time..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Log messages from host_3 ...
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Sending binlog..
Mon Nov 13 16:24:21 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog succeeded.
Mon Nov 13 16:24:21 2017 - [info] Starting recovery on host_3(host_3:3306)..
Mon Nov 13 16:24:21 2017 - [info] Generating diffs succeeded.
Mon Nov 13 16:24:21 2017 - [info] Waiting until all relay logs are applied.
Mon Nov 13 16:24:21 2017 - [info] done.
Mon Nov 13 16:24:21 2017 - [info] Getting slave status..
Mon Nov 13 16:24:21 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000002:1035). No need to recover from Exec_Master_Log_Pos.
Mon Nov 13 16:24:21 2017 - [info] Connecting to the target slave host host_3, running recover script..
Mon Nov 13 16:24:21 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171113162403 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Mon Nov 13 16:24:21 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171113162403.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Mon Nov 13 16:24:21 2017 - [info] All relay logs were successfully applied.
Mon Nov 13 16:24:21 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Mon Nov 13 16:24:21 2017 - [info] Executed CHANGE MASTER.
Mon Nov 13 16:24:21 2017 - [info] Slave started.
Mon Nov 13 16:24:21 2017 - [info] End of log messages from host_3.
Mon Nov 13 16:24:21 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Mon Nov 13 16:24:21 2017 - [info] All new slave servers recovered successfully.
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] * Phase 5: New master cleanup phase..
Mon Nov 13 16:24:21 2017 - [info]
Mon Nov 13 16:24:21 2017 - [info] Resetting slave info on the new master..
Mon Nov 13 16:24:21 2017 - [info] host_1: Resetting slave info succeeded.
Mon Nov 13 16:24:21 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Mon Nov 13 16:24:21 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Mon Nov 13 16:24:21 2017 - [info] Sending mail..
  • 1.2.2 当master的所有日志已经传递到1个etl,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
### 模拟现场,现场的3台DB binlog状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000010 | 1694 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 164 | 1 |
| 165 | 2 |
| 166 | 3 |
+-----+------+
3 rows in set (0.00 sec)
* slave (candidate master) host_1
Master_Log_File: host_2.000010
Exec_Master_Log_Pos: 806
dba:lc> select * from t_char_2;
Empty set (0.00 sec)
* etl (other slave) host_3
Master_Log_File: host_2.000010
Exec_Master_Log_Pos: 1694
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 164 | 1 |
| 165 | 2 |
| 166 | 3 |
+-----+------+
3 rows in set (0.00 sec)
### 切换日志
Wed Nov 15 10:25:50 2017 - [info] MHA::MasterFailover version 0.56.
Wed Nov 15 10:25:50 2017 - [info] Starting master failover.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] * Phase 1: Configuration Check Phase..
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Wed Nov 15 10:25:50 2017 - [info] GTID failover mode = 0
Wed Nov 15 10:25:50 2017 - [info] Dead Servers:
Wed Nov 15 10:25:50 2017 - [info] host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Checking master reachability via MySQL(double check)...
Wed Nov 15 10:25:50 2017 - [info] ok.
Wed Nov 15 10:25:50 2017 - [info] Alive Servers:
Wed Nov 15 10:25:50 2017 - [info] host_1(host_1:3306)
Wed Nov 15 10:25:50 2017 - [info] host_3(host_3:3306)
Wed Nov 15 10:25:50 2017 - [info] Alive Slaves:
Wed Nov 15 10:25:50 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:50 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:50 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:50 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:50 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:50 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Wed Nov 15 10:25:50 2017 - [info] done.
Wed Nov 15 10:25:50 2017 - [info] Starting Non-GTID based failover.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Nov 15 10:25:50 2017 - [info]
Wed Nov 15 10:25:50 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Wed Nov 15 10:25:51 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Wed Nov 15 10:25:51 2017 - [info] Executing master IP deactivation script:
Wed Nov 15 10:25:51 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-15 10:25:51-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 11.1M=0s
2017-11-15 10:25:53 (11.1 MB/s) - 已写入标准输出 [38]
Wed Nov 15 10:25:53 2017 - [info] done.
Wed Nov 15 10:25:53 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Nov 15 10:25:53 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3: Master Recovery Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] The latest binary log file/position on all slaves is host_2.000010:1694
Wed Nov 15 10:25:53 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Nov 15 10:25:53 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:53 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:53 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:53 2017 - [info] The oldest binary log file/position on all slaves is host_2.000010:806
Wed Nov 15 10:25:53 2017 - [info] Oldest slaves:
Wed Nov 15 10:25:53 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:53 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:53 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Nov 15 10:25:53 2017 - [info]
Wed Nov 15 10:25:53 2017 - [info] Fetching dead master's binary logs..
Wed Nov 15 10:25:53 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000010 --start_pos=1694 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_2.000010 pos 1694 to host_2.000010 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog ..
Dumping binlog format description event, from position 0 to 150.. ok.
Dumping effective binlog data from /data/mysql.bin/host_2.000010 position 1694 to tail(1713).. ok.
Concat succeeded.
Wed Nov 15 10:25:53 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:54 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Wed Nov 15 10:25:54 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] * Phase 3.3: Determining New Master Phase..
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Nov 15 10:25:54 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Wed Nov 15 10:25:54 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000010 --latest_rmlp=1694 --target_mlf=host_2.000010 --target_rmlp=806 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171115102550 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004 :
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:1017.
Target relay log FOUND!
Wed Nov 15 10:25:54 2017 - [info] OK. host_3 has all relay logs.
Wed Nov 15 10:25:54 2017 - [info] Searching new master from slaves..
Wed Nov 15 10:25:54 2017 - [info] Candidate masters from the configuration file:
Wed Nov 15 10:25:54 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:54 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:54 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:25:54 2017 - [info] Non-candidate masters:
Wed Nov 15 10:25:54 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:25:54 2017 - [info] Replicating from host_2(host_2:3306)
Wed Nov 15 10:25:54 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:25:54 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Wed Nov 15 10:25:54 2017 - [info] Not found.
Wed Nov 15 10:25:54 2017 - [info] Searching from all candidate_master slaves..
Wed Nov 15 10:25:54 2017 - [info] New master is host_1(host_1:3306)
Wed Nov 15 10:25:54 2017 - [info] Starting master failover..
Wed Nov 15 10:25:54 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Nov 15 10:25:54 2017 - [info]
Wed Nov 15 10:25:54 2017 - [info] Server host_1 received relay logs up to: host_2.000010:806
Wed Nov 15 10:25:54 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000010:1694 (using the latest slave's relay logs)
Wed Nov 15 10:25:55 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Wed Nov 15 10:25:55 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000010 --latest_rmlp=1694 --target_mlf=host_2.000010 --target_rmlp=806 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004
Wed Nov 15 10:25:55 2017 - [info]
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:1017.
Concat binary/relay logs from host_3-relay-bin.000004 pos 1017 to host_3-relay-bin.000004 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog ..
Dumping binlog format description event, from position 0 to 361.. ok.
Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000004 position 1017 to tail(1905).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog .
scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog to root@host_1(22) succeeded.
Wed Nov 15 10:25:55 2017 - [info] Generating diff files succeeded.
Wed Nov 15 10:25:55 2017 - [info] Sending binlog..
Wed Nov 15 10:25:56 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:56 2017 - [info]
Wed Nov 15 10:25:56 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Nov 15 10:25:56 2017 - [info]
Wed Nov 15 10:25:56 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Nov 15 10:25:56 2017 - [info] Starting recovery on host_1(host_1:3306)..
Wed Nov 15 10:25:56 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:25:56 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:25:56 2017 - [info] done.
Wed Nov 15 10:25:56 2017 - [info] Getting slave status..
Wed Nov 15 10:25:56 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000010:806). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:25:56 2017 - [info] Connecting to the target slave host host_1, running recover script..
Wed Nov 15 10:25:56 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:25:56 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog to /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog.. dumped up to pos 150. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog has effective binlog events from pos 150.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog position 150 to tail(169).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_1_3306.20171115102550.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171115102550.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:25:56 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:25:56 2017 - [info] Getting new master's binlog name and position..
Wed Nov 15 10:25:56 2017 - [info] host_1.000010:2060
Wed Nov 15 10:25:56 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000010', MASTER_LOG_POS=2060, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Nov 15 10:25:56 2017 - [info] Executing master IP activate script:
Wed Nov 15 10:25:56 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Wed Nov 15 10:25:59 2017 - [info] OK.
Wed Nov 15 10:25:59 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Wed Nov 15 10:25:59 2017 - [info] ok.
Wed Nov 15 10:25:59 2017 - [info] ** Finished master recovery successfully.
Wed Nov 15 10:25:59 2017 - [info] * Phase 3: Master Recovery Phase completed.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4: Slaves Recovery Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 125962. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115102550.log if it takes time..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Wed Nov 15 10:25:59 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:25:59 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Wed Nov 15 10:25:59 2017 - [info] Generating relay diff files from the latest slave succeeded.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 125967. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115102550.log if it takes time..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Sending binlog..
Wed Nov 15 10:25:59 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog succeeded.
Wed Nov 15 10:25:59 2017 - [info] Starting recovery on host_3(host_3:3306)..
Wed Nov 15 10:25:59 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:25:59 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:25:59 2017 - [info] done.
Wed Nov 15 10:25:59 2017 - [info] Getting slave status..
Wed Nov 15 10:25:59 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000010:1694). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:25:59 2017 - [info] Connecting to the target slave host host_3, running recover script..
Wed Nov 15 10:25:59 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115102550 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:25:59 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171115102550.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:25:59 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:25:59 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Wed Nov 15 10:25:59 2017 - [info] Executed CHANGE MASTER.
Wed Nov 15 10:25:59 2017 - [info] Slave started.
Wed Nov 15 10:25:59 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:25:59 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:25:59 2017 - [info] All new slave servers recovered successfully.
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] * Phase 5: New master cleanup phase..
Wed Nov 15 10:25:59 2017 - [info]
Wed Nov 15 10:25:59 2017 - [info] Resetting slave info on the new master..
Wed Nov 15 10:25:59 2017 - [info] host_1: Resetting slave info succeeded.
Wed Nov 15 10:25:59 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Wed Nov 15 10:25:59 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Wed Nov 15 10:25:59 2017 - [info] Sending mail..
### 切换后的结果
* new master 和 new etl 数据一致
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 164 | 1 |
| 165 | 2 |
| 166 | 3 |
+-----+------+
3 rows in set (0.00 sec)

1.3 slave(候选master)的日志是最新的,比etl要多

  • 1.3.1 当master的部分日志还没传递两个slave,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
### 模拟现场,现场的3台DB binlog状态
* master host_2
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000004 | 4577 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
| 122 | 10 |
| 123 | 11 |
| 124 | 12 |
| 125 | 13 |
| 126 | 14 |
| 127 | 15 |
+-----+------+
14 rows in set (0.00 sec)
* slave (candidate master) host_1
Master_Log_File: host_2.000004
Exec_Master_Log_Pos: 3683
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
| 122 | 10 |
| 123 | 11 |
| 124 | 12 |
+-----+------+
11 rows in set (0.00 sec)
* etl (other slave) host_3
Master_Log_File: host_2.000004
Exec_Master_Log_Pos: 2789
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
+-----+------+
8 rows in set (0.00 sec)
### 切换日志
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=host_2 --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
Tue Nov 14 17:13:08 2017 - [info] MHA::MasterFailover version 0.56.
Tue Nov 14 17:13:08 2017 - [info] Starting master failover.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] * Phase 1: Configuration Check Phase..
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Tue Nov 14 17:13:08 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Tue Nov 14 17:13:08 2017 - [info] GTID failover mode = 0
Tue Nov 14 17:13:08 2017 - [info] Dead Servers:
Tue Nov 14 17:13:08 2017 - [info] host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Checking master reachability via MySQL(double check)...
Tue Nov 14 17:13:08 2017 - [info] ok.
Tue Nov 14 17:13:08 2017 - [info] Alive Servers:
Tue Nov 14 17:13:08 2017 - [info] host_1(host_1:3306)
Tue Nov 14 17:13:08 2017 - [info] host_3(host_3:3306)
Tue Nov 14 17:13:08 2017 - [info] Alive Slaves:
Tue Nov 14 17:13:08 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:08 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:08 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:08 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:08 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:08 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Tue Nov 14 17:13:08 2017 - [info] done.
Tue Nov 14 17:13:08 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Tue Nov 14 17:13:08 2017 - [info] done.
Tue Nov 14 17:13:08 2017 - [info] Starting Non-GTID based failover.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Nov 14 17:13:08 2017 - [info]
Tue Nov 14 17:13:08 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Tue Nov 14 17:13:09 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Nov 14 17:13:09 2017 - [info] Executing master IP deactivation script:
Tue Nov 14 17:13:09 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-14 17:13:09-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 7.19M=0s
2017-11-14 17:13:11 (7.19 MB/s) - 已写入标准输出 [38]
Tue Nov 14 17:13:11 2017 - [info] done.
Tue Nov 14 17:13:11 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Nov 14 17:13:11 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3: Master Recovery Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] The latest binary log file/position on all slaves is host_2.000004:3683
Tue Nov 14 17:13:11 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Nov 14 17:13:11 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:11 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:11 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:11 2017 - [info] The oldest binary log file/position on all slaves is host_2.000004:2789
Tue Nov 14 17:13:11 2017 - [info] Oldest slaves:
Tue Nov 14 17:13:11 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:11 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:11 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Tue Nov 14 17:13:11 2017 - [info]
Tue Nov 14 17:13:11 2017 - [info] Fetching dead master's binary logs..
Tue Nov 14 17:13:11 2017 - [info] Executing command on the dead master host_2(host_2:3306): save_binary_logs --command=save --start_file=host_2.000004 --start_pos=3683 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_2.000004 pos 3683 to host_2.000004 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog ..
Dumping binlog format description event, from position 0 to 150.. ok.
Dumping effective binlog data from /data/mysql.bin/host_2.000004 position 3683 to tail(4596).. ok.
Concat succeeded.
Tue Nov 14 17:13:12 2017 - [info] scp from root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:12 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Tue Nov 14 17:13:12 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Tue Nov 14 17:13:12 2017 - [info]
Tue Nov 14 17:13:12 2017 - [info] * Phase 3.3: Determining New Master Phase..
Tue Nov 14 17:13:12 2017 - [info]
Tue Nov 14 17:13:12 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Tue Nov 14 17:13:12 2017 - [info] Checking whether host_1 has relay logs from the oldest position..
Tue Nov 14 17:13:12 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000004 --latest_rmlp=3683 --target_mlf=host_2.000004 --target_rmlp=2789 --server_id=1261261646 --workdir=/var/log/masterha/mha_test --timestamp=20171114171308 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_1-relay-bin.000003 :
Relay log found at /data/mysql_data, up to host_1-relay-bin.000003
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_1-relay-bin.000003, start_pos:269.
Target relay log FOUND!
Tue Nov 14 17:13:13 2017 - [info] OK. host_1 has all relay logs.
Tue Nov 14 17:13:13 2017 - [info] Searching new master from slaves..
Tue Nov 14 17:13:13 2017 - [info] Candidate masters from the configuration file:
Tue Nov 14 17:13:13 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:13 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:13 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Tue Nov 14 17:13:13 2017 - [info] Non-candidate masters:
Tue Nov 14 17:13:13 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Tue Nov 14 17:13:13 2017 - [info] Replicating from host_2(host_2:3306)
Tue Nov 14 17:13:13 2017 - [info] Not candidate for the new Master (no_master is set)
Tue Nov 14 17:13:13 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Tue Nov 14 17:13:13 2017 - [info] New master is host_1(host_1:3306)
Tue Nov 14 17:13:13 2017 - [info] Starting master failover..
Tue Nov 14 17:13:13 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Tue Nov 14 17:13:13 2017 - [info] Sending binlog..
Tue Nov 14 17:13:13 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Tue Nov 14 17:13:13 2017 - [info]
Tue Nov 14 17:13:13 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Tue Nov 14 17:13:13 2017 - [info] Starting recovery on host_1(host_1:3306)..
Tue Nov 14 17:13:13 2017 - [info] Generating diffs succeeded.
Tue Nov 14 17:13:13 2017 - [info] Waiting until all relay logs are applied.
Tue Nov 14 17:13:13 2017 - [info] done.
Tue Nov 14 17:13:13 2017 - [info] Getting slave status..
Tue Nov 14 17:13:13 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000004:3683). No need to recover from Exec_Master_Log_Pos.
Tue Nov 14 17:13:13 2017 - [info] Connecting to the target slave host host_1, running recover script..
Tue Nov 14 17:13:13 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Nov 14 17:13:13 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Tue Nov 14 17:13:13 2017 - [info] All relay logs were successfully applied.
Tue Nov 14 17:13:13 2017 - [info] Getting new master's binlog name and position..
Tue Nov 14 17:13:13 2017 - [info] host_1.000003:2347
Tue Nov 14 17:13:13 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000003', MASTER_LOG_POS=2347, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Nov 14 17:13:13 2017 - [info] Executing master IP activate script:
Tue Nov 14 17:13:13 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Tue Nov 14 17:13:16 2017 - [info] OK.
Tue Nov 14 17:13:16 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Tue Nov 14 17:13:16 2017 - [info] ok.
Tue Nov 14 17:13:16 2017 - [info] ** Finished master recovery successfully.
Tue Nov 14 17:13:16 2017 - [info] * Phase 3: Master Recovery Phase completed.
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] * Phase 4: Slaves Recovery Phase..
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Tue Nov 14 17:13:16 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 29885. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171114171308.log if it takes time..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Log messages from host_3 ...
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:16 2017 - [info] Server host_3 received relay logs up to: host_2.000004:2789
Tue Nov 14 17:13:16 2017 - [info] Need to get diffs from the latest slave(host_1) up to: host_2.000004:3683 (using the latest slave's relay logs)
Tue Nov 14 17:13:16 2017 - [info] Connecting to the latest slave host host_1, generating diff relay log files..
Tue Nov 14 17:13:16 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_2.000004 --latest_rmlp=3683 --target_mlf=host_2.000004 --target_rmlp=2789 --server_id=1261261646 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_1-relay-bin.000003
Tue Nov 14 17:13:17 2017 - [info]
Relay log found at /data/mysql_data, up to host_1-relay-bin.000003
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_1-relay-bin.000003, start_pos:269.
Concat binary/relay logs from host_1-relay-bin.000003 pos 269 to host_1-relay-bin.000003 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog ..
Dumping binlog format description event, from position 0 to 269.. ok.
Dumping effective binlog data from /data/mysql_data/host_1-relay-bin.000003 position 269 to tail(1163).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog .
scp host_1.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog to root@host_3(22) succeeded.
Tue Nov 14 17:13:17 2017 - [info] Generating diff files succeeded.
Tue Nov 14 17:13:17 2017 - [info] End of log messages from host_3.
Tue Nov 14 17:13:17 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Tue Nov 14 17:13:17 2017 - [info] Generating relay diff files from the latest slave succeeded.
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 31393. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171114171308.log if it takes time..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Log messages from host_3 ...
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Sending binlog..
Tue Nov 14 17:13:17 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog succeeded.
Tue Nov 14 17:13:17 2017 - [info] Starting recovery on host_3(host_3:3306)..
Tue Nov 14 17:13:17 2017 - [info] Generating diffs succeeded.
Tue Nov 14 17:13:17 2017 - [info] Waiting until all relay logs are applied.
Tue Nov 14 17:13:17 2017 - [info] done.
Tue Nov 14 17:13:17 2017 - [info] Getting slave status..
Tue Nov 14 17:13:17 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000004:2789). No need to recover from Exec_Master_Log_Pos.
Tue Nov 14 17:13:17 2017 - [info] Connecting to the target slave host host_3, running recover script..
Tue Nov 14 17:13:17 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171114171308 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Tue Nov 14 17:13:17 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog.. dumped up to pos 150. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog has effective binlog events from pos 150.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog position 150 to tail(1063).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171114171308.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171114171308.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_2_3306_20171114171308.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Tue Nov 14 17:13:17 2017 - [info] All relay logs were successfully applied.
Tue Nov 14 17:13:17 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Tue Nov 14 17:13:17 2017 - [info] Executed CHANGE MASTER.
Tue Nov 14 17:13:17 2017 - [info] Slave started.
Tue Nov 14 17:13:17 2017 - [info] End of log messages from host_3.
Tue Nov 14 17:13:17 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Tue Nov 14 17:13:17 2017 - [info] All new slave servers recovered successfully.
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] * Phase 5: New master cleanup phase..
Tue Nov 14 17:13:17 2017 - [info]
Tue Nov 14 17:13:17 2017 - [info] Resetting slave info on the new master..
Tue Nov 14 17:13:17 2017 - [info] host_1: Resetting slave info succeeded.
Tue Nov 14 17:13:17 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Tue Nov 14 17:13:17 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_1(host_1:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_1(host_1:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
Tue Nov 14 17:13:17 2017 - [info] Sending mail..
## 切换结果
new master 和 new etl
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 114 | 1 |
| 115 | 2 |
| 116 | 3 |
| 117 | 4 |
| 118 | 5 |
| 119 | 6 |
| 120 | 7 |
| 121 | 8 |
| 122 | 10 |
| 123 | 11 |
| 124 | 12 |
| 125 | 13 |
| 126 | 14 |
| 127 | 15 |
+-----+------+
14 rows in set (0.00 sec)
  • 1.3.2 当master的所有日志已经传递slave,这时候master 上的MySQL挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
### 模拟现场,现场的3台DB binlog状态
* master host_1
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_1.000010 | 3341 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 167 | 1 |
| 168 | 2 |
| 169 | 3 |
+-----+------+
3 rows in set (0.00 sec)
* slave host_2
Master_Log_File: host_1.000010
Exec_Master_Log_Pos: 3341
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 167 | 1 |
| 168 | 2 |
| 169 | 3 |
+-----+------+
3 rows in set (0.00 sec)
* etl host_3
Master_Log_File: host_1.000010
Exec_Master_Log_Pos: 2381
### 切换日志
Wed Nov 15 10:39:36 2017 - [info] MHA::MasterFailover version 0.56.
Wed Nov 15 10:39:36 2017 - [info] Starting master failover.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] * Phase 1: Configuration Check Phase..
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Wed Nov 15 10:39:36 2017 - [info] GTID failover mode = 0
Wed Nov 15 10:39:36 2017 - [info] Dead Servers:
Wed Nov 15 10:39:36 2017 - [info] host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Checking master reachability via MySQL(double check)...
Wed Nov 15 10:39:36 2017 - [info] ok.
Wed Nov 15 10:39:36 2017 - [info] Alive Servers:
Wed Nov 15 10:39:36 2017 - [info] host_2(host_2:3306)
Wed Nov 15 10:39:36 2017 - [info] host_3(host_3:3306)
Wed Nov 15 10:39:36 2017 - [info] Alive Slaves:
Wed Nov 15 10:39:36 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:36 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:36 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:36 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:36 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:36 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Wed Nov 15 10:39:36 2017 - [info] done.
Wed Nov 15 10:39:36 2017 - [info] Starting Non-GTID based failover.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Wed Nov 15 10:39:36 2017 - [info]
Wed Nov 15 10:39:36 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Wed Nov 15 10:39:37 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Wed Nov 15 10:39:37 2017 - [info] Executing master IP deactivation script:
Wed Nov 15 10:39:37 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root
=================== swift vip : tgw_vip from host_1 is deleted ==============================
--2017-11-15 10:39:37-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 11.4M=0s
2017-11-15 10:39:39 (11.4 MB/s) - 已写入标准输出 [38]
Wed Nov 15 10:39:39 2017 - [info] done.
Wed Nov 15 10:39:39 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Wed Nov 15 10:39:39 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3: Master Recovery Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] The latest binary log file/position on all slaves is host_1.000010:3341
Wed Nov 15 10:39:39 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Wed Nov 15 10:39:39 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:39 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:39 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:39 2017 - [info] The oldest binary log file/position on all slaves is host_1.000010:2381
Wed Nov 15 10:39:39 2017 - [info] Oldest slaves:
Wed Nov 15 10:39:39 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:39 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:39 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Wed Nov 15 10:39:39 2017 - [info]
Wed Nov 15 10:39:39 2017 - [info] Fetching dead master's binary logs..
Wed Nov 15 10:39:39 2017 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1.000010 --start_pos=3341 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_1.000010 pos 3341 to host_1.000010 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
Dumping effective binlog data from /data/mysql.bin/host_1.000010 position 3341 to tail(3364).. ok.
Binlog Checksum enabled
Concat succeeded.
Wed Nov 15 10:39:39 2017 - [info] scp from root@host_1:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:40 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Wed Nov 15 10:39:40 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] * Phase 3.3: Determining New Master Phase..
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Wed Nov 15 10:39:40 2017 - [info] Checking whether host_2 has relay logs from the oldest position..
Wed Nov 15 10:39:40 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1.000010 --latest_rmlp=3341 --target_mlf=host_1.000010 --target_rmlp=2381 --server_id=1261261656 --workdir=/var/log/masterha/mha_test --timestamp=20171115103936 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002 :
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:636.
Target relay log FOUND!
Wed Nov 15 10:39:40 2017 - [info] OK. host_2 has all relay logs.
Wed Nov 15 10:39:40 2017 - [info] Searching new master from slaves..
Wed Nov 15 10:39:40 2017 - [info] Candidate masters from the configuration file:
Wed Nov 15 10:39:40 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:40 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:40 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Wed Nov 15 10:39:40 2017 - [info] Non-candidate masters:
Wed Nov 15 10:39:40 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Wed Nov 15 10:39:40 2017 - [info] Replicating from host_1(host_1:3306)
Wed Nov 15 10:39:40 2017 - [info] Not candidate for the new Master (no_master is set)
Wed Nov 15 10:39:40 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Wed Nov 15 10:39:40 2017 - [info] New master is host_2(host_2:3306)
Wed Nov 15 10:39:40 2017 - [info] Starting master failover..
Wed Nov 15 10:39:40 2017 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Wed Nov 15 10:39:40 2017 - [info]
Wed Nov 15 10:39:40 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Wed Nov 15 10:39:40 2017 - [info] Sending binlog..
Wed Nov 15 10:39:41 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to root@host_2:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:41 2017 - [info]
Wed Nov 15 10:39:41 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Wed Nov 15 10:39:41 2017 - [info]
Wed Nov 15 10:39:41 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Wed Nov 15 10:39:41 2017 - [info] Starting recovery on host_2(host_2:3306)..
Wed Nov 15 10:39:41 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:39:41 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:39:41 2017 - [info] done.
Wed Nov 15 10:39:41 2017 - [info] Getting slave status..
Wed Nov 15 10:39:41 2017 - [info] This slave(host_2)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000010:3341). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:39:41 2017 - [info] Connecting to the target slave host host_2, running recover script..
Wed Nov 15 10:39:41 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_2 --slave_ip=host_2 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:39:41 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog on host_2:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:39:41 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:39:41 2017 - [info] Getting new master's binlog name and position..
Wed Nov 15 10:39:41 2017 - [info] host_2.000011:1307
Wed Nov 15 10:39:41 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2.000011', MASTER_LOG_POS=1307, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Wed Nov 15 10:39:41 2017 - [info] Executing master IP activate script:
Wed Nov 15 10:39:41 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_2 is added ==============================
Wed Nov 15 10:39:44 2017 - [info] OK.
Wed Nov 15 10:39:44 2017 - [info] Setting read_only=0 on host_2(host_2:3306)..
Wed Nov 15 10:39:44 2017 - [info] ok.
Wed Nov 15 10:39:44 2017 - [info] ** Finished master recovery successfully.
Wed Nov 15 10:39:44 2017 - [info] * Phase 3: Master Recovery Phase completed.
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] * Phase 4: Slaves Recovery Phase..
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Wed Nov 15 10:39:44 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 11760. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115103936.log if it takes time..
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:44 2017 - [info] Server host_3 received relay logs up to: host_1.000010:2381
Wed Nov 15 10:39:44 2017 - [info] Need to get diffs from the latest slave(host_2) up to: host_1.000010:3341 (using the latest slave's relay logs)
Wed Nov 15 10:39:44 2017 - [info] Connecting to the latest slave host host_2, generating diff relay log files..
Wed Nov 15 10:39:44 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_1.000010 --latest_rmlp=3341 --target_mlf=host_1.000010 --target_rmlp=2381 --server_id=1261261656 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002
Wed Nov 15 10:39:45 2017 - [info]
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:636.
Concat binary/relay logs from host_2-relay-bin.000002 pos 636 to host_2-relay-bin.000002 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 315.. ok.
Dumping effective binlog data from /data/mysql_data/host_2-relay-bin.000002 position 636 to tail(1596).. ok.
Binlog Checksum enabled
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog .
scp host_2.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog to root@host_3(22) succeeded.
Wed Nov 15 10:39:45 2017 - [info] Generating diff files succeeded.
Wed Nov 15 10:39:45 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:39:45 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:39:45 2017 - [info] Generating relay diff files from the latest slave succeeded.
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Wed Nov 15 10:39:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 12881. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171115103936.log if it takes time..
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] Log messages from host_3 ...
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:39:45 2017 - [info] Sending binlog..
Wed Nov 15 10:39:45 2017 - [info] scp from local:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog to root@host_3:/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog succeeded.
Wed Nov 15 10:39:45 2017 - [info] Starting recovery on host_3(host_3:3306)..
Wed Nov 15 10:39:45 2017 - [info] Generating diffs succeeded.
Wed Nov 15 10:39:45 2017 - [info] Waiting until all relay logs are applied.
Wed Nov 15 10:39:45 2017 - [info] done.
Wed Nov 15 10:39:45 2017 - [info] Getting slave status..
Wed Nov 15 10:39:45 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000010:2381). No need to recover from Exec_Master_Log_Pos.
Wed Nov 15 10:39:45 2017 - [info] Connecting to the target slave host host_3, running recover script..
Wed Nov 15 10:39:45 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171115103936 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Wed Nov 15 10:40:45 2017 - [info]
Concat all apply files to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog ..
Copying the first binlog file /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog to /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog.. ok.
Dumping binlog head events (rotate events), skipping format description events from /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog.. Binlog Checksum enabled
dumped up to pos 154. ok.
/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog has effective binlog events from pos 154.
Dumping effective binlog data from /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog position 154 to tail(177).. ok.
Concat succeeded.
All apply target binary logs are concatinated at /var/log/masterha/mha_test/total_binlog_for_host_3_3306.20171115103936.binlog .
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171115103936.binlog,/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171115103936.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Wed Nov 15 10:40:45 2017 - [info] All relay logs were successfully applied.
Wed Nov 15 10:40:45 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Wed Nov 15 10:40:45 2017 - [info] Executed CHANGE MASTER.
Wed Nov 15 10:40:45 2017 - [info] Slave started.
Wed Nov 15 10:40:45 2017 - [info] End of log messages from host_3.
Wed Nov 15 10:40:45 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Wed Nov 15 10:40:45 2017 - [info] All new slave servers recovered successfully.
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] * Phase 5: New master cleanup phase..
Wed Nov 15 10:40:45 2017 - [info]
Wed Nov 15 10:40:45 2017 - [info] Resetting slave info on the new master..
Wed Nov 15 10:40:45 2017 - [info] host_2: Resetting slave info succeeded.
Wed Nov 15 10:40:45 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
Wed Nov 15 10:40:45 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_2(host_2:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.
Wed Nov 15 10:40:45 2017 - [info] Sending mail..
### 切换结果
new master & new etl 数据一致
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 167 | 1 |
| 168 | 2 |
| 169 | 3 |

1.4 slave(候选master)上面有大事务在跑

  • 1000s的大查询
1
无影响,正常切换

1.5 如果MHA过程中失败,是否可以重新执行MHA的failover呢?

  • 90%的场景都是可以重新执行的

  • 5%的场景不能执行

1
2
3
* NO。1
问题: dead master上面的vip已经delete,再次failover 这一步会报错
方案: a)将dead master上的vip再重新add上去 或者 b)直接过滤掉 delete dead master vip这一步
  • 5%的场景不能再次执行,执行会报错

一般这种场景就是:已经failover到最后的change master阶段,这样主从结构已经变更( candidate slave <==rep== etl ),MHA无法重新走一遍。
好在,change master的语句已经生成,其余slave再次执行一遍日志里面的change master即可

1
2
3
Thu Nov 9 16:49:39 2017 - [info] GTID failover mode = 0
Thu Nov 9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln169] Detected dead master host_1(host_1:3306) does not match with specified dead master host_2(host_2:3306)!
Thu Nov 9 16:49:39 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53

1.7 Master:MySQL down小结

1
2
3
1. failover最终命令
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=$dead_master_ip --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error

二、Master : Server down

2.1 etl 延迟8小时

1
同1.1 结论

2.2 slave(候选master)比etl还要落后更多

  • 2.2.1 当master的部分日志还没传递两个slave,这时候master server挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
### 3台DB的binlog状态
* master
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_2.000012 | 1651 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc>
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 172 | 1 |
| 173 | 2 |
| 174 | 3 |
| 175 | 4 |
+-----+------+
4 rows in set (0.00 sec)
* slave
Master_Log_File: host_2.000012
Exec_Master_Log_Pos: 467
dba:lc> select * from t_char_2;
Empty set (0.00 sec)
* etl
Master_Log_File: host_2.000012
Exec_Master_Log_Pos: 1059
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 172 | 1 |
| 173 | 2 |
+-----+------+
2 rows in set (0.00 sec)
### 模拟故障场景
* 隔离master的网络,让其等同于down机
master> iptables -A INPUT -p tcp -s other_ip --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 切换日志
Thu Nov 16 16:54:40 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 16 16:54:40 2017 - [info] Starting master failover.
Thu Nov 16 16:54:40 2017 - [info]
Thu Nov 16 16:54:40 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 16 16:54:40 2017 - [info]
Thu Nov 16 16:54:40 2017 - [warning] SQL Thread is stopped(no error) on host_1(host_1:3306)
Thu Nov 16 16:54:40 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Thu Nov 16 16:54:40 2017 - [info] GTID failover mode = 0
Thu Nov 16 16:54:40 2017 - [info] Dead Servers:
Thu Nov 16 16:54:40 2017 - [info] host_2(host_2:3306)
Thu Nov 16 16:54:40 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 16 16:54:41 2017 - [info] ok.
Thu Nov 16 16:54:41 2017 - [info] Alive Servers:
Thu Nov 16 16:54:41 2017 - [info] host_1(host_1:3306)
Thu Nov 16 16:54:41 2017 - [info] host_3(host_3:3306)
Thu Nov 16 16:54:41 2017 - [info] Alive Slaves:
Thu Nov 16 16:54:41 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:54:41 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:54:41 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:54:41 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:54:41 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:54:41 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 16:54:41 2017 - [info] Starting SQL thread on host_1(host_1:3306) ..
Thu Nov 16 16:54:41 2017 - [info] done.
Thu Nov 16 16:54:41 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Thu Nov 16 16:54:41 2017 - [info] done.
Thu Nov 16 16:54:41 2017 - [info] Starting Non-GTID based failover.
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:54:41 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:54:41 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 16 16:54:41 2017 - [info]
Thu Nov 16 16:55:31 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_2! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Thu Nov 16 16:55:31 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 16 16:55:31 2017 - [info] Executing master IP deactivation script:
Thu Nov 16 16:55:31 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --command=stop
ssh: connect to host host_2 port 22: Connection timed out
=================== swift vip : tgw_vip from host_2 is deleted ==============================
--2017-11-16 16:55:38-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 12.1M=0s
2017-11-16 16:57:36 (12.1 MB/s) - 已写入标准输出 [38]
Thu Nov 16 16:57:36 2017 - [info] done.
Thu Nov 16 16:57:36 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 16 16:57:36 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] The latest binary log file/position on all slaves is host_2.000012:1059
Thu Nov 16 16:57:36 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 16 16:57:36 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:36 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:36 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 16:57:36 2017 - [info] The oldest binary log file/position on all slaves is host_2.000012:467
Thu Nov 16 16:57:36 2017 - [info] Oldest slaves:
Thu Nov 16 16:57:36 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:36 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:36 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 16 16:57:36 2017 - [info]
Thu Nov 16 16:57:36 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Nov 16 16:57:37 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 16 16:57:37 2017 - [info] Checking whether host_3 has relay logs from the oldest position..
Thu Nov 16 16:57:37 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_2.000012 --latest_rmlp=1059 --target_mlf=host_2.000012 --target_rmlp=467 --server_id=1261261666 --workdir=/var/log/masterha/mha_test --timestamp=20171116165440 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004 :
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:678.
Target relay log FOUND!
Thu Nov 16 16:57:37 2017 - [info] OK. host_3 has all relay logs.
Thu Nov 16 16:57:37 2017 - [info] HealthCheck: SSH to host_1 is reachable.
Thu Nov 16 16:57:37 2017 - [info] Searching new master from slaves..
Thu Nov 16 16:57:37 2017 - [info] Candidate masters from the configuration file:
Thu Nov 16 16:57:37 2017 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:37 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:37 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 16:57:37 2017 - [info] Non-candidate masters:
Thu Nov 16 16:57:37 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 16:57:37 2017 - [info] Replicating from host_2(host_2:3306)
Thu Nov 16 16:57:37 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 16:57:37 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 16 16:57:37 2017 - [info] Not found.
Thu Nov 16 16:57:37 2017 - [info] Searching from all candidate_master slaves..
Thu Nov 16 16:57:37 2017 - [info] New master is host_1(host_1:3306)
Thu Nov 16 16:57:37 2017 - [info] Starting master failover..
Thu Nov 16 16:57:37 2017 - [info]
From:
host_2(host_2:3306) (current master)
+--host_1(host_1:3306)
+--host_3(host_3:3306)
To:
host_1(host_1:3306) (new master)
+--host_3(host_3:3306)
Thu Nov 16 16:57:37 2017 - [info]
Thu Nov 16 16:57:37 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Nov 16 16:57:37 2017 - [info]
Thu Nov 16 16:57:37 2017 - [info] Server host_1 received relay logs up to: host_2.000012:467
Thu Nov 16 16:57:37 2017 - [info] Need to get diffs from the latest slave(host_3) up to: host_2.000012:1059 (using the latest slave's relay logs)
Thu Nov 16 16:57:38 2017 - [info] Connecting to the latest slave host host_3, generating diff relay log files..
Thu Nov 16 16:57:38 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_1 --latest_mlf=host_2.000012 --latest_rmlp=1059 --target_mlf=host_2.000012 --target_rmlp=467 --server_id=1261261666 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171116165440 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3-relay-bin.000004
Thu Nov 16 16:57:38 2017 - [info]
Relay log found at /data/mysql_data, up to host_3-relay-bin.000004
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_3-relay-bin.000004, start_pos:678.
Concat binary/relay logs from host_3-relay-bin.000004 pos 678 to host_3-relay-bin.000004 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog ..
Dumping binlog format description event, from position 0 to 361.. ok.
Dumping effective binlog data from /data/mysql_data/host_3-relay-bin.000004 position 678 to tail(1270).. ok.
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog .
scp host_3.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog to root@host_1(22) succeeded.
Thu Nov 16 16:57:38 2017 - [info] Generating diff files succeeded.
Thu Nov 16 16:57:38 2017 - [info]
Thu Nov 16 16:57:38 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Nov 16 16:57:38 2017 - [info]
Thu Nov 16 16:57:38 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Nov 16 16:57:38 2017 - [info] Starting recovery on host_1(host_1:3306)..
Thu Nov 16 16:57:38 2017 - [info] Generating diffs succeeded.
Thu Nov 16 16:57:38 2017 - [info] Waiting until all relay logs are applied.
Thu Nov 16 16:57:38 2017 - [info] done.
Thu Nov 16 16:57:38 2017 - [info] Getting slave status..
Thu Nov 16 16:57:38 2017 - [info] This slave(host_1)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_2.000012:467). No need to recover from Exec_Master_Log_Pos.
Thu Nov 16 16:57:38 2017 - [info] Connecting to the target slave host host_1, running recover script..
Thu Nov 16 16:57:38 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_1 --slave_ip=host_1 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171116165440 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Thu Nov 16 16:57:39 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_1_3306_20171116165440.binlog on host_1:3306. This may take long time...
Applying log files succeeded.
Thu Nov 16 16:57:39 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 16:57:39 2017 - [info] Getting new master's binlog name and position..
Thu Nov 16 16:57:39 2017 - [info] host_1.000012:1310
Thu Nov 16 16:57:39 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1.000012', MASTER_LOG_POS=1310, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 16 16:57:39 2017 - [info] Executing master IP activate script:
Thu Nov 16 16:57:39 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_2 --orig_master_ip=host_2 --orig_master_port=3306 --new_master_host=host_1 --new_master_ip=host_1 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_1 is added ==============================
Thu Nov 16 16:57:41 2017 - [info] OK.
Thu Nov 16 16:57:41 2017 - [info] Setting read_only=0 on host_1(host_1:3306)..
Thu Nov 16 16:57:41 2017 - [info] ok.
Thu Nov 16 16:57:41 2017 - [info] ** Finished master recovery successfully.
Thu Nov 16 16:57:41 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 123011. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116165440.log if it takes time..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Log messages from host_3 ...
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Nov 16 16:57:41 2017 - [info] End of log messages from host_3.
Thu Nov 16 16:57:41 2017 - [info] -- host_3(host_3:3306) has the latest relay log events.
Thu Nov 16 16:57:41 2017 - [info] Generating relay diff files from the latest slave succeeded.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 123044. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116165440.log if it takes time..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Log messages from host_3 ...
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Starting recovery on host_3(host_3:3306)..
Thu Nov 16 16:57:41 2017 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Nov 16 16:57:41 2017 - [info] done.
Thu Nov 16 16:57:41 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 16:57:41 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306)..
Thu Nov 16 16:57:41 2017 - [info] Executed CHANGE MASTER.
Thu Nov 16 16:57:41 2017 - [info] Slave started.
Thu Nov 16 16:57:41 2017 - [info] End of log messages from host_3.
Thu Nov 16 16:57:41 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Nov 16 16:57:41 2017 - [info] All new slave servers recovered successfully.
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 16 16:57:41 2017 - [info]
Thu Nov 16 16:57:41 2017 - [info] Resetting slave info on the new master..
Thu Nov 16 16:57:41 2017 - [info] host_1: Resetting slave info succeeded.
Thu Nov 16 16:57:41 2017 - [info] Master failover to host_1(host_1:3306) completed successfully.
Thu Nov 16 16:57:41 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_2(host_2:3306) to host_1(host_1:3306) succeeded
Master host_2(host_2:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_2(host_2:3306)
The latest slave host_3(host_3:3306) has all relay logs for recovery.
Selected host_1(host_1:3306) as a new master.
host_1(host_1:3306): OK: Applying all logs succeeded.
host_1(host_1:3306): OK: Activated master IP address.
host_3(host_3:3306): This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_1(host_1:3306)
host_1(host_1:3306): Resetting slave info succeeded.
Master failover to host_1(host_1:3306) completed successfully.
### 切换后的结果
new_master & new_etl 结果一致,由于master上面未传过来的binlog彻底丢失,所以相应的新集群也缺失这些数据。
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 172 | 1 |
| 173 | 2 |
+-----+------+
2 rows in set (0.00 sec)
### 最后一步很重要
如果dead master之后又活过来了,那么这一步要做
dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==》 tgw章节详细描述
  • 2.2.2 当master的所有日志已经传递到1个etl,这时候master server挂了
1
测试省略,和2.2.1基本一样

结论:由于master上的所有日志全部传递到etl,所以最后是不会丢失master上任何数据的。

2.3 slave(候选master)的日志是最新的,比etl要多

  • 2.3.1 当master的部分日志还没传递两个slave,这时候master server挂了
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
### 3台DB的binlog状态
* master
dba:lc> show master status;
+---------------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+---------------------+----------+--------------+------------------+-------------------+
| host_1.000012 | 3860 | | | |
+---------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 176 | 1 |
| 177 | 2 |
| 178 | 10 |
| 179 | 20 |
+-----+------+
4 rows in set (0.00 sec)
* slave
Master_Log_File: host_1.000012
Exec_Master_Log_Pos: 3216
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 176 | 1 |
| 177 | 2 |
+-----+------+
2 rows in set (0.00 sec)
* etl
Master_Log_File: host_1.000012
Exec_Master_Log_Pos: 2576
dba:lc> select * from t_char_2;
Empty set (0.00 sec)
### 模拟故障场景
* 隔离master的网络,让其等同于down机
master> iptables -A INPUT -p tcp -s other_ip --dport 22 -j ACCEPT
master> iptables -A INPUT -p tcp -s 0.0.0.0/0 -j DROP
### 切换日志
Thu Nov 16 17:17:59 2017 - [info] MHA::MasterFailover version 0.56.
Thu Nov 16 17:17:59 2017 - [info] Starting master failover.
Thu Nov 16 17:17:59 2017 - [info]
Thu Nov 16 17:17:59 2017 - [info] * Phase 1: Configuration Check Phase..
Thu Nov 16 17:17:59 2017 - [info]
Thu Nov 16 17:17:59 2017 - [warning] SQL Thread is stopped(no error) on host_2(host_2:3306)
Thu Nov 16 17:17:59 2017 - [warning] SQL Thread is stopped(no error) on host_3(host_3:3306)
Thu Nov 16 17:17:59 2017 - [info] GTID failover mode = 0
Thu Nov 16 17:17:59 2017 - [info] Dead Servers:
Thu Nov 16 17:17:59 2017 - [info] host_1(host_1:3306)
Thu Nov 16 17:17:59 2017 - [info] Checking master reachability via MySQL(double check)...
Thu Nov 16 17:18:00 2017 - [info] ok.
Thu Nov 16 17:18:00 2017 - [info] Alive Servers:
Thu Nov 16 17:18:00 2017 - [info] host_2(host_2:3306)
Thu Nov 16 17:18:00 2017 - [info] host_3(host_3:3306)
Thu Nov 16 17:18:00 2017 - [info] Alive Slaves:
Thu Nov 16 17:18:00 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:18:00 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:18:00 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:18:00 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:18:00 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:18:00 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 17:18:00 2017 - [info] Starting SQL thread on host_2(host_2:3306) ..
Thu Nov 16 17:18:00 2017 - [info] done.
Thu Nov 16 17:18:00 2017 - [info] Starting SQL thread on host_3(host_3:3306) ..
Thu Nov 16 17:18:00 2017 - [info] done.
Thu Nov 16 17:18:00 2017 - [info] Starting Non-GTID based failover.
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:00 2017 - [info] ** Phase 1: Configuration Check Phase completed.
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:00 2017 - [info] * Phase 2: Dead Master Shutdown Phase..
Thu Nov 16 17:18:00 2017 - [info]
Thu Nov 16 17:18:50 2017 - [warning] HealthCheck: Got timeout on checking SSH connection to host_1! at /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm line 342.
Thu Nov 16 17:18:50 2017 - [info] Forcing shutdown so that applications never connect to the current master..
Thu Nov 16 17:18:50 2017 - [info] Executing master IP deactivation script:
Thu Nov 16 17:18:50 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stop
ssh: connect to host host_1 port 22: Connection timed out
=================== swift vip : tgw_vip from host_1 is deleted ==============================
--2017-11-16 17:18:57-- http://tgw_server/cgi-bin/fun_logic/bin/public_api/op_rs.cgi
正在连接 tgw_server:80... 已连接。
已发出 HTTP 请求,正在等待回应... 200 OK
长度:未指定 [text/html]
正在保存至: “STDOUT”
0K 8.61M=0s
2017-11-16 17:20:55 (8.61 MB/s) - 已写入标准输出 [38]
Thu Nov 16 17:20:55 2017 - [info] done.
Thu Nov 16 17:20:55 2017 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Thu Nov 16 17:20:55 2017 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3: Master Recovery Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] The latest binary log file/position on all slaves is host_1.000012:3216
Thu Nov 16 17:20:55 2017 - [info] Latest slaves (Slaves that received relay log files to the latest):
Thu Nov 16 17:20:55 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:55 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:55 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:20:55 2017 - [info] The oldest binary log file/position on all slaves is host_1.000012:2576
Thu Nov 16 17:20:55 2017 - [info] Oldest slaves:
Thu Nov 16 17:20:55 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:55 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:55 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [warning] Dead Master is not SSH reachable. Could not save it's binlogs. Transactions that were not sent to the latest slave (Read_Master_Log_Pos to the tail of the dead master's binlog) were lost.
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] * Phase 3.3: Determining New Master Phase..
Thu Nov 16 17:20:55 2017 - [info]
Thu Nov 16 17:20:55 2017 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Thu Nov 16 17:20:55 2017 - [info] HealthCheck: SSH to host_2 is reachable.
Thu Nov 16 17:20:55 2017 - [info] Checking whether host_2 has relay logs from the oldest position..
Thu Nov 16 17:20:55 2017 - [info] Executing command: apply_diff_relay_logs --command=find --latest_mlf=host_1.000012 --latest_rmlp=3216 --target_mlf=host_1.000012 --target_rmlp=2576 --server_id=1261261656 --workdir=/var/log/masterha/mha_test --timestamp=20171116171759 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002 :
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:1581.
Target relay log FOUND!
Thu Nov 16 17:20:56 2017 - [info] OK. host_2 has all relay logs.
Thu Nov 16 17:20:56 2017 - [info] HealthCheck: SSH to host_3 is reachable.
Thu Nov 16 17:20:56 2017 - [info] Searching new master from slaves..
Thu Nov 16 17:20:56 2017 - [info] Candidate masters from the configuration file:
Thu Nov 16 17:20:56 2017 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:56 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:56 2017 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Nov 16 17:20:56 2017 - [info] Non-candidate masters:
Thu Nov 16 17:20:56 2017 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled
Thu Nov 16 17:20:56 2017 - [info] Replicating from host_1(host_1:3306)
Thu Nov 16 17:20:56 2017 - [info] Not candidate for the new Master (no_master is set)
Thu Nov 16 17:20:56 2017 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Thu Nov 16 17:20:56 2017 - [info] New master is host_2(host_2:3306)
Thu Nov 16 17:20:56 2017 - [info] Starting master failover..
Thu Nov 16 17:20:56 2017 - [info]
From:
host_1(host_1:3306) (current master)
+--host_2(host_2:3306)
+--host_3(host_3:3306)
To:
host_2(host_2:3306) (new master)
+--host_3(host_3:3306)
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] This server has all relay logs. No need to generate diff files from the latest slave.
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] * Phase 3.4: Master Log Apply Phase..
Thu Nov 16 17:20:56 2017 - [info]
Thu Nov 16 17:20:56 2017 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Thu Nov 16 17:20:56 2017 - [info] Starting recovery on host_2(host_2:3306)..
Thu Nov 16 17:20:56 2017 - [info] This server has all relay logs. Waiting all logs to be applied..
Thu Nov 16 17:20:56 2017 - [info] done.
Thu Nov 16 17:20:56 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 17:20:56 2017 - [info] Getting new master's binlog name and position..
Thu Nov 16 17:20:56 2017 - [info] host_2.000012:3959
Thu Nov 16 17:20:56 2017 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2.000012', MASTER_LOG_POS=3959, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Thu Nov 16 17:20:56 2017 - [info] Executing master IP activate script:
Thu Nov 16 17:20:56 2017 - [info] /data/online/agent/MHA/masterha/bak_mha_test/master_ip_failover_mha_test --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba'
Unknown option: new_master_user
Unknown option: new_master_password
=================== swift vip : tgw_vip to host_2 is added ==============================
Thu Nov 16 17:20:59 2017 - [info] OK.
Thu Nov 16 17:20:59 2017 - [info] ** Finished master recovery successfully.
Thu Nov 16 17:20:59 2017 - [info] * Phase 3: Master Recovery Phase completed.
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] * Phase 4: Slaves Recovery Phase..
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Thu Nov 16 17:20:59 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 77007. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116171759.log if it takes time..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Log messages from host_3 ...
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:20:59 2017 - [info] Server host_3 received relay logs up to: host_1.000012:2576
Thu Nov 16 17:20:59 2017 - [info] Need to get diffs from the latest slave(host_2) up to: host_1.000012:3216 (using the latest slave's relay logs)
Thu Nov 16 17:20:59 2017 - [info] Connecting to the latest slave host host_2, generating diff relay log files..
Thu Nov 16 17:20:59 2017 - [info] Executing command: apply_diff_relay_logs --command=generate_and_send --scp_user=root --scp_host=host_3 --latest_mlf=host_1.000012 --latest_rmlp=3216 --target_mlf=host_1.000012 --target_rmlp=2576 --server_id=1261261656 --diff_file_readtolatest=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog --workdir=/var/log/masterha/mha_test --timestamp=20171116171759 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2-relay-bin.000002
Thu Nov 16 17:21:00 2017 - [info]
Relay log found at /data/mysql_data, up to host_2-relay-bin.000002
Fast relay log position search succeeded.
Target relay log file/position found. start_file:host_2-relay-bin.000002, start_pos:1581.
Concat binary/relay logs from host_2-relay-bin.000002 pos 1581 to host_2-relay-bin.000002 EOF into /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 315.. ok.
Dumping effective binlog data from /data/mysql_data/host_2-relay-bin.000002 position 1581 to tail(2221).. ok.
Binlog Checksum enabled
Concat succeeded.
Generating diff relay log succeeded. Saved at /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog .
scp host_2.58os.org:/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog to root@host_3(22) succeeded.
Thu Nov 16 17:21:00 2017 - [info] Generating diff files succeeded.
Thu Nov 16 17:21:00 2017 - [info] End of log messages from host_3.
Thu Nov 16 17:21:00 2017 - [info] -- Slave diff log generation on host host_3(host_3:3306) succeeded.
Thu Nov 16 17:21:00 2017 - [info] Generating relay diff files from the latest slave succeeded.
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 78627. Check tmp log /var/log/masterha/mha_test/host_3_3306_20171116171759.log if it takes time..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Log messages from host_3 ...
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Starting recovery on host_3(host_3:3306)..
Thu Nov 16 17:21:00 2017 - [info] Generating diffs succeeded.
Thu Nov 16 17:21:00 2017 - [info] Waiting until all relay logs are applied.
Thu Nov 16 17:21:00 2017 - [info] done.
Thu Nov 16 17:21:00 2017 - [info] Getting slave status..
Thu Nov 16 17:21:00 2017 - [info] This slave(host_3)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(host_1.000012:2576). No need to recover from Exec_Master_Log_Pos.
Thu Nov 16 17:21:00 2017 - [info] Connecting to the target slave host host_3, running recover script..
Thu Nov 16 17:21:00 2017 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --apply_files=/var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog --workdir=/var/log/masterha/mha_test --target_version=5.7.13-log --timestamp=20171116171759 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 --slave_pass=xxx
Thu Nov 16 17:21:00 2017 - [info]
MySQL client version is 5.7.13. Using --binary-mode.
Applying differential binary/relay log files /var/log/masterha/mha_test/relay_from_read_to_latest_host_3_3306_20171116171759.binlog on host_3:3306. This may take long time...
Applying log files succeeded.
Thu Nov 16 17:21:00 2017 - [info] All relay logs were successfully applied.
Thu Nov 16 17:21:00 2017 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306)..
Thu Nov 16 17:21:00 2017 - [info] Executed CHANGE MASTER.
Thu Nov 16 17:21:00 2017 - [info] Slave started.
Thu Nov 16 17:21:00 2017 - [info] End of log messages from host_3.
Thu Nov 16 17:21:00 2017 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded.
Thu Nov 16 17:21:00 2017 - [info] All new slave servers recovered successfully.
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] * Phase 5: New master cleanup phase..
Thu Nov 16 17:21:00 2017 - [info]
Thu Nov 16 17:21:00 2017 - [info] Resetting slave info on the new master..
Thu Nov 16 17:21:00 2017 - [info] host_2: Resetting slave info succeeded.
Thu Nov 16 17:21:00 2017 - [info] Master failover to host_2(host_2:3306) completed successfully.
Thu Nov 16 17:21:00 2017 - [info]
----- Failover Report -----
bak_mha_test: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded
Master host_1(host_1:3306) is down!
Check MHA Manager logs at tjtx135-2-217.58os.org:/var/log/masterha/mha_test/mha_test.log for details.
Started automated(non-interactive) failover.
Invalidated master IP address on host_1(host_1:3306)
The latest slave host_2(host_2:3306) has all relay logs for recovery.
Selected host_2(host_2:3306) as a new master.
host_2(host_2:3306): OK: Applying all logs succeeded.
host_2(host_2:3306): OK: Activated master IP address.
host_3(host_3:3306): Generating differential relay logs up to host_2(host_2:3306)succeeded.
Generating relay diff files from the latest slave succeeded.
host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306)
host_2(host_2:3306): Resetting slave info succeeded.
Master failover to host_2(host_2:3306) completed successfully.
### 切换后的结果
new_master & new_etl 结果一致,由于master上面未传过来的binlog彻底丢失,所以相应的新集群也缺失这些数据。
dba:lc> select * from t_char_2;
+-----+------+
| id | name |
+-----+------+
| 176 | 1 |
| 177 | 2 |
+-----+------+
2 rows in set (0.00 sec)
### 最后一步很重要
如果dead master之后又活过来了,那么这一步要做
dead_master> /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==》 tgw章节详细描述

结论: 由于master 已挂,然而最后的日志没有传递到其他服务器,所以会丢失master没有传递过来的事务日志

  • 2.3.2 当master的所有日志已经传递slave,这时候master server挂了
1
测试省略,和2.3.1基本一样

结论:由于master上的所有日志全部传递到slave,所以最后是不会丢失master上任何数据的。

2.4 slave(候选master)上面有大事务在跑

  • 1000s的大查询
1
同1.4结论
  • flush tables with readlock
1
2
同1.4结论

2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?

1
同1.6结论

三、遇到的坑

3.1 交互模式下,如果没有及时敲’YES’,则终止切换

3.2 如果在non-gtid模式的机器上,配置了binlog server,会有什么影响呢?

1
无影响

3.3 不要在relay-log的地方伪造日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
xx-relay-bin.000001
xx-relay-bin.000002
xx-relay-bin.000002.lc --这个是伪造的,当时这个日志是自己解析的 mysqlbinlog -vv xx-relay-bin.000002 > xx-relay-bin.000002.lc
MHA在切换的时候报错如下:
Reading xx-relay-bin.000002.lc
Event too large: pos: 4, event_length: 1163083840, event_type: 32
at /usr/share/perl5/vendor_perl/MHA/BinlogPosFindManager.pm line 103
Tue Nov 14 10:39:08 2017 - [warning] xx doesn't have all relay logs. Maybe some logs were purged.
Tue Nov 14 10:39:08 2017 - [warning] None of latest servers have enough relay logs from oldest position. We can't recover oldest slaves.
Tue Nov 14 10:39:08 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln947] None of the latest slaves has enough relay logs for recovery.
Tue Nov 14 10:39:08 2017 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53

3.4 flush tables with read lock 会阻塞

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
dba:(none)> show processlist;
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
| 63 | dba | localhost | NULL | Query | 0 | starting | show processlist |
| 65 | dba | xx:11164 | NULL | Sleep | 121 | | NULL |
| 83 | dba | new master:49022 | NULL | Query | 176 | Waiting for global read lock | BINLOG '
GpAKWhNYUy1LMAAAAGYHAAAAAG0AAAAAAAEAAmxjAAh0X2NoYXJfMgACAw8CLAEC
GpAKWh5YUy1LJwAAAI0HAAAAAG |
+----+------+----------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
会卡在这一步
Tue Nov 14 15:16:23 2017 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
--------------Waiting for global read lock-----------------
那么后面的步骤都会卡住
* Phase 4.2: Starting Parallel Slave Log Apply Phase..
--将 当前需要恢复的slave 和 latest slave之间的diff日志 传送给 需要恢复的 slave
--将 latest slave 和 dead master之间的diff日志 传送给 需要恢复的 slave
--将 刚刚的差异日志依次apply 到需要恢复的slave
--Resetting slave , Executed CHANGE MASTER to new_master
* Phase 5: New master cleanup phase..
Resetting slave info on the new master.

3.5 binlog伪造

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Tue Nov 14 17:23:10 2017 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1.000003 --start_pos=4042 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171114172304.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56
Creating /var/log/masterha/mha_test if not exists.. ok.
Concat binary/relay logs from host_1.000003 pos 4042 to host_1.000050 EOF into /var/log/masterha/mha_test/saved_master_binlog_from_host_1_3306_20171114172304.binlog ..
Binlog Checksum enabled
Dumping binlog format description event, from position 0 to 154.. ok.
Dumping effective binlog data from /data/mysql.bin/host_1.000003 position 4042 to tail(4065).. ok.
Failed to save binary log: Target file /data/mysql.bin/host_1.000004 not found!
at /usr/bin/save_binary_logs line 176
Tue Nov 14 17:23:10 2017 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln760] Failed to save binary log events from the orig master. Maybe disks on binary logs are not accessible or binary log itself is corrupt?
测试:
flush + 无binlog.xx + 1伪造 = save binlog --成功
flush + 有binlog.xx + 1伪造 = save binlog --失败
flush + 无binlog.xx + 1伪造 = save binlog --成功
flush + 有binlog.xx + 1伪造 = save binlog --失败
说明,如果有伪造的binlog,且后缀比之前的大,那么就会告警
解析:
/usr/bin/save_binary_logs ==》
generate_diff_binary_log =>
/usr/share/perl5/vendor_perl/MHA/BinlogManager.pm ==》
concat_all_binlogs_from():: =>
start_num , end_num(如果出现伪造的binlog,则会报错)

四、总结

MHA + NON-GTID 模式,重点配置和用法如下:

1
2
3
4
5
6
7
8
9
10
11
1. command
masterha_master_switch --global_conf=/data/online/agent/MHA/conf/masterha_default.cnf --conf=/data/online/agent/MHA/conf/bak_mha_test.cnf --dead_master_host=$dead_master_ip --dead_master_port=3306 --master_state=dead --interactive=0 --ignore_last_failover --ignore_binlog_server_error
2. tgw 清理
dead master 如果还可以起来,那么必须在上面执行: /usr/local/realserver/RS_TUNL0/etc/setup_rs.sh -c
原因可参看:http://gitlab.corp.anjuke.com/_dba/architecture/blob/master/personal/Keithlan/other/share/tools/always_used_command.md ==> TGW 章节

五、 流程简介

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
* Phase 1: Configuration Check Phase..
HealthCheck: SSH N台DB是否reachable
Binlog server: N台DB 是否reachable
GTID failover mode = ?
Dead Servers is
Primary candidate for the new Master (candidate_master is set) ?
* Phase 2: Dead Master Shutdown Phase..
Executing master IP deactivation script
TGW-vip delete操作
shutdown_script: ?
* Phase 3: Master Recovery Phase..
* Phase 3.1: Getting Latest Slaves Phase..
Latest slaves ,file position ?
Oldest slaves , file position ?
* Phase 3.2: Saving Dead Master's Binlog Phase..
Executing command on the dead master : save_binary_logs --command=save
Concat binary/relay logs from latest_slave file_pos to master's binlog EOF
scp from root@master_ip: binlog to local: xx.binlog --将master最新的,latest缺失的binlog传递到manager机器
* Phase 3.3: Determining New Master Phase..
选择哪个slave为new master
* Phase 3.3: New Master Diff Log Generation Phase..
Need to get diffs from the latest slave up to: xx (using the latest slave's relay logs) --生成new master和latest slave之间的diff日志
scp latest_ip:xx to new_master:xx --将new master和latest slave之间的diff日志 传送给 new master
scp from local: xx to new_master: xx --将latest slave 和 dead master之间的diff日志 传送给 new master
* Phase 3.4: Master Log Apply Phase..
new master 将之前的两段diff日志合并,然后开始apply
All other slaves should start replication from here --开始生成change master语句
Executing master IP activate script: TGW-vip 激活操作,并且设置readonly=0
* Phase 4: Slaves Recovery Phase.. (并行操作)
* Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. --开始生成其余slave 和 latest 之间的diff日志
* Phase 4.2: Starting Parallel Slave Log Apply Phase..
--将 当前需要恢复的slave 和 latest slave之间的diff日志 传送给 需要恢复的 slave
--将 latest slave 和 dead master之间的diff日志 传送给 需要恢复的 slave
--将 刚刚的差异日志依次apply 到需要恢复的slave
--Resetting slave , Executed CHANGE MASTER to new_master
* Phase 5: New master cleanup phase..
Resetting slave info on the new master.
Contents
  1. 1. MHA failover NON-GTID 专题
    1. 1.1. 假定环境(经典三节点)
    2. 1.2. 一、Master : MySQL down
      1. 1.2.1. 1.1 etl 延迟8小时
      2. 1.2.2. 1.2 slave(候选master)比etl还要落后更多
      3. 1.2.3. 1.3 slave(候选master)的日志是最新的,比etl要多
      4. 1.2.4. 1.4 slave(候选master)上面有大事务在跑
      5. 1.2.5. 1.5 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
      6. 1.2.6. 1.7 Master:MySQL down小结
    3. 1.3. 二、Master : Server down
      1. 1.3.1. 2.1 etl 延迟8小时
      2. 1.3.2. 2.2 slave(候选master)比etl还要落后更多
      3. 1.3.3. 2.3 slave(候选master)的日志是最新的,比etl要多
      4. 1.3.4. 2.4 slave(候选master)上面有大事务在跑
      5. 1.3.5. 2.6 如果MHA过程中失败,是否可以重新执行MHA的failover呢?
    4. 1.4. 三、遇到的坑
      1. 1.4.1. 3.1 交互模式下,如果没有及时敲’YES’,则终止切换
      2. 1.4.2. 3.2 如果在non-gtid模式的机器上,配置了binlog server,会有什么影响呢?
      3. 1.4.3. 3.3 不要在relay-log的地方伪造日志
      4. 1.4.4. 3.4 flush tables with read lock 会阻塞
      5. 1.4.5. 3.5 binlog伪造
    5. 1.5. 四、总结
    6. 1.6. 五、 流程简介

幸福,不在于得到的多

而在于计较的少