programer
前回の「MySQL-MHAのインストール&設定」ではMySQL-MHAの設定が完了したので、
今回は実際にmha-managerのコマンドによるフェイルオーバーを行いたいと思います。

■masterha_master_switchについて

masterha_master_switchは手動でフェイルオーバーを行うためのコマンドになります。
また、挙動やオプションを大きく分けると、
–master_stateオプションで指定できる、masterが死んでいるのか?生きているのか?で違ってきます。
※–master_state=deadが指定されている場合には、死んでいる。
–master_state=aliveが指定されている場合には、生きている。という意味になる。

今回は、–master_state=deadオプションを利用した、
サーバーが死んでる状態で実行できる手動によるフェイルオーバーを行いたいと思います
–master_state=aliveのmasterが生きている状態でのフェイルオーバーは
masterha_master_switchについて(前編)を参照してください。

サーバー構成は前回と同じ内容で行いたいと思います。
※masterには仮想IPを設定してあります。

■masterが死んでる場合に利用されそうなオプションについて
※誤解などがあるかもしれないので詳細やその他オプションは公式のドキュメントを読んでください!
masterha_master_switchのドキュメント

- conf
 設定ファイルの指定
- master_state=dead
 必須パラメータ。 
 masterが生きている状態で行う場合には「alive」を設定する。(オンラインマスタースイッチになる)
 死んでる状態で切り替える場合には「dead」を設定する。
  
- dead_master_host =(ホスト名)
 master_state=deadを指定した場合には、必須パラメータ。
 死んでるとmasterの指定(死んでると仮定するmasterの指定)。
 
- dead_master_ip 
 gethostbyname(dead_master_host)の結果になる。
 
- dead_master_port
 デフォルトは3306。
 
- new_master_host =(ホスト名)
 新しいmasterの指定。デフォルトのmha managerの自動フェイルオーバーによるslaveの選択ではなくなる。
 
- interactive=(0 | 1)
 0:非対話型フェイルオーバー
 1:対話型フェイルオーバー
 
- ssh_reachable =(0 | 1 | 2)
 masterにSSH経由で到達可能であるかどうかを指定します。
 0:SSHが到達不可の場合
 1:SSHが到達可能な場合
 2:SSHが到達可能か不明の場合(デフォルト)
  デフォルトの2が設定されている場合には、SSH接続が可能かチェックして、0または1にステータスを変更する。
  このコマンドは、内部的にマスターはSSHまたはしない経由で到達可能かチェックし、0または1を使って内部のSSHステータスを更新します。
  masterがSSH経由で到達可能である場合、master_ip_failover_script、shutdown_scriptが設定されている場合には、
  " -command=stopssh"が実行される。
  また、masterからバイナリーログがコピーされる。
  設定されていない場合には、masterha_master_switchの" -command=stop"が実行される。
 
- skip_change_master
 差動リレーのログを適用した後のCHANGE MASTER TOコマンドをスキップする。
 
- skip_disable_read_only
 新しいmasterにread_only=0の設定をスキップする。
 
- last_failover_minute =(分)
 以前のフェイルオーバーから指定された時間内に再度フェイルオーバーを実行することになった場合に、
 フェイルオーバーを行わない。(デフォルトでは480分。8時間)。
 ただし、ignore_last_failoverが設定されている場合には、無効になる。
 
- ignore_last_failover
 以前フェイルオーバーが失敗した場合に再び同じ現象が起こる可能性があるので、、
 last_failover_minuteで指定された時間を経過するまではフェイルオーバーを開始しないため、
 それを無効化するためのもの。
  
- wait_on_failover_error =(秒)
 フェイルオーバー失敗時に、指定された秒数待機してから終了する。
 デーモンスクリプトから自動化されたmasterの監視とフェイルオーバー用に、再び監視させるまでの待機時間ように。
 
- remove_dead_master_conf
 フェイルオーバーが正常終了した場合に、死んだmasterの設定ファイルのセクション名を削除する。(デフォルトは削除されない)



作業前に現在のサーバーの状態を確認したいと思います。
※前回の記事の構成を利用しています。

現在のサーバーの状態の確認

・mha-managerサーバー
 IP:192.168.10.227)
・masterサーバー
IP:192.168.10.233
VIP:192.168.10.234
・salve1
IP:192.168.10.229
・salve2
IP:192.168.10.228

この状態から、現在のmaster(192.168.10.233)を死んだと仮定してやっていきたいと思います( ゚Д゚)ゞ リョーカイ!!

では、192.168.10.233は死んだことにします!お前は、既に死んでいる!!北斗百烈拳( -_-)=○アタ☆=○アタ☆=○アター☆)°o°)アウ!

■現在masterの192.168.10.233が死んだ事にして、フェイルオーバーを実行する。

次のコマンドの各オプションなどの意味は次のようになります。
–master_stateで、現在のmasterの状態を指定します。(dead or alive)
–dead_master_hostで、死んでいるmasterを指定します。(deadの場合に必須)
オプションで新しいmasterを指定していないので、masterに選定されるslaveはmha-managerにお任せする。



masterha_master_switch --master_state=dead --conf=/etc/mha.cnf --dead_master_host=192.168.10.233


実行した際に、192.168.10.233が起動している場合には、次のようエラーが発生して、フェイルオーバーされない・・・。
さ、さすが、ラオウ((((;゚Д゚))))ガクガクブルブル


Tue Feb 12 08:29:50 2013 - [info] Dead Servers:
Tue Feb 12 08:29:50 2013 - [error][/usr/lib/perl5/vendor_perl/MHA/MasterFailover.pm, ln183] None of server is dead. Stop failover.
Tue Feb 12 08:29:50 2013 - [error][/usr/lib/perl5/vendor_perl/MHA/ManagerUtil.pm, ln178] Got ERROR:  at /usr/bin/masterha_master_switch line 53


(・д・)チッ

仕方ないので、実際にmasterを停止してやってみることに!

現在のmasterは192.168.10.233なので、そのサーバーのmysqlを停止させます!ザ・ワールド!ド━(゚Д゚)━ン!!


#プロセスの確認
# ps axu|grep mysql
root      4905  0.0  0.2   5556  1180 ?        S    Feb08   0:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/lib/mysql --socket=/var/lib/mysql/mysql.sock --pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
mysql     5463  0.0 13.9 614640 73252 ?        Sl   Feb08   0:00 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/lib/mysql/mysql-error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306
root      5495  0.0  0.5  23016  2776 xvc0     S+   Feb08   0:00 mysql -u root
root     19318  0.0  0.1   5120   780 pts/0    R+   11:22   0:00 grep mysql
 
#停止
# /etc/init.d/mysqld stop
mysqld を停止中:                                           [  OK  ]
 
#プロセスが停止したことの確認
# ps axu|grep mysql
root     19359  0.0  0.1   5120   780 pts/0    R+   11:23   0:00 grep mysql



それでは、再度、フェイルオーバーを実行してみる。



# masterha_master_switch --master_state=dead --conf=/etc/mha.cnf --dead_master_host=192.168.10.233
--dead_master_ip=<dead_master_ip> is not set. Using 192.168.10.233.
--dead_master_port=<dead_master_port> is not set. Using 3306.
Tue Feb 12 11:26:41 2013 - [info] Reading default configuratoins from /etc/masterha_default.cnf..
Tue Feb 12 11:26:41 2013 - [info] Reading application default configurations from /etc/mha.cnf..
Tue Feb 12 11:26:41 2013 - [info] Reading server configurations from /etc/mha.cnf..
Tue Feb 12 11:26:41 2013 - [info] MHA::MasterFailover version 0.55.
Tue Feb 12 11:26:41 2013 - [info] Starting master failover.
Tue Feb 12 11:26:41 2013 - [info]
Tue Feb 12 11:26:41 2013 - [info] * Phase 1: Configuration Check Phase..
Tue Feb 12 11:26:41 2013 - [info]
Tue Feb 12 11:26:42 2013 - [info] Dead Servers:
Tue Feb 12 11:26:42 2013 - [info]   192.168.10.233(192.168.10.233:3306)
Tue Feb 12 11:26:42 2013 - [info] Checking master reachability via mysql(double check)..
Tue Feb 12 11:26:42 2013 - [info]  ok.
Tue Feb 12 11:26:42 2013 - [info] Alive Servers:
Tue Feb 12 11:26:42 2013 - [info]   192.168.10.228(192.168.10.228:3306)
Tue Feb 12 11:26:42 2013 - [info]   192.168.10.229(192.168.10.229:3306)
Tue Feb 12 11:26:42 2013 - [info] Alive Slaves:
Tue Feb 12 11:26:42 2013 - [info]   192.168.10.228(192.168.10.228:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 11:26:42 2013 - [info]     Replicating from 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 11:26:42 2013 - [info]   192.168.10.229(192.168.10.229:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 11:26:42 2013 - [info]     Replicating from 192.168.10.233(192.168.10.233:3306)
Master 192.168.10.233 is dead. Proceed? (yes/NO): y
Tue Feb 12 11:27:01 2013 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Feb 12 11:27:01 2013 - [info]
Tue Feb 12 11:27:01 2013 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Feb 12 11:27:01 2013 - [info]
Tue Feb 12 11:27:02 2013 - [info] HealthCheck: SSH to 192.168.10.233 is reachable.
Tue Feb 12 11:27:03 2013 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Feb 12 11:27:03 2013 - [info] Executing master IP deactivatation script:
Tue Feb 12 11:27:03 2013 - [info]   /usr/bin/master_ip_failover --virtual_ip=192.168.10.234 --orig_master_vip_eth=eth0:234 --new_master_vip_eth=eth0:234 --orig_master_host=192.168.10.233 --orig_master_ip=192.168.10.233 --orig_master_port=3306 --command=stopssh --ssh_user=root
DEBUG PARAMETERS***********
command => stopssh
ssh_user=s => root
orig_master_host => 192.168.10.233
orig_master_ip => 192.168.10.233
orig_master_port => 3306
virtual_ip => 192.168.10.234
orig_master_vip_eth => eth0:234
new_master_vip_eth => eth0:234
Tue Feb 12 11:27:03 2013 - [info]  done.
Tue Feb 12 11:27:03 2013 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Feb 12 11:27:03 2013 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Feb 12 11:27:03 2013 - [info]
Tue Feb 12 11:27:03 2013 - [info] * Phase 3: Master Recovery Phase..
Tue Feb 12 11:27:03 2013 - [info]
Tue Feb 12 11:27:03 2013 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Feb 12 11:27:03 2013 - [info]
Tue Feb 12 11:27:03 2013 - [info] The latest binary log file/position on all slaves is mysql-bin.000011:929
Tue Feb 12 11:27:03 2013 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Feb 12 11:27:03 2013 - [info]   192.168.10.228(192.168.10.228:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 11:27:03 2013 - [info]     Replicating from 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 11:27:03 2013 - [info]   192.168.10.229(192.168.10.229:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 11:27:03 2013 - [info]     Replicating from 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 11:27:03 2013 - [info] The oldest binary log file/position on all slaves is mysql-bin.000011:929
Tue Feb 12 11:27:03 2013 - [info] Oldest slaves:
Tue Feb 12 11:27:03 2013 - [info]   192.168.10.228(192.168.10.228:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 11:27:03 2013 - [info]     Replicating from 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 11:27:03 2013 - [info]   192.168.10.229(192.168.10.229:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 11:27:03 2013 - [info]     Replicating from 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 11:27:03 2013 - [info]
Tue Feb 12 11:27:03 2013 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Tue Feb 12 11:27:03 2013 - [info]
Tue Feb 12 11:27:04 2013 - [info] Fetching dead master's binary logs..
Tue Feb 12 11:27:04 2013 - [info] Executing command on the dead master 192.168.10.233(192.168.10.233:3306): save_binary_logs --command=save --start_file=mysql-bin.000011  --start_pos=929 --binlog_dir=/var/lib/mysql,/var/log/mysql --output_file=/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55
  Creating /tmp/mha if not exists..    ok.
 Concat binary/relay logs from mysql-bin.000011 pos 929 to mysql-bin.000011 EOF into /tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog ..
  Dumping binlog format description event, from position 0 to 107.. ok.
  Dumping effective binlog data from /var/lib/mysql/mysql-bin.000011 position 929 to tail(948).. ok.
 Concat succeeded.
saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog                                                                                                                        100%  126     0.1KB/s   00:00
Tue Feb 12 11:27:07 2013 - [info] scp from root@192.168.10.233:/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog to local:/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog succeeded.
Tue Feb 12 11:27:07 2013 - [info] HealthCheck: SSH to 192.168.10.228 is reachable.
Tue Feb 12 11:27:09 2013 - [info] HealthCheck: SSH to 192.168.10.229 is reachable.
Tue Feb 12 11:27:09 2013 - [info]
Tue Feb 12 11:27:09 2013 - [info] * Phase 3.3: Determining New Master Phase..
Tue Feb 12 11:27:09 2013 - [info]
Tue Feb 12 11:27:09 2013 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Tue Feb 12 11:27:09 2013 - [info] All slaves received relay logs to the same position. No need to resync each other.
Tue Feb 12 11:27:09 2013 - [info] Searching new master from slaves..
Tue Feb 12 11:27:09 2013 - [info]  Candidate masters from the configuration file:
Tue Feb 12 11:27:09 2013 - [info]  Non-candidate masters:
Tue Feb 12 11:27:09 2013 - [info] New master is 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 11:27:09 2013 - [info] Starting master failover..
Tue Feb 12 11:27:09 2013 - [info]
From:
192.168.10.233 (current master)
 +--192.168.10.228
 +--192.168.10.229
 
To:
192.168.10.228 (new master)
 +--192.168.10.229
 
Starting master switch from 192.168.10.233(192.168.10.233:3306) to 192.168.10.228(192.168.10.228:3306)? (yes/NO): y
Tue Feb 12 11:27:15 2013 - [info] New master decided manually is 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 11:27:15 2013 - [info]
Tue Feb 12 11:27:15 2013 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Tue Feb 12 11:27:15 2013 - [info]
Tue Feb 12 11:27:15 2013 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Tue Feb 12 11:27:15 2013 - [info] Sending binlog..
saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog                                                                                                                        100%  126     0.1KB/s   00:00
Tue Feb 12 11:27:17 2013 - [info] scp from local:/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog to root@192.168.10.228:/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog succeeded.
Tue Feb 12 11:27:17 2013 - [info]
Tue Feb 12 11:27:17 2013 - [info] * Phase 3.4: Master Log Apply Phase..
Tue Feb 12 11:27:17 2013 - [info]
Tue Feb 12 11:27:17 2013 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Tue Feb 12 11:27:17 2013 - [info] Starting recovery on 192.168.10.228(192.168.10.228:3306)..
Tue Feb 12 11:27:17 2013 - [info]  Generating diffs succeeded.
Tue Feb 12 11:27:17 2013 - [info] Waiting until all relay logs are applied.
Tue Feb 12 11:27:17 2013 - [info]  done.
Tue Feb 12 11:27:17 2013 - [info] Getting slave status..
Tue Feb 12 11:27:17 2013 - [info] This slave(192.168.10.228)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000011:929). No need to recover from Exec_Master_Log_Pos.
Tue Feb 12 11:27:17 2013 - [info] Connecting to the target slave host 192.168.10.228, running recover script..
Tue Feb 12 11:27:17 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='repl' --slave_host=192.168.10.228 --slave_ip=192.168.10.228  --slave_port=3306 --apply_files=/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog --workdir=/tmp/mha --target_version=5.5.29-log --timestamp=20130212112641 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx
Tue Feb 12 11:27:17 2013 - [info]
Applying differential binary/relay log files /tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog on 192.168.10.228:3306. This may take long time...
Applying log files succeeded.
Tue Feb 12 11:27:17 2013 - [info]  All relay logs were successfully applied.
Tue Feb 12 11:27:17 2013 - [info] Getting new master's binlog name and position..
Tue Feb 12 11:27:17 2013 - [info]  mysql-bin.000011:3185
Tue Feb 12 11:27:17 2013 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.10.228', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=3185, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Feb 12 11:27:17 2013 - [info] Executing master IP activate script:
Tue Feb 12 11:27:17 2013 - [info]   /usr/bin/master_ip_failover --virtual_ip=192.168.10.234 --orig_master_vip_eth=eth0:234 --new_master_vip_eth=eth0:234 --command=start --ssh_user=root --orig_master_host=192.168.10.233 --orig_master_ip=192.168.10.233 --orig_master_port=3306 --new_master_host=192.168.10.228 --new_master_ip=192.168.10.228 --new_master_port=3306 --new_master_user='repl' --new_master_password='repl'
DEBUG PARAMETERS***********
command => start
ssh_user=s => root
orig_master_host => 192.168.10.233
orig_master_ip => 192.168.10.233
orig_master_port => 3306
new_master_host => 192.168.10.228
new_master_ip => 192.168.10.228
new_master_port => 3306
virtual_ip => 192.168.10.234
orig_master_vip_eth => eth0:234
new_master_vip_eth => eth0:234
Set read_only=0 on the new master.
Tue Feb 12 11:27:19 2013 - [info]  OK.
Tue Feb 12 11:27:19 2013 - [info] ** Finished master recovery successfully.
Tue Feb 12 11:27:19 2013 - [info] * Phase 3: Master Recovery Phase completed.
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] * Phase 4: Slaves Recovery Phase..
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] -- Slave diff file generation on host 192.168.10.229(192.168.10.229:3306) started, pid: 14227. Check tmp log /tmp/mha/192.168.10.229_3306_20130212112641.log if it takes time..
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] Log messages from 192.168.10.229 ...
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Tue Feb 12 11:27:19 2013 - [info] End of log messages from 192.168.10.229.
Tue Feb 12 11:27:19 2013 - [info] -- 192.168.10.229(192.168.10.229:3306) has the latest relay log events.
Tue Feb 12 11:27:19 2013 - [info] Generating relay diff files from the latest slave succeeded.
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Tue Feb 12 11:27:19 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] -- Slave recovery on host 192.168.10.229(192.168.10.229:3306) started, pid: 14229. Check tmp log /tmp/mha/192.168.10.229_3306_20130212112641.log if it takes time..
saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog                                                                                                                        100%  126     0.1KB/s   00:00
CHANGE MASTER TO MASTER_HOST='192.168.10.228', MASTER_PORT=3306, MASTER_USER='repl', MASTER_PASSWORD='repl', MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=3185Tue Feb 12 11:27:21 2013 - [info]
Tue Feb 12 11:27:21 2013 - [info] Log messages from 192.168.10.229 ...
Tue Feb 12 11:27:21 2013 - [info]
Tue Feb 12 11:27:19 2013 - [info] Sending binlog..
Tue Feb 12 11:27:20 2013 - [info] scp from local:/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog to root@192.168.10.229:/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog succeeded.
Tue Feb 12 11:27:20 2013 - [info] Starting recovery on 192.168.10.229(192.168.10.229:3306)..
Tue Feb 12 11:27:20 2013 - [info]  Generating diffs succeeded.
Tue Feb 12 11:27:20 2013 - [info] Waiting until all relay logs are applied.
Tue Feb 12 11:27:20 2013 - [info]  done.
Tue Feb 12 11:27:20 2013 - [info] Getting slave status..
Tue Feb 12 11:27:20 2013 - [info] This slave(192.168.10.229)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000011:929). No need to recover from Exec_Master_Log_Pos.
Tue Feb 12 11:27:20 2013 - [info] Connecting to the target slave host 192.168.10.229, running recover script..
Tue Feb 12 11:27:20 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='repl' --slave_host=192.168.10.229 --slave_ip=192.168.10.229  --slave_port=3306 --apply_files=/tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog --workdir=/tmp/mha --target_version=5.5.29-log --timestamp=20130212112641 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx
Tue Feb 12 11:27:21 2013 - [info]
Applying differential binary/relay log files /tmp/mha/saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog on 192.168.10.229:3306. This may take long time...
Applying log files succeeded.
Tue Feb 12 11:27:21 2013 - [info]  All relay logs were successfully applied.
Tue Feb 12 11:27:21 2013 - [info]  Resetting slave 192.168.10.229(192.168.10.229:3306) and starting replication from the new master 192.168.10.228(192.168.10.228:3306)..
Tue Feb 12 11:27:21 2013 - [info]  Executed CHANGE MASTER.
Tue Feb 12 11:27:21 2013 - [info]  Slave started.
Tue Feb 12 11:27:21 2013 - [info] End of log messages from 192.168.10.229.
Tue Feb 12 11:27:21 2013 - [info] -- Slave recovery on host 192.168.10.229(192.168.10.229:3306) succeeded.
Tue Feb 12 11:27:21 2013 - [info] All new slave servers recovered successfully.
Tue Feb 12 11:27:21 2013 - [info]
Tue Feb 12 11:27:21 2013 - [info] * Phase 5: New master cleanup phase..
Tue Feb 12 11:27:21 2013 - [info]
Tue Feb 12 11:27:21 2013 - [info] Resetting slave info on the new master..
Tue Feb 12 11:27:21 2013 - [info]  192.168.10.228: Resetting slave info succeeded.
Tue Feb 12 11:27:21 2013 - [info] Master failover to 192.168.10.228(192.168.10.228:3306) completed successfully.
Tue Feb 12 11:27:21 2013 - [info]
 
----- Failover Report -----
 
mha: MySQL Master failover 192.168.10.233 to 192.168.10.228 succeeded
 
Master 192.168.10.233 is down!
 
Check MHA Manager logs at local-dev1-vm004 for details.
 
Started manual(interactive) failover.
Invalidated master IP address on 192.168.10.233.
The latest slave 192.168.10.228(192.168.10.228:3306) has all relay logs for recovery.
Selected 192.168.10.228 as a new master.
192.168.10.228: OK: Applying all logs succeeded.
192.168.10.228: OK: Activated master IP address.
192.168.10.229: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.10.229: OK: Applying all logs succeeded. Slave started, replicating from 192.168.10.228.
192.168.10.228: Resetting slave info succeeded.
Master failover to 192.168.10.228(192.168.10.228:3306) completed successfully.
</dead_master_port></dead_master_ip>




完了(+・`ー’・)ドヤ

確認事項として次の2点を確認されます。
問題なければ「yes」を入力する。




#master(192.168.10.233)は死んでるの?エッ(゚Д゚≡゚Д゚)マジ?
Master 192.168.10.233 is dead. Proceed? (yes/NO): y
 
#masterを192.168.10.233から192.168.10.223に変えちゃうよ?(○ `人´ ○) タノンマスー!
Starting master switch from 192.168.10.233(192.168.10.233:3306) to 192.168.10.228(192.168.10.228:3306)? (yes/NO):




フェイルオーバー完了後に
作業用ディレクトリを見てみると、2つのファイルが作られているようです。

mha.failover.completeは、フェイルオーバーが完了したという判断用のファイルかな?
saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlogは、masterのバイナリーログかな?


# ls -alt /tmp/mha/
合計 36
drwxrwxrwx 3 root root 4096  2月 12 11:27 .
-rw-r--r-- 1 root root    0  2月 12 11:27 mha.failover.complete
-rw-r--r-- 1 root root  126  2月 12 11:27 saved_master_binlog_from_192.168.10.233_3306_20130212112641.binlog



各MySQLのサーバーにもリレーログとバイナリーログが作られていると思います。

mha-managerは、差分のリレーログをファイルオーバー時にslaveに適用してくれるので、
masterにあるバイナリーログを各slaveに転送しているみたいです。

slave(192.168.10.229)が新しいmasterを参照しているか確認する


mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.228
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql-bin.000011
          Read_Master_Log_Pos: 3185
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000011
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3185
              Relay_Log_Space: 409
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
1 row in set (0.00 sec)



問題ない(゚д゚)(。_。)(゚д゚)(。_。) ウンウン

次に、新master(192.168.10.228)が正しくmasterとし機能しているか確認する。
仮想IPが正しく設定されているか確認。(eth0:234があること)



#新master(192.168.10.228)で行う
# ifconfig
eth0      Link encap:Ethernet  HWaddr 00:16:3E:47:5E:B8
          inet addr:192.168.10.228  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3237356 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1441341 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:993055512 (947.0 MiB)  TX bytes:4238095734 (3.9 GiB)
 
eth0:234  Link encap:Ethernet  HWaddr 00:16:3E:47:5E:B8
          inet addr:192.168.10.234  Bcast:192.168.10.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
 
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:413 errors:0 dropped:0 overruns:0 frame:0
          TX packets:413 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:56639 (55.3 KiB)  TX bytes:56639 (55.3 KiB)



次に、フェイルオーバー前はslaveだったので、
show slaveで表示してみて、何も表示されないことを確認する。


mysql> show slave status\G
Empty set (0.00 sec)


参照されているslaveの一覧を取得する。


mysql> show slave hosts;
+-----------+---------+------+-----------+
| Server_id | Host    | Port | Master_id |
+-----------+---------+------+-----------+
|        30 | slave30 | 3306 |        10 |
|        20 | slave20 | 3306 |        10 |
+-----------+---------+------+-----------+
2 rows in set (0.00 sec)
※落ちてるはずのslave30があるのはなぜ・・・?



read_onlyが0になっているか確認する


mysql> SELECT @@read_only;
+-------------+
| @@read_only |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)



masterの状態を確認する。


mysql> show master status;
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.000011 |     3185 |              |                  |
+------------------+----------+--------------+------------------+
1 row in set (0.00 sec)



どっちも問題なさそうです!ワーイヽ(゚∀゚)メ(゚∀゚)メ(゚∀゚)ノワーイ

実際に新master(192.168.10.228)にデータを入れて、レプリケーションされるか確認してみる。



#新master(192.168.10.228)で行う
mysql> use test;
Database changed
 
mysql> show tables;
Empty set (0.00 sec)
 
mysql> CREATE TABLE IF NOT EXISTS `dummy` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(30) NOT NULL,
  `insert_date` datetime NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;
 
mysql> INSERT INTO dummy( id, name, insert_date ) VALUES ( NULL, 'aaa', now() );
Query OK, 1 row affected (0.01 sec)
 
mysql> SELECT * FROM dummy;
+----+------+---------------------+
| id | name | insert_date         |
+----+------+---------------------+
|  1 | aaa  | 2013-02-09 11:24:52 |
+----+------+---------------------+
1 row in set (0.00 sec)




※各slaveで同じデータが登録されているか確認してください。

仮想IPでMySQLに接続できるか確認する。


# mysql -u repl -h 192.168.10.234 -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 17
Server version: 5.5.29-log MySQL Community Server (GPL) by Remi
 
Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
 
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.




検証のために落とした元master(192.168.10.233)をslaveとして起動させる。

masterだった状態で停止させたので、mha-managerには何もやられていないこともあり、
CHANGE MASTER TOコマンドを実行する必要があります。(´;ω;`)ウッ…


#元master(192.168.10.233)で行う
mysql> show slave status\G
Empty set (0.00 sec)



CHANGE MASTER TOコマンドを実行して再設定を行います。

192.168.10.229が新しいmasterに切り替えるため用に、
CHANGE MASTER TOを実行しているので、それと同じものを実行すると楽チン!( ̄ー ̄)ニヤリ

mha-managerは親切なことに、ログに実行すべきCHANGE MASTER TOコマンド出力されていますъ(゚Д゚)グッジョブ!!



Tue Feb 12 11:27:17 2013 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.10.228', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000011', MASTER_LOG_POS=3185, MASTER_USER='repl', MASTER_PASSWORD='xxx'



このログのMASTER_PASSWORDの部分を正しい値に書き換えて実行すれば良いだけになるようです。
その前にreset slave;を実行して置いた方が良いかも?


#元master(192.168.10.233)で行う
mysql> reset slave;


masterの情報を設定する。



#元master(192.168.10.233)で行う
mysql> CHANGE MASTER TO 
MASTER_HOST='192.168.10.228', 
MASTER_PORT=3306, 
MASTER_USER='repl', 
MASTER_PASSWORD='repl', 
MASTER_LOG_FILE='mysql-bin.000011', 
MASTER_LOG_POS=3185;



Master_HostやMaster_Log_Fileの情報が正しいか確認する。






#元master(192.168.10.233)で行う
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 192.168.10.228
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000011
          Read_Master_Log_Pos: 3185
               Relay_Log_File: mysql-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: mysql-bin.000011
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3185
              Relay_Log_Space: 107
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 0
1 row in set (0.00 sec)



同期を開始させる


#元master(192.168.10.233)で行う
mysql> start slave;



正常に同期が始まっているか確認する。
Slave_IO_Running、Slave_SQL_RunningがYesになっているか確認。


#元master(192.168.10.233)で行う
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.228
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000011
          Read_Master_Log_Pos: 3185
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000011
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3185
              Relay_Log_Space: 409
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
1 row in set (0.00 sec)




今回は必要最低限のオプションで実行したので、
次は、色々オプションをつけて確認したいと思います

現在のサーバーの状態

・mha-managerサーバー
 IP:192.168.10.227)
・masterサーバー
IP:192.168.10.228
VIP:192.168.10.234
・salve1
IP:192.168.10.229
・salve2
IP:192.168.10.233

その前に各オプションで変更される項目の現在の設定などの確認をしたいと思います。

–skip_disable_read_onlyの挙動確認のため、各サーバーでread_onlyを確認する
master(192.168.10.228)のread_only値を確認する。

mysql> SELECT @@read_only;
+-------------+
| @@read_only |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)


slave1(192.168.10.229),slave2(192.168.10.233)のread_only値を確認する。


mysql> SELECT @@read_only;
+-------------+
| @@read_only |
+-------------+
|           1 |
+-------------+
1 row in set (0.00 sec)


–remove_dead_master_confの挙動確認のため、/etc/mha.cnfの[server_XXX]セクションを確認する。


view /etc/mha.cnf
:
:
省略
:
:
[server1]
hostname=192.168.10.228
 
[server2]
hostname=192.168.10.229
 
[server3]
hostname=192.168.10.233



–master_state=deadを指定するので、実際にmaster(192.168.10.228)を停止させます。


# /etc/init.d/mysqld stop
mysqld を停止中:                                           [  OK  ]


mha-managerでフェイルオーバーを実行する


masterha_master_switch --master_state=dead --conf=/etc/mha.cnf --dead_master_host=192.168.10.228 --new_master_host=192.168.10.233 --skip_disable_read_only --skip_change_master --remove_dead_master_conf


あれ・・・エラー発生。なぜ (・・?)ナゼ?


Tue Feb 12 12:32:01 2013 - [error][/usr/lib/perl5/vendor_perl/MHA/MasterFailover.pm, ln298] Last failover was done at 2013/02/12 11:01:21. Current time is too early to do failover again. If you want to do failover, manually remove /tmp/mha/mha.failover.complete and run this script again.
Tue Feb 12 12:32:01 2013 - [error][/usr/lib/perl5/vendor_perl/MHA/ManagerUtil.pm, ln178] Got ERROR:  at /usr/bin/masterha_master_switch line 53


エラー内容をみると、
最後にフェイルオーバーしてからすぐにまたフェイルオーバーするなんておかしいよ!絶対ダメ( ‘д‘⊂彡☆))Д´) パーン

って、感じらしいです。

もしそれでもフェイルオーバーしたいなら/tmp/mha/mha.failover.completeを削除しろ(#゚Д゚)ゴルァ!!
って事らしいです。

そういえば、マニュアルを読んでた時にこのチェックを無効化するオプションなどがあったので、それを使ってみることに。

last_failover_minuteは、以前フェイルオーバーしてから再度実行する場合に、
どのくらい経過したら許可するかという分の指定らしいです。デフォルトでは480分の8時間で設定されている。

ignore_last_failoverは、フェイルオーバーを一度行ってから再度実行する場合に、
last_failover_minuteで指定した時間だけ経過時したかのチェックを行わない。

とりあえず、last_failover_minute=10を追加して実行してみる。
10という事は前回のフェイルオーバーから10分経過していれば実行するという意味になる。



# masterha_master_switch --master_state=dead --conf=/etc/mha.cnf --dead_master_host=192.168.10.228 --new_master_host=192.168.10.233 --skip_disable_read_only --skip_change_master --remove_dead_master_conf --last_failover_minute=10
--dead_master_ip=<dead_master_ip> is not set. Using 192.168.10.228.
--dead_master_port=<dead_master_port> is not set. Using 3306.
Tue Feb 12 12:39:12 2013 - [info] Reading default configuratoins from /etc/masterha_default.cnf..
Tue Feb 12 12:39:12 2013 - [info] Reading application default configurations from /etc/mha.cnf..
Tue Feb 12 12:39:12 2013 - [info] Reading server configurations from /etc/mha.cnf..
Tue Feb 12 12:39:12 2013 - [info] MHA::MasterFailover version 0.55.
Tue Feb 12 12:39:12 2013 - [info] Starting master failover.
Tue Feb 12 12:39:12 2013 - [info]
Tue Feb 12 12:39:12 2013 - [info] * Phase 1: Configuration Check Phase..
Tue Feb 12 12:39:12 2013 - [info]
Tue Feb 12 12:39:12 2013 - [info] Dead Servers:
Tue Feb 12 12:39:12 2013 - [info]   192.168.10.228(192.168.10.228:3306)
Tue Feb 12 12:39:12 2013 - [info] Checking master reachability via mysql(double check)..
Tue Feb 12 12:39:12 2013 - [info]  ok.
Tue Feb 12 12:39:12 2013 - [info] Alive Servers:
Tue Feb 12 12:39:12 2013 - [info]   192.168.10.229(192.168.10.229:3306)
Tue Feb 12 12:39:12 2013 - [info]   192.168.10.233(192.168.10.233:3306)
Tue Feb 12 12:39:12 2013 - [info] Alive Slaves:
Tue Feb 12 12:39:12 2013 - [info]   192.168.10.229(192.168.10.229:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 12:39:12 2013 - [info]     Replicating from 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 12:39:12 2013 - [info]   192.168.10.233(192.168.10.233:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 12:39:12 2013 - [info]     Replicating from 192.168.10.228(192.168.10.228:3306)
Master 192.168.10.228 is dead. Proceed? (yes/NO): y
Tue Feb 12 12:39:14 2013 - [info] ** Phase 1: Configuration Check Phase completed.
Tue Feb 12 12:39:14 2013 - [info]
Tue Feb 12 12:39:14 2013 - [info] * Phase 2: Dead Master Shutdown Phase..
Tue Feb 12 12:39:14 2013 - [info]
Tue Feb 12 12:39:15 2013 - [info] HealthCheck: SSH to 192.168.10.228 is reachable.
Tue Feb 12 12:39:16 2013 - [info] Forcing shutdown so that applications never connect to the current master..
Tue Feb 12 12:39:16 2013 - [info] Executing master IP deactivatation script:
Tue Feb 12 12:39:16 2013 - [info]   /usr/bin/master_ip_failover --virtual_ip=192.168.10.234 --orig_master_vip_eth=eth0:234 --new_master_vip_eth=eth0:234 --orig_master_host=192.168.10.228 --orig_master_ip=192.168.10.228 --orig_master_port=3306 --command=stopssh --ssh_user=root
DEBUG PARAMETERS***********
command => stopssh
ssh_user=s => root
orig_master_host => 192.168.10.228
orig_master_ip => 192.168.10.228
orig_master_port => 3306
virtual_ip => 192.168.10.234
orig_master_vip_eth => eth0:234
new_master_vip_eth => eth0:234
Tue Feb 12 12:39:16 2013 - [info]  done.
Tue Feb 12 12:39:16 2013 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Tue Feb 12 12:39:16 2013 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Tue Feb 12 12:39:16 2013 - [info]
Tue Feb 12 12:39:16 2013 - [info] * Phase 3: Master Recovery Phase..
Tue Feb 12 12:39:16 2013 - [info]
Tue Feb 12 12:39:16 2013 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Tue Feb 12 12:39:16 2013 - [info]
Tue Feb 12 12:39:16 2013 - [info] The latest binary log file/position on all slaves is mysql-bin.000011:3185
Tue Feb 12 12:39:16 2013 - [info] Latest slaves (Slaves that received relay log files to the latest):
Tue Feb 12 12:39:16 2013 - [info]   192.168.10.229(192.168.10.229:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 12:39:16 2013 - [info]     Replicating from 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 12:39:16 2013 - [info]   192.168.10.233(192.168.10.233:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 12:39:16 2013 - [info]     Replicating from 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 12:39:16 2013 - [info] The oldest binary log file/position on all slaves is mysql-bin.000011:3185
Tue Feb 12 12:39:16 2013 - [info] Oldest slaves:
Tue Feb 12 12:39:16 2013 - [info]   192.168.10.229(192.168.10.229:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 12:39:16 2013 - [info]     Replicating from 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 12:39:16 2013 - [info]   192.168.10.233(192.168.10.233:3306)  Version=5.5.29-log (oldest major version between slaves) log-bin:enabled
Tue Feb 12 12:39:16 2013 - [info]     Replicating from 192.168.10.228(192.168.10.228:3306)
Tue Feb 12 12:39:16 2013 - [info]
Tue Feb 12 12:39:16 2013 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase..
Tue Feb 12 12:39:16 2013 - [info]
Tue Feb 12 12:39:17 2013 - [info] Fetching dead master's binary logs..
Tue Feb 12 12:39:17 2013 - [info] Executing command on the dead master 192.168.10.228(192.168.10.228:3306): save_binary_logs --command=save --start_file=mysql-bin.000011  --start_pos=3185 --binlog_dir=/var/lib/mysql,/var/log/mysql --output_file=/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55
  Creating /tmp/mha if not exists..    ok.
 Concat binary/relay logs from mysql-bin.000011 pos 3185 to mysql-bin.000011 EOF into /tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog ..
  Dumping binlog format description event, from position 0 to 107.. ok.
  Dumping effective binlog data from /var/lib/mysql/mysql-bin.000011 position 3185 to tail(3204).. ok.
 Concat succeeded.
saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog                                                                             100%  126     0.1KB/s   00:00
Tue Feb 12 12:39:19 2013 - [info] scp from root@192.168.10.228:/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog to local:/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog succeeded.
Tue Feb 12 12:39:19 2013 - [info] HealthCheck: SSH to 192.168.10.229 is reachable.
Tue Feb 12 12:39:21 2013 - [info] HealthCheck: SSH to 192.168.10.233 is reachable.
Tue Feb 12 12:39:21 2013 - [info]
Tue Feb 12 12:39:21 2013 - [info] * Phase 3.3: Determining New Master Phase..
Tue Feb 12 12:39:21 2013 - [info]
Tue Feb 12 12:39:21 2013 - [info] Finding the latest slave that has all relay logs for recovering other slaves..
Tue Feb 12 12:39:21 2013 - [info] All slaves received relay logs to the same position. No need to resync each other.
Tue Feb 12 12:39:21 2013 - [info] 192.168.10.233 can be new master.
Tue Feb 12 12:39:21 2013 - [info] New master is 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 12:39:21 2013 - [info] Starting master failover..
Tue Feb 12 12:39:21 2013 - [info]
From:
192.168.10.228 (current master)
 +--192.168.10.229
 +--192.168.10.233
 
To:
192.168.10.233 (new master)
 +--192.168.10.229
 
Starting master switch from 192.168.10.228(192.168.10.228:3306) to 192.168.10.233(192.168.10.233:3306)? (yes/NO): y
Tue Feb 12 12:39:24 2013 - [info] New master decided manually is 192.168.10.233(192.168.10.233:3306)
Tue Feb 12 12:39:24 2013 - [info]
Tue Feb 12 12:39:24 2013 - [info] * Phase 3.3: New Master Diff Log Generation Phase..
Tue Feb 12 12:39:24 2013 - [info]
Tue Feb 12 12:39:24 2013 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Tue Feb 12 12:39:24 2013 - [info] Sending binlog..
saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog                                                                             100%  126     0.1KB/s   00:00
Tue Feb 12 12:39:26 2013 - [info] scp from local:/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog to root@192.168.10.233:/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog succeeded.
Tue Feb 12 12:39:26 2013 - [info]
Tue Feb 12 12:39:26 2013 - [info] * Phase 3.4: Master Log Apply Phase..
Tue Feb 12 12:39:26 2013 - [info]
Tue Feb 12 12:39:26 2013 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed.
Tue Feb 12 12:39:26 2013 - [info] Starting recovery on 192.168.10.233(192.168.10.233:3306)..
Tue Feb 12 12:39:26 2013 - [info]  Generating diffs succeeded.
Tue Feb 12 12:39:26 2013 - [info] Waiting until all relay logs are applied.
Tue Feb 12 12:39:26 2013 - [info]  done.
Tue Feb 12 12:39:26 2013 - [info] Getting slave status..
Tue Feb 12 12:39:26 2013 - [info] This slave(192.168.10.233)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000011:3185). No need to recover from Exec_Master_Log_Pos.
Tue Feb 12 12:39:26 2013 - [info] Connecting to the target slave host 192.168.10.233, running recover script..
Tue Feb 12 12:39:26 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='repl' --slave_host=192.168.10.233 --slave_ip=192.168.10.233  --slave_port=3306 --apply_files=/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog --workdir=/tmp/mha --target_version=5.5.29-log --timestamp=20130212123912 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx
Tue Feb 12 12:39:27 2013 - [info]
Applying differential binary/relay log files /tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog on 192.168.10.233:3306. This may take long time...
Applying log files succeeded.
Tue Feb 12 12:39:27 2013 - [info]  All relay logs were successfully applied.
Tue Feb 12 12:39:27 2013 - [info] Getting new master's binlog name and position..
Tue Feb 12 12:39:27 2013 - [info]  mysql-bin.000012:107
Tue Feb 12 12:39:27 2013 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.10.233', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000012', MASTER_LOG_POS=107, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Feb 12 12:39:27 2013 - [info] Executing master IP activate script:
Tue Feb 12 12:39:27 2013 - [info]   /usr/bin/master_ip_failover --virtual_ip=192.168.10.234 --orig_master_vip_eth=eth0:234 --new_master_vip_eth=eth0:234 --command=start --ssh_user=root --orig_master_host=192.168.10.228 --orig_master_ip=192.168.10.228 --orig_master_port=3306 --new_master_host=192.168.10.233 --new_master_ip=192.168.10.233 --new_master_port=3306 --new_master_user='repl' --new_master_password='repl'
DEBUG PARAMETERS***********
command => start
ssh_user=s => root
orig_master_host => 192.168.10.228
orig_master_ip => 192.168.10.228
orig_master_port => 3306
new_master_host => 192.168.10.233
new_master_ip => 192.168.10.233
new_master_port => 3306
virtual_ip => 192.168.10.234
orig_master_vip_eth => eth0:234
new_master_vip_eth => eth0:234
Set read_only=0 on the new master.
Tue Feb 12 12:39:28 2013 - [info]  OK.
Tue Feb 12 12:39:28 2013 - [info] ** Finished master recovery successfully.
Tue Feb 12 12:39:28 2013 - [info] * Phase 3: Master Recovery Phase completed.
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] * Phase 4: Slaves Recovery Phase..
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase..
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] -- Slave diff file generation on host 192.168.10.229(192.168.10.229:3306) started, pid: 14418. Check tmp log /tmp/mha/192.168.10.229_3306_20130212123912.log if it takes time..
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] Log messages from 192.168.10.229 ...
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info]  This server has all relay logs. No need to generate diff files from the latest slave.
Tue Feb 12 12:39:28 2013 - [info] End of log messages from 192.168.10.229.
Tue Feb 12 12:39:28 2013 - [info] -- 192.168.10.229(192.168.10.229:3306) has the latest relay log events.
Tue Feb 12 12:39:28 2013 - [info] Generating relay diff files from the latest slave succeeded.
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase..
Tue Feb 12 12:39:28 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] -- Slave recovery on host 192.168.10.229(192.168.10.229:3306) started, pid: 14420. Check tmp log /tmp/mha/192.168.10.229_3306_20130212123912.log if it takes time..
saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog                                                                             100%  126     0.1KB/s   00:00
Tue Feb 12 12:39:30 2013 - [info]
Tue Feb 12 12:39:30 2013 - [info] Log messages from 192.168.10.229 ...
Tue Feb 12 12:39:30 2013 - [info]
Tue Feb 12 12:39:28 2013 - [info] Sending binlog..
Tue Feb 12 12:39:29 2013 - [info] scp from local:/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog to root@192.168.10.229:/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog succeeded.
Tue Feb 12 12:39:29 2013 - [info] Starting recovery on 192.168.10.229(192.168.10.229:3306)..
Tue Feb 12 12:39:29 2013 - [info]  Generating diffs succeeded.
Tue Feb 12 12:39:29 2013 - [info] Waiting until all relay logs are applied.
Tue Feb 12 12:39:29 2013 - [info]  done.
Tue Feb 12 12:39:29 2013 - [info] Getting slave status..
Tue Feb 12 12:39:29 2013 - [info] This slave(192.168.10.229)'s Exec_Master_Log_Pos equals to Read_Master_Log_Pos(mysql-bin.000011:3185). No need to recover from Exec_Master_Log_Pos.
Tue Feb 12 12:39:29 2013 - [info] Connecting to the target slave host 192.168.10.229, running recover script..
Tue Feb 12 12:39:29 2013 - [info] Executing command: apply_diff_relay_logs --command=apply --slave_user='repl' --slave_host=192.168.10.229 --slave_ip=192.168.10.229  --slave_port=3306 --apply_files=/tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog --workdir=/tmp/mha --target_version=5.5.29-log --timestamp=20130212123912 --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.55 --slave_pass=xxx
Tue Feb 12 12:39:30 2013 - [info]
Applying differential binary/relay log files /tmp/mha/saved_master_binlog_from_192.168.10.228_3306_20130212123912.binlog on 192.168.10.229:3306. This may take long time...
Applying log files succeeded.
Tue Feb 12 12:39:30 2013 - [info]  All relay logs were successfully applied.
Tue Feb 12 12:39:30 2013 - [info] Skipping change master and start slave
Tue Feb 12 12:39:30 2013 - [info] End of log messages from 192.168.10.229.
Tue Feb 12 12:39:30 2013 - [info] -- Slave recovery on host 192.168.10.229(192.168.10.229:3306) succeeded.
Tue Feb 12 12:39:30 2013 - [info] All slave servers are applied logs successfully.
Tue Feb 12 12:39:30 2013 - [info]
Tue Feb 12 12:39:30 2013 - [info] Master failover to 192.168.10.233(192.168.10.233:3306) completed successfully.
Tue Feb 12 12:39:30 2013 - [info] Deleted server1 entry from /etc/mha.cnf .
Tue Feb 12 12:39:30 2013 - [info]
 
----- Failover Report -----
 
mha: MySQL Master failover 192.168.10.228 to 192.168.10.233 succeeded
 
Master 192.168.10.228 is down!
 
Check MHA Manager logs at local-dev1-vm004 for details.
 
Started manual(interactive) failover.
Invalidated master IP address on 192.168.10.228.
The latest slave 192.168.10.229(192.168.10.229:3306) has all relay logs for recovery.
Selected 192.168.10.233 as a new master.
192.168.10.233: OK: Applying all logs succeeded.
192.168.10.233: OK: Activated master IP address.
192.168.10.229: This host has the latest relay log events.
Generating relay diff files from the latest slave succeeded.
192.168.10.229: OK: Applying all logs succeeded.
Master failover to 192.168.10.233(192.168.10.233:3306) completed successfully.
</dead_master_port></dead_master_ip>



(゚∀゚)キタコレ!!

それでは、各オプションが正常に機能しているか確認してみます(・д・)ジーッ

まず、–remove_dead_master_confオプションの確認。




view /etc/mha.conf
[server default]
log_level=info
manager_log=/tmp/mha/log/mha.log
manager_workdir=/tmp/mha
master_ip_failover_script=/usr/bin/master_ip_failover --virtual_ip=192.168.10.234 --orig_master_vip_eth=eth0:234 --new_master_vip_eth=eth0:234
master_ip_online_change_script=/usr/bin/master_ip_online_change --virtual_ip=192.168.10.234 --orig_master_vip_eth=eth0:234 --new_master_vip_eth=eth0:234
password=repl
remote_workdir=/tmp/mha
repl_password=repl
repl_user=repl
ssh_user=root
user=repl
 
[server2]
hostname=192.168.10.229
 
[server3]
hostname=192.168.10.233



ちゃんとmaster(192.168.10.228)が定義されていた [server1]消えてる消えてるスゲ━━━━━━ヽ(゚Д゚)ノ━━━━━━!!!!

って、自分が書いてたコメントとかも消えてるじゃん(つд⊂)ゴシゴシ
並び順も変わっとるがな:(;゙゚’ω゚’):

こ、こここ、これは仕方ないよね!!(b´∀`)ネッ!

次、は–skip_disable_read_onlyオプションの確認。
slave1(192.168.10.229)はslaveのままなのでread_only=1のままでOK!



mysql> SELECT @@read_only;
+-------------+
| @@read_only |
+-------------+
|           1 |
+-------------+
1 row in set (0.00 sec)


slave2(192.168.10.233)はnew_master_hostで指定されているのでmasterになる予定のサーバーだけど、
–skip_disable_read_onlyが設定されているからread_only=1のままかと思いきや0になっていました。



mysql> SELECT @@read_only;
+-------------+
| @@read_only |
+-------------+
|           0 |
+-------------+
1 row in set (0.00 sec)



これは、master_ip_failoverで実行されている新masterに対してread_only=0の実行が優先されているってことなのかな?

次は、–skip_change_masterオプションの確認
slave1(192.168.10.229)、slave2(192.168.10.233)ともに、
show slave statusの結果がどちらともに次のようになっていました。

CHANGE MASTER TOコマンドが実行されていないために、エラーが出てしまっている状況になっている。



mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 192.168.10.228
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql-bin.000011
          Read_Master_Log_Pos: 3185
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000011
             Slave_IO_Running: No
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 3185
              Relay_Log_Space: 409
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'repl@192.168.10.228:3306' - retry-time: 10  retries: 86400
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
1 row in set (0.00 sec)



以上の結果、オプション的には正しく動いていそうです!(゚д゚)(。_。)(゚д゚)(。_。) ウンウン

–skip_change_masterオプションのおかげで、
slaveで本来実行してくれているCHANGE MASTER TOコマンドが実行されていないので、
先程と同じようにログからCHANGE MASTER TOコマンドを見つけ出して、実行します。
※MASTER_PASSWORDがxxxになっているので正しい値を設定して実行すること。


Tue Feb 12 12:39:27 2013 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='192.168.10.233', MASTER_PORT=3306, MASTER_LOG_FILE='mysql-bin.000012', MASTER_LOG_POS=107, MASTER_USER='repl', MASTER_PASSWORD='xxx';


slave1(192.168.10.229)で実行する。


CHANGE MASTER TO 
MASTER_HOST='192.168.10.233', 
MASTER_PORT=3306, 
MASTER_LOG_FILE='mysql-bin.000012', 
MASTER_LOG_POS=107, 
MASTER_USER='repl', 
MASTER_PASSWORD='repl';


同期を開始する。


mysql> start slave;
Query OK, 0 rows affected (0.00 sec)


slaveの状態を確認する。


mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.10.233
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 10
              Master_Log_File: mysql-bin.000012
          Read_Master_Log_Pos: 107
               Relay_Log_File: mysql-relay-bin.000002
                Relay_Log_Pos: 253
        Relay_Master_Log_File: mysql-bin.000012
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 107
              Relay_Log_Space: 409
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 30
1 row in set (0.00 sec)


slave1(192.168.10.229)復旧キタ――(゚∀゚)――!!

新master(192.168.10.233)で確認してみる。
さっきまでなかった参照されているslave一覧が表示される。


mysql> show slave hosts;
+-----------+---------+------+-----------+
| Server_id | Host    | Port | Master_id |
+-----------+---------+------+-----------+
|        20 | slave20 | 3306 |        30 |
+-----------+---------+------+-----------+
1 row in set (0.00 sec)


新しいmaster(192.168.10.233)に実際にデータの登録を行ってみる。


#
mysql> INSERT INTO dummy( id, name, insert_date ) VALUES ( NULL, 4, now() );




slave1(192.168.10.229)で正しく同期されているか確認する。


#
mysql> SELECT * FROM dummy;
+----+------+---------------------+
| id | name | insert_date         |
+----+------+---------------------+
|  1 | aaa  | 2013-02-09 11:24:52 |
|  2 | aaa1 | 2013-02-09 12:35:30 |
|  3 | 3    | 2013-02-09 13:36:23 |
|  4 | 4    | 2013-02-09 14:16:02 |
+----+------+---------------------+


次に、停止していた元master(192.168.10.228)を起動する。

/etc/init.d/mysqld start


元master(192.168.10.228)は落ちていた&新しいmasterへの設定を行っていないため、 先程追加したデータは繁栄されていないことを確認する。


#
mysql> use test;
Database changed
 
#
mysql> SELECT * FROM dummy;
+----+------+---------------------+
| id | name | insert_date         |
+----+------+---------------------+
|  1 | aaa  | 2013-02-09 11:24:52 |
|  2 | aaa1 | 2013-02-09 12:35:30 |
|  3 | 3    | 2013-02-09 13:36:23 |
+----+------+---------------------+
3 rows in set (0.06 sec)


元master(192.168.10.228)に新しいmasterを参照させる。


CHANGE MASTER TO 
MASTER_HOST='192.168.10.233', 
MASTER_PORT=3306, 
MASTER_LOG_FILE='mysql-bin.000012', 
MASTER_LOG_POS=107, 
MASTER_USER='repl', 
MASTER_PASSWORD='repl';


同期を開始させる。


mysql> start slave;
Query OK, 0 rows affected (0.00 sec)


同期が開始されたことによって、データの同期が取られる。


mysql> SELECT * FROM dummy;
+----+------+---------------------+
| id | name | insert_date         |
+----+------+---------------------+
|  1 | aaa  | 2013-02-09 11:24:52 |
|  2 | aaa1 | 2013-02-09 12:35:30 |
|  3 | 3    | 2013-02-09 13:36:23 |
|  4 | 4    | 2013-02-12 13:16:02 |
+----+------+---------------------+
4 rows in set (0.00 sec)


一応、仮想IPで接続できるか確認する。


# mysql -u repl -h 192.168.10.234 -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 18
Server version: 5.5.29-log MySQL Community Server (GPL) by Remi
 
Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
 
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
mysql> show slave hosts;
+-----------+---------+------+-----------+
| Server_id | Host    | Port | Master_id |
+-----------+---------+------+-----------+
|        20 | slave20 | 3306 |        30 |
|        10 |         | 3306 |        30 |
+-----------+---------+------+-----------+
2 rows in set (0.00 sec)


これで、無事にレプリケーションの環境が元に戻りましたワーイヽ(゚∀゚)メ(゚∀゚)メ(゚∀゚)ノワーイ
あ、/etc/mha.cnfの[server_XXX]セクションが消えてしまっているので、
そちらを元に戻すのを忘れずに:(;゙゚’ω゚’):
元に戻さないとmha-managerの監視対象にならないので・・・orz

他のオプションも検証してみたいけど、今回は以上(`・ω・´)ゞビシッ!!
次回「masterha_master_switchについて(後編)」では、–master_state=aliveが指定されている場合のフェイルオーバーを行いたいと思います(ΦωΦ)フフフ…

この記事を書いた人

nakajima
nakajima
企画開発部のエンジニアです。プログラム、サーバーもどっちも楽しくて好きですが最近はサーバーの方がメインになってる気がします。