RightScale Blog

Cloud Management Blog
Cloud Management Blog

Surviving a MySQL Master Database Crash

Nothing is more heart-arresting than to find out that your database machine has died. Site down. Data gone. Life s...

That's what happened to one of our customers yesterday morning, right when they were featured on some prominent sites. The Amazon EC2 instance hosting their master DB died. Fortunately they had tested the master-slave setup using our Manager for MySQL, so everything was set up to recover quickly. They IMed me so I could help should things go wrong. We waited a couple of minutes to see whether the machine was just rebooting, but to no avail. So we hit the "promote to master" on the slave instance, and here's the log of what happened:

[2007-08-21 16:24:45] [ServerActionsWorker] : Executing: 'Executing action: DB promote to master' 
[2007-08-21 16:24:46] [ServerActionsWorker] : Using MasterDB DNS ID: 2577432 . 
[2007-08-21 16:24:46] [ServerActionsWorker] : Using SlaveDB DNS ID: 2577433 . 
[2007-08-21 16:24:54] [ServerActionsWorker] : No slave argument given...assuming localhost 
Using C interface for mysql, client version 5.0.22 
Server doesn't appear to be logging binary logs, configuring and restarting server with binary logging 
Locking slave (and enabling writes) 
[2007-08-21 16:28:04] [ServerActionsWorker] : Process 7927 has the lock. terminating others. 
Written read_only changes to new master conf file 
Stopping master (if alive), noting position, making RO, stopping and unconfiguring replication 
Previously connected master db-p-master.company.com not reachable... 
...Warning: assuming old master is dead and that the current contents of the Slave is the latest and best we can get. 
Promoting slave... 
Waiting until it catches up (if alive), stopping and unconfiguring replication, 
unlocking tables and setting up replication privileges 
Retrieved new master info...File: mysql-bin.000001 position: 98 
Stopping slave and misconfiguring master 
granting rep rights... 
done with rights... 
Unlocking tables 
Demoting old master... 
Changing Master DB DNS... 
OK. Result: DNSID 2577432 set to this instance IP: 10.255.47.70 
Mission accomplished. 
[2007-08-21 16:28:04] [ServerActionsWorker] : Server action successully completed

Woot! The slave promoted to master just fine. At that point we had to bounce the Mongrel servers because, as far as we can tell, ActiveRecord just doesn't switch to the new DNS entry for the database in any reasonable amount of time. After verifying that the site was back up and fixing an ancilliary server that wasn't pointing to the proper database DNS entry, we launched a fresh slave with another button press.

MySQL after Failure

Phew, all this within about a half hour, including initial reaction and troubleshooting time and followup cleanup work. Everything we put in place with Manager for MySQL worked like a charm!


Archived Comments

Lox “Stopping slave and misconfiguring master” ?? :)

Thorsten Yes, “Stopping slave and misconfiguring master” Don’t you love the comments developers put into their code :-). This refers to stopping the replication on the slave so it can become the master, and mis-configuring the master to ensure nothing continues to talk to it.

Ian All the notes in there mentioning DNS: I assume you have a facility setup that ensures when a new DB master goes live that all the app servers are automagically updated with the new settings? I’d love to learn more about how you do that! I’m very interested in using your Manager for MySQL for my company, once we’re ready to release. I’ve been using the free version of RightScale and love it!

Thorsten Ian, thanks for the interest! We use DNSmadeeasy.com for our DNS services. They make it easy to create dynamic DNS entries that can be updated automatically when an instance boots. So when the master DB boots or a slave DB promotes to master it updates the appropriate DNS entry (e.g. db-master.rightscale.com). We set the TTL on that DNS entry to 75 seconds, which is the lowest supported by dnsmadeeasy.com for the type of account we have with them. We have found no problems in the propagation of such a change, and in the case of the DB this switch-over time is fine considering everything else that needs to happen. We are having trouble, however, in convincing Rails’ ActiveRecord to switch.  We’re still tracking down some scenarios where restarting the Rails app is required.

Comments

Nice work! I'm interested in how do you make initial import of the data for MySQL slave. I guess you reset the binary log position on master, then do regular mysqldump/import from master to slave and then you start the master-slave replication? How do you solve the data consistency in dump without long-term locking on master side?
Spes, the initial import is done by taking an LVM snapshot on the master and copying the raw files over to the slave. For the snapshot we lock all tables in mysql, freeze xfs, then lvm snapshot. We rsync the data over to the slave and then fire it up. We have found this to be the fastest way to get the data over. A dump & import would take much longer and much more cpu resources.

Post a comment