This post is to follow up with our users and subscribers and let them know about the short outage we had yesterday, Saturday March 19. Most of this post is going to be relatively technical, so if that’s not your cup of tea, you can rest assured that Rdio is back up and functioning.
At 3:28PM our operations team was notified that the site was returning errors and not functioning properly. The oncall operations engineer logged in and discovered pretty quickly that an entire rack in our colo facility had disappeared off the network. All of our other racks were reachable and functioning. We later discovered the switch on the rack just died, completely unreachable.
We immediately put up a maintenance page letting our users know that we were working on the problem, and began looking at what was on that rack and why it would have caused an outage. Fortunately nothing critical was on the lost rack. After converting a redis slave to a master and restarting a few backend services we thought we were ready to turn everything back on. However there was a minor issue with Django (one of the tools we use) that prevented everything from coming back up properly. The way Django works, every database that is defined for an application is required to be functioning and accepting connections even if this configuration of the application never uses it. We have a lot of different database clusters, and one is used only by our ingestion system. The master database for this cluster was on the failed rack, so we didn’t bother to change it in an attempt to get the service back up quickly. Once we figured that out, we made a quick change, brought it back up and the site returned to normal.
This all sounds like it happened really fast, but because it was Saturday most of us were at home watching NCAA basketball (Go Heels!) or out getting our kid’s hair cut. We primarily used IM for communication during the incident, and that caused increased latency and didn’t help us work together. If this happened during a weekday the site would have probably been down for 20 minutes, instead of 73.
After we brought it back up, we went back and looked at exactly what happened to cause the site to go down considering there was nothing mission-critical in the dead rack. It turns out that due to how one of our backend components is structured if it is waiting on data from one of its datasources, and the machine turns into a network black hole, it doesn’t handle the connection failure very well and just blocks that entire connection all the way up to the worker processes serving the web request. We have some connection timeouts in place, but they aren’t nearly low enough to prevent everything from stacking up, and causing worker process starvation. We think that was the root cause of the outage.
So, what did we learn? Well, we learned a lot of stuff. First, we need to have a conference bridge available for diagnosing these issues, because they happen at all hours, and a IM has a much higher latency than a phone call. Second, we have redundant equipment for our core router, and that’s great, but we need redundancy all the way down the line to help prevent problems. Third, we need to make sure our systems are hardened against systems blinking off the network and make sure our backend systems can exit quickly in error conditions instead of locking up retrying down resources for too long.
Software is hard. Dead networking gear sucks. Managing worker process starvation and network timeouts is key. Now we know (better). Sorry about that. The silver lining is that if you have synced music to your mobile phone, you could still play it.