Sunday at midnight, we had scheduled some downtime for Wufoo to transfer our current setup to some new servers in another facility. We, unfortunately, ran into some unexpected problems and the downtime spilled over into this Monday morning. For that, we offer the most sincere apologies. The purpose of this message is to let you know that all accounts are now up and functioning properly, and that we are a bit embarrassed at the level of service we provided this morning.
For those of you interested in the technical details, here’s what happened. Wufoo is unique in that each form created is essentially it’s own database setup. Because of this, we have some complicated and non-traditional procedures that need to be followed when we make changes like we did today to our servers. When we moved the accounts to the new servers, the actual estimated transfer time was about 55 minutes. The first transfer completed on time, but due to our setup, certain configuration files were corrupted in the process. After that first transfer, it took us 3 additional transfers plus some digging time to isolate and fix the problem.
We’ll be the first to tell you that we have no excuses to justify all the things that went wrong today, and we have learned quite a few things from the experience. In the future, we will be running maintenance on Friday / Saturday night to prevent any chance of the work week being affected, and we will always test as much as we can ahead of time given our resources. Additionally, we’re working on a way for you to be better notified in these situations. We have to present proper error messages on your sites to keep not just you, but your users informed as well — there will be no cryptic database messages in the future from Wufoo.
Thanks to everyone who was patient and understanding while we worked on this. You guys were great. We, obviously, have a lot of work to do now, but we assure you that this isn’t a common thing for us. For our recent crop of users and customers, we hope you don’t judge us by these hiccups.
Again, we can’t thank you enough for your understanding.
Absolutely no problems from me. Your customer service is exemplary and I really appreciate the update. Onwards and upwards!Posted May 14th, 2007 by Dave Foy.
Excellent communication and transparency. Glad you surmounted the problems, and thanks for sharing the details.Posted May 14th, 2007 by Anton Zuiker.
I agree. It’s almost like… I don’t so much mind problems, as not knowing if they will be fixed soon. Silence is the WORST! Awesome jobs on the emails Team wufoo.Posted May 14th, 2007 by David Kypuros.
I appreciate your friendly and honest approach. It was not a big deal for us, but it is nice to find out what happened anyways (because I’m a computer nerd). Keep up the great work!Posted May 14th, 2007 by Justin.
To any new wufoo customers out there I just want to say these guys are outstanding. Definitely do not judge the service by this hiccup. We’ve been using wufoo at auditoriumA.com to power all of our contact forms since they launched and have loved the service. You will too.Posted May 14th, 2007 by Tony Mars.
What happened yesterday afternoon? And am I supposed to be getting email updates automatically from you all when the server goes down?
Thanks,Posted May 14th, 2007 by Martha Hardy.
We’ll always email you if we are expecting an outage, but yesterday was a freak accident. We’re still receiving some information on the problem, but it appears that a hardware failure at our datacenter knocked out a number of companies.Posted May 14th, 2007 by Chris Campbell.
It would be great to get an email update when something like yesterday happens. I was in the middle of recommending wufoo to an associate (literally when it went down – went to the site to show it off and… nothing!). I’d also just gone live with a form for a tournament as well.
I have added the wufoo url to one of my monitoring services, but think it would be great if you guys sent out email/rss updates – even after you were back up.
I will continue to use wufoo because you guys do what you do better than any other service like this, but I’m crossing my fingers we don’t see too many outages like yesterday’s.Posted May 14th, 2007 by Linda Irvine.
I agree, and the problem yesterday was that our email server that is separate from the live server was also knocked out. Basically, everything on that line went down. If it were a hardware failure on the main server, we would have been able to access the other server and send out the mail. What we should probably do is use a 3rd party app like CampaignMonitor to handle emergency emails in that situation.Posted May 14th, 2007 by Chris Campbell.
An explanation of Monday’s outage.
Today, June 4, 2007 at 10:31AM PDT, one of our upstream providers suffered a hardware failure on a line card in their router. Despite our previous assurances from them that a) their infrastructure was fully redundant and b) a failure of one of our uplinks would not correlate with a failure of the others, this single component is a single point of failure that brought down all of our connectivity.
(all times in Pacific daylight time)
10:31 connectivity on all of BitPusherâ€™s uplinks is lost
10:35 BitPusher starts investigating the problem
10:40 BitPusher discovers that the problem is upstream from BitPusher, problem escalated with vendors
11:05 BitPusher arrives at the data center and determines that everything within the BitPusherâ€™s infrastructure is behaving correctly
11:45 vendor gives erroneous assessment that the root cause was a denial of service attack
12:30 vendor corrects the assessment and identifies a hardware failure as root cause (a router line card)
12:30 BitPusher starts discussions with another vendor for redundant connectivity (but this is normally a week or longer process)
13:00 (approx) replacement line card is dispatched from vendor (needs to be flown to Seattle). Meanwhile, vendor continues to try to get services running with broken line card. Some brief periods of functioning service occur.
14:35 line card has been replaced and service is restored
15:03 backlogged incoming e-mail for bitpusher.com and customer domains, which was spooled on a BitPusher server in Texas, is delivered
BitPusher has three network uplinks. Our primary uplink involves a layer 2 connection from our facility (Fortress) to the Westin Building, one of the main peering centers in Seattle. At the Westin we get transit from Internap, a premium ISP. The layer 2 connection is provided by Mzima.
We also have two connections directly to Mzima at Fortress, providing layer 2 and Internet service.
Per our previous discussions with Mzima, we believed that their network was fully redundant, and that any failures that would cause our Mzima Internet service to fail would not cause our layer 2 connection to the Westin to fail.
In reality, the failure of a single line card in a single Mzima router caused both our layer 2 connection to the Westin to go down and all of Mzimaâ€™s Internet service at Fortress and the Westin to go down.
The line card was mostly but not completely non-functional, which is why some of you might have noticed brief periods in the middle of the outage when services worked.
The resolution required the replacement of the line card, and Mzima did not have a spare line card in Seattle. A replacement line card was flown in from the vendor (Force10). Once the line card was replaced, network connectivity was reestablished without any problems.
During the outage BitPusher started working with another vendor to try to arrange additional/alternate connectivity. This is usually a much longer process, though, an we were not able to expedite it sufficiently to make it happen today.
Prevention and Mitigation: Short Term
The vendor now has started building a stock of spare parts in Seattle in case of future component failures.
Prevention and Mitigation: Long Term
We are working with another vendor to provide connectivity at Fortress (layer 2 and Internet service) that will not depend on any Mzima network equipment. This should be up by the end of the week.
We are also in the process of reviewing our architecture to find any other vulnerabilities we may have based on vendor assertions that have not been carefully verified.Posted May 14th, 2007 by Michael T. Halligan.