Exclusive: TypePad outage update and details

Six Apart executive status report

Popular blog hosting provider TypePad.com was unavailable to its members for 18 hours today as a result of a failed storage upgrade during scheduled maintenance late last night. I visited Six Apart’s headquarters in San Francisco earlier this evening to talk with Six Apart employees about the events that transpired and the current state of TypePad’s blogging platform.

I sat down for an interview with Anil Dash, Six Apart’s Vice President of Professional Products, about an hour after TypePad’s servers came back online. We discussed TypePad’s reliability issues over the past two months, what went wrong today, and how the company plans to prevent future outages and create a reliable service.

Anil and I had a candid conversation about the difficulties of building a large-scale hosted service, the uniqueness of TypePad’s members, future plans for the TypePad product including Project Comet and a business-level offering, and how the company approaches scalability and building dependable applications that people love to rely on as a utility.

The entire 23-minute interview with Anil Dash on TypePad service issues is available as a 10.4 MB MP3 audio download. Any rumblings you may hear is the sound of TypePad engineers running back and forth nearby and literally shaking the entire room.

Each paragraph in the interview transcript below is uniquely identified in the page markup. To link directly to a paragraph just use add “#” and the question and paragraph id found in the source.

Interview Transcript

Niall:

Hi I’m Niall Kennedy and I’m here with Anil Dash of Six Apart. Anil is VP of Professional Products and we’re going to talk about TypePad and some of the issues that happened today, Friday, and last night. Hi Anil.

Anil:

Hello.

Niall:

Can you tell us a bit about what happened last night to cause a loss of service to TypePad?

Anil:

Yeah. I think the first thing I would start with is no data was lost and the service is back up and running now. A couple weeks ago, as you well know, we had some problems with capacity and scalability as TypePad has been getting more popular. We made a bunch of commitments in terms of improving bandwidth, improving power, and things like that. Among those changes was improving our storage infrastructure and one part of that was increasing redundancy. We actually have accomplished most of the things we promised we would do on that list and one of the last things to do was to improve redundancy on storage.

We were bringing online another storage system and it’s one of those things where you have a moment of vulnerability while you’re improving the system and we had a failure during that point. So we had what we thought would be a fairly routine scheduled maintenance last night and we had posted an alert about that so people knew ahead of time “oh, OK, this is why this blip is happening” and right at the end of that window of maintenance we had a failure of the new storage system.

The process at that point was understanding what had happened and whether there were any issues of data loss, which fortunately there were not. The first step that was decided — and this is something I am sure will be avidly debated — is that we would go to the last full snapshot. We have incremental backups and we have full snapshots and most people I think that are listening to this probably know the difference. The full snapshot was anywhere from a few hours to a few days old, depending on people’s blogs. We then did a verify on the data that we were restoring.

As we brought the disks back up — the process is like the olden days of Windows 98 when you would get that reboot into the blue screen saying “I am running checkdisk and please wait forever while I check your hard drive.” It was doing that process and this is what we started doing last night and all through the night and in the interim we had those old snapshots of pages up. I think a lot of the people who were looking at their blogs today or posting had that alarm of “oh my gosh, old data” and “does this mean old posts are safe and my new posts are not?” It was making sure that was not the case that took a lot of time today. And then, of course, eliminating single points of failure, identifying all the other problems. Between all those steps together it took until this afternoon Pacific Time before we were able to switch the app back on.

Niall:

You guys run a variety of sites from TypePad.com, blogs.com, Major League Baseball blogs. What sites and partners were affected by the outage?

Anil:

All the TypePad-powered sites in the U.S.: TypePad.com, blogs.com, Friendster’s blog service, as well as Major League Baseball’s service to various degrees. I don’t actually know offhand the details about which blogs were affected on those services or what percentage it is. That’s something I know our operational team definitely knows but my job today interfere with them. My estimation is the majority of TypePad blogs were affected.

Niall:

How did you notify TypePad members of the outage and the unavailability of their recent posts?

Anil:

We had a couple of different steps. The first is our status blog which is our first line of defense and the first place you should check if you ever have a question as a TypePad user. It honestly is probably not nearly prominent enough to find and that’s something we are going to be remedying. It’s one of those things you find out in these situations is that people don’t know that channel exists. And then we’ll be blogging about it. I think one of the things we’ve shown a pretty good track record of is having a conversation about these things and really engaging in a dialogue with our members and I think you’ll see our usual breadth of posts which is here’s a technical explanation for people who care about that, here’s what we’re going to do to make it right for people — I think for most people that’s their priority — and then a little about what we’ve learned from the past is something we will go into. You work at Technorati. I think the folks at Technorati have done a really good job about being open in communication on that. I think we’ve got a pretty good reputation there and I’d like to be sure we stay in touch with everyone.

Niall:

Did you consider sending out e-mails to every TypePad member to let them know that they wouldn’t be able to post and their latest posts might not be available?

Anil:

Yeah, I think we talked about a variety of different channels and the thing you want to do is make sure you have all the right information getting to the right people at the right time. Part of it is until you know that everything is up and running and all the data is OK you don’t want to send a “we don’t know” e-mail because it tends to make things worse before it makes things better. That’s something we’ve gotten by talking to our professional users and people who are really invested in their blog. They’ve said “just let me know you are aware there is a problem and then let me know when you’ve fixed it and in between if you’re going to tell me about the fsck you’re running on your hard drives I don’t care.” I think that’s probably true for most TypePad users. “Just let me know when it works.” That’s I think what we’re really going to focus on is that blogging experience.

There are many different channels. I spent all day commenting on blogs today. Other folks here at the company have. Jay Allen I think a lot of people will see out there. A lot of people have been on IM and on Skype and on phone calls all day too. We talked about if we should do a big Skype call with a bunch of people but finding something that we can setup in a short amount of time that scales to millions of users or millions of listeners is pretty hard. I think in the future we’ll see what we can do about that too.

Niall:

How are today’s problems related to the problems you had at the end of October? Are these similar issues? Are they going to be recurring?

Anil:

No, it is actually not about capacity at all. I saw somebody describe it as “a perfect storm” today or it was like a “lightning strike.” I felt a little bit like Charlie Brown when Lucy pulls the football out. It was absolutely one of those situations where we had tackled the problems that caused application performance issues in the past and we wanted to make sure we were bulletproof. At the point where we are kind of putting up another line of defense we kind of hit ourselves on the head and got knocked out. The timing was terrible and it just one of those things where Murphy’s Law will just always in.

Lucy pulls back the football

Niall:

Are LiveJournal users affected at all or are they on separate hardware?

Anil:

Not at all. They are on separate hardware. The best thing about having LiveJournal in-house is they have really taught us, and the whole web in general really, about scaling and reliability in large-scale operations. They’ve got 9 million journals and they recently did a datacenter move to the same datacenter we’ve got TypePad in. They have hundreds of servers, millions of users, millions and millions of posts and they did it with some glitches around e-mail and things like that but no downtime. The open-source technologies they have created like MogileFS, a caching file system and Perbal, a load balancing system, run sites like Wikipedia and Slashdot and Friendster: these gigantic and dynamic sites. They’ve shown that the technology exists. People on our team build these services capable of making large services reliable. I think you are going to see a lot of knowledge transfer and best practices about how to make this stuff bulletproof and reliable.

One of the examples I think Steve Rubel talked about on his blog is eBay back in 1999 when people were just starting to have their livelihood depend on the auctions closing on time and the service would go down. People solved these things then and I think we are going to see the same thing for us.

The thing that gives me a lot of faith is we have these TypePad users that are rooting for us to do it even though we have absolutely not met their expectations, especially today. We had somebody send us pizzas for the whole TypePad team for lunch today saying “I love TypePad, I know you guys are going to do right by us, I trust you to get it right, and we are thinking of you.” That’s an astounding thing for a community to do. That’s really special, that’s why you do it. That’s why you keep going we dropped the ball today.

Niall:

Can you talk about the backup solutions that are in place to make sure people don’t lose data on TypePad?

Anil:

I’ll speak as much as I know. I don’t want to overstep what I am knowledgeable about because I am not on the operations or infrastructure teams that we have. There are, as I understand it, the now redundant storage system. One of the interesting things about our architecture is we have a database that stores all your posts but it generates and a machine that generates the pages that your posts appear on. They are separate. The storage failure today was on the system that stores your web pages that you can view and that’s why we saw an older cached copy when they went back live. The entries in the database are on a separate database server and that is already redundant and somewhere we knew we didn’t lose any data so we felt a lot better about that. We were able to verify that pretty quickly, it was just the disks that took some time to come back online. In terms of architecture I don’t know exactly how the different tiers split up but the database is separate from the web server pages and now we have redundancy on both.

Niall:

Last February in an interview Ben and Mena mentioned that TypePad was originally intended for 3000 users and then it was going to switch over to an invite-only system. Is this a scalability issue and is TypePad up to the challenge of hosting hundreds of thousands of blogs?

Anil:

I think what they were talking about there was not so much the architecture of TypePad but the concept of how many people would want to be blogging, how many people would want to be on a hosted service, and how many people would pay for a hosted service. It really spoke more to whether, I think this is something Om Malik was talking about with Web 2.0 scalability in general, are you designing for a community of people who you are personal friends with and can talk to one-on-one or are you trying to bring something to the masses? I think that’s the biggest distinction that Ben and Mena were talking about the change for.

As far as the application architecture actually I feel more confident than I ever have that we can scale out to support the number of users TypePad is hosting. We have got millions of people on this platform. TypePad and TypeKey are related and I think between the two of them you are certainly into the millions of users. That’s something where I think the learnings from LiveJournal in particular are really really educational for us.

There are a couple of different places you can fail. You can fail at app, you can fail at hardware, you can fail at connectivity. We’ve probably had each of those problems. I think that we have stated that it’s a goal for us and that it is a requirement for us to be business class. It’s really really important to be professional-grade.

I think that is a distinction from … I have this debate all the time with Jason Fried from 37signals where his team and David Heinemeier Hansson have made Ruby on Rails into this web development phenomenon but nobody’s ever done a really large-scale service, certainly not the scale of TypePad let alone LiveJournal, on that platform. David Heinemeier Hansson has said “we don’t want to invest in scale.” I think he is overstating the point for rhetorical effect, but it’s understated whether it is a priority of yours or not and whether you want to have impact with a small influential crowd or whether you want to bring this technology to the masses. Those are things that we are really clear on. I think it’s really important to us that we are a company of bloggers. I think, you know, we’re all bloggers. I’ve had a blog for 6 years now I think and there are a number of people in this company that is true for. We’ve been doing it a long time and we want it to be something that everybody can do. It’s a different goal and it matters to us that people that are of blogging and from blogging that bring blogs to a regular user as opposed to being Bill Gates saying “this is a high growth area that I think we should maximize revenue potential in.” We want people to have a voice and we think we can help them do that.

I think TypePad users today are sharing their passion for their blogs and they care that much about their sites. We are absolutely lucky to have that kind of passion from our users. One of the kindest things today was a friend of mine on the Blogger team IM’s me and said “hang in there, we had a 7-hour outage one time” and everyone has had different outages but to be really supportive and also to say we’re kind of kindred spirits and we’re a team of people that are bloggers too. It’s a nice thing that there is still that sense of community and that people notice.

Every major blogging service except you guys has had downtime this week. Del.icio.us got unplugged for a while, and I think Bloglines said “they are sucking eggs” was their quote. That’s nothing negative against those guys, I have huge amounts of respect for both those teams, and we probably share a huge overlap in users too so it’s the same people we are talking about. It’s a hard problem and it has to be something that either that’s your priority or it’s not. You only get the problems that you are uniquely qualified to suffer for. TypePad is the thing we care about and so this is the burden we take on.

Niall:

I know you guys are planning some changes to TypePad’s infrastructure including some behind-the-scenes stuff you are doing as part of Project Comet and some other initiatives. Can you talk about that and what’s being done?

Anil:

I don’t have a lot of specifics there because I think a lot of that stuff is always evolving based on what feedback people have. Frankly we redoubled our efforts on TypePad’s infrastructure sometime before the problems we had in October. It was something we recognized and we wanted to focus on and the first step in that was moving our server location. We didn’t get done entirely in time for that but it did happen and pretty smoothly once the performance issues were addressed.

I think we’ve got our best team, some of the people who have been looking at TypePad the longest, really looking at all of the different ways TypePad can scale. Something you and I, Niall, talked about that we haven’t really announced but the intention is to do a business-level TypePad. We have the kind of flagship customers that some people were blogging about today like MSNBC or ABC News or USA Today are all running their blogs on TypePad. That’s an incredible responsibility. It’s just as serious to me as people who have their baby pictures and talk to their friends and family on there. That’s an incredible amount of trust to put in us and I am at least heartened that it made the news that we let them down today. You want it to be newsworthy when you fail, not when you don’t.

That much is good, but we do have a bit of a hole to dig out of as far as people trusting us and I don’t know what the ratio is. It’s maybe for every day you screw up you have to have a month where you’re perfect or a week, or whatever it is. That’s something I have a lot of confidence people are going to give us the chance to get right because we’ve had that dialog with them because they know that we are out there listening and they know they can call me or Skype me or IM me or TrackBack me or whatever and we’ll all reply.

Niall:

Talk about that “business class.” How does that differ from what there is now? How do you define business class reliability at Six Apart?

Anil:

We haven’t gone into specifics on it and I think obviously this is something we are going to redefine based on today to some degree. It really is about saying that it matters to us to be a professional tool people can trust. We understand that the constraint for a lot of companies with blogging is they don’t want to deal with their IT department. You’d rather sign up and just use something you can expense on your credit card and ask forgiveness rather than permission, although we’ll probably say it nicer than that on the website. People also want an option for priority support, people want to be able to have that relationship with us, people want to optionally purchase a service-level agreement.

I think that’s absolutely a reasonable option for people to have if they think that it’s worth their investment. Those are things where we’ve had people say “I do this for fun” or “I like a service to be free” or “I don’t like to deal with SLAs or boring legal stuff,” bring on the boring legal stuff. If we can do things the way we want to for the business audience blogging should be deadly, dull, boring. Something that you just don’t even have to think about, it just runs like an e-mail server or a web server or a search server where it’s just there as a utility. I think you guys have stated probably different terms but similar aspirations for Technorati where you just count on it, you know it’s comprehensive, and you know it’s there. That’s definitely a parallel to what we want to do.

Niall:

Are you guys going to take any time off this Christmas? Is your team going to get a break at all?

Anil:

I work most closely day-to-day these days with the Movable Type team and they have had a great week with the Yahoo! partnership we announced so they are going to have some well-deserved time off.

I personally am going to be traveling and dealing with some of our partnerships overseas. The nice thing about having a break here in the U.S. is it’s not always taken in other countries so we get some time, some downtime, to really connect with our partnerships in Japan and things like that.

There’s no question to me that everybody who is in operations is going to be 24/7 for the foreseeable future on their beepers and on their phones and on-call. They have done that before and it wouldn’t be the first time that a hard drive was the grinch that stole Christmas. That’s what they signed on for. They really do have an amazing amount of dedication to making sure people can connect with the people they care about so that goes a long way.

Niall:

Is Six Apart hiring operations personnel to take care of some of this stuff?

Anil:

We’ve brought on a new director of ops, we do have a bunch on new people on the ops team. We’ve been building up that team for some time and like I said, it’s funny I was looking at one of the performance monitoring systems that monitors TypePad and we had these old cached copies of the pages coming up but they were coming up really fast! It was one of those things where that was an improvement that had been made, a performance improvement and a reliability, and even when you have this hardware failure there are these other things that are good news.

They are a really dedicated team. This is not something that was any human failure or error. I have a tremendous amount of respect for them doing hard work day in and day out.

Niall:

If there are aspiring sysadmins out there should they be sending in their resumes?

Anil:

Yeah. I think anybody that’s got talent and cares about blogging we want to have on our team and have them contribute. So yes, please.

Niall:

Where is the best place for TypePad users to send in comments or feedback?

Anil:

There are a couple of different ways. We have our help ticket support system built-in to the app. People who use TypePad know that support team is pretty legendary for their dedication. Feel free to get me directly. I’m Anil [at] sixapart.com so I am pretty easy to get ahold of for most people. I know Barak our CEO is also willing to get feedback. He’s BarakB [at] sixapart.com.

Those are great ways to get us. We are going to have TrackBacks on the posts that we will have up about this so feel free to link. Blog about it, we do our Technorati searches and find everybody talking about what we’re doing. We probably can justify the ego surfing more than most people so give us a mention and talk about it. We’ll try to be out there. We don’t guarantee we’ll read every single post but we try to.

Niall:

Thanks Anil. I hope you get some downtime this Christmas season.

Anil:

Bad word!

Niall:

I hope you get some relaxation time this holiday season and I hope everyone out there using TypePad keeps on blogging.

Anil:

I appreciate it and thanks for giving us a chance to talk to everybody.

Tags: