February 10, 2009: Disaster Day – Sheep Guarding Llama

I didn’t realize it when I woke up this morning but today was destined to be one of those horribly long and awful days that you just can’t imagine how stressful they can be until they happen.

I woke up rather on the early side but ended up not actually getting out of bed for a while. I did my morning weigh in and was up just a smidgen but that was to be expected after yesterday’s huge drop all at once. Then it was down to work with intermittent trips upstairs to see the family.

We took some pictures with dad and Liesl this morning then dad headed out the door sometime after ten twenty to get onto the road back home to Peoria. Just after dad left I went back down to work and attempted to sign on to my email and discovered, to my horror, that the email system, as well as the instant messaging system, were down. Quickly I discovered that the host server that handles both of them had suffered catastrophic drive failure that included one or more of its drives physically being pulled from their connectors. I don’t know how this happened and I have never seen it in a server before.

I spent the first few hours trying to get the drives reconnected by talking someone through the procedure over the phone which is never fun. Eventually we had that portion of the problem figured out and got the Smart Array able to recognize all of the drives. The Smart Array was happy to tell us that everything was “okay” but, of course, it was not really “okay”.

I thought for sure that the RAID array had completely failed and that there was going to be nothing at all. Things were not quite that bleak. It turned out that the array survived somewhat intact but that the filesystem on the drives was seriously hosed. I have no idea what trauma that server went through to cause so much damage but this was really something. I have never seen so much filesystem damage.

I spent much of the afternoon in a panic attempting to get a rescue disc mounted and booted on the server. The data center didn’t answer ten of my twelve phone calls and eventually I resorted to email from my Yahoo account. I am guessing that something bad was happening at the datacenter. They always stop taking calls when they have broken a bunch of stuff.

Eventually we got a maintenance CD mounted but then spent a few hours trying in vain to get that to work. It turns out, we believe, that the CD was bad. It took forever to get another CD mounted but eventually we did and we were able to get to the console and begin working on the box.

The first thing that we learned was that the filesystem really was hosed and that there was almost no chance of salvaging anything. Now that is depressing. Pretty much the only thing to do was to run a complete file system check and hope for the best. The filesystem was in heavy use when the world was yanked out from under it so the damage is potentially pretty significant.

In the process of looking for the status of backups after the server failed I also quickly discovered that around the same time this morning that the email and instant messaging systems died that the backup server – a completely separate machine with its own redundanct drive systems – had also completely and utterly disappeared! Now this is a bad day in the making. Key server gone along with the backups.

I ended up doing nothing today but work and work on these servers. It was exhausting. Exhausting and depressing.

The filesystem check came back with the most depressing results of the day – everything was gone. Everything. Gone. Nothing left. Nope. Nadda.

Around nine o’clock this evening we made the call that there was just nothing that could be done with the lost filesystem and any continued work on it was a wasted effort that could be better spent elsewhere and there was little to no chance of being able to repair the lost SunFire server remotely as it has been restarted and no one at the data center was able to determine anything about its status. So that left me with nothing to do but to hop into the Mazda PR5 and hit the road for Scranton.

It was just after ten when I actually made it out of the door and onto the road. I arrived at the Scranton Data Center just a few minutes after midnight. Luckily the crew was standing around outside smoking so I was able to find everyone that I needed instantly and get right in, derack the two lost servers, load them into the car, swing into Turkey Hill to pick up a pack of cashews (I haven’t eaten since breakfast) and an energy drink and to get back onto the road heading to Peekskill.

I arrived back at the house in Peekskill just minutes after two in the morning. Four hours for a round trip from Peekskill to Scranton with two servers being deracked is pretty impressive if I do say so myself. No time wasted anywhere.

My first order of business was restoring the backup server itself. The handy thing about tonight’s move was that that server was always supposed to be deracked and moved to the house in Peekskill. Originally we were not planning on making that move until after everything else had left Scranton but this ended up working out reasonably well as it was from that perspective.

I got the backup server working and determined that there was an elusive backup available from September which was our “best care scenario” once we had seen how catastrophic things were. So getting a copy of that backup was of primary interest although these backups are so large that just moving a copy from one machine to another is rather difficult.

While I was working on getting the backups moved around to places more useful (and to make additional copies of them for safety as they are now the master copies) I got back to work on building the Zimbra server itself that is going to take over for the failed machine. Again, another multi-hour long process.

The startup of the first run of the Zimbra server ended up taking a very long time as did the file copies. In the end I resorted to just going to bed and leaving the Zimbra server to come up on its own, completing the first, small file copy and kicking off a massive compression job on the one server in the hopes of reducing the amount of data that has to be moved around.

It was six in the morning when I finally managed to call it a night. Not nearly as much progress as I was hoping to have made but I think that there was enough done that it is likely that the new server will be completed significantly enough so that the machine can be packed and shipped to Toronto tomorrow afternoon. Now I just have to get up and set things up with the datacenter in the morning so that they are ready to receive the new box!!

What a day.

Leave a comment