After having worked in IT for decades you tend to accumulate anecdotes. This is one of my favourites.
I once consulted for a major “new Ivy League” university on a project involving their database server cluster used to backend their core line of business application. The university’s library completely relied on this system and it had to have near perfect uptimes and the data stored in it was critical. Because of these needs the university had spared little expensive buying high end SAN, clustering the database, buying great hardware, hiring consulting engineers (like myself) and having a full time system administrator dedicated to this singular system. The resources being thrown at this system were truly impressive which I found very surprising as I could not figure out how this system paid for itself – but that was a business decision whose factors I was not privy to so perhaps it was well justified.
The point here, however, is that the enormity of the resources thrown at the single system were staggering. The SAN itself had to be over a half-million dollar investment alone. Perhaps much more.
While I was consulting there one day the system administrator went into the other room to make some minor change at her console for the SAN and accidentally clicked on the wrong LUN. One click. One mistake. Because it was a SAN this was a system that was regularly managed, but a “set and forget” storage system that would go years without human intervention. This was a normal, every day change. Nothing big. No reason to not touch it during business hours. No cause for concern. No need for special safeties. Except that clicking on the wrong line in the display was the difference between a temporarily LUN for testing and the production LUN on which the entire operation depended.
In the blink of an eye everything was gone. One wrong click and all of the expensive, all of the redundancy, all of the planning went right out the window. It was gone. The entire database was just… gone.
So now we are operating in an “outage” state. What is the plan? What do we do from here? We have no idea because instantly, upon realizing what had happened, the staff system administrator basically went into shock. The stress of not only having the system go offline on her watch but to have done so due to an error that she made was too much and she was, quite instantly, useless. She could not talk, stand or do anything. She was so useless, in fact, that rather than dealing with the outage the remaining IT staff had to deal with getting her breathing normally and moved to a cafeteria or some other place out of the way so that she could recover and we could get to work repairing the damage.
So in this case, the human component not only caused the error (to err is human and all that) but upon having erred went into a “failure state.” This was not good at all.
Fortunately a SAN administrator was available and able to start working on getting me available storage and I was able to start working on rebuilding the cluster. After a few hours the systems were back online and working. Data loss was relatively significant since the mid-priced storage and high levels of redundancy had lured them into believing that the system would not fail so the backups were not nearly as recent as one would have hoped but they were recent enough to bring the system back online eventually.
Had this event not happened on exactly the day that I was there consulting and immediately after having been walked through their cluster setup the order of magnitude of the disaster would have been much higher. This could have been an outage of days or weeks. There was little protection, either technically or via process, to protect against catastrophic human error and no redundancy for the most fragile piece of the system – the human.
I learned many lessons from this event that I have carried with me through my career. These are great lessons and ones that everyone should know. There are so many assumptions in IT that we often forget to step back and evaluate the big picture. There are several things that could have prevented or mitigated this disaster including a better backup process, physical separations for environments and process controls to keep production changes limited to non-production hours of which there was plenty for this system since it only ran about 70-80 hours per week. Plenty of time to have been doing work during off-hours.
Some of the things that I learned include that human error is our biggest fear – computers rarely fail as spectacularly as humans. I learned that the presence of redundancy, even at every level, does not protect against intra-system failure – in this case the redundancy replicated the error to all points in the system instantly. Humans need redundancy often more than computers do. Humans need redundancy every day when they eat, sleep, travel, vacation, get sick, etc. Computers need it only when something bad happens. Processes are very important – had someone written down that development and production were on the same interface in a manual they probably would have realized that that was a bad idea. And finally that having unbelievably reliable technology not only can cost more than it can possibly save but also that it makes it very tempting to trust the technology to protect against every scenario and fail to plan accordingly.
In this particular case this multi-million dollar system could have been reduced to a single, fifteen thousand dollar server and run faster and more reliably. The complexity of the system ended up contributing to its downfall.