There’s a fine line between bravery and idiocy, and it’s usually
determined by the outcome. Such is true in conflict and in IT. One of
the major benefits of experience in either is that you develop a sixth
sense about when discretion truly is the better part of valor.
Shakespeare may have intended this as a joke, but it rings true.
Before we undertake any major action, whether pushing a major new app
version to production, migrating massive data sets from one storage
array to another and cutting over to production, or performing intricate
tasks required to maintain a production system without removing it from
production, we hedge our bets. Well, we should hedge our bets.
I try to imagine any possible blocking problems beforehand and determine
if there are ways to deal with them before they happen. If at all
possible, I like to have already scripted a reversion method to reset
everything back to before any work was done, akin to pulling a ripcord. I
like to leave nothing to chance.
There may be a time during the work when the infrastructure is in an
extremely precarious position, but I like to limit that exposure as much
as possible and have a clear path back to safety. This concept is built
into some code deployment methods, but it’s not as easy in IT in
general.
As we all know, IT is a fickle beast, and there are eventualities that
can’t be fully accounted for. A storage array intended as temporary
holding space that was completely stable for months will throw a disk or
two halfway through the process, becoming a major bottleneck at best or
completely blowing up the migration at worst. Or an order of operations
mistake will be made, and you'll find yourself painted into a corner --
the only questions being how dirty you will get trying to get out and
what you will have to sacrifice along the way.
With enough experience in this world, you can see some of these
possibilities before they happen. You can either bail on the planned
maintenance or upgrade or quickly develop an alternate plan that evades
the problem. However, if more than a few of those issues crop up, even
if there’s a seemingly clear path to success, you may hear that little
IT voice in the back of your head screaming that it’s a trap, and it’s
better to walk away while you still can. It’s usually wise to listen to
that voice.
The basic concept is that no matter what, we should never lose data or
systems during any IT function. Even if everything goes completely
pear-shaped, the resulting questions should center on how long it will
take to recover, not if it can be recovered. Even if it requires a few
extra days of preparation beforehand, there should always be a way to
undo whatever work is being done. It may cost more money in the form of
backup storage or systems, but it’s always worth it, even if it’s
ultimately not needed.
This is where the cowboys come in. It’s in the midst of sensitive and
delicate operations where unforeseen problems appear that a cowboy admin
will push forward without a safety net and try to reach the other side.
If he succeeds, everyone’s thrilled and admiring, and rounds of beers
will be bought at the pub. If he fails, everyone sticks around for hours
or even days of constant stress and pressure until whatever can be
recovered is recovered. These are situations that you don’t want to be
part of if you can help it, because they usually don’t end well.
There’s a trick to determining if the move was a true cowboy move,
however, because to an observer it may be hard to distinguish. I’ve made
plenty of unorthodox saves in the middle of crises that some might
consider unusual or avant garde, but with a backup plan in place if at
all possible. It might be as simple as SCPing a broken management VM
from one array to another in order to repair it and bring it up on
stable storage to facilitate further saving migrations, or reworking
iSCSI LUN masking on the fly to block certain problem servers from
overloading a failing storage array in order to allow a fragile recovery
to complete.
Full disclosure: I've had my share of cowboy moments with no safety net. I'm pretty sure most of us have.
If we lived in a perfect world, these things wouldn’t require any
thought or planning at all. Big data and VM migrations, app and database
rollouts and upgrades -- everything would be as easy and natural as
breathing. We have made great strides in this area over the past few
decades, and there may come a day when that is possible, but it’s
certainly not today. There is no magic bullet; there is only Zuul.
Post a Comment