Wednesday, September 1, 2010

The technology SPOF

When I'm thinking availability, a lot my time and thoughts go to the careful search for SPOFs.

  1. I do look for hardware SPOFs, like a unique machine doing an important job and requiring a backup in case of breakdown. [first step of a BCP]
  2. I also do look for network SPOFs, like making sure the backup has the same network accesses as the master machine, so that it doesn't remain alone, useless. The same is true for firewall accesses and all kinds of filtering that network flows do undergo. [careful execution of the BCP]
  3. I furthermore do look for configuration SPOFs. These consist of the rather funny case when the backup machine is up, reachable on the network, but the clients are not aware it's there and so don't undertake anything with it. This is usually the case when IP addresses are not switched automatically between the master and the backup or when a configuration screen allows only to type in one server, not two or many. Hopefully, this should not happen with MS Domain Controllers, the way they work (or we would hear this kind of SPOF more often). Anyway, it's still recurrent in many "small vendor" appliances and in the application world. [very careful execution of the BCP]
  4. Nonetheless, a terrific SPOF remains: the technology SPOF. You have the backup machine, it's reachable, and others are aware that they should communicate with it. But the backup suffers from the same technological incident as the master machine did.
    Say, for example, that the master breaks down because it fails to handle a large quantity of "client data" (or anything) that has to be treated. The backup will also break down under the same charge.
    Let's take another example, the server is an application connecting to a database. The database received a minor software update that changes something that makes the master server go crazy. The backup takes the job and goes crazy too.
    Third and last example, you have a nice application server with scheduled tasks that make needed job. One fails and the server goes down, unable to continue its work. At that moment, the backup goes up and launches the same scheduled task, failing also... [and that's outside the perimeter of most BCPs]
That's the time when you'd want to have another way to provide your service. That's the moment when you are forced to remember that the machine is not here just to be here and working, it's here to help provide a service. And that service may be provided otherwise. That's the time when you enjoy having a well-prepared DR plan, with forethought reduced/degraded modes...