Myth of the isolated production fix

18 October, 2007. It was a Thursday.

While in WCF training this week, I heard once again the argument why config files are great – your IT staff can change them without a recompilation. Sounds great right? But what exactly does this imply?

Sure, there’s no recompilation, but do the changes get tested? I can’t imagine someone modifying the production environment without testing those changes first. If something goes wrong, who’s fault is it? It’s the application team’s responsibility to ensure a tested, reliable application, but they’re not the ones making this change.

How exactly does this change happen? Does the IT staff manually edit the configuration files? What happens if they make a mistake?

Flirting with disaster

Statements like that, “change without a recompilation”, absolutely makes me cringe, as it usually implies that these changes are small, isolated, easy, and therefore don’t need to be tested.

The problem is that making configuration changes can sometimes lead to unexpected behavior changes because something we didn’t anticipate is dependent in some way on the configuration file. Even if the development team is consulted about these changes, how can they say with confidence the change is small or isolated without actually testing those changes?

Every change, no matter how small, that could potentially affect the behavior of the system, needs to be tested in a variety of systems and environments. Automated deployments to, and testing of, clones of production environments gives the team confidence in their changes. Anything else is a wild and potentially hazardous guess.

####

A responsible SCM process

The first step in any reasonable SCM process is continuous integration. Until the build is repeatable, automated, and tested, the team can’t have much confidence in the reliability or quality of a build. Even small production fixes should start at this phase, as there’s never any guarantee a configuration change doesn’t affect business logic without regression testing.

After a CI build and possibly a nightly deployment, we typically have a set of gated environments where builds are put through more and more tests until they are certified as production-ready. Only builds labeled as production-ready are allowed to be promoted to the production environment. Although every build in CI processes might be labeled “production-ready”, sometimes longer regression tests need to occur before certifying a build ready for the next environment.

But why are small changes allowed to subvert this process? The only time this process should be allowed to be subverted is in the event of a critical failure, and the business is losing money because of downtime. If it takes three weeks for a build to move through the pipeline, that’s three weeks where the business is losing money. The only time we compromise on the quality of the build is when we absolutely have to push it out right away (i.e., in hours).

That’s not to say testing doesn’t happen, but it happens in a targeted area. After the hotfix is pushed out, we still go through the gated promotion process, just to make sure we didn’t miss anything. We might even throw out the changes in source control for more sound fixes. That hotfix is still considered suspect and isn’t treated as production-ready, but temporary.

####

Refuse to compromise values

Hotfixes are a rare occurrence and must go through a promotion process themselves, where the severity of the bug must have a certain level of negative effect on business. The temptation to push out lower severity bugs through hotfixes becomes higher when a few successful hotfix deployments occur. The business asks “well, if it was that easy, why don’t we do that all the time?”. This is just playing production roulette, and eventually it will catch up to you.

I don’t feel terrible occasionally compromising on practices when the urgency of a hotfix requires it, but it’s important for the team never to compromise on their core values. When hotfix patches are demanded on a regular basis, the team should learn to say “no”, push back, and volunteer a more responsible approach.

← Ruby-style loops in C# 3.0

Some Domain-Driven Design resources →