Myth of the isolated production fix

While in WCF training this week, I heard once again the argument why config files are great – your IT staff can change them without a recompilation.  Sounds great right?  But what exactly does this imply?

Sure, there’s no recompilation, but do the changes get tested?  I can’t imagine someone modifying the production environment without testing those changes first.  If something goes wrong, who’s fault is it?  It’s the application team’s responsibility to ensure a tested, reliable application, but they’re not the ones making this change.

How exactly does this change happen?  Does the IT staff manually edit the configuration files?  What happens if they make a mistake?

Flirting with disaster

Statements like that, “change without a recompilation”, absolutely makes me cringe, as it usually implies that these changes are small, isolated, easy, and therefore don’t need to be tested.

The problem is that making configuration changes can sometimes lead to unexpected behavior changes because something we didn’t anticipate is dependent in some way on the configuration file.  Even if the development team is consulted about these changes, how can they say with confidence the change is small or isolated without actually testing those changes?

Every change, no matter how small, that could potentially affect the behavior of the system, needs to be tested in a variety of systems and environments.  Automated deployments to, and testing of, clones of production environments gives the team confidence in their changes.  Anything else is a wild and potentially hazardous guess.

A responsible SCM process

The first step in any reasonable SCM process is continuous integration.  Until the build is repeatable, automated, and tested, the team can’t have much confidence in the reliability or quality of a build.  Even small production fixes should start at this phase, as there’s never any guarantee a configuration change doesn’t affect business logic without regression testing.

After a CI build and possibly a nightly deployment, we typically have a set of gated environments where builds are put through more and more tests until they are certified as production-ready.  Only builds labeled as production-ready are allowed to be promoted to the production environment.  Although every build in CI processes might be labeled “production-ready”, sometimes longer regression tests need to occur before certifying a build ready for the next environment.

But why are small changes allowed to subvert this process?  The only time this process should be allowed to be subverted is in the event of a critical failure, and the business is losing money because of downtime.  If it takes three weeks for a build to move through the pipeline, that’s three weeks where the business is losing money.  The only time we compromise on the quality of the build is when we absolutely have to push it out right away (i.e., in hours).

That’s not to say testing doesn’t happen, but it happens in a targeted area.  After the hotfix is pushed out, we still go through the gated promotion process, just to make sure we didn’t miss anything.  We might even throw out the changes in source control for more sound fixes.  That hotfix is still considered suspect and isn’t treated as production-ready, but temporary.

Refuse to compromise values

Hotfixes are a rare occurrence and must go through a promotion process themselves, where the severity of the bug must have a certain level of negative effect on business.  The temptation to push out lower severity bugs through hotfixes becomes higher when a few successful hotfix deployments occur.  The business asks “well, if it was that easy, why don’t we do that all the time?”.  This is just playing production roulette, and eventually it will catch up to you.

I don’t feel terrible occasionally compromising on practices when the urgency of a hotfix requires it, but it’s important for the team never to compromise on their core values.  When hotfix patches are demanded on a regular basis, the team should learn to say “no”, push back, and volunteer a more responsible approach.

About Jimmy Bogard

I'm a technical architect with Headspring in Austin, TX. I focus on DDD, distributed systems, and any other acronym-centric design/architecture/methodology. I created AutoMapper and am a co-author of the ASP.NET MVC in Action books.
This entry was posted in Agile. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Think about it. What do you do if you don’t practice TDD, you have know idea how the changes made to code impact the production architecture.

    In essence you are practicing FDD (Faith Driven Development). You are praying that nothing goes wrong when you change code. But wait a smart developer on your team has minimized risk by decoupling configuration out of code and into an XML file. What genius! Now my risk has been mitigated by the fact that this configuration file is certainly isolated to this one component and can’t possibly have any side effects. Lets just make this simple change in production. 16 hours later.

    Dev: “Yeah we should have tested that before we
    deployed to production.”

    Build Guy: “But I thought you said it didn’t affect anything else.”

    Dev: “I didn’t know they changed the code to aslo interact with this system as well. When I code it, it worked, it’s not my fault the didn’t update the documentation.”

    Ahhh the memories of lessons learned…How I don’t miss them.

  • That’s exactly why I’d rather push config into code, and if not that, have tests for my config files. FDD, I like that.