Profiling a legacy app

Approaching a legacy application can be a daunting task.  You may or may not have access to the original developers (if they even still work for the company), and the domain experts might not be able to commit to teaching you the software full time.  If you’re lucky enough to have access to true domain experts, it’s rare that they know the system from a technical standpoint, or are familiar with the entire system.  It’s more likely they have intimate knowledge of one piece, and cursory knowledge of the rest.

We’ve gone with an approach that’s allowed us to glean quite a bit of information about the domain in a fairly short period of time.  It can be overwhelming trying to see where to start, especially if you’re looking at a codebase with hundreds of thousands, if not over one million lines of code.


Some pitfalls we wanted to avoid were:

  • Getting bogged down in code
  • Worrying/complaining about code quality
  • Searching for a pot o’ gold (magical class that gives you complete insight into the system)

The biggest one is the complaining.  Many very successful systems are built with duck tape and baling wire.  It’s rather pointless to vilify an application or system that’s netted a company millions of dollars.  Lack of structure or tests might be bogging down the company now, but it’s put a lot of food on people’s tables.

When seeing code that I wouldn’t necessarily write, I like to step back and ask myself, “how many meals did this code buy for a family?”  I’ve written my share of stinkers over the years, so who am I to point fingers?

Getting started

The absolute first step is to get access to:

  1. A working application
  2. Its outputs

Without a working application, we’re forced to comb through a codebase.  That’s hardly productive as without an original developer, we have to make educational guesses about how it works and what’s important.

Also vital is insight to the outputs of the application, whether it be files, web service calls or a database.  We want to treat the application like a black box.  We’re going to poke and prod the application, so it’s necessary to see what pops out the other side.

If the application writes to a database, get access to a profiler so you can watch traffic.  If it makes network calls, use a port sniffer or something like Fiddler to watch the traffic.  If it writes to files or to MSMQ, again, find a profiler so you can watch traffic.

Collect some stats

Chances are your legacy app writes to some kind of database.  The older the system, the more variety you’ll see in design, or more alien the design will seem to you.  Modern databases haven’t been around too long considering the timeline of mainframes, and these designs were much different than the databases you’re probably used to.

Don’t be surprised if you don’t find any referential integrity.  You might not be able to run a dependency analysis tool (like RedGate’s excellent Dependency Tracker) and find anything connected.  If the application uses SQL Server, do yourself a favor and get one of RedGate’s bundles.

What worked well for us was not the diagrams from the legacy app, which can have more orphans than a Dickens novel, but a simple listing of tables with some important extra data.  We created another “Analysis” database with two tables:

  • TableInformation
  • ColumnInformation

The TableInformation contained columns for Database, TableName, ColumnCount, RowCount, create/modify dates, and anything else we could glean from the metadata information.  RowCount is important as you can query TableInformation, sort by RowCount, and have a good idea of what the most important tables are.  Chances are a table with 10 million rows is fairly important.  Tables with zero rows can be crossed off the list immediately, as your database probably has tables that were created but never used.

By seeing a list of all of the tables with their RowCount, you can get an idea of which tables are Transactional (lots of reads/writes), Lookup (written once, now just for lookup values, like states or country lists), or Unused (one or fewer rows).  The number of “important” tables is now a fraction of the original number of tables you were looking at.

The ColumnInformation contained columns for Database, TableName, ColumnName, as well as data type information.  Collecting column information is extremely useful when your database doesn’t have any relationships explicitly defined.  You can perform queries like “SELECT * FROM ColumnInformation WHERE ColumnName = ‘ORDER_NUM’”.  This can give you a great indication of what is related to what.

Poke and watch

Finally, with a running application and some base stats, we’re ready to profile the application.  The basic idea is to perform a concrete operation and examine the traffic.  Pull up that Customer page, and with the profiler open, capture a slice of traffic related to your operation.  It helps if no one else is using the system at the time, as you don’t want to collect false data.

For each operation we find, we’ll:

  • Start the profiler
  • Perform the operation
  • Stop the profiler
  • Archive/examine the profiler results

By doing something like examining a product, looking up a (valid) customer, we can see not only what the main Entity table is, but any ancillary tables are.  Most SQL profilers (such as SQL Server’s) allow you to copy the actual SQL script being used.  We can then paste this SQL script into our query tool to re-run the script to examine the data returned.

Finally, as we’re noting relationships between the tables, we create a completely new database that contains only the tables and relationships.  We can’t add relationships to the existing database, as referential integrity probably wasn’t enforced (maybe it was only suggested or encouraged).  This allowed us to create a very descriptive diagram that contained all of the tables and relationships of the legacy database, just without all of the other stuff that gets in the way.

We keep the original names of the tables and columns too, as it let us go back to our stats database and do additional queries.  As soon as we can determine what the main “Customer” table is (and its primary key or identifier column), we can query to see if any other tables reference it in some way.  Testing the connection through counts and joins lets us confirm the relationship.

Don’t get discouraged

It can be easy to get bogged down and discouraged when a legacy app falls on your lap.  A application with millions of lines of code and hundreds, if not thousands of database tables can be completely overwhelming.

But often the goal is not to understand the codebase, but the entities, relationships and business processes.  Focus on key scenarios with domain experts, and profile the results.  With any application, importance is not equally distributed to all features.  With some targeted analysis and heavy conversation with the domain experts, you’ll be able to gain a deep insight into the business behind the legacy application.

About Jimmy Bogard

I'm a technical architect with Headspring in Austin, TX. I focus on DDD, distributed systems, and any other acronym-centric design/architecture/methodology. I created AutoMapper and am a co-author of the ASP.NET MVC in Action books.
This entry was posted in LegacyCode. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Good stuff Jimmy. I’m in this exact situation at the moment working in a mostly untested codebase and it is easy to get overwhelmed. Another thing I’m doing right now is taking some advice from Michael Feathers and writing some “characterization” tests against a particular (huge) class before I start making changes to it. Doing this can help to understand the existing behavior much more and act as good documentation in the future.

  • This was awesome:
    “The biggest one is the complaining. Many very successful systems are built with duck tape and baling wire. It’s rather pointless to vilify an application or system that’s netted a company millions of dollars. Lack of structure or tests might be bogging down the company now, but it’s put a lot of food on people’s tables.”

    And I totally agree. Complaining allows the bar for improvement to be lowered and “justified,” but really it’s just a way to avoid really examining the problems. It’s a lot easier to blame it on “oh their code is fragile; this is undoable.”

    I’m also convinced now that patching legacy code without the original developers to be one of the hardest jobs there is.

  • Excellent advice, thanks.

    Regarding sql profiler, you can limit your profiling to a particular processid – then it does not matter if others are using the app.

  • @Steve

    Thanks, yeah there are definitely some columns to filter out what you’re looking at. In one case, we had all calls going through remoting objects on an application server, so there was literally no way to distinguish our calls from others. Nearly all of the time there is some unique information though, whether it’s process ID, login account, machine name etc.

  • Whatever

    “The biggest one is the complaining. Many very successful systems are built with duck tape and baling wire. It’s rather pointless to vilify an application or system that’s netted a company millions of dollars. Lack of structure or tests might be bogging down the company now, but it’s put a lot of food on people’s tables.”

    Amen, brother. I’m pragmatic if anything and I cringe when I hear developers who are pulling down six figures bitch and moan about the condition of the code that is responsible for paying their hefty paychecks.

  • Actually, complaining about the code is only helpful when the original developer is there to overhear your frustration and perhaps learn a little something-something. It just isn’t socially acceptable, perhaps, but then I feel we are missing out on the benefits of public humiliation as a society…

    I’ve written some crap code, as all developers have at one point. To me, the developers that can rise above the crap code and actually get something done without their complaining slowing them down can complain all day long for all I care.

    Complaining can be theraputic, no?

    I groan in the instances where you cannot create a staging or test server. Good times there…