Profiling a legacy app

Approaching a legacy application can be a daunting task.  You may or may not have access to the original developers (if they even still work for the company), and the domain experts might not be able to commit to teaching you the software full time.  If you’re lucky enough to have access to true domain experts, it’s rare that they know the system from a technical standpoint, or are familiar with the entire system.  It’s more likely they have intimate knowledge of one piece, and cursory knowledge of the rest.

We’ve gone with an approach that’s allowed us to glean quite a bit of information about the domain in a fairly short period of time.  It can be overwhelming trying to see where to start, especially if you’re looking at a codebase with hundreds of thousands, if not over one million lines of code.


Some pitfalls we wanted to avoid were:

  • Getting bogged down in code
  • Worrying/complaining about code quality
  • Searching for a pot o’ gold (magical class that gives you complete insight into the system)

The biggest one is the complaining.  Many very successful systems are built with duck tape and baling wire.  It’s rather pointless to vilify an application or system that’s netted a company millions of dollars.  Lack of structure or tests might be bogging down the company now, but it’s put a lot of food on people’s tables.

When seeing code that I wouldn’t necessarily write, I like to step back and ask myself, “how many meals did this code buy for a family?”  I’ve written my share of stinkers over the years, so who am I to point fingers?

Getting started

The absolute first step is to get access to:

  1. A working application
  2. Its outputs

Without a working application, we’re forced to comb through a codebase.  That’s hardly productive as without an original developer, we have to make educational guesses about how it works and what’s important.

Also vital is insight to the outputs of the application, whether it be files, web service calls or a database.  We want to treat the application like a black box.  We’re going to poke and prod the application, so it’s necessary to see what pops out the other side.

If the application writes to a database, get access to a profiler so you can watch traffic.  If it makes network calls, use a port sniffer or something like Fiddler to watch the traffic.  If it writes to files or to MSMQ, again, find a profiler so you can watch traffic.

Collect some stats

Chances are your legacy app writes to some kind of database.  The older the system, the more variety you’ll see in design, or more alien the design will seem to you.  Modern databases haven’t been around too long considering the timeline of mainframes, and these designs were much different than the databases you’re probably used to.

Don’t be surprised if you don’t find any referential integrity.  You might not be able to run a dependency analysis tool (like RedGate’s excellent Dependency Tracker) and find anything connected.  If the application uses SQL Server, do yourself a favor and get one of RedGate’s bundles.

What worked well for us was not the diagrams from the legacy app, which can have more orphans than a Dickens novel, but a simple listing of tables with some important extra data.  We created another “Analysis” database with two tables:

  • TableInformation
  • ColumnInformation

The TableInformation contained columns for Database, TableName, ColumnCount, RowCount, create/modify dates, and anything else we could glean from the metadata information.  RowCount is important as you can query TableInformation, sort by RowCount, and have a good idea of what the most important tables are.  Chances are a table with 10 million rows is fairly important.  Tables with zero rows can be crossed off the list immediately, as your database probably has tables that were created but never used.

By seeing a list of all of the tables with their RowCount, you can get an idea of which tables are Transactional (lots of reads/writes), Lookup (written once, now just for lookup values, like states or country lists), or Unused (one or fewer rows).  The number of “important” tables is now a fraction of the original number of tables you were looking at.

The ColumnInformation contained columns for Database, TableName, ColumnName, as well as data type information.  Collecting column information is extremely useful when your database doesn’t have any relationships explicitly defined.  You can perform queries like “SELECT * FROM ColumnInformation WHERE ColumnName = ‘ORDER_NUM’”.  This can give you a great indication of what is related to what.

Poke and watch

Finally, with a running application and some base stats, we’re ready to profile the application.  The basic idea is to perform a concrete operation and examine the traffic.  Pull up that Customer page, and with the profiler open, capture a slice of traffic related to your operation.  It helps if no one else is using the system at the time, as you don’t want to collect false data.

For each operation we find, we’ll:

  • Start the profiler
  • Perform the operation
  • Stop the profiler
  • Archive/examine the profiler results

By doing something like examining a product, looking up a (valid) customer, we can see not only what the main Entity table is, but any ancillary tables are.  Most SQL profilers (such as SQL Server’s) allow you to copy the actual SQL script being used.  We can then paste this SQL script into our query tool to re-run the script to examine the data returned.

Finally, as we’re noting relationships between the tables, we create a completely new database that contains only the tables and relationships.  We can’t add relationships to the existing database, as referential integrity probably wasn’t enforced (maybe it was only suggested or encouraged).  This allowed us to create a very descriptive diagram that contained all of the tables and relationships of the legacy database, just without all of the other stuff that gets in the way.

We keep the original names of the tables and columns too, as it let us go back to our stats database and do additional queries.  As soon as we can determine what the main “Customer” table is (and its primary key or identifier column), we can query to see if any other tables reference it in some way.  Testing the connection through counts and joins lets us confirm the relationship.

Don’t get discouraged

It can be easy to get bogged down and discouraged when a legacy app falls on your lap.  A application with millions of lines of code and hundreds, if not thousands of database tables can be completely overwhelming.

But often the goal is not to understand the codebase, but the entities, relationships and business processes.  Focus on key scenarios with domain experts, and profile the results.  With any application, importance is not equally distributed to all features.  With some targeted analysis and heavy conversation with the domain experts, you’ll be able to gain a deep insight into the business behind the legacy application.

Guidelines aren’t rules