Designing primary keys

When creating a primary key for a table, we have a few options:

  • Composite key
  • Natural key
  • Surrogate key

Each has their own advantages and disadvantages, but by and large I always wind up going with the last option.

Composite keys

Composite keys are primary keys made up of multiple columns.  For example, consider a system with and Order and LineItem table:

The LineItems table has an identity that is composed of two columns: the OrderID and the ProductID.  This makes sense in that an Order can only have one LineItem per Product.

The problem happens when I need to make a table that has a foreign key relationship to the LineItems table:

Blech! Now our composite key is invading every table that relates to it.  Order is a natural aggregate root, but what happens if we decide later down the line that a LineItem is a standalone concept?

I’ve seen this a few times in databases I’ve created and some legacy ones.  You assume that a row’s identity is defined by its parent, but later it begins to have a life of its own.

Unless the table is a junction table (one that holds a many-to-many relationship), and it doesn’t have and attributes of its own, a single primary key is the safest bet.

Natural keys

Ugh.

Please don’t continue this (worst) practice going forward.  Natural keys are keys with business uniqueness, such as Social Security Number, Tax ID, Driver’s License Number, etc.

Besides performance issues with data like these as keys, one thing to keep in mind is that an entity’s identity can never change, from now until the end of time.

Things like SSN’s and such are likely entered by humans.  Ever mistype something?  If something needs to be unique, there are other ways besides a primary key to ensure uniqueness.

Surrogate keys

Surrogate keys are computer-generated keys that likely have no meaning, or might never be shown to the end user.  Typical types are seeded integers and GUIDs.

Personally, I prefer GUIDs as they’re guaranteed to only to be unique to the table, but globally unique (it’s in the name for pete’s sake!)  Combed GUIDs provide comparable performance to integers or longs, as well as the added benefit that they can be created by the application instead of the database.  GUIDs also play very nicely in replication scenarios.

You can still use natural key information, such as SSN, to identify or search for a particular record, but a surrogate key ensures uniqueness and performance.

Frankenstein keys

Unfortunately I have found another kind of primary key: the Frankenstein key.  Here’s a small taste:

Hmmm…let’s see, a CUSTOMER table, an ADDRESS_MASTER table, and one table that should join the two together, the CUST_ADDR_DTL.  But it doesn’t have any foreign key relationships to the two tables it should have, it has a cryptic CUST_ADDR_NO column instead.

Looking at the data reveals a mind-boggling design:

CUST_NUM
----------
     89480

ADDR_KEY
----------
    441839

CUST_ADDR_NO
--------------------
89480         441839

Wow.  Fixed-length column values aren’t new, especially in databases ported from mainframes.  The last value, instead of having two columns in a composite key, just jams both foreign key values into one single Frankenstein column.

Just about the ugliest thing I’d ever seen, until I saw a SELECT statement, with where clauses using string functions to parse out the individual values.

Surrogate GUIDs by default

It doesn’t cost anything, and it’s the easiest and simplest choice to make going forward.  Composite keys can still be represented as foreign keys and unique constraints, but it’s tough to add identity later.

Not to mention, domain models look pretty terrible and are quite brittle with composite keys.  As any system is so much more than data it contains, why not go with surrogate keys, which give so much flexibility to your design?

Related Articles:

Post Footer automatically generated by Add Post Footer Plugin for wordpress.

About Jimmy Bogard

I'm a technical architect with Headspring in Austin, TX. I focus on DDD, distributed systems, and any other acronym-centric design/architecture/methodology. I created AutoMapper and am a co-author of the ASP.NET MVC in Action books.
This entry was posted in Domain-Driven Design, SQL. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • http://www.codethinked.com Justin Etheredge

    I completely agree, but you better hope that no DBAs read your blog or I think you might get an earful! :-)

  • http://Bryan.ReynoldsLive.com Bryan Reynolds

    Completely agree.
    When you I see an application with almost all the tables holding long composite keys or natural keys it makes me sad.
    Bryan

  • jlockwood

    Preach it brother! I’m working on a project now that has a number of composite keys that totally aren’t needed.

    Example:
    TABLE_WIDGET
    ———————————-
    ID NUMBER(30)

    TABLE GIDGET
    ———————————-
    WIDGET_FK NUMBER(30)
    NAME VARCHAR(30)

    This is to support a one-to-one relationship. The DBA used both the widget foreign key and name to create a composite key. This is way unecessary for a one-to-one and ended up making the associated classes look retarded (they are using the DDLs to generate the hbm files, then use the hbm files to generate the DataObjects).

    It makes baby Jesus cry I tell you!

  • Andre Borst

    Our customers DBA didn’t allow us to use generated IDs because it was supposedly very problematic to drop the database and import it into another system (something with the increment values I think), is this a a valid problem? We had to go with using a timestamp as the primary key (the time of insertion), but this caused problems later when the db-server had a different timezone as the client…

  • http://compaspascal.blogspot.com/ Lars D

    My experience is, that you should always design a database to be read, fast. If you don’t know how a table is going to be read, use surrogate keys. However, there may be situations, where a natural or composite key gives a performance increase of 10-1,000 times. I have run into quite a lot of these.

    Let’s imagine that we have 10000 things. For each of these, we have 10000 pieces of data, each 1kbyte. That’s 100GByte data. Now let’s imagine that we want the first 100 pieces of data from thing number 527… let us assume that all indexes (but no data) are cached in RAM, that the database sorts records by the primary key, and that hard disk seek time is 10ms. Finding these 100 pieces of data would take 10 ms or 1 second, depending on the choice of primary key.

  • Michi

    You don’t like “Natural Keys” because someone could enter a not-unique one by mistake. But isn’t that what Primary Keys are also about? To check/verify data when its entered? When you only have “Surrogate Key” and someone enters a wrong data, your database just inserts this wrong data withou you noticing.
    What do you guys say about this controlling aspect (Constraint) of Primary Keys?

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @Andre

    I stay away from database-generated IDs as much as possible. Sometimes legacy DBs need them (when a table doesn’t have one). Otherwise, I always create the primary key from the application. When it’s a GUID, it’s easy to do. When we’re using NHibernate, we use the guid.comb ID generator.

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @MIchi

    I don’t like natural keys because 1) of their bad performance I’ve seen. A recent legacy DB we used, we switched from natural keys of fixed-length chars to GUIDs, seeing about 2 orders of magnitude improvement.

    The other issue is that primary key values can never, ever change. If natural keys are entered by humans, right then you have a problem of human error. That violates the immutability principle of the primary key.

    I do like to use constraints for values that should be unique, like usernames, etc.

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @Lars

    Really? That’s the opposite experience I’ve had. Natural keys always had just awful performance. Indexing on a char(30) SSN column was a couple of orders magnitude worse than a combed GUID. (This is SQL Server, so YMMV).

    I definitely do a lot of performance testing (near the end of the project) looking at various indexes I can put in place, looking at what queries I’m doing. Typically though, there’s one search (username, address, whatever), then the rest of the time I’m using the ID. Maybe 90-95% of the time we use the primary key, the rest a single search.

    Do you see a performance difference between natural and composite keys, and an index?

  • Kovica

    In our database every table in a table has a primary key of a Long (BIGINT) as numbers are so much faster to compare as strings. Then we use unique keys to represent our
    “natural keys”. You simply join table via numbers, so you get readable SQL statement and speed. We have tried using a natural key as our primary key once, but that was a disaster. We ended up writing complex, long and unreadable SQL statements. Luckily it was a minor project. :) ))

  • http://www.bin-co.com/ Binny V A

    I am not a DBA, but when I design a table, the primary key is almost always and integer with auto increment.

    1. Does this method have a name?
    2. What are the advantages/disadvantages of this method?
    3. How come you did not mention this method in the article?

  • http://chadmyers.lostechies.com Chad Myers

    @Binny:

    That’s known as ‘identity’ in SQL Server and ‘sequence’ in Oracle (auto-increment in most other DB’s).

    Advantages are that the DB usually guarantees that it’s sequential, so you avoid a lot of hassle. Performance is decent compared to most of the other methods. In my experience, this point is negligible though due to other performance issues with RDBMS’s. And that’s about it as far as advantages.

    Disadvantages are that the app usually has to make a separate call or add stuff to an INSERT statement to get the ID of the most-recently-inserted record(s).

    ID’s are not unique among all records and DB’s, so if you have to merge/replicate to another DB, it is no less than a complete, total nightmare.

    The best solution I’ve ever seen that solves just about every problem and only creates a few is using GUID’s (specifically the COMBined method of generating GUIDs) for all ID’s. On certain databases, it can impact performance. Again, the pluses/minuses impact to performance of ID generation (no matter what you do) is negligible compared to all the other performance problems with using databases (locking, join queries, etc, etc).

    The acceleration I get by using GUID’s usually far outweighs any negative aspects. And I can usually deal with the negative aspects simply.

  • http://compaspascal.blogspot.com/ Lars D

    @bogardj: The use of GUIDs in databases is probably the main reason why Microsoft SQL servers tend to have more RAM than data. If your server has more RAM than data, then GUIDs are probably ok, because RAM supports random access at high speed. However, if your data amount is significantly larger than the amount of RAM, distributing data across the hard disk will kill performance immediately.

    I’m not fortunate enough to work with databases that can fit into the RAM of a server, so GUIDs are definitely outlawed. Auto-incrementing integers are ok, because proximity of two of these integers is correlated to the probability that the records are both returned by a query.

  • Bob Saggett

    Your article makes some valid points but appears to be completely biased to your personal viewpoint. There are good, and bad, reasons for using surrogate keys and good, and bad, reasons for using other options.

    One point worthy of mention. You talk about the “COMB” GUIDs in close proximity to talking about GUIDs being globally unique. A simply mathematical exercise will show that there is MORE chance of the COMB GUIDs being non-unique on the global scale than those generated by other methods. The linked article makes the point of it being difficult to create a million GUIDs in less than 1/300th of a second. However, on a global scale or, in fact, in a large organisation with many hundreds of replicated servers, the overall grid may produce many more GUIDs in this amount of time. This would still give a low chance of collision but not as low as a good GUID algorithm.

    It is interesting to note that Microsoft’s preferred algorithm is simply a completely random number (or as random as a computer can be).

  • Mark Stouffer

    Overloaded fields suck in general, but especially when it is the join keys in a many-to-many relationship table. I use auto-increment keys in most tables just because testing selects by typeing or copying a guid is a bit of a constant hassle durring development.

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @Lars

    That’s why we use the combed GUIDs. We’ve seen similar performance to LONGs in SQL Server (similar, but not equal). IIRC, MS is putting some more effort into the GUID keys to up their performance.

    We did try GUIDs in Oracle once. We wound up having to sequence the bytes a little differently to get a similar effect.

    In the Oracle installs with TB’s of data, they were doing other things to help with performance, such as creating clusters based on dates, etc. Of course, that was because we made the mistake of letting the business use our OLTP DB as their reporting DB, but that’s another story…

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @Bob

    Yes, there’s definitely a greater chance of collision. It’s a risk. But when using identity columns with increasing numbers, collision is guaranteed (in replication scenarios).

    If all of the GUIDs are comb’ed, then the chance of a collision is the chance that two GUIDs are created with the same value _in the same time window_. I’d have to remember back to my stats class to figure out the probability…

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @Benny

    Good – I’m not a DBA either! :)

    There’s another reason why I don’t like the identity column also. It’s too easy to put a number in and just get something back. When integrating with other systems that use INT identifiers, I always wind up having to do an extra verification, just to make sure I got back what I meant to ask for.

  • Bob Saggett

    @bogardj

    You are absolutely correct Identities in a multi-server environment or a synchronised off-line environment where users can write to the database off-line are unusable. The possibility of synchronsation and replication should be consdered carefully before thinking about using identities.

  • http://statestreetgang.net Will

    Wow. Apparently no DBAs read this blog (which isn’t necessarily a bad thing!).

    Relevant links:
    Celko on PKs
    http://www.intelligententerprise.com/channels/business_intelligence/showArticle.jhtml?articleID=201806814
    Atwood on Celko on PKs
    http://www.codinghorror.com/blog/archives/000817.html

    The last time I created a database I mixed guids with natural keys. The guids have not caused me any problems. The natural keys have been a headache, as I didn’t plan them correctly (is that EVER possible?).

    I think unless I have strict requirements for speed and size I’m going to be sticking with guids from now on.

  • http://www.tandemgames.com Aaron Murray

    As some others have stated, and I want to reiterate, please do not use db guids as the primary key *unless you must*

    Just last week I had to deal with a table that used guids for the PK, and the table was so burdened that it finally became unresponsive during inserts. The table had approx 120 million records in it (giant log table), but each insert was causing hundreds of disk hits because the clustered index was gigantic. We recreated the table with a new indentity column (int) and left the guid column on there without constraints. The table started behaving correctly again.

  • http://jimmybogard.lostechies.com Jimmy Bogard

    @Aaron

    Just curious, were these GUIDs in this table combed GUIDs? Also, it sounds like since this was a log table, and not one for entities (that need to be UPDATEd as much as INSERTed), that also has to be a consideration.

    When we used combed GUIDs on a particularly large Oracle installtation (~1TB, most in one table), we didn’t run into any problems. Then again, it’s Oracle.

  • john

    has anyone ran tests with using the newsequentialid(). Does it give any of the performance benefits that the comb GUID this article talks about.