Storage Size And Performance Implications Of A GUID PK

I sent the same Guid vs. Int. vs BigInt question to a group of coworkers yesterday. One of the responses I got was from a DBA, and I thought was worth repeating for the world to hear. Keep in mind that these represent the opinions and research of my coworker, not me. Whether I agree with him or not, is not the point of posting his response for the world to see.

———

Just as with most other things, there is no one correct answer.  I am a proponent of natural keys when they make sense.  Occasionally, we find that the natural key of an relation becomes cumbersome to deal with on a regular basis and we opt to utilize a surrogate key.  Surrogate key usage is a different discussion all together regardless of the data type chosen to represent it.

As far as size and scalability… in SQL Server, an int is 4 Bytes and can hold roughly 2 Billion positive unique numbers, bigint is 8 Bytes and can hold about  9.2 Quintillion (9,223,372,036,854,775,808) positive unique records.

GUIDs are typically 16 Bytes and can store 3.4 X10^38 unique values. 

There are several arguments for the use of  GUID in databases utilized by applications that we write. 

  1. You can generate the pk in code.
  2. You don’t have to execute and additional DB call to retrieve the key, post insert.
  3. Should you need to merge databases between different environments, the keys *should* not collide
    1. This assumes that some other natural key, represented by a unique index, is not violated.
  4. You all can read the rest of the reasons in [the article linked at NHForge].

That having been said, if you were to have a relation with 500 million records. The following would be true.

 

Records

PK Data Type

Data Type Size

Size Units

Size of PK

PK Unit

500,000,000

Integer

4

Bytes

1.863

GB

500,000,000

Big Integer

8

Bytes

3.725

GB

500,000,000

GUID

16

Bytes

7.451

GB

 

You can see that there is a significant different in the amount of space required to store a GUID vs Int or BigInt.  Although disk-space is cheap,  ultimately, that equates to physical contiguous blocks on a hard drive platter (until we all move to SS HDD’s) that have to be physically read or skipped during a seek event.  This can have a negative impact on database seeks that require a table scan.  Once the data is located, the GUID also consumes significantly more memory (RAM) during the retrieval…

If you are doing exact (indexed) fetches via the primary key (OLTP transactions), most of those arguments don’t hold water.  Most of what we do is OLTP.  In an OLAP environment, I strongly recommend the use of natural keys and the smallest data type available to store the data, for the reasons listed above.  When you truly have a relation with 260 million rows and that needs to be joined with a relation containing 500 million rows, an 8-12 byte increase on one of the columns used in the join (on each row) can bring even a hefty db server to its knees.  The likelihood of this happening in one of our OLTP systems is about the same as the likelihood that a GUID will be duplicated. :)

One last thought, db indexes on integer data types are EXTREMELY EFFICIENT, and extremely small.  I honestly don’t know the size difference, but I do know that an index on a GUID is inherently more complex.

Long story short.  Use your brain when you architect these things.  When it makes sense to use a GUID for application architectural reasons, by all means, use them.  However, be aware that the natural keys still exist, and your GUID may need to be done away with in a migration to a large aggregate OLAP reporting system.

———

Now, if I could only convince him (and the rest of my company) to blog about these things, on their own! :)


Post Footer automatically generated by Add Post Footer Plugin for wordpress.

About Derick Bailey

Derick Bailey is an entrepreneur, problem solver (and creator? :P ), software developer, screecaster, writer, blogger, speaker and technology leader in central Texas (north of Austin). He runs SignalLeaf.com - the amazingly awesome podcast audio hosting service that everyone should be using, and WatchMeCode.net where he throws down the JavaScript gauntlets to get you up to speed. He has been a professional software developer since the late 90's, and has been writing code since the late 80's. Find me on twitter: @derickbailey, @mutedsolutions, @backbonejsclass Find me on the web: SignalLeaf, WatchMeCode, Kendo UI blog, MarionetteJS, My Github profile, On Google+.
This entry was posted in Data Access, Database. Bookmark the permalink. Follow any comments here with the RSS feed for this post.
  • Mikael Egnér

    I think another relevant problem with GUIDs is in the cache of the database server.
    If you use the PK as the clustered index (which is common), then all other indexes on that table will have the GUID value stored as well (for the lookup).
    When you add more and more indexes, the size of the GUID in the PK and the indexes will take a lot of space.
    That space is not only relevant when reading from disk, but also decides how much “logical” data that could be stored in the cache of SQL-server.

    The performance for queries that could use the cache is very much faster than when SQL-server have to read from disk

    Our database is near 1 TB of data with very many GUIDs…

  • http://www.lostechies.com/members/colinjack/default.aspx colinjack

    Really good stuff.

  • Joel

    bvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvbnbnbnbnbnbnbnbnbnbnbnbnbnbnbnbnbnbn bn bnbnbnbnbnbnbnbnbnbnbn

  • Kaji

    Appreciable work..!

  • http://twitter.com/DaveVdE Dave Van den Eynde

    I think a clustered index on a GUID column is nonsensical. Also, primary keys don’t need to be clustered indexes. Finally, if you have 500mio rows, the size difference between ints and GUIDs may be the least of your problems.