Storage Size And Performance Implications Of A GUID PK
I sent the same Guid vs. Int. vs BigInt question to a group of coworkers yesterday. One of the responses I got was from a DBA, and I thought was worth repeating for the world to hear. Keep in mind that these represent the opinions and research of my coworker, not me. Whether I agree with him or not, is not the point of posting his response for the world to see.
———
Just as with most other things, there is no one correct answer. I am a proponent of natural keys when they make sense. Occasionally, we find that the natural key of an relation becomes cumbersome to deal with on a regular basis and we opt to utilize a surrogate key. Surrogate key usage is a different discussion all together regardless of the data type chosen to represent it.
As far as size and scalability… in SQL Server, an int is 4 Bytes and can hold roughly 2 Billion positive unique numbers, bigint is 8 Bytes and can hold about 9.2 Quintillion (9,223,372,036,854,775,808) positive unique records.
GUIDs are typically 16 Bytes and can store 3.4 X10^38 unique values.
There are several arguments for the use of GUID in databases utilized by applications that we write.
- You can generate the pk in code.
- You don’t have to execute and additional DB call to retrieve the key, post insert.
- Should you need to merge databases between different environments, the keys should not collide
-
This assumes that some other natural key, represented by a unique index, is not violated.
- You all can read the rest of the reasons in [the article linked at NHForge].
That having been said, if you were to have a relation with 500 million records. The following would be true.
Records
PK Data Type
Data Type Size
Size Units
Size of PK
PK Unit
500,000,000
Integer
4
Bytes
1.863
GB
500,000,000
Big Integer
8
Bytes
3.725
GB
500,000,000
GUID
16
Bytes
7.451
GB
You can see that there is a significant different in the amount of space required to store a GUID vs Int or BigInt. Although disk-space is cheap, ultimately, that equates to physical contiguous blocks on a hard drive platter (until we all move to SS HDD’s) that have to be physically read or skipped during a seek event. This can have a negative impact on database seeks that require a table scan. Once the data is located, the GUID also consumes significantly more memory (RAM) during the retrieval…
If you are doing exact (indexed) fetches via the primary key (OLTP transactions), most of those arguments don’t hold water. Most of what we do is OLTP. In an OLAP environment, I strongly recommend the use of natural keys and the smallest data type available to store the data, for the reasons listed above. When you truly have a relation with 260 million rows and that needs to be joined with a relation containing 500 million rows, an 8-12 byte increase on one of the columns used in the join (on each row) can bring even a hefty db server to its knees. The likelihood of this happening in one of our OLTP systems is about the same as the likelihood that a GUID will be duplicated. 🙂
One last thought, db indexes on integer data types are EXTREMELY EFFICIENT, and extremely small. I honestly don’t know the size difference, but I do know that an index on a GUID is inherently more complex.
Long story short. Use your brain when you architect these things. When it makes sense to use a GUID for application architectural reasons, by all means, use them. However, be aware that the natural keys still exist, and your GUID may need to be done away with in a migration to a large aggregate OLAP reporting system.
———
Now, if I could only convince him (and the rest of my company) to blog about these things, on their own! 🙂