One of the biggest, although non-normalizing, issues that I’ve struggled with is that of the natural versus surrogate keys. Now, I didn’t know what these were a few years ago so on the off chance that you are me in that time period, I’ll briefly explain. A natural key is a unique/primary key that is made up of strictly information that is logically connected with the entity you are storing in the row. The tried and true example is that of a social security number or a detailed title, basically anything that is part of the entity that makes it unique. What makes things hard is choosing the right data to make that storage easier, obviously the same FIRST_NAME occurs in more than one PERSON so you’ll have to use another or more rows (ex. FIRST_NAME, LAST_NAME, BIRTHDATE, BIRTH_PLACE). But it’s a lot easier to use a surrogate key, most often an autoincrementing integer, to just represent each row, because you know that will be distinct.
When instructed in my undergrad database lectures I was told, flat out, to never use surrogate keys. Fortunately, the man I learned this from can now be properly labeled as a “hack” when it came to practical computational theory. He would literally grab his shirt collar and adjust it uncomfortable while making strange, wounded-animal-like noises every time a programmer stuck an extra method call in a constructor or used a surrogate key. His goal was to make us flawless when it came to design practice, but he really just made us terrified that if we did something outside of his strict guidelines he would leap on us with his red pen. Or that the noises he was making was part of the summoning act to bring some giant bird to come and tear our typing fingers off.
Well, we learned to do a pretty good impression of him, and we also learned that what he was saying wasn’t exactly flawless, or even reasonable. The reality of the matter is that it’s OKAY TO USE INTEGER PRIMARY KEYS. First of all, it’s not the end of the world to de-naturalize a piece of data. Plus, not only is it okay, but there are significant benefits in speed and tool interoperability.
The only argument in existence against using a surrogate key is that you are essentially defining rows as being defined by a simple number that has nothing to do with the actual data. Sure, this is a loss, and it’s going to make comparisons between multiple tables a whole lot easier. But what about complex tables that aren’t easy to represent in one or two fields. I remember that same “E” professor made us, for a practical software course, store foreign key relationships to a table using 3 varchar fields. Since we had 2 mappings to said table, this meant 6 varchar fields defining the relationships instead of 2 integer fields that would have been much faster for our thousands of rows and much more readable.
Beyond that, do you know how to fix the entity integrity issue? Really, you can just add a unique key constraint on the same fields you would have used for the natural primary key. By using this method you essentially recreate the exact same restrictions on a faster indexed table for which it is much easier to define relationships.
CAVEAT: But it’s not right to say that a database designer, especially a green one, gets to use surrogate keys off the start. Why not? Because it’s not the best case. To be honest, I think this whole world would (www) be a little bit better off if we could make the initial model work. I will continue to use natural keys whenever it is convenient for the model, but the difference is that I will make a conscious decision about which one to pick.
Call me a glassy-eyed, idealist youth, but I honestly believe that you can test a good programmer not by what he knows but by how he evaluates his options. It’s a fine balance between diving off the board before checking for pool water and hourly, broad-field pH tests, but it’s the one that will make you a solid programmer/architect/designer.
Posted on August 1st, 2008 | filed under databases, programming | Trackback |