SQL Server - Varchars/Chars vs Int/BigInt as keys

Asked By ryan putman on 03-Mar-08 11:53 PM
This conversation came up and I was wondering if anyone had some
definitive proof one way or another....

If you have a key column for a table and that column will be used as a
reference/foreign key to other tables often and frequently but also
could reach values well into the millions at what point (if ever) do
you opt for a char/varchar key as opposed to an int/bigint?

If the choice does become char/varchar at some point, are there
particular advantages to using char vs. varchar?  In the system under
question, the id values are currently between 7 and 11 characters
long......so no really small values or any too large neither.

Thanks in Advance
Ryan Putman




Andrew J. Kelly replied on 03-Mar-08 08:31 PM
The decision to use char vs. int should depend on what the type of data is.
If the values are always going to be numbers then why make it a char or
varchar when an INT is what it really is?  INT's will always be more
efficient than a 12 character datatype since the INT will only take up 4
bytes and the other 12.

--
Andrew J. Kelly    SQL MVP
Solid Quality Mentors
ryan putman replied on 06-Mar-08 06:11 AM
On Mar 3, 8:31 pm, "Andrew J. Kelly" <sqlmvpnooos...@shadhawk.com>

hmm...thanks for the reply but apparently I didn't state my question
clearly enough.

The column in question is just some key column.  The question is
totally hypothetical, not analytical.  If I have a table that could
eventually reach well into the millions in rows, should I use a
standard integer (or bigint) identity column as a primary key or
should I use a generated var/char column.

The integer should provide faster linking and searching initially but
is there a point where have a generated varchar key like A283CD is
better than an integer ID of 2,828,828?
Dan Guzman replied on 04-Mar-08 08:56 AM
I think Andrew understood your question but you may not have understood his
answer.  The key datatype is determined by the underlying data.  For
example, if you have a numeric CustomerID, then an integer type is
appropriate.  A char (or varchar) is appropriate for alphanumeric
CustomerID.


Since you mention "generated", I think you are specifically referring to a
surrogate key.  A surrogate key is an artificial value used as an
alternative to the natural key.  For a surrogate key data type, integers are
most commonly used in SQL Server because values are easily generated using
IDENTITY and provide good performance an low storage requirements.
Uniqueidentifiers are used in cases where the value needs to be globally
unique.  Character values are used only when business requirements dictate.

When you use surrogate keys, it is important that you also have a constraint
on the natural key to ensure uniqueness and prevent duplicate rows.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/
TheSQLGuru replied on 04-Mar-08 09:44 AM
Andrew hit the nail on the head.  int WILL be more efficient for your
application as stated.

Also, don't waste the extra 4 bytes for a bigint if millions are all you
expect.  int can have -2.1B to +2.1B entries.  so unless you expect to hit
4BILLION + entries go with int identity seeded at the negative limit.

--
Kevin G. Boles
Indicium Resources, Inc.
SQL Server MVP
kgboles a earthlink dt net
Andrew J. Kelly replied on 04-Mar-08 09:49 AM
I think Dan answered the questions but I have one more comment. I still
don't get why you feel the need to use a Varchar over an INT just because of
the number of rows.  An INT will hold plus and minus over 6 billion values
with only taking up 4 bytes and will always be more efficient in storage and
searching than the equivalent character datatype would be.

--
Andrew J. Kelly    SQL MVP
Solid Quality Mentors
ryan putman replied on 06-Mar-08 06:11 AM
On Mar 4, 8:56 am, "Dan Guzman" <guzma...@nospam-online.sbcglobal.net>


There is no underlying data.  As I said, this is mainly a hypothetical
question.

A colleague of mine made the point that if I was going to have a table
with a very large number of rows (into the millions) that a char-based
generated ID value would be more efficient than an Integer
value......which I doubted because INTs have always been faster in my
experience.

There is no data.  Just the question.....is there ever a point where
INTs become less efficient because of how large they get versus a
char.  I think they you guys have basically said that chars are never
faster.....which is the answer I was looking for.....everyone just
seemed to get hung up on the thought that there was a pre-existing
data format already in place or something.
ryan putman replied on 06-Mar-08 06:11 AM
To provide an example....

The identity seeded integer style ID would have values like this -
1,2,3,4,5,6,7,8,9,10,11,12,13,14....34,35,36,37......45,46,47
The generated char ID would have values like this -
1,2,3,4,5,6,7,8,9,A,B,C,D,E..........Y,Z,10,11.........19,1A,1B

Obviously, you would eventually be storing a much smaller looking char
value compared to the INT value once the number of rows is into the
millions but the question is still is that char more efficient or
better in any way than the INT?
Andrew J. Kelly replied on 04-Mar-08 04:36 PM
A char or varch takes up 1 byte for each character. An INT takes up 4 bytes
regardless of the value. So as soon as the CHAR value get to 5 characters it
becomes larger than the INT and that would happen well before you hit the
millions let alone billions. And the INT is still more efficient than the
same size CHAR.  In SQL2008 the INT can even take up less space if the
values are not near the max.

--
Andrew J. Kelly    SQL MVP
Solid Quality Mentors
Tom Cooper replied on 04-Mar-08 04:56 PM
Smaller "looking" char value is right.  It only looks smaller.  Remember
that a varchar takes 2 extra bytes in your table.  So the characters for the
first 36 rows
0,1,2,3,4,5,6,7,8,9,A,B,C,D,E..........Y,Z each take 3 bytes to store which
is slightly better than an int which takes 4 bytes to store.
The next 1296 rows (00 thru ZZ) each take 4 bytes
The next 46,656 rows 000 thru ZZZ each take 5 bytes
The next 1,679,616 rows 0000 thru ZZZZ each take 6 bytes
The next 60,466,176 rows 00000 thru ZZZZZ each take 7 bytes
The next 2,176,782,336 rows 000000 thru ZZZZZZ each take 8 bytes
But each one of those rows could be assigned an int that only took 4 bytes.
So, as you can see, using varchar vs using an int will actually cost you
storage space once you have more than about 1300 rows.

And searching/comparisions/joining on an int value is faster than a varchar
value.

And using an identity attribute on the int to generate each value is much
easier and straight forward that the code you would need to find the next
varchar value when a row was inserted.

It's a slam dunk. Use an identity column and choose between smallint, int,
and bigint depending on how many rows could be added to the table.

Tom
ryan putman replied on 06-Mar-08 06:11 AM
Thanks, everyone, for your replies.

They were definitely helpful.

Not sure I can convince the people in charge but at least now I know
I am right :)
Tom Cooper replied on 05-Mar-08 02:05 PM
If they won't listen to you, suggest they run a test.  For example, a simple
test would be create a table with an integer primary key with the identity
clause and insert some number of rows into it and and see how long it takes
and how much space the table takes up when you're done.  Then do the same
thing with a table using a varchar key with the scheme you outlined and see
how long that takes and how much space that table takes up.  For example,
for the identity column and 100,000 rows, I ran

Create Table TestInt (PK int identity Primary Key)
go
Declare @Time1 datetime
Declare @Time2 datetime
Declare @InsertCounter int
Set @InsertCounter = 1
Set @Time1 = Current_Timestamp
While @InsertCounter <= 100000
Begin
Insert TestInt Default Values
Set @InsertCounter = @InsertCounter + 1
End
Set @Time2 = Current_Timestamp
Select DateDiff(ms, @Time1, @Time2) As ElaspedMS
Exec sp_spaceused 'TestInt', 'True'

(I know that's not the fastest way to insert 100,000 rows, but since I
presume your real system will insert the rows one at a time, I had this test
do that also.)  This test on my machine took about 50 seconds, and the
sp_spaceused said reserved space was 1,352 KB, data 1,288 KB, index 16 KB
and unused 48 KB.

With the varchar key, it's more complicated.  You need a way to store the
next key (I added a table with one row), a way to compute the next key (I
have a stored proc, please note this should not be used in production, it's
not bullet proof at all, it doesn't for example, handle concurrent requests
for the next key).  But this is only a test, so I was keeping it simple.  So
that code was

Create Table TestVC (PK varchar(6) Primary Key)
Create Table NextVC (PK int Primary Key, NextVCPK varchar(6))
Insert NextVC (PK, NextVCPK) Values (1, '1')
go
Create Procedure GetNextKey @NextKey varchar(6) Output As
Begin
Declare @NewNextKey varchar(6)
Declare @CharToProcess tinyint
Declare @CharAsciiValue tinyint
Declare @Done tinyint
Select @NextKey = NextVCPK From NextVC Where PK = 1
Set @NewNextKey = @NextKey
Set @CharToProcess = Len(@NewNextKey)
/* if value is all Z's, Z or ZZ or ZZZ, etc
then next value is a 1 followed by the appropriate number of zeros */
If @NewNextKey In ('Z', 'ZZ', 'ZZZ', 'ZZZZ', 'ZZZZZ')
Begin
Set @NewNextKey = '1' + Replicate('0', @CharToProcess)
End
Else
Begin
Set @Done = 0
While @Done = 0
Begin
Set @CharAsciiValue = Ascii(Substring(@NewNextKey, @CharToProcess, 1))
/* Find the next value for this character */
Set @CharAsciiValue = Case
/* change 0-8 to the next number, 9 to A, A-Y to Z and Z to 0 */
When @CharAsciiValue Between 48 And 56 Then @CharAsciiValue + 1
When @CharAsciiValue = 57 Then 65
When @CharAsciiValue Between 65 And 89 Then @CharAsciiValue + 1
Else 48 End
Set @NewNextKey = Stuff(@NewNextKey, @CharToProcess, 1,
Char(@CharAsciiValue))
If @CharAsciiValue <> 48 Set @Done = 1 /* we are done */
Else Set @CharToProcess = @CharToProcess - 1
End
End
Update NextVC Set NextVCPK = @NewNextKey Where PK = 1
End
go
Declare @Time1 datetime
Declare @Time2 datetime
Declare @InsertCounter int
Declare @NextKey varchar(6)
Set @InsertCounter = 1
Set @Time1 = Current_Timestamp
While @InsertCounter <= 100000
Begin
Exec GetNextKey @NextKey Output
Insert TestVC (PK) Values (@NextKey)
Set @InsertCounter = @InsertCounter + 1
End
Set @Time2 = Current_Timestamp
Select DateDiff(ms, @Time1, @Time2) As ElaspedMS
Exec sp_spaceused 'TestVC', 'True'

This took about 120 seconds, and the sp_spaceused said reserved space was
2,568 KB, data 2,504 KB, index 16 KB and unused 48 KB.

So about twice as long to load the varchar keys, twice as much space, and
much more complicated coding.

Tom
Andrew J. Kelly replied on 05-Mar-08 02:14 PM
And don't forget that every index you add will be that much larger as well.
That means more network and disk I/O and less that will fit into cache.

--
Andrew J. Kelly    SQL MVP
Solid Quality Mentors