SQL Server - Unicode and Varchar

Asked By Roger.Norepl on 12-Apr-08 08:39 PM
Hi,

I use a development tool that is able to handle Unicode characters.
I'm a bit confused about it's mapping to the Sql-Server. There are two
mapping possibilities;
The first possibility maps an "Unicode-string" as a "varchar"-string
but twice the length of your setting in the internal data model of the
tool (fairly logical).
The second possibility maps a Unicode-string as nchar with the length
of your setting in the internal datamodel.

The nchar isn't suitable (I miss the nvarchar-possibility) as I have
some very large tables with very varying length of data (varying per
row).

What disadvantages are there to save "Unicode-data" as the varchar-
type?. Apparently my tool is able to handle the mapping to/from
varchar in the database, but what about fex. Visual Studio and other
development tools. What possible problems will there be with Unicode-
data in varchar fields?

Regards RogerW.

PS. My question could also be; whats the difference between
nvarchar(50) and varchar(100), what benefits/disadvantages do you get
from the nvarchar-type?




Dan Guzman replied on 11-Apr-08 08:12 AM
The issue is that not all Unicode characters can be converted to
corresponding varchar values because you are going from 2 bytes per
character to 1 byte per character.  The standard ASCII characters in the
0-127 range are common to all collations so you won't have any problems with
those.  However, the mapping of other Unicode characters into the 128-255
range depends on target collation of the varchar column.


An nvarchar(50) column can hold 50 Unicode characters at a cost of 100
bytes.  A varchar(100) column can hold up to 100 characters with the same
100 byte storage cost.  The benefit of Unicode is that you can store many
more different characters (e.g. simplified Chinese) without the need to pick
a specific collation for all the characters you might need to store.  The
benefit of varchar is the reduced storage.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/
Roger.Norepl replied on 12-Apr-08 08:39 PM
On 11 huhti, 15:12, "Dan Guzman" <guzma...@nospam-


th
5

My tool is handling all Unicode characters, so actually the bytes used
in the varchar-field are used in the same way as the bytes in a
nvarchar-field, ie. each character, also the standard characters
(0-127), do need 2 bytes of the varchar-field. I suppose that the tool
does the transformation between program/screen and the database.


e
y
ck
e

I don't still understand why you can't store as many unicode
characters in a varchar(100) as in a nvarchar(50) as you have the same
number of bytes available?

I'm interested in fex. whats the possibility to tell fex. a Visual
Studio program that it should handle the field as a nvarchar-column
although it actually is a varchar-field in the database?
And what's actually the difference between a nchar(50) and
varchar(100), I suppose that some programs (fex. Microsoft products as
QueryAnalyzer) can recognise nvarchar as a unicode type and handle the
characters right, but overall there should be possible to store
unicode characters in a varchar (with proper transformations)?
Is nvarchar a standardized type used by other databases than
SqlServer?

Sorry if I misunderstood your answer.

Regards RogerW.
Bob Milton replied on 11-Apr-08 02:37 PM
Roger,
One big problem - the driver level code tends to translate bit streams
based on the fact that you are writing/reading from an 8 bit text column. So
they will do things like truncate at the first null byte, or perhaps
translate Unicode to 8 bit based on the current language settings. In either
case, what is stored is NOT the same bits as would be if the column were
marked nvarchar (which is an SQL standard, not a Microsoft extension). In
short, don't do it!
Bob
On 11 huhti, 15:12, "Dan Guzman" <guzma...@nospam-


My tool is handling all Unicode characters, so actually the bytes used
in the varchar-field are used in the same way as the bytes in a
nvarchar-field, ie. each character, also the standard characters
(0-127), do need 2 bytes of the varchar-field. I suppose that the tool
does the transformation between program/screen and the database.



I don't still understand why you can't store as many unicode
characters in a varchar(100) as in a nvarchar(50) as you have the same
number of bytes available?

I'm interested in fex. whats the possibility to tell fex. a Visual
Studio program that it should handle the field as a nvarchar-column
although it actually is a varchar-field in the database?
And what's actually the difference between a nchar(50) and
varchar(100), I suppose that some programs (fex. Microsoft products as
QueryAnalyzer) can recognise nvarchar as a unicode type and handle the
characters right, but overall there should be possible to store
unicode characters in a varchar (with proper transformations)?
Is nvarchar a standardized type used by other databases than
SqlServer?

Sorry if I misunderstood your answer.

Regards RogerW.
Dan Guzman replied on 12-Apr-08 09:18 AM
If you were to convert the Unicode string to a byte array and store the
result in a varbinary(100) instead of varchar(100), then you could store
Unicode strings as you describe.

I think your misunderstanding is that you are confusing the number of bytes
with the number of characters.  It is only an implementation detail that
varchar(100) requires 100 bytes.  A varchar(100) declaration in SQL Server
means that you want to store 100 *characters* and that non-ASCII characters
will be mapped according the the target column collation as possible.
Consequently, each Unicode character (2 bytes) is mapped to a varchar
character (1 byte) so the remaining 50 characters of the varchar is unused.

The upshot of this is that you won't be able to do a round-trip conversion
for all Unicode characters.  I've included a script below to illustrate
this.  Note that the Unicode characters in the script might not display
correctly, depending on your reader.


Why not change the database column to nvarchar?  If you must use varchar,
you can only choose a collation that provides the best coverage of the
characters you need to store.  There will be compromises since not all
Unicode characters can be represented in a single varchar regardless of the
collation.


No, you can't transform all possible Unicode characters to a varchar for the
reasons mentioned earlier.


I don't know the official SQL ANSI standards on this but SQL Server and
Oracle provide nvarchar/nvarchar2 for Unicode character data.

CREATE TABLE dbo.UnicodeExample
(
unicode_column nvarchar(50),
varchar_latin varchar(100) COLLATE Latin1_General_CI_AS,
varchar_greek varchar(100) COLLATE Greek_CI_AS,
varbinary_column varbinary(100)
)

DECLARE @unicode_variable nvarchar(50)
SET @unicode_variable =
REPLICATE(NCHAR(0x0393), 50) -- Greek letter Gamma x 50

INSERT INTO dbo.UnicodeExample
(
unicode_column,
varchar_latin,
varchar_greek,
varbinary_column
)
SELECT
@unicode_variable,
@unicode_variable,
@unicode_variable,
CAST(@unicode_variable AS varbinary(100))

SELECT
unicode_column, --ΓΓΓΓΓ...
varchar_latin, --GGGGG...
varchar_greek, --ΓΓΓΓΓ...
varbinary_column, 0x930393039303...
CAST(varbinary_column AS nvarchar(50)) --ΓΓΓΓΓ...
FROM dbo.UnicodeExample

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/
Roger.Norepl replied on 16-Apr-08 07:53 AM
Thank you very much for your answers,

here are my thoughts:

,
e

I'm trying to persuade those responsible of the IDE-tool that I'm
using to change their mappings. But as this is a very long process
(they support many databases), it will last for a while. I'm now
investigating wheter I can proceed with the mappings available or if I
should wait. But as usual I won't have time to wait for new mappings,
but as Visual Studio is also used, it should be able to handle the
mappings used. However the main application is built with  the other
IDE-tool and the Sql-Server database driver of that should define the
table-settings. I will probably be able to convert the data while
their nvarchar-mapping is available (if ever). Meanwhile I have to
check further how the varchar is used with Unicode in the IDE-tool.


I'm no Oracle expert but I know that you can use at least UTF8 with
the char-type (varchar) in Oracle, although the standard type is nchar
(or nvarchar). Below an extract from a Oracle-document.

encoding forms. For example, if you store Unicode data in CHAR columns
using UTF-8 only to later discover that you are storing many Asian
characters and that using UTF-16 would save you a great deal of disk
space, help is available. Oracle provides a quick and painless
migration solution whereby you can easily change the encoding from
UTF-8 to UTF-16 by using the ALTER TABLE MODIFY statement to change
the column's datatype from CHAR to NCHAR. The reverse conversion is
also possible. The conversion from one Unicode encoding form to
another is seamlessly handled."

Any further comments appreciated, especially about UTF-8 and Unicode!

Regards RogerW.
Dan Guzman replied on 14-Apr-08 06:41 AM
I think you will face a challenge if you need to support multiple DBMS
products and cannot change the data type mappings for Unicode data.  To
store UTF-8 data in Microsoft SQL Server, you'll need to store data in a
varbinary data type, convert and store as UCS2 in nvarchar/nchar or store in
a UTF-8 user-defined data type (see
http://msdn2.microsoft.com/en-us/library/ms160893.aspx).

Also, note that character set support may be an database/instance level
setting for products other than SQL Server.  I don't think you can make the
assumption that char/varchar can store UTF-8 data.  For example, my
understanding is that an Oracle database installed with US7ASCII rather than
UTF8 will require the DBA to ALTER the database accordingly before UTF-8 can
be used in CHAR/VARCHAR2.  Note that like you, I am no Oracle expert and
suggest post database-specific questions to the appropriate forum.

--
Hope this helps.

Dan Guzman
SQL Server MVP
http://weblogs.sqlteam.com/dang/

Thank you very much for your answers,

here are my thoughts:


I'm trying to persuade those responsible of the IDE-tool that I'm
using to change their mappings. But as this is a very long process
(they support many databases), it will last for a while. I'm now
investigating wheter I can proceed with the mappings available or if I
should wait. But as usual I won't have time to wait for new mappings,
but as Visual Studio is also used, it should be able to handle the
mappings used. However the main application is built with  the other
IDE-tool and the Sql-Server database driver of that should define the
table-settings. I will probably be able to convert the data while
their nvarchar-mapping is available (if ever). Meanwhile I have to
check further how the varchar is used with Unicode in the IDE-tool.


I'm no Oracle expert but I know that you can use at least UTF8 with
the char-type (varchar) in Oracle, although the standard type is nchar
(or nvarchar). Below an extract from a Oracle-document.

encoding forms. For example, if you store Unicode data in CHAR columns
using UTF-8 only to later discover that you are storing many Asian
characters and that using UTF-16 would save you a great deal of disk
space, help is available. Oracle provides a quick and painless
migration solution whereby you can easily change the encoding from
UTF-8 to UTF-16 by using the ALTER TABLE MODIFY statement to change
the column's datatype from CHAR to NCHAR. The reverse conversion is
also possible. The conversion from one Unicode encoding form to
another is seamlessly handled."

Any further comments appreciated, especially about UTF-8 and Unicode!

Regards RogerW.
Roger.Norepl replied on 16-Apr-08 07:54 AM
in
y/ms160893.aspx).

I found some other useful links about this:
http://msdn2.microsoft.com/en-us/library/bb330962.aspx#intlftrql2005_topic2
http://support.microsoft.com/kb/232580

Having read these I agree with you that those responsible of the IDE-
tool have done wrong mapping to varchar.
The statement
from databases by mapping between UTF-8 (intern format of tool) and
the UTF format used in the database via an algorithm."

The right algorithm would transform to/from UCS-2, and so the mapping
should be nvarchar or nchar.

Regards RogerW.