cancel
Showing results for 
Search instead for 
Did you mean: 

What's the difference between the XML and LONG VARCHAR data types?

VolkerBarth
Contributor
3,414

AFAIK, in older versions, the XML data type was just a "built-in user-defined" data type based on LONG VARCHAR, i.e. it was listed in SYS.SYSUSERTYPE.

In current versions, it seems to be a distinct data type listed in SYS.SYSDOMAIN.

Questions:

  • Is there any difference between XML and LONG VARCHAR in v12 and above?
  • And related: If the XML data I'm dealing with is UTF-8 encoded, should I use LONG NVARCHAR as data type instead of XML in case the default char collation is a single-byte one (such as Windows-1252)?
jack_schueler
Advisor
Advisor
0 Kudos

By current versions, you mean anything after version 9.

VolkerBarth
Contributor
0 Kudos

I'm not sure I have meant that but now I know:)

FWIW, I have not found anything regarding that change in the v10 "What's New" doc sections...

jack_schueler
Advisor
Advisor
0 Kudos

I did the same search. I did find a .NET change where we noted a new SADbType.Xml data type. But that's about the only hint concerning this change.

Accepted Solutions (1)

Accepted Solutions (1)

Former Member

Some good questions and a few possible challenges too Volker.

. . . The rest of this is just a few observations that may help. . . .

While there is a distinct XML Data Type, such columns ... storage-wise, described-as, and encoding-wise ... are very much the same as a Long Varchar; even being stored (and if required converted) to the charset encoding of the database. This much you seem very aware of.

And there-in lies one challenge, which charset encoding you should use for a database when the the standard itself indicates such a wide variety of valid, possible encodings

My first instinct would be similar to yours (with your suggestion of LONG NVARCHAR) of storing XML as UTF-8 ... but the fact that a full XML document can specify it's own encoding such conversion could be problematic if you do not control that encoding at the source. Parsing the content and normalizing the Encoding attribute to match the stored encoding may be required to make that truly robust.

So yes you could store you xml content as LONG NVARCHAR instead though the easier solution may be to initialize the database to be UTF-8 for both CHAR-type and NCHAR-types. {assuming all your content is UTF-8). There may even be a reason to go with long binary in some situations as well, but the good new may be that the OpenXML( ) operator can deal with any of those (if correctly managed).

-- Note none of this should be considered a recommendation without first adding in some -- practical considerations and application requirements into a proper analysis.

So that may only leave the question of "What you give up when you do not use the XML Data Type, explicitly? ... Again I'm not the expert. There may be some benefit to certain ADO.Net functions but I suspect the biggest advantage would be when using the various XML*( ) functions and not needing to cast other datatypes to XML. The need to cast and possibly needing to convert data encoding can be a bit challenging.

Do share your thoughts and experiences though.

HTH

VolkerBarth
Contributor
0 Kudos

Many thanks, Nick, that answers my questions - at least for the moment:). And particular thanks to include an answer for the inherent question "What you give up when you do not use the XML Data Type, explicitly?". As I'm using the XML_% functions, I'll prefer to stick with the XML data type, too.

So the most important point seems to be aware of the endoding issue at all, methinks...

VolkerBarth
Contributor
0 Kudos

Parsing the content and normalizing the Encoding attribute to match the stored encoding may be required to make that truly robust.

FWIW, is there a builtin function that would help with this task? (AFAIK, DBISQL itself tries to determine a script's encoding based on several rules.)

I'm asking for the general case - in my current task, the encodings are known beforehand.

Answers (1)

Answers (1)

Former Member

I hadn't found a single function, builtin or otherwise for the most critical part of this. The 'encoding="..."' attribute (if explicitly specified) exists only in the "<?xml ...?>" header and is not an xml parseable <tag>. So that only leaves string functions such as this query

 `   create or replace variable read_file long varchar;
     set read_file = xp_read_file( 'F:\\\\...\\\\Samples\\\\UltraLite\\\\sample.xml');
     select substring( 
               read_file,     //from an xp_read_file() in my mockup
               locate(read_file,'encoding="' )+10,
               /*say*/ 6)  // or another locate to find the position for the close '"'`

Of course where that becomes important is how, when and where you are accessing and how much charset conversion has already happened.

Good luck

VolkerBarth
Contributor
0 Kudos

OK, yes, that's helpful for XML data. (I guess I had hoped that SA would possibly have a general "attempt to determine a text file's encoding" function but that would certainly be very difficult to implement...)

Former Member

I agree it is a bit awkward but then again I suspect much of the XML manipulation that goes on probably happens in applications and applicaton servers and not as much in the back-end store technology. And I suspect we are not very much different from other RDBMS in that regard.

But I do suspect much of the XML/SOAP/DISH/WDSL-type support happens inside of a CLR (external environment in our case) or WEB/Content-serving type environment so you might want to see what benefits that approach brings; 'outside of the SQL DB server context' pèr sé.

Maybe a little too suspicious (or skeptical) but maybe that opens up your investigations a bit wider; and hopefully a more complete answer can be had that way.

I'm pretty sure you will find lots of C# and VB.Net code examples (as well as Java+DOM/SAX examples, and C++ XML code samplets too) that can shed some light on many requirements.