cancel
Showing results for 
Search instead for 
Did you mean: 

UTF-8 characters stored OK, but output via XML procedure fails?

vlad1
Participant
938

Hi,

SAP Anywhere 17.0.11.7058

Our database is of character encoding 1252 Latin (Western Europe) for CHAR and UTF-8 for NCHAR character sets, everything gets stored just fine and retrieved on our Powerbuilder application, but we also have an interface that exports XML files - if we try to export a record with some UTF-8 characters in them, we get the following error:

There was an error reading the results of SQL Statement. The displayed results may be incorrect or incomplete. Control character in XML output SQLCODE=-911, ODBC 3 State = "HY000"

Our output statement is as follows:

EXECUTE IMMEDIATE 'UNLOAD SELECT XMLGEN( xml version="1.0" encoding="UTF-8" Root{$x}/Root,
(SELECT
   Header.No,
   Header.JournalName,
   ...
   for xml auto, elements) AS x ) TO ''' + filepath_filename + ''' ENCODING ''UTF-8'' format ASCII quotes off escapes off';

Outputs of database queries which do not retrieve UTF characters are working fine.

Any help appreciated.

Cheers

Accepted Solutions (0)

Answers (1)

Answers (1)

johnsmirnios
Employee
Employee

XML datatype appears to use the CHAR datatype so yours will be encoded in 1252. https://dcx.sap.com/index.html#sqla170/en/html/819b6cda6ce2101488ce846acd462106.html*loio819b6cda6ce...

VolkerBarth
Contributor
0 Kudos

That following discussion with Nick Elson might give some clues how to proceed...

vlad1
Participant
0 Kudos

Thanks, but I think there's a misunderstanding - we are not storing the XML at all as a datatype in any table, we are just retrieving text/numeric fields of 2 tables, with some of the fields being NVARCHAR and LONG NVARCHAR, others being mostly CHAR and LONGS, as the XML files - just pure text files with XML markup notation.

And this works fine as long as there are no UTF characters to be retrieved, but as soon as we get some, we are getting the above message?

Our "unload SELECT XMLGEN" line clearly states that encoding should be UTF-8:

EXECUTE IMMEDIATE 'UNLOAD SELECT XMLGEN( <?xml version="1.0" encoding="UTF-8" ?>

and the last output file settings specifies it as well:

ENCODING UTF-8 format ASCII quotes off escapes off'

Here is another thing - when I tried it on a database which has even CHAR as UTF encoding set, everything works just fine. It's only on this database which has CHAR as Latin1252 and NCHAR as UTF-8 which is giving us this problem.

VolkerBarth
Contributor

That's exactly the point: The XML functions like XMLGEN() return XML as data type which is based on VARCHAR - and as such, is restricted to your CHAR char set. Therefore it works with UTF8 as CHAR char set but will fail with single-byte char for non-fitting code points...

vlad1
Participant
0 Kudos

I see, thanks. So there is simply no way for me to fix it like that?

In that case, I will convert the database to both CHAR and NCHAR encoding UTF-8. Are there any implications of it I should know about? I was a bit surprised UTF-8 wasn't a default encoding set for CHAR, but Latin 1252. Is there any reason for that, anything I should be cautious about?

vlad1
Participant
0 Kudos

I've just read that initial discussion in a bit more detail down the thread. I can't see immediately how I could change the XML data type to LONG NVARCHAR? I can see the XML domain in SYS.SYSDOMAIN, but not sure how to change it from CHAR to LONG NVARCHAR? Is this even worth trying? 🙂

VolkerBarth
Contributor

I don't think you can change the XML data type. If you rely on cp1252 as CHAR charset (like we do...), another approach would be to create the XML document and then use csconvert() to convert it to NCHAR (like we have done regularly). You would still be restricted to the cp1252 charset as the XML functions can only handle that (and therefore won't support, say eastern European characters) but would then encode the result accordingly.

This should resolve the issue if your client expects UTF-8 or other encodings.

vlad1
Participant
0 Kudos

I see - that wouldn't really solve the problem. I think I might swtich CHAR encoding set to be UTF-8, unless that would have any repercussions about stability or performance?

Thanks very much 🙂

johnsmirnios
Employee
Employee

Handling multibyte characters always has a bit of a performance impact relative to singlebyte characters but it is probably very minor. It might even help you end-to-end if your web clients are UTF-8 since you will skip a charset translation from 1252 to UTF-8 for all data going to the client.

You will need to choose between UTF8BIN (smaller performance impact but less flexible sorting/comparisons [for example, you cannot get accent insensitivity]) and UCA (more impact but has linguistic sorting). NCHAR uses UCA by default; I recommend using that for CHAR too.

vlad1
Participant
0 Kudos

Thank you John