cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

SQL Anywhere: Problem with JSON data with German umlauts, wrongly decoded based on following chars

VolkerBarth
Contributor
0 Kudos
3,652

While consuming JSON web service data, I've stumbled over certain problems with nvarchar data with German umlauts. In some cases, these umlauts (and the German sharp s) are decoded incorrectly.

IMHO, these characters are correctly coded in JSON with \u and 4 digits, say a "ü" as \u00fe. However, depending on the following character (usually a regular latin character, coded as one byte so without the \u prefix), the sequence seems to be treated correctly or as a totally different unicode point with a way higher number.

Here's a small sample with umlaut "ü", followed by a space, the text "ge" and the text "be". In the last case, the umlaut is wrongly decoded as codepoint 64702 (FCBE), which is "ARABIC LIGATURE FEH WITH JEEM INITIAL FORM".

This is run in a 17.0.11.7908 database with CharSet = windows-1252, collation = 1252LATIN1 resp. NCharSet = UTF-8 and nchar collation = UCA.

select '\u00fc' as umlaut, unistr('\u00fc ') as UmlautWithSpace, unistr('\u00fcge') as UmlautWithGe, unistr('\u00fcbe') as UmlautWithBe, unicode('\u00fc'), unicode(UmlautWithSpace), unicode(UmlautWithGe), unicode(UmlautWithBe)

returns 

VolkerBarth_0-1756398148329.png

Is there a problem in the UTF-8 decoding? And does is there a workaround to fix this behaviour?

To clarify: I'm aware that the UNISTR function does both support \uXXXX and \uXXXXXX, so the value \u00fcbe can be treated as a escape sequence with six digits. However, I'm only using here UNISTR to show the effect, I'm generally relying on the sp_parse_json() function, and AFAIK the JSON standard does only allow Unicode escape sequences with 4 digits (\uXXXX).

 

View Entire Topic
johnsmirnios
Product and Topic Expert
Product and Topic Expert

I looked at our JSON parser and it does accept 6-character hex sequences. It should not be doing so and it's a quick fix.

However, I would also recommend that you consider using a CHAR collation based on UTF8 rather than windows-1252. If I recall correctly, sp_parse_json returns its string values as CHAR datatype; however, although our parser internally converts the bare text in strings from NCHAR to CHAR, it explicitly generates UTF8 encoding for escape sequences. The result can be strings that are in a mix of CHAR and UTF8 encoding. It's an old bug but changing our JSON support to use NCHAR everywhere had backward compatibility issues and some other complications that meant it never got implemented. Using a UTF8-based CHAR collation is an easy workaround. Unfortunately, it means a database unload/reload would be required.

VolkerBarth
Contributor
0 Kudos

Thanks for the quick response, John - and the hint to a different collation. I cannot currently change the database collation but will be careful in that respect.
As a workaround I will replace the possibly misinterpreted \uxxxx escape sequences with the actual characters (typically umlauts) when they are followed by two hex digits - and do that before sp_parse_json() is called. This should work as long as they are available in our charset.