cancel
Showing results for 
Search instead for 
Did you mean: 

How can I test whether a conversion from NCHAR to CHAR (single byte collation) is lossy?

VolkerBarth
Contributor
796

With a database with default Windows-1252 CHAR encoding and default UTF-8 NCHAR encoding, I'm facing values in a NCHAR column that are not valid in the CHAR charset. (These are probably Polish characters, e.g. "C with Acute Accent", small ć = U+0107, and yes, I'm aware that using UTF-8 as CHAR encoding would omit that problem – but that is not possible in this case.)

Note, I'm not dealing with conversion between client and database, it's just database internal conversion via csconvert(). I'm using a query such as

select MyColumn,
   cast(csconvert(MyColumn, 'char_charset', 'nchar_charset') as long varchar) as MyColumn1252
from MyTable

How do I test whether a conversion via csconvert leads to losses?

I had thought the on_charset_conversion_failure option would be of help but setting tast option to a value besides the default "Ignore" "returns SQLCODE -883:

You attempted to set the on_charset_conversion_failure option to produce either a warning or error report. This setting is not available when the single-byte to single-byte converter is used."

Using csconvert to convert according NCHAR values to CHAR replaces the invalid characters with a substitution character /x1A character as expected - so is that the recommended method to check for invalidity by either testing

  • for a (known) SUB character via "where MyColumn1252 like '%\\x1A%'"
  • or whether original and converted data are inequal, such as "where MyColumn != MyColumn1252"?

And is there a facility to get the "base character" in this case (here a latin "c") via some csconvert with different locales or collation tailoring?

Accepted Solutions (1)

Accepted Solutions (1)

johnsmirnios
Employee
Employee

The doc says that on_charset_conversion_failure is for conversions between the client & the server, not for conversions occurring within the server. There's a lot of history there but ultimately I don't think there's a built-in, authoritative way to accomplish what you want. You can look for a substitution character but not all character sets have the same substitution character. Your second suggestion is probably best: round-tripping the string. If to_nchar( to_char( nstr ) ) != nstr then the conversion was lossy or the original string had an invalid encoding.

I don't think there is a way to get the "base character" either.

VolkerBarth
Contributor
0 Kudos

Well, I had hoped that, say within the German locale, I could use collation tailoring options that treat a regular c and a c with acute accent as equal (as that would be the typical "real-life workaround"). Thanks for the quick clarification, and the hint with to_char().

johnsmirnios
Employee
Employee

That's a different question from your original "How do I test whether a conversion via csconvert leads to losses?" or "how to get the base character" 🙂

If you want to compare strings at just the primary level, I think you can use COMPARE( x, y, 'UCA(case=ignore;accent=ignore;punct=primary)' )

You may need to be careful if x & y are different string types. If one is CHAR and the other NCHAR, I have a vague recollection that they might both be converted to CHAR first which would be lossy but maybe I'm confusing it with inference rules for comparison: https://dcx.sap.com/index.html#sqla170/en/html/81f5aac76ce21014ab3ac8ae03b903ae.html. I would hope that they are both promoted to NCHAR.

VolkerBarth
Contributor

John, thanks for the further pointer, I had (as usually...) forgotten that all string literals are converted to the CHAR charset, which had made my previous comparisons pointless.

If I use column values or build fitting unicode strings via UNISTR and \\u codes, I get the expected results that different locales treat different (accented) characters as separate characters or not. E.g. "ć" is a separate character in Polish whereas it is a variant of "c" in German.

begin
   declare str1 nvarchar(30) = unistr('Novi\\u0107');
   declare str2 nvarchar(30) = unistr('Novic');
   select str1, str2,
      compare(str1, str2, 'UCA(Locale=pl;Accent=Ignore)') as pl_compare, 
      compare(str1, str2, 'UCA(Locale=de;Accent=Ignore)') as de_compare;
end;

return Nović Novic 1 0

VolkerBarth
Contributor
0 Kudos

Just to add: For the (very rare) occasions of lossy conversions in my case, I now use

  • the test "MyUnicodeColum != to_nchar(to_char(MyUnicodeColum))" to detect such cases, and
  • then check for the substituted characters with some kind of "brute-force" approach whether any of the CHAR "base characters" (A-ZÆÐÞa-zæðþ) is treated as equal when compared via "UCA(Locale=de;Case=respect;Accent=ignore)" with the unicode character - if that succeeds, the according base character is used instead of the substitute character.

So in my above sample, "Nović" will be converted to "Novic" instead of "Novi␚", as that's the best I can do here IMHO.

Answers (0)