on 2022 Aug 17 2:14 PM
With a database with default Windows-1252 CHAR encoding and default UTF-8 NCHAR encoding, I'm facing values in a NCHAR column that are not valid in the CHAR charset. (These are probably Polish characters, e.g. "C with Acute Accent", small ć = U+0107, and yes, I'm aware that using UTF-8 as CHAR encoding would omit that problem – but that is not possible in this case.)
Note, I'm not dealing with conversion between client and database, it's just database internal conversion via csconvert(). I'm using a query such as
select MyColumn, cast(csconvert(MyColumn, 'char_charset', 'nchar_charset') as long varchar) as MyColumn1252 from MyTable
How do I test whether a conversion via csconvert leads to losses?
I had thought the on_charset_conversion_failure option would be of help but setting tast option to a value besides the default "Ignore" "returns SQLCODE -883:
You attempted to set the on_charset_conversion_failure option to produce either a warning or error report. This setting is not available when the single-byte to single-byte converter is used."
Using csconvert to convert according NCHAR values to CHAR replaces the invalid characters with a substitution character /x1A character as expected - so is that the recommended method to check for invalidity by either testing
And is there a facility to get the "base character" in this case (here a latin "c") via some csconvert with different locales or collation tailoring?
Request clarification before answering.
The doc says that on_charset_conversion_failure is for conversions between the client & the server, not for conversions occurring within the server. There's a lot of history there but ultimately I don't think there's a built-in, authoritative way to accomplish what you want. You can look for a substitution character but not all character sets have the same substitution character. Your second suggestion is probably best: round-tripping the string. If to_nchar( to_char( nstr ) ) != nstr then the conversion was lossy or the original string had an invalid encoding.
I don't think there is a way to get the "base character" either.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
That's a different question from your original "How do I test whether a conversion via csconvert leads to losses?" or "how to get the base character" 🙂
If you want to compare strings at just the primary level, I think you can use COMPARE( x, y, 'UCA(case=ignore;accent=ignore;punct=primary)' )
You may need to be careful if x & y are different string types. If one is CHAR and the other NCHAR, I have a vague recollection that they might both be converted to CHAR first which would be lossy but maybe I'm confusing it with inference rules for comparison: https://dcx.sap.com/index.html#sqla170/en/html/81f5aac76ce21014ab3ac8ae03b903ae.html. I would hope that they are both promoted to NCHAR.
John, thanks for the further pointer, I had (as usually...) forgotten that all string literals are converted to the CHAR charset, which had made my previous comparisons pointless.
If I use column values or build fitting unicode strings via UNISTR and \\u codes, I get the expected results that different locales treat different (accented) characters as separate characters or not. E.g. "ć" is a separate character in Polish whereas it is a variant of "c" in German.
begin declare str1 nvarchar(30) = unistr('Novi\\u0107'); declare str2 nvarchar(30) = unistr('Novic'); select str1, str2, compare(str1, str2, 'UCA(Locale=pl;Accent=Ignore)') as pl_compare, compare(str1, str2, 'UCA(Locale=de;Accent=Ignore)') as de_compare; end;
return Nović Novic 1 0
Just to add: For the (very rare) occasions of lossy conversions in my case, I now use
So in my above sample, "Nović" will be converted to "Novic" instead of "Novi␚", as that's the best I can do here IMHO.
User | Count |
---|---|
77 | |
30 | |
10 | |
8 | |
8 | |
7 | |
7 | |
5 | |
4 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.