on 2015 Mar 05 5:16 AM
We like to create a database that stores CHARs as "UTF16-LE", but literally ALL (!) collations listed with -l say they are incompatible with "UTF-16LE". So what is the magic trick...?
AFAIK, in SQL Anywhere neither CHAR nor NCHAR data types can use UTF-16 to store values. When using UNICODE you are restricted to UTF-8 (which will often need much lesser storage) - in contrast to say, MS SQL Server using UTF-16 for NCHAR data IIRC.
Cf. the qoute from the doc topic "International languages and character sets":
Unicode support SQL Anywhere supports Unicode as follows:
- Client support for UTF-16 in SQL Anywhere client libraries for ODBC, OLE DB, ADO.NET, and JDBC
- NCHAR data types for storing Unicode character data in UTF-8
- CHAR data types can use UTF-8 encoding
(Note: I may be mixing - as usually - the terms "character sets"/"encodings"/"collations"...)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Why then does dbinit -le list so many UTF16 entries?
SQL Anywhere Initialization Utility Version 11.0.1.3158 SA, IANA, MIME, ICU, JAVA, ASE ANSI_X3.4-1968, ANSI_X3.4-1968, US-ASCII, US-ASCII, ASCII, Big5, Big5, Big5, windows-950-2000, Big5, cp950 Big5-HKSCS, , , ibm-1375_P100-2003, MS950_HKSCS, big5hk EUC-KR, EUC-KR, EUC-KR, ibm-970_P110-1995, EUC-KR, eucksc EUC-TW, , x-euc-tw, ibm-964_P110-1999, cp964, euccns Extended_UNIX_Code_Packed_Format_for_Japanese, Extended_UNIX_Code_Packed_Format_ for_Japanese, EUC-JP, ibm-33722_P12A-1999, EUC-JP, eucjis GB18030, GB18030, , gb18030, , GB2312, GB2312, GB2312, ibm-1383_P110-1999, , eucgb GBK, GBK, , windows-936-2000, GBK, cp936 hp-roman8, hp-roman8, , ibm-1051_P100-1995, , roman8 IBM437, IBM437, , ibm-437_P100-1995, cp437, cp437 IBM850, IBM850, cp850, ibm-850_P100-1995, cp850, cp850 IBM852, IBM852, , ibm-852_P100-1995, cp852, cp852 IBM855, IBM855, , ibm-855_P100-1995, cp855, cp855 IBM856, , cp856, ibm-856_P100-1995, cp856, cp856 IBM857, IBM857, cp857, ibm-857_P100-1995, cp857, cp857 IBM860, IBM860, cp860, ibm-860_P100-1995, cp860, cp860 IBM861, IBM861, cp861, ibm-861_P100-1995, cp861, cp861 IBM862, IBM862, cp862, ibm-862_P100-1995, cp862, cp862 IBM863, IBM863, cp863, ibm-863_P100-1995, cp863, cp863 IBM864, IBM864, cp864, ibm-864_X110-1999, cp864, cp864 IBM865, IBM865, cp865, ibm-865_P100-1995, cp865, cp865 IBM866, IBM866, cp866, ibm-866_P100-1995, cp866, cp866 IBM869, IBM869, cp869, ibm-869_P100-1995, cp869, cp869 IBM949, , , windows-949-2000, windows-949, cp949 IBM950, , , ibm-950_P110-1999, cp950, cp950 ISO-8859-15, ISO-8859-15, ISO-8859-15, ibm-923_P100-1998, ISO-8859-15, iso15 ISO_8859-1:1987, ISO_8859-1:1987, ISO-8859-1, ISO-8859-1, ISO-8859-1, iso_1 ISO_8859-2:1987, ISO_8859-2:1987, iso-8859-2, ibm-912_P100-1995, iso-8859-2, iso 88592 ISO_8859-5:1988, ISO_8859-5:1988, iso-8859-5, ibm-915_P100-1995, iso-8859-5, iso 88595 ISO_8859-6:1987, ISO_8859-6:1987, iso-8859-6, ibm-1089_P100-1995, iso-8859-6, is o88596 ISO_8859-7:1987, ISO_8859-7:1987, iso-8859-7, ibm-813_P100-1995, iso-8859-7, iso 88597 ISO_8859-8:1988, ISO_8859-8:1988, iso-8859-8, ibm-916_P100-1995, iso-8859-8, iso 88598 ISO_8859-9:1989, ISO_8859-9:1989, iso-8859-9, ibm-920_P100-1995, iso-8859-9, iso 88599 KOI8-R, KOI8-R, KOI8-R, ibm-878_P100-1996, KOI8-R, koi8 macintosh, macintosh, macintosh, macos-0_2-10.2, , mac macos-29-10.2, , x-mac-centraleurroman, macos-29-10.2, , mac_ee macos-6-10.2, , x-mac-greek, macos-6-10.2, , macgrk2 macos-7_3-10.2, , x-mac-cyrillic, macos-7_3-10.2, MacCyrillic, mac_cyr macturk, , x-mac-turkish, macos-35-10.2, , macturk Shift_JIS, Shift_JIS, Shift_JIS, ibm-943_P15A-2003, , sjis TIS-620, TIS-620, , ibm-874_P100-1995, cp874, tis620 UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE, UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, utf16 UTF-8, UTF-8, UTF-8, UTF-8, UTF-8, utf8 windows-1250, windows-1250, , ibm-5346_P100-1998, windows-1250, cp1250 windows-1251, windows-1251, , ibm-5347_P100-1998, windows-1251, cp1251 windows-1252, windows-1252, , ibm-5348_P100-1997, windows-1252, cp1252 windows-1253, windows-1253, , ibm-5349_P100-1998, windows-1253, cp1253 windows-1254, windows-1254, , ibm-5350_P100-1998, windows-1254, cp1254 windows-1255, windows-1255, , ibm-9447_P100-2002, windows-1255, cp1255 windows-1256, windows-1256, , windows-1256-2000, windows-1256, cp1256 windows-1257, windows-1257, , ibm-9449_P100-2002, windows-1257, cp1257 windows-1258, windows-1258, , ibm-5354_P100-1998, windows-1258, cp1258 Windows-31J, Windows-31J, , ibm-943_P15A-2003, cp943c, cp932 windows-874-2000, , , windows-874-2000, windows-874, cp874 C:\\projects\\$templates\\$SA_templates\\run\\dbinit>PAUSE Press any key to continue . . .
'
Actually we must store Chinese text AND English / German text in the same database due to the nature of the application. While this is possible with UTF-8, we expect that particularly the asian characters will bloat the .db file with significant overhead due to UTF-8's bias for latin characters. We further expect a performance drawback due to that. So we wanted to benchmark, what happens if UTF-16LE is use -- and asked dbinit -le whether it is supported (which says, it IS supported).
So the question is not "how to store unicode" but "how to use UTF-15"? 🙂
And the answer is that you cannot use UTF-16 in a SQLAnywhere database as the encoding for the CHAR or NCHAR type.
dbinit -le lists all encodings known to SQLAnywhere, not just the encodings that can be used in a database. GB18030 is not supported in databases either because you cannot determine the size of a character from the first byte; however, it is supported as a client charset, and in LOAD, UNLOAD, CSCONVERT.
I guess the output from DBINIT -le(+) leaves some questions on its own... - obviously it shows more collations/encodings than are allowed for with DBINIT.
Why then does dbinit -le list so many UTF16 entries?
UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE, UTF-16BE,
UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, UTF-16LE, utf16
There are (just) two, one for "UTF-16BE" and one for "UTF-16LE". In my understanding, each row lists one encoding/collation name in SA naming standard and its appropriate name in 5 other categories (and the first row seems to miss an "ASE" name...).
FWIW: The output seems quite identical for current versions like 12.0.1.4216.
As John has told: "The database does not store data in UTF16." That is the answer to your original question, isn't it?
To summarize: SQL Anywhere cannot store any other unicode encoding besides UTF-8, right?
This effectively implies a substantial overhead in .db file size for mixed database which have to store Chinese and Western characters at the same time. Is there a solution available to reduce this overhead or is the effect so small that SAP simply does not see any need to prevent the effect?
Depending on the mix of Chinese and Western data, there might be storage overhead or there might be storage savings: Western data will take 1 byte per character and Chinese data will be 3 bytes per character. For UTF16, it is (almost) always 2 bytes character for both Western and Chinese. The storage overhead would be greatest when storing only Chinese characters. For a fairly even mix of Western and Chinese, you will break even.
Whether the effect is too small or too large depends on the application, the text stored, the absolute size of the database and, above all, your own particular requirements.
Would UTF16 storage be advantageous for storage and/or performance in some circumstances? Sometimes yes, sometimes no. And, many times, it doesn't matter if one approach outperforms the other when both solutions provide completely adequate space & time performance for the given situation.
I can't say whether SQLA will ever support the use UTF16 as a database storage encoding but it does not currently do so. However, it's entirely possible that SQLAnywhere as it exists today will still completely meet your needs.
ASE does not have endian-specific UTF16-encodings: 'utf16' is always the same endian as the server (or presumably the endian of the client if the client requests 'utf16').
SQLA lets you specify a specific endian and 'utf16' is an alias for the platform endian. I believe you can even refer to 'UTF-16PE' (for Platform Endian) on SQLA as an alias. It's not listed above because it's not the canonical name for any of the categories listed in the chart. It's just an extra alias.
On a big-endian platform, FWIW, the 'utf16' ASE alias would appear on the first line.
Apparently I can't speak for John, but given his response, I don't see that he has stated to be "convinced there will not be a performance problem" in your case.
In my understanding, he has told "it depends" whether UTF-16 may be more performant or not (or correctly: need more or less storage space) than UTF-8, and that even a performance difference may not really be noticeable for users.
As SA does not offer both UTF-8 and UTF-16 for storage, in my understanding, you could attempt to test whether storage and performance do behave worse with UTF-8 compared to using only non-unicode character sets, i.e. to a single-byte charset for western and a fitting chinese character set for chinese data. IMHO, that would be the trade-offs to check for.
(I assume Chinese and Western data will belong to the same tables and columns, so it would not be possible to use CHAR for one and NCHAR for the other - or to use two different databases with different character sets?)
Volker is correct: I did not say that there would not be a performance problem in some cases. Some cases will have better space performance when using UTF16 and some will have better space performance when using UTF8. Some cases will have better time performance when using UTF16 and some will have better time performance when using UTF8 (and those are not necessarily the same cases). Whether the performance is a "problem" is completely subjective. Often, either approach yields completely acceptable storage & time performance characteristics for the job at hand. Only you can measure the system and determine if it meets your needs.
User | Count |
---|---|
67 | |
10 | |
10 | |
10 | |
10 | |
8 | |
8 | |
7 | |
5 | |
5 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.