‎2011 Aug 24 11:32 AM - edited ‎2011 Aug 24 1:32 PM
Hi,
I have a client in Russia using our software in Cyrillic and they are getting what I would describe as "odd" ordering of alphanumeric lists.
When displaying a list of names the order appears alomost random in that there will be a number of A's then a number of B's then again some A's, Some B's, Some C's, Some Z's then some A's again. For example if it was in English they get:
Aberthforth, Brian
Aimes, John,
Charles, A
Charleston, Peter
Avery, J
Bryan, Zoe
It might start off looking OK but then fails. I have a nice screenshot to show the example but I can't post this until I get to 100 points. If allowed I can post a link to the screenshot on our website if that helps.
I have tried running simple queries in ISQL and get the same results:
select customer.keyname from customer order by customer.keyname
They are using SQL Anywhere 10 on Windows 7 and XP.
It has been suggested that changing the collation of the database may help but this is totally new territory for me so thoughgt I would tap up the forum knowledge before I go down completely the wrong route.
Thanks in advance.
Alasdair
Added for Alasdair:
Character set 1252LATIN1, SQLA 10.0.1.3831
Request clarification before answering.
Images from Alasdair of the results of using SORTKEY in the ordering as suggested in Volker's post:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
If my suspicions about bypassing charset conversion are correct, I don't expect sortkey used in this way to do any better since sortkey will implicitly convert charsets before computing the the key. In the first example above, we will convert data from what we think is CP1252 to CP1251 and, largely, every string will be full of substitution characters. You might try something like "order by sortkey( csconvert(customer.keyname,'utf8','cp1251'),'UCA(locale-RU)')" but that's just a crazy way of getting around the fundamental issue that the charset conversion is being bypassed in the original configuration which means that data stored in the database is being interpreted by the app as CP1251 and by the engine as CP1252.
The reason might be that comments get "read-only" after exactly one hour - cf. this FAQ...
I agree with John that SORTKEY is not the right solution. It was implemented before we had NCHAR support and is intended to provide linguistically correct sorting beyond the capability of our SQL Anywhere collations such as 1251CYR.
Depending on the requirements, 1251CYR is a perfectly good choice to store Cyrillic in CHAR columns and get reasonable sorting. If you want sorting like a phone book, consider using NCHAR columns and NCHAR collations (introduced in SA10).
The difference in behaviour is most noticeable around letters that are considered related. I can't speak for Russian, but in Western languages an example is accented letters. SA collations consider all forms of the letter 'a' (with and without accents) to be equal and sort them as equal. Think of it as removing all of the accents, sorting, then putting the accents back in the result.
I can see that 1251CYR has only a few "equal" letters, for example, 0x83, 0xb4 and 0xE3 are equal (variations of CYRILLIC SMALL LETTER GJE according to my references).
Using NCHAR collations, you get phonebook-style sorting, where the base letter is used for initial comparison, then accents (and case) are considered as secondary attributes. See http://dcx.sybase.com/index.html#1201/en/dbadmin/natlang-s-7003956.html for more details.
| User | Count |
|---|---|
| 15 | |
| 9 | |
| 6 | |
| 5 | |
| 4 | |
| 4 | |
| 3 | |
| 2 | |
| 2 | |
| 2 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.