cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

Cyrillic Alphanumeric Ordering Problems

RADical_Systems
Participant
11,408

Hi,

I have a client in Russia using our software in Cyrillic and they are getting what I would describe as "odd" ordering of alphanumeric lists.

When displaying a list of names the order appears alomost random in that there will be a number of A's then a number of B's then again some A's, Some B's, Some C's, Some Z's then some A's again. For example if it was in English they get:

Aberthforth, Brian
Aimes, John,
Charles, A
Charleston, Peter
Avery, J
Bryan, Zoe

It might start off looking OK but then fails. I have a nice screenshot to show the example but I can't post this until I get to 100 points. If allowed I can post a link to the screenshot on our website if that helps.

I have tried running simple queries in ISQL and get the same results:

select customer.keyname from customer order by customer.keyname

They are using SQL Anywhere 10 on Windows 7 and XP.

It has been suggested that changing the collation of the database may help but this is totally new territory for me so thoughgt I would tap up the forum knowledge before I go down completely the wrong route.

Thanks in advance.

Alasdair

Added for Alasdair:

Character set 1252LATIN1, SQLA 10.0.1.3831


Cyrillic Alphanumeric Ordering Problems

View Entire Topic
justin_willey
Participant
0 Likes

Images from Alasdair of the results of using SORTKEY in the ordering as suggested in Volker's post:


Re Cyrillic Alphanumeric Ordering Problems


Re Cyrillic Alphanumeric Ordering Problems

johnsmirnios
Product and Topic Expert
Product and Topic Expert
0 Likes

If my suspicions about bypassing charset conversion are correct, I don't expect sortkey used in this way to do any better since sortkey will implicitly convert charsets before computing the the key. In the first example above, we will convert data from what we think is CP1252 to CP1251 and, largely, every string will be full of substitution characters. You might try something like "order by sortkey( csconvert(customer.keyname,'utf8','cp1251'),'UCA(locale-RU)')" but that's just a crazy way of getting around the fundamental issue that the charset conversion is being bypassed in the original configuration which means that data stored in the database is being interpreted by the app as CP1251 and by the engine as CP1252.

johnsmirnios
Product and Topic Expert
Product and Topic Expert
0 Likes

For some reason, I cannot edit my comment... Change my workaround to "order by sortkey( cast( csconvert(customer.keyname,'utf8','cp1251') as nchar),'UCA(locale-RU)')"

VolkerBarth
Contributor
0 Likes

The reason might be that comments get "read-only" after exactly one hour - cf. this FAQ...

Former Member

I agree with John that SORTKEY is not the right solution. It was implemented before we had NCHAR support and is intended to provide linguistically correct sorting beyond the capability of our SQL Anywhere collations such as 1251CYR.

Depending on the requirements, 1251CYR is a perfectly good choice to store Cyrillic in CHAR columns and get reasonable sorting. If you want sorting like a phone book, consider using NCHAR columns and NCHAR collations (introduced in SA10).

The difference in behaviour is most noticeable around letters that are considered related. I can't speak for Russian, but in Western languages an example is accented letters. SA collations consider all forms of the letter 'a' (with and without accents) to be equal and sort them as equal. Think of it as removing all of the accents, sorting, then putting the accents back in the result.

I can see that 1251CYR has only a few "equal" letters, for example, 0x83, 0xb4 and 0xE3 are equal (variations of CYRILLIC SMALL LETTER GJE according to my references).

Using NCHAR collations, you get phonebook-style sorting, where the base letter is used for initial comparison, then accents (and case) are considered as secondary attributes. See http://dcx.sybase.com/index.html#1201/en/dbadmin/natlang-s-7003956.html for more details.