cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

openxml xml parser error

6,495

Hello!

When using "openxml" on results of the Google Geocoding API, I receive an XML parser error with the message "Invalid control character" for certain address details. Here's an example:

select * from openxml("CallGoogleapi"('https://maps.googleapis.com/maps/api/geocode/xml?address=pasica+75000+tuzla&language=en'),'/GeocodeR... with("lat" long varchar 'text()')

with "CallGoogleapi" defined as

FUNCTION "CallGoogleapi"( in "sLocation" long varchar ) 
returns xml
url '!sLocation!'
type 'HTTP:GET'

What can I do, so that openxml does not geerate this error?

Thanks in advance!

Dirk

View Entire Topic
Vlad
Product and Topic Expert
Product and Topic Expert

I was finally able to solve this... I don't even know the correct word... 'thing'. I don't know where to start, because the error is unclear, and the documentation looks incomplete for Unicode-related questions.

There are important links that explain that SqlAnywhere (when it cannot convert a string from one encoding to another) replaces the multi-byte character with '0x1a'. There is an option (on_charset_conversion_failure) that you can try to switch to Error to see if SA faces some difficulties with an implicit conversion. And I forgot to mention that you cannot use this option, because the default collation is LATIN1251 (at least on my machine) - 'single byte'.

If you execute the SELECT statement in SCJVIEW, you will see that SA returns a different string than you submitted:

SELECT  'Pašića Hadži' AS chr1,   'Жпа' AS chr2,
       N'Pašića Hadži' AS nchr1, N'Жпа' AS nchr2;

Here is what I see on my screen:
:(

But after I have created the DB file with the following command line options:

dbinit -dba dba,sql -mpl 3 -ze utf8 -zn UTF8BIN -z UTF8BIN anotherDB

I was able to get what I really wanted:

:)

Even you code works on a fresh DB file:

CREATE OR REPLACE FUNCTION "dba"."CallGoogleApi"( IN "sLocation" LONG NVARCHAR)
RETURNS LONG BINARY
URL '!sLocation'
TYPE 'HTTP:GET'

SELECT * FROM OPENXML("CallGoogleapi"('https://maps.googleapis.com/maps/api/geocode/xml?address=pasica+75000+tuzla&language=de@region=de'), '/GeocodeResponse/result/geometry/location/lat')
              WITH("lat" LONG NVARCHAR'text()')

The Grand Budapest Hotel

p.s. don't repeat my faults... don't work with Unicode in SQL. This is painful.

MarkCulp
Participant

Note that all SQL statements are converted and transmitted to the server in Database CHAR charset and characters that cannot be mapped to the Database CHAR charset get replaced by the '0x1a' character. This is why you are getting the results that you are seeing.

In a similar way, HTTP results are expected to be in Database CHAR charset as Volker pointed out in his response above.

VolkerBarth
Contributor

So my conclusion is correct that a SOAP type request is required for NCHAR data when the database does use a single-byte CHAR collation?

I agree with Vlad that Unicode handling with web client procedures could be explained better in the docs... 🙂

Vlad
Product and Topic Expert
Product and Topic Expert

I would like to add that I do not see any conversion to the DB CHAR charset. I will explain why:
1. When the CallGoogleApi is executed, I get a temporary value (the server response) as LONG BINARY (one-to-one, no conversion)
2. I pass this BINARY to OPENXML that should accept it as well (indeed I can also add ENCODING if I want, but the documentation says that USING VALUE can read data from LONG BINARY)
3. Then I select all fields/records from the OPENXML result set.

In other words, I do not store values in the DB. I do not use single-byte functions. The code should work with temporary variables/values as in any other programming language. I just submit my query to the server and expect that it will process it well.

I agree that only after I plan to insert the obtained UTF-8 string into a table, the DB server should warn me that implicit character conversion is used and either decline the INSERT or replace characters.

0 Likes

Thanks a lot for your help!

With a new database with UTF8 as CHAR charset there are no problems.

With a new database with default chasets (LATIN1251/UCA) I still habe the same error. Although theoretically no CHAR conversion should take place, it seems to take place somewhere.

I also tried "TYPE 'SOAP'" in CallGoogleApi and get the same results.

Unfortunately, the Google Geocoding API does not seem to have an option to influence the encoding of the result.

Too bad that there seems to be no solution. In any case, changing the CHAR charset to UTF8 is not possible.

Vlad
Product and Topic Expert
Product and Topic Expert
0 Likes

I agree that no conversion should happen when no persistence is involved, but it seems that the entire DB server (including even temporary results, function calls) uses the DB charset.

I'd better avoid using OPENXML, and try to work with JSON instead (as a long string). E.g.

https://maps.googleapis.com/maps/api/geocode/json?address=pasica+75000+tuzla

or the most reliable way is to use strongly typed languages such as Java.

I like the idea to use DB as a universal platform for everything, but not today.

VolkerBarth
Contributor
0 Likes

Hm, I do use SOAP web client functions with UTF-8 result sets in a database with default CHAR collation (1252Latin1), so I think that should work here, too.

Oops, I have to correct myself: These are SOAP calls but I use type "HTTP:POST:text/xml" with long nvarchar parameters, and I compose the SOAP envelope internally. It was the contrary restriction: type SOAP:DOC with XML parameters does not work with NCHAR data as the XML datatype is based on CHAR, not NCHAR...

See that FAQ for further information.

Vlad
Product and Topic Expert
Product and Topic Expert
0 Likes

I have noticed that simply not all characters from Bitwerk's payload are available in 1252Latin1. E.g. there is an article somewhere in help, where the word 'Straße' is used as an example how to work with Unicode in SA, but this example is wrong, because the letter 'ß' exists in a single-byte codepage Windows-1252.

If they took the real Unicode word (e.g. 'Pašića Hadži' or something from Hebrew/Chinese/something_else), their example will look completely different. I'd better see how to manipulate with Unicode strings in SA on real examples, such as: if we do this, we get that.

Vlad
Product and Topic Expert
Product and Topic Expert
0 Likes

Oh, I forgot what I wanted to tell 🙂

Yes, you can work with the SOAP web service, but who knows, maybe you use characters from your current charset, and everything works fine. If your web service returns you something that doesn't exist in your charset, your code might fail, or you see a blank square.

VolkerBarth
Contributor
0 Likes

Hm, currently I have no time for further tests but I hope to do some in the next weeks...