on 2017 Dec 15 4:01 AM
Hello!
When using "openxml" on results of the Google Geocoding API, I receive an XML parser error with the message "Invalid control character" for certain address details. Here's an example:
select * from openxml("CallGoogleapi"('https://maps.googleapis.com/maps/api/geocode/xml?address=pasica+75000+tuzla&language=en'),'/GeocodeR... with("lat" long varchar 'text()')
with "CallGoogleapi" defined as
FUNCTION "CallGoogleapi"( in "sLocation" long varchar ) returns xml url '!sLocation!' type 'HTTP:GET'
What can I do, so that openxml does not geerate this error?
Thanks in advance!
Dirk
Request clarification before answering.
I suggest to have a look at the XML document returned from the API, ideally in an XML aware editor / viewer which visualizes the document structure.
HTH
Volker
DB-TecKnowledgy
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I was finally able to solve this... I don't even know the correct word... 'thing'. I don't know where to start, because the error is unclear, and the documentation looks incomplete for Unicode-related questions.
There are important links that explain that SqlAnywhere (when it cannot convert a string from one encoding to another) replaces the multi-byte character with '0x1a'. There is an option (on_charset_conversion_failure) that you can try to switch to Error to see if SA faces some difficulties with an implicit conversion. And I forgot to mention that you cannot use this option, because the default collation is LATIN1251 (at least on my machine) - 'single byte'.
If you execute the SELECT statement in SCJVIEW, you will see that SA returns a different string than you submitted:
SELECT 'Pašića Hadži' AS chr1, 'Жпа' AS chr2, N'Pašića Hadži' AS nchr1, N'Жпа' AS nchr2;
Here is what I see on my screen:
But after I have created the DB file with the following command line options:
dbinit -dba dba,sql -mpl 3 -ze utf8 -zn UTF8BIN -z UTF8BIN anotherDB
I was able to get what I really wanted:
Even you code works on a fresh DB file:
CREATE OR REPLACE FUNCTION "dba"."CallGoogleApi"( IN "sLocation" LONG NVARCHAR) RETURNS LONG BINARY URL '!sLocation' TYPE 'HTTP:GET' SELECT * FROM OPENXML("CallGoogleapi"('https://maps.googleapis.com/maps/api/geocode/xml?address=pasica+75000+tuzla&language=de@region=de'), '/GeocodeResponse/result/geometry/location/lat') WITH("lat" LONG NVARCHAR'text()')
p.s. don't repeat my faults... don't work with Unicode in SQL. This is painful.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Note that all SQL statements are converted and transmitted to the server in Database CHAR charset and characters that cannot be mapped to the Database CHAR charset get replaced by the '0x1a' character. This is why you are getting the results that you are seeing.
In a similar way, HTTP results are expected to be in Database CHAR charset as Volker pointed out in his response above.
So my conclusion is correct that a SOAP type request is required for NCHAR data when the database does use a single-byte CHAR collation?
I agree with Vlad that Unicode handling with web client procedures could be explained better in the docs... 🙂
I would like to add that I do not see any conversion to the DB CHAR charset. I will explain why:
1. When the CallGoogleApi is executed, I get a temporary value (the server response) as LONG BINARY (one-to-one, no conversion)
2. I pass this BINARY to OPENXML that should accept it as well (indeed I can also add ENCODING if I want, but the documentation says that USING VALUE can read data from LONG BINARY)
3. Then I select all fields/records from the OPENXML result set.
In other words, I do not store values in the DB. I do not use single-byte functions. The code should work with temporary variables/values as in any other programming language. I just submit my query to the server and expect that it will process it well.
I agree that only after I plan to insert the obtained UTF-8 string into a table, the DB server should warn me that implicit character conversion is used and either decline the INSERT or replace characters.
Thanks a lot for your help!
With a new database with UTF8 as CHAR charset there are no problems.
With a new database with default chasets (LATIN1251/UCA) I still habe the same error. Although theoretically no CHAR conversion should take place, it seems to take place somewhere.
I also tried "TYPE 'SOAP'" in CallGoogleApi and get the same results.
Unfortunately, the Google Geocoding API does not seem to have an option to influence the encoding of the result.
Too bad that there seems to be no solution. In any case, changing the CHAR charset to UTF8 is not possible.
I agree that no conversion should happen when no persistence is involved, but it seems that the entire DB server (including even temporary results, function calls) uses the DB charset.
I'd better avoid using OPENXML, and try to work with JSON instead (as a long string). E.g.
https://maps.googleapis.com/maps/api/geocode/json?address=pasica+75000+tuzla
or the most reliable way is to use strongly typed languages such as Java.
I like the idea to use DB as a universal platform for everything, but not today.
Hm, I do use SOAP web client functions with UTF-8 result sets in a database with default CHAR collation (1252Latin1), so I think that should work here, too.
Oops, I have to correct myself: These are SOAP calls but I use type "HTTP:POST:text/xml" with long nvarchar parameters, and I compose the SOAP envelope internally. It was the contrary restriction: type SOAP:DOC with XML parameters does not work with NCHAR data as the XML datatype is based on CHAR, not NCHAR...
See that FAQ for further information.
I have noticed that simply not all characters from Bitwerk's payload are available in 1252Latin1. E.g. there is an article somewhere in help, where the word 'Straße' is used as an example how to work with Unicode in SA, but this example is wrong, because the letter 'ß' exists in a single-byte codepage Windows-1252.
If they took the real Unicode word (e.g. 'Pašića Hadži' or something from Hebrew/Chinese/something_else), their example will look completely different. I'd better see how to manipulate with Unicode strings in SA on real examples, such as: if we do this, we get that.
Oh, I forgot what I wanted to tell 🙂
Yes, you can work with the SOAP web service, but who knows, maybe you use characters from your current charset, and everything works fine. If your web service returns you something that doesn't exist in your charset, your code might fail, or you see a blank square.
I think you should use (long) nvarchar data types for the OPEN XML output columns (i.e. in your case the "lat" column). According to the docs, the XML data will then be interpreted as NCHAR data, as well - to cite:
The xml-data is parsed directly in the NCHAR encoding if there are any NCHAR columns in the output. The xpath and namespaces arguments are also converted and parsed in the NCHAR encoding.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I forgot to mention that this call normally works. For "berlin" I can calculate lat/log without oproblems:
"https://maps.googleapis.com/maps/api/geocode/xml?address=berlin&language=en" "......<lat>52.5200066</lat><lng>13.4049540</lng>....."
In the query "pasica+75000+tuzla" sa-17 reports the parsing error at character 150, line 6, column 41:
1 < ?xml version="1.0" encoding="UTF-8"?> 2 -< GeocodeResponse> 3 < status>OK< /status> 4 < result> 5 < type>route< /type> 6 < formatted_address>Hadži Hasanage Pašića, Tuzla 75000, Bosnia and Herzegovina< /formatted_address> 7 < address_component> 8 < long_name>Hadži Hasanage Pašića< /long_name> .....
Saving the result to a file, I see the following result with an unicode character:
<formatted_address>Hadži Hasanage Paši**/u001a**a, Tuzla 75000, Bosnia and Herzegovina</formatted_address>
In line 6, column 41 there is a hex "1A00" or U+001A. Apparently, openxml has a problem with this character. Would I have to do a conversion here?
Kind regards, Dirk
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I'm not sufficiently expert in XML to decide whether this is correct but my guess is that 'encoding="UTF-8"' makes it valid. Unless your database's char collation is UTF-8, it might be worth trying to use nvarchar instead of varchar in your definitions.
At least, this information could help to locate the problem.
Cheers
Volker
I think the problem is not the OPENXML call but the web client function. It seems to convert the returned value to the CHAR datatype (so it works when you use UTF-8 as CHAR encoding). That restriction is documented here under "data-type", if my assumption is correct:
Only SOAP requests support the transmission of typed data such as FLOAT, INT, and so on. HTTP requests support the transmission of strings only, so you are limited to CHAR types.
So you might have to use a web client function of type SOAP...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
I tried it again with newly created databases. When using the default collation ("COLLATION '1252LATIN1' NCHAR COLLATION 'UCA'") I still get the error.
When using "COLLATION 'UCA' NCHAR COLLATION 'UCA'" everything works fine!
I changed everything to "NVARCHAR" and tried it again with default collation: I still get the error.
select * from openxml("CallGoogleapi"('https://maps.googleapis.com/maps/api/geocode/xml?address=pasica+75000+tuzla&language=de@region=de'),... with("lat" long nvarchar 'text()') OPTION (ENCODING='UTF-8') FUNCTION "DBA"."CallGoogleapi"( in "sLocation" long nvarchar ) returns long nvarchar url '!sLocation' type 'HTTP:GET'
Shouldn't the call also be possible with the default collation?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
86 | |
9 | |
9 | |
9 | |
7 | |
6 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.