Application Development and Automation Discussions
Join the discussions or start your own on all things application development, including tools and APIs, programming models, and keeping your skills sharp.
cancel
Showing results for 
Search instead for 
Did you mean: 
Read only

UTF-8 Encoding not consistent in Simple Transformation

patricksteffens
Participant
4,408

Information: This question is a follow-up question to my previously answered question here: https://answers.sap.com/questions/12921343/simple-transformation-problems-with-encoding-conve.html

I found it could be handled most efficiently as a separate question.

---------------

Using fragments I realized the dynamic generation of XML tags without loosing the opening and closing brackets. However, If one such dynamic tag contains special symbols, they do not seem to be properly translated into the output byte string.

Example:

The field "DKTXT" is a description field containing the string "Ä & < >".

I want to generate the following output XML (ignore the whitespaces after &):

<?xml version="1.0" encoding="utf-8"?>
<Paket>
	<Metadaten>
		<DKTXT>& Auml; & amp; & lt; & gt;</DKTXT>
	</Metadaten>
</Paket>

using this ABAP code for transformation:

DATA: lv_xml_out  TYPE xsdany.
CALL TRANSFORMATION zpd_st_nscale_xml_abap
   SOURCE paket = ms_xml_data-paket
   RESULT XML lv_xml_out. "Using variable of type rawstring I get an UTF-8-XML

When I obtain the bytes from lv_xml_out and translate it via an online hex <--> utf-8 converter (https://sites.google.com/site/nathanlexwww/tools/utf8-convert), I get the following output:

Formatted XML (ignore the whitespace after &):

<?xml version="1.0" encoding="utf-8"?>
<Paket>
  <Metadaten>
    <DKTXT>Ä & amp; & lt; & gt;</DKTXT>
  </Metadaten>
</Paket>

Raw bytes:

3C3F786D6C2076657273696F6E3D22312E302220656E636F64696E673D227574662D38223F3E0A3C50616B65743E0A20203C4D657461646174656E3E0A202020203C444B5458543EC3842026616D703B20266C743B202667743B3C2F444B5458543E0A20203C2F4D657461646174656E3E0A3C2F50616B65743E

The result: our processing software parsing the output XML is complaining about the 'Ä' and claims that the UTF-8 file is erronerous.

Why did it translate <, >, & bit not Ä? I would expect all symbols be translated (i.e. Ä --> @Auml;).

And how can I force translation of all special symbols into HTML-Characters?

1 ACCEPTED SOLUTION
Read only

Sandra_Rossi
Active Contributor
2,941

The XML is completely normal and the program which complains is wrong (or maybe the file is not correctly sent).

Ä is represented as C384, which in UTF-8 represents the Unicode character U+00C4 which is Ä. In your XML header, it is clearly stated that the encoding after the header is UTF-8 so the XML is technically fine.

In XML, the only characters which need to be escaped are < and & as explained in XML standards:

The ampersand character (&) and the left angle bracket (<) must not appear in 
their literal form, except when used as markup delimiters, or within a comment, 
a processing instruction, or a CDATA section. If they are needed elsewhere, they 
must be escaped using either numeric character references or the strings " & " 
and " < " respectively.

Other characters don't need to be represented by their character entity references.

5 REPLIES 5
Read only

Sandra_Rossi
Active Contributor
0 Likes
2,941

To help people answer, here is the display side by side of text and UTF-8 hexadecimal:

<?xml version="1.0"             3C3F786D6C2076657273696F6E3D22312E3022
 encoding="utf-8"?>             20656E636F64696E673D227574662D38223F3E0A
<Paket>                         3C50616B65743E0A
  <Metadaten>                   20203C4D657461646174656E3E0A
    <DKTXT>                     202020203C444B5458543E
       Ä & amp; & lt;           C3842026616D703B20266C743B
       & gt;</DKTXT>            202667743B3C2F444B5458543E0A
  </Metadaten>                  20203C2F4D657461646174656E3E0A
</Paket>                        3C2F50616B65743EF

As we can see, Ä is represented as C384, which in UTF-8 represents the Unicode character U+00C4 which is Ä.

Read only

Sandra_Rossi
Active Contributor
2,942

The XML is completely normal and the program which complains is wrong (or maybe the file is not correctly sent).

Ä is represented as C384, which in UTF-8 represents the Unicode character U+00C4 which is Ä. In your XML header, it is clearly stated that the encoding after the header is UTF-8 so the XML is technically fine.

In XML, the only characters which need to be escaped are < and & as explained in XML standards:

The ampersand character (&) and the left angle bracket (<) must not appear in 
their literal form, except when used as markup delimiters, or within a comment, 
a processing instruction, or a CDATA section. If they are needed elsewhere, they 
must be escaped using either numeric character references or the strings " & " 
and " < " respectively.

Other characters don't need to be represented by their character entity references.

Read only

2,941

Hi Sandra, thank you for your clarification. It made me understand the whole topic a bit better. We store the XML file on the SAP filesystem and found out, that it is saved in ANSI, i.e. 'Ä' is encoded as E4 and there is no UTF-8 file header.

Hence the transformation is correct but the storing currently converts the encoding.

Read only

Sandra_Rossi
Active Contributor
0 Likes
2,941

By the way, why do you use an online converter to display hexadecimal, why don't you use the ABAP debugger?

Read only

patricksteffens
Participant
0 Likes
2,941

I wanted use a SAP-independent tool to make sure that I don't run into misconceptions in the SAP ABAP context. Additionally I can now prove that it is not an SAP-related issue.