Solved: Re: Remove HTML Tags

Former Member · ‎11-08-2013

Hi All,

I am using FM 'SO_DOCUMENT_READ_API1' to read e-mails received in SAP. This returns HTML data in table 'contents_hex'. Then I use method

cl_bcs_convert=>htmlbin_to_htmltxt to convert HTML.

My problem is that this method 'cl_bcs_convert=>htmlbin_to_htmltxt' returns below data in ET_HTML table

<html dir="ltr"><head>##<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">##<meta name="GENERATOR" content="MSHTML 8.00.7601.18228">##<style>.EmailQuote {###BORDER-LEFT: #800000 2px solid; PADDING-LEFT: 4p

t; MARGIN-LEFT: 1pt##}##</style><style>DIV.PlainText {###FONT-FAMILY: monospace; FONT-SIZE: 120%##}##</style><style title="owaParaStyle"></style>##</head>##<body ocsi="x">##<font size="2">##<div class="

PlainText"><br>## <br>## <br>##*- NOTE -*<br>##Pls do not change mail subject<br>##You can add your remarks above<br>##Click SEND button to complete your action</div>##</font>##</body>##</html>####################################################

now after using

REPLACE ALL OCCURRENCES OF REGEX '<[a-zA-Z\/][^>]*>'

or

FM 'SOTR_TAGS_REMOVE_FROM_STRING'

does not remove all tags.

I want to read only the below data from the above format (after removing HTML tags)

*- NOTE -*

Pls do not change mail subject

You can add your remarks above

Click SEND button to complete your action

Regards,

Neha

Former Member · ‎12-02-2013

Hi All,

The issue is resolved.

We changed the config. in Transaction SCOT for Inbound mails.

SCOT -> Settings -> Inbound Messages -> Settings ->

Choose 'Text Version' for field 'Prefer following version of an Inbound E-mail'

After this all the inbound mails in SAP are saved in Text format, irrespective of that they are received in RAW or HTML and from any device Outlook , Blackberry or iOS.

Thus HTML conversion is no more required.

Regards,

Neha

Former Member · ‎11-08-2013

The data you are looking for is inside div tag.

So you can first strip everything outside div tag, remove special characters and then replace <br> tags by newline, cr_lf or just split string at <br> tag into internal table.

Something like this:

REPLACE REGEX '.*<div[^>]*>' IN lv WITH ''.
REPLACE REGEX '</div>.*' IN lv WITH ''.
REPLACE ALL OCCURRENCES OF REGEX 
    '[^a-zA-Z0-9<> ]*' IN lv WITH ''.
REPLACE ALL OCCURRENCES OF REGEX '<br>' IN lv 
    WITH cl_abap_char_utilities=>newline.
CONDENSE lv.

Former Member · ‎11-11-2013

Hi Manish,

Thanks for your Input. But after applying the REGEX statements you have provided also does not remove the tags completely.

Also I would like to add that the script received in the ET_HTML table mentioned in the original query above, is not the same always. The positioning of the tags differ with the device the e-mail was sent from like Blackberry or iOS.

Regards,

Neha

Former Member · ‎11-11-2013

Hi Manish,

Thanks for your Input. But after applying the REGEX statements you have provided also does not remove the tags completely.

Also I would like to add that the script received in the ET_HTML table mentioned in the original query above, is not the same always. The positioning of the tags differ with the device the e-mail was sent from like Blackberry or iOS.

Regards,

Neha

Former Member · ‎11-11-2013

>>does not remove the tags completely

Which tag is left out? While testing for example html snippet given by you, I didn't see any tags.

>>is not the same always

This should have been mentioned in original post, with example html snippet.

rdiger_plantiko2 · ‎11-11-2013

Neha,

you should convert the SOLIX tab into an XSTRING first, by looping over the SOLIX table and CONCATENATE'ing the lines into an XSTRING, using IN BYTE MODE.
Then convert the XSTRING into a STRING in SAP's internal encoding. Simplest method for this is CL_ABAP_CODEPAGE=>CONVERT_FROM( 😞 LV_STRING = CL_ABAP_CODEPAGE=>CONVERT_FROM( LV_XSTRING ). (this call assumes an UTF-8 encoded XSTRING. In your case, you will add the string 'ISO-8859-1' as additional parameter for the encoding).
For this string finally, you should give the function module SOTR_TAGS_REMOVE_FROM_STRING another try. It should strip off all the tags now.

Regards,

Rüdiger

rdiger_plantiko2 · ‎11-12-2013

... I found out that the function module SOTR_TAGS_REMOVE_FROM_STRING has a slight disadvantage: It replaces the detected tags with a space character. This may be unwanted. You may use the following statement instead

replace all occurrences of regex `<[^>]*>` in cv_string with space.

Differing from the function module, it replaces the tags with 'nothing' (the ABAP literal SPACE will be considered as the empty string internally). If you want the space characters, use ` ` (one space char enclosed by backticks), instead of SPACE.

Remark: The regex is only a 99% solution, since '>' may be allowed in attribute values of a tag. So it doesn't work properly for a html string like:

<span title="Total amount > 100.00 GBP (critical value)" class="warning">123.00 GBP</span>

This would result in

 100.00 GBP (critical value)" class="warning">123.00 GBP

The function module SOTR_TAGS_REMOVE_FROM_STRING has the same problem.

Former Member · ‎11-12-2013

This is the reason why parsing html using regex is not recommended.

rdiger_plantiko2 · ‎11-12-2013

Of course: The string level is too low-level for source code instances of a grammar. You'll need a parser for that. For XML, a parser is available in ABAP. (Even for HTML, one could use a trick: Instrument the class CL_HTMLTIDY to transform the HTML as XHTML. Then parse the resulting XHTML document with ABAP's XML parser (from the if_ixml family). But I would never use this procedure in productive scenarios).

Former Member · ‎11-13-2013

Hi,

The REGEX statement does not allow { } brackets in the statement. Also there are other key words also in the script like '.EmailQuote'. Is there any function module / method / code which can remove all tags??

If as per your suggestion we can not use this in production scenario, then what other alternative can be used....

Regards,

Neha

rdiger_plantiko2 · ‎11-13-2013

Hi Neha,

The REGEX statement does not allow { } brackets in the statement.

regexes allow whatever you want and write in them, but...

on the string level you will get lost with the task: Now you want to detect the content of <style> ... </style> tags and eliminate it. The next will send you HTML with a script tag. Finally, you will have re-invented an HTML parser.

But after a quick search, I just found a tiny HTML parser which apparently does what you want (didn't know it before---). There is a class CL_CRM_HTML_PARSER (in package CRM_EMAIL_BASE, software component WEBCUIF). Don't expect it to be fully compliant with the W3C HTML specification. But for most of the real-world and simple HTML documents out there, it should work.

The following code gives what you want: The text content inside of the 'body' tag.

Regards,

Rüdiger

data: lt_parts type cmail_html_parser_tab,
           lv_error type flag.
     cl_crm_html_parser=>parse_html_string( exporting iv_html = lv_html
                                                    iv_all_tags_lower_case = 'X'
                                          importing et_parts = lt_parts
                                                    ev_error = lv_error ).
     data: lv_start type i value 1.
     field-symbols: <ls_part> type cmail_html_parser.
     read table lt_parts transporting no fields with key tag = 'body'.  " Let the search start after <body> (if present)
     if sy-subrc eq 0.
       lv_start = sy-tabix + 1.
     endif.
     loop at lt_parts assigning <ls_part> from lv_start.
       if <ls_part>-tag is initial.
 * Strip leading CR/LF if present
         replace regex '^[\r\n]*' in <ls_part>-part with space.
         check <ls_part>-part cn space. "" Skip empty text nodes
*  Here is a text part: Do what you want with it - e.g. WRITE it: 
         write: / <ls_part>-part. 
       endif.
       if <ls_part>-tag = 'body'.
         exit.  " Reached end of body: Stop searching
       endif.
     endloop.

Former Member · ‎12-02-2013

Hi Rudiger,

Thanks for your inputs, they really helped.

Well method cl_crm_html_parser=>parse_html_string resolved the issue to a large extent, but not completely.

The problem that remains is that when the content is split-ted in multiple lines in table it_soli_bcs, the rows which do not have beginning of tags for those rows some of the content is lost. Example -

data in table it_soli_bcs

Row 1

PlainText"> </div>##<div class="PlainText"><font face="courier new">Remarks : this document is rejected due to large amount. Please change the amount of the##</font>PR 6000002560 with in budget. <br>## <br>## <br>##*- NOTE -*<br>##Plea

Row 2

se do not change content of this mail <br>## <br>##*********************************</div>##</font>##</body>##</html>############################################################################################################################### # # #

after passing these rows as string in parameter iv_html of method cl_crm_html_parser=>parse_html_string data retrieved in parameter et_parts table is

Row 1

Remarks : this document is rejected due to large amount. Please change the amount of the##PR 6000002560 with in budget.##*- NOTE -*##Plea

Row 2

##*********************************##

'se do not change content of this mail' in the Row 2 of table it_soli_bcs is lost.

Regards,

Neha

rdiger_plantiko2 · ‎12-02-2013

The problem that remains is that when the content is split-ted in multiple lines in table  it_soli_bcs, the rows which do not have beginning of tags for those rows some of the content is lost

Then you didn't properly concatenate the lines into the string that serves as input for the parser (GIGO). Don't use CR/LF as separator in the concatenate statement.

Regards,

Rüdiger

Former Member · ‎12-02-2013

Hi All,

The issue is resolved.

We changed the config. in Transaction SCOT for Inbound mails.

SCOT -> Settings -> Inbound Messages -> Settings ->

Choose 'Text Version' for field 'Prefer following version of an Inbound E-mail'

After this all the inbound mails in SAP are saved in Text format, irrespective of that they are received in RAW or HTML and from any device Outlook , Blackberry or iOS.

Thus HTML conversion is no more required.

Regards,

Neha

ravirayala_sap · ‎04-03-2023

Hi Neha,

In my case even it is maintained as Text version iam getting data as <html> format

Could you please help.