11-08-2013 8:32 AM
Hi All,
I am using FM 'SO_DOCUMENT_READ_API1' to read e-mails received in SAP. This returns HTML data in table 'contents_hex'. Then I use method
cl_bcs_convert=>htmlbin_to_htmltxt to convert HTML.
My problem is that this method 'cl_bcs_convert=>htmlbin_to_htmltxt' returns below data in ET_HTML table
<html dir="ltr"><head>##<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">##<meta name="GENERATOR" content="MSHTML 8.00.7601.18228">##<!-- converted from text --><style>.EmailQuote {###BORDER-LEFT: #800000 2px solid; PADDING-LEFT: 4p
t; MARGIN-LEFT: 1pt##}##</style><style>DIV.PlainText {###FONT-FAMILY: monospace; FONT-SIZE: 120%##}##</style><style title="owaParaStyle"><!--P {###MARGIN-TOP: 0px; MARGIN-BOTTOM: 0px##}##--></style>##</head>##<body ocsi="x">##<font size="2">##<div class="
PlainText"><br>## <br>## <br>##*- NOTE -*<br>##Pls do not change mail subject<br>##You can add your remarks above<br>##Click SEND button to complete your action</div>##</font>##</body>##</html>####################################################
now after using
REPLACE ALL OCCURRENCES OF REGEX '<[a-zA-Z\/][^>]*>'
or
FM 'SOTR_TAGS_REMOVE_FROM_STRING'
does not remove all tags.
I want to read only the below data from the above format (after removing HTML tags)
*- NOTE -*
Pls do not change mail subject
You can add your remarks above
Click SEND button to complete your action
Regards,
Neha
12-02-2013 5:50 AM
Hi All,
The issue is resolved.
We changed the config. in Transaction SCOT for Inbound mails.
SCOT -> Settings -> Inbound Messages -> Settings ->
Choose 'Text Version' for field 'Prefer following version of an Inbound E-mail'
After this all the inbound mails in SAP are saved in Text format, irrespective of that they are received in RAW or HTML and from any device Outlook , Blackberry or iOS.
Thus HTML conversion is no more required.
Regards,
Neha
11-08-2013 3:49 PM
The data you are looking for is inside div tag.
So you can first strip everything outside div tag, remove special characters and then replace <br> tags by newline, cr_lf or just split string at <br> tag into internal table.
Something like this:
REPLACE REGEX '.*<div[^>]*>' IN lv WITH ''.
REPLACE REGEX '</div>.*' IN lv WITH ''.
REPLACE ALL OCCURRENCES OF REGEX
'[^a-zA-Z0-9<> ]*' IN lv WITH ''.
REPLACE ALL OCCURRENCES OF REGEX '<br>' IN lv
WITH cl_abap_char_utilities=>newline.
CONDENSE lv.
11-11-2013 9:16 AM
Hi Manish,
Thanks for your Input. But after applying the REGEX statements you have provided also does not remove the tags completely.
Also I would like to add that the script received in the ET_HTML table mentioned in the original query above, is not the same always. The positioning of the tags differ with the device the e-mail was sent from like Blackberry or iOS.
Regards,
Neha
11-11-2013 9:16 AM
Hi Manish,
Thanks for your Input. But after applying the REGEX statements you have provided also does not remove the tags completely.
Also I would like to add that the script received in the ET_HTML table mentioned in the original query above, is not the same always. The positioning of the tags differ with the device the e-mail was sent from like Blackberry or iOS.
Regards,
Neha
11-11-2013 9:25 AM
>>does not remove the tags completely
Which tag is left out? While testing for example html snippet given by you, I didn't see any tags.
>>is not the same always
This should have been mentioned in original post, with example html snippet.
11-11-2013 9:34 AM
Neha,
Regards,
Rüdiger
11-12-2013 7:23 AM
... I found out that the function module SOTR_TAGS_REMOVE_FROM_STRING has a slight disadvantage: It replaces the detected tags with a space character. This may be unwanted. You may use the following statement instead
replace all occurrences of regex `<[^>]*>` in cv_string with space.
Differing from the function module, it replaces the tags with 'nothing' (the ABAP literal SPACE will be considered as the empty string internally). If you want the space characters, use ` ` (one space char enclosed by backticks), instead of SPACE.
Remark: The regex is only a 99% solution, since '>' may be allowed in attribute values of a tag. So it doesn't work properly for a html string like:
<span title="Total amount > 100.00 GBP (critical value)" class="warning">123.00 GBP</span>
This would result in
100.00 GBP (critical value)" class="warning">123.00 GBP
The function module SOTR_TAGS_REMOVE_FROM_STRING has the same problem.
11-12-2013 9:28 AM
This is the reason why parsing html using regex is not recommended.
11-12-2013 1:08 PM
Of course: The string level is too low-level for source code instances of a grammar. You'll need a parser for that. For XML, a parser is available in ABAP. (Even for HTML, one could use a trick: Instrument the class CL_HTMLTIDY to transform the HTML as XHTML. Then parse the resulting XHTML document with ABAP's XML parser (from the if_ixml family). But I would never use this procedure in productive scenarios).
11-13-2013 4:58 AM
Hi,
The REGEX statement does not allow { } brackets in the statement. Also there are other key words also in the script like '.EmailQuote'. Is there any function module / method / code which can remove all tags??
If as per your suggestion we can not use this in production scenario, then what other alternative can be used....
Regards,
Neha
11-13-2013 7:28 AM
Hi Neha,
The REGEX statement does not allow { } brackets in the statement.
regexes allow whatever you want and write in them, but...
on the string level you will get lost with the task: Now you want to detect the content of <style> ... </style> tags and eliminate it. The next will send you HTML with a script tag. Finally, you will have re-invented an HTML parser.
But after a quick search, I just found a tiny HTML parser which apparently does what you want (didn't know it before---). There is a class CL_CRM_HTML_PARSER (in package CRM_EMAIL_BASE, software component WEBCUIF). Don't expect it to be fully compliant with the W3C HTML specification. But for most of the real-world and simple HTML documents out there, it should work.
The following code gives what you want: The text content inside of the 'body' tag.
Regards,
Rüdiger
data: lt_parts type cmail_html_parser_tab,
lv_error type flag.
cl_crm_html_parser=>parse_html_string( exporting iv_html = lv_html
iv_all_tags_lower_case = 'X'
importing et_parts = lt_parts
ev_error = lv_error ).
data: lv_start type i value 1.
field-symbols: <ls_part> type cmail_html_parser.
read table lt_parts transporting no fields with key tag = 'body'. " Let the search start after <body> (if present)
if sy-subrc eq 0.
lv_start = sy-tabix + 1.
endif.
loop at lt_parts assigning <ls_part> from lv_start.
if <ls_part>-tag is initial.
* Strip leading CR/LF if present
replace regex '^[\r\n]*' in <ls_part>-part with space.
check <ls_part>-part cn space. "" Skip empty text nodes
* Here is a text part: Do what you want with it - e.g. WRITE it:
write: / <ls_part>-part.
endif.
if <ls_part>-tag = 'body'.
exit. " Reached end of body: Stop searching
endif.
endloop.
12-02-2013 5:45 AM
Hi Rudiger,
Thanks for your inputs, they really helped.
Well method cl_crm_html_parser=>parse_html_string resolved the issue to a large extent, but not completely.
The problem that remains is that when the content is split-ted in multiple lines in table it_soli_bcs, the rows which do not have beginning of tags for those rows some of the content is lost. Example -
data in table it_soli_bcs
Row 1
PlainText"> </div>##<div class="PlainText"><font face="courier new">Remarks : this document is rejected due to large amount. Please change the amount of the##</font>PR 6000002560 with in budget. <br>## <br>## <br>##*- NOTE -*<br>##Plea
Row 2
se do not change content of this mail <br>## <br>##*********************************</div>##</font>##</body>##</html>############################################################################################################################### # # #
after passing these rows as string in parameter iv_html of method cl_crm_html_parser=>parse_html_string data retrieved in parameter et_parts table is
Row 1
Remarks : this document is rejected due to large amount. Please change the amount of the##PR 6000002560 with in budget.##*- NOTE -*##Plea
Row 2
##*********************************##
'se do not change content of this mail' in the Row 2 of table it_soli_bcs is lost.
Regards,
Neha
12-02-2013 7:49 AM
The problem that remains is that when the content is split-ted in multiple lines in table it_soli_bcs, the rows which do not have beginning of tags for those rows some of the content is lost
Then you didn't properly concatenate the lines into the string that serves as input for the parser (GIGO). Don't use CR/LF as separator in the concatenate statement.
Regards,
Rüdiger
12-02-2013 5:50 AM
Hi All,
The issue is resolved.
We changed the config. in Transaction SCOT for Inbound mails.
SCOT -> Settings -> Inbound Messages -> Settings ->
Choose 'Text Version' for field 'Prefer following version of an Inbound E-mail'
After this all the inbound mails in SAP are saved in Text format, irrespective of that they are received in RAW or HTML and from any device Outlook , Blackberry or iOS.
Thus HTML conversion is no more required.
Regards,
Neha
04-03-2023 5:31 PM
Hi Neha,
In my case even it is maintained as Text version iam getting data as <html> format
Could you please help.