Solved: Remove HTML tags from a string

daniel_humberg · ‎2009 Dec 16

I have a string that contains a couple of HTML or XHTML tag, for example

 lv_my_string = '<p style="something">Hello <strong>World</strong>!</p>'.

For a special use case, I want to remove all HTML from that string and process only the plain text

 lv_my_new_string = 'Hello World!'.

Is there any method, function module, XSLT or anything else for that already?

Former Member · ‎2009 Dec 17

You can use some Regular Expressions -:)


DATA: message TYPE string.

message = '<p style="something">Hello <strong>World</strong>!</p>'.

REPLACE ALL OCCURRENCES OF REGEX '<[a-zA-Z\/][^>]*>' IN message with space.
WRITE:/ message.

Greetings,

Blag.

kesavadas_thekkillath · ‎2009 Dec 16

Check

HRDSYS_CONVERT_FROM_HTML & SOTR_TAGS_REMOVE_FROM_STRING

Former Member · ‎2009 Dec 16

Hi,

Check

Former Member · ‎2009 Dec 17

Try using the followig FMs:

1. SOTR_TAGS_REMOVE_FROM_STRING

2. SWA_STRING_REMOVE_SUBSTRING

daniel_humberg · ‎2009 Dec 17

Thx for the help.

1. SOTR_TAGS_REMOVE_FROM_STRING => nice but not perfect. It removes also single characters like "<" and ">" from the text. I would have to encode them before, right?

2. HRDSYS_CONVERT_FROM_HTML => returns only an empty table in my test

3. SWA_STRING_REMOVE_SUBSTRING => what kind of delete pattern would I use?

former_member206377 · ‎2009 Dec 17

Hi Daniel,

Hope this code solves your problem.


DATA : ipstr TYPE string,
       opstr1 type string,
       opstr2 type string,
       opstr TYPE string,
       len TYPE i VALUE 0,
       ch TYPE char1,
       num TYPE i VALUE 0,   "No of Characters to be taken
       pos TYPE char3,      "Position of Char in the Input String
       count(3) type n.
*Input string
ipstr = '<p style="something">Hello <strong>World</strong>!</p><br>I need the data.</br>How are you?'.

len = STRLEN( ipstr ).
  DO len TIMES.
*Char by Char
  ch = ipstr+pos(1).
  pos = pos + 1.
*Scan each char in input String for ">"
  FIND '>' IN ch IGNORING CASE.
  IF sy-subrc = 0.
count = count + 1.
endif.
  FIND '<' IN ch IGNORING CASE.
    IF sy-subrc = 0.
count = count + 1.
endif.
enddo.

Edited by: Vasuki S Patki on Dec 17, 2009 4:56 PM

former_member206377 · ‎2009 Dec 17


  split ipstr at '>' INTO opstr opstr1.
    DO count TIMES.
  split opstr1 at '<' into opstr opstr1.

former_member206377 · ‎2009 Dec 17


 concatenate opstr2 opstr into opstr2.
  split opstr1 at '>' into opstr opstr1.

ENDDO.
  WRITE :/ opstr2.

Please combine all the above posted code and hope thsi helps you..

I tested it with this code and works fine..

Former Member · ‎2009 Dec 18

Hi Daniel,

I tried using the FM (SWA_STRING_REMOVE_SUBSTRING) but I guess it is expecting a particular pattern which is not so apparent in your case. Iu2019ve written a small piece of code which you can try using in a FM or a PERFORM and that should do the trick. Please let me know if you have any questions.


PARAMETER: P_LINE(100).

TYPES: BEGIN OF TY_LINE,
         LINE(100),
       END OF TY_LINE.

DATA: T_LINE TYPE STANDARD TABLE OF TY_LINE,
      WA_LINE LIKE LINE OF T_LINE.

DATA: W_LINE(100),
      W_LEN(100),
      W_COUNT TYPE I,
      W_FLAG,
      W_FLAG1,
      W_I TYPE I.

W_COUNT = STRLEN( P_LINE ).

DO W_COUNT TIMES.
  IF P_LINE+W_I(1) = '<'.
    W_FLAG = 1.
    W_I = W_I + 1.
    IF NOT WA_LINE-LINE IS INITIAL.
      APPEND WA_LINE-LINE TO T_LINE.
      CLEAR WA_LINE.
    ENDIF.
    CONTINUE.

  ELSEIF P_LINE+W_I(1) = '>'.
    W_FLAG = 0.
    W_I = W_I + 1.
    CONTINUE.
  ENDIF.

  IF W_FLAG = 1.
    W_I = W_I + 1.
    CONTINUE.
  ELSE.
    CONCATENATE WA_LINE-LINE P_LINE+W_I(1) INTO WA_LINE-LINE.
    W_I = W_I + 1.
  ENDIF.

ENDDO.

LOOP AT T_LINE INTO WA_LINE.
  CONCATENATE W_LINE WA_LINE-LINE INTO W_LINE SEPARATED BY SPACE.

ENDLOOP.

SHIFT W_LINE LEFT DELETING LEADING SPACE.
WRITE: W_LINE.

Input:

<p style="something">Hello <strong>World</strong>!</p>

Output:

HELLO WORLD !

Regards,

Pritam

Former Member · ‎2009 Dec 17

You can use some Regular Expressions -:)


DATA: message TYPE string.

message = '<p style="something">Hello <strong>World</strong>!</p>'.

REPLACE ALL OCCURRENCES OF REGEX '<[a-zA-Z\/][^>]*>' IN message with space.
WRITE:/ message.

Greetings,

Blag.

Former Member · ‎2009 Dec 19

Hi Daniel,

I realized i made a typo while copying the code. The Do~Enddo code would go like this.


DO W_COUNT TIMES.
  IF P_LINE+W_I(1) = '<'.
    W_FLAG = 1.
    W_I = W_I + 1.
    CONTINUE.
  ENDIF.
 
  IF P_LINE+W_I(1) = '>'.
    W_FLAG = 0.
    W_I = W_I + 1.
    CONTINUE.
  ENDIF.

  IF W_FLAG = 1.
    W_I = W_I + 1.
    CONTINUE.
  ELSE.
    CONCATENATE WA_LINE-LINE P_LINE+W_I(1) INTO WA_LINE-LINE.
    W_I = W_I + 1.
  ENDIF.
 ENDDO.

Please try this let me know if you have any questions

Former Member · ‎2009 Dec 19

Hi Daniel,

There is some formatting issue while posting this code using the symbol {<}.

Basically inside the Do loop the if condition is to be written twice. First time to check for {<} and if satisafied w_flag is set to "1" and second time to check for ">" and if satisfied w_flag is set to "0".

By Category

Related Content

Activity Groups

Industry Groups

Influence and Feedback Groups

Interest Groups

Location Groups

Customer Only Groups

Forums

Related Resources

Products

Learning and Support

About

My SAP Profile

My SAP Profile

Remove HTML tags from a string