
find
, replace
, count
, matches
, etc.FIND
and REPLACE
with the addition REGEX
CL_ABAP_REGEX
and CL_ABAP_MATCHER
CX_SY_REGEX_TOO_COMPLEX
error yourself.$1
to be substituted for the contents of the first capture groupCL_ABAP_REGEX
and CL_ABAP_MATCHER
, via the new factory function CL_ABAP_REGEX=>CREATE_PCRE( )
FIND
and REPLACE
, via the new addition PCRE
find
, find_end
, matches
, match
, count
, replace
, substring
and contains
, via the new parameter pcre
+
meaning one or more times and *
meaning zero or more times. As briefly mentioned above, POSIX only supports so called greedy quantifiers. A greedy quantifier will always match as much as possible, as outlined in the following example:DATA(posix_result) = match( val = `<outer><inner></inner></outer>` regex = `<.+>` ) ##regex_posix.
" --> '<outer><inner></inner></outer>'
<.+>
matches the complete string, as .+
will match as many characters as it possibly can.>
? In that case we can make use of lazy (also known as reluctant) quantifiers, which are probably the most sought after feature in ABAP regular expressions:DATA(pcre_result) = match( val = `<outer><inner></inner></outer>` pcre = `<.+?>` ).
" --> '<outer>'
pcre
parameter of the match
function above)?
character to it. A lazy quantifier will have the same basic matching characteristics as its greedy counterpart (i.e. +?
will match one or more characters), but will match as few characters as possible.(?<=...)
denotes a positive lookbehind, meaning the lookbehind fails if the current position of the match is not preceded by the given pattern(?<!...)
denotes a negative lookbehind, meaning the lookbehind fails if the current position of the match is preceded by the given patternDATA(result1) = find( val = `meter` pcre = `(?<!centi)meter` ).
" --> found; matches whole string
DATA(result2) = find( val = `millimeter` pcre = `(?<!centi)meter` ).
" --> found; match starts at offset 5
DATA(result3) = find( val = `centimeter` pcre = `(?<!centi)meter` ).
" --> not found
^
and $
). Let's take a look at a modified version of the example above:DATA(result) = match( val = `centimeter` pcre = `(?<=centi)meter` ).
" --> 'meter'
meter
is preceded by centi
, which succeeds. The reported match however only contains meter
, as the lookbehind assertion is zero-length.\w+
can match an arbitrary number of characters:DATA(result) = match( val = `kilometer` pcre = `(?<!\w+)meter` ).
" --> ERROR
U+HHHH..
, that is U+
followed by the code point value in hexadecimal. The code point U+0041
for example corresponds to the Latin capital letter A
.U+0000
to U+FFFF
. Pretty much all characters of all modern languages fall into this plane, as well as some symbols and other stuff.U+1F47D
(also known as the Extraterrestrial Alien ?): when encoded as UTF-16, it will result in the code units 0xD83D
(the high surrogate) and 0xDC7D
(the low surrogate).0xD83D
alone is not valid UTF-16. Also, no two encoded valid BMP characters form a surrogate pair. This means that there is no combination of two non-surrogate code units that can be interpreted as a surrogate pair.U+1F47D
. From ABAP's perspective, this is just a string consisting of two characters, 0xD83D
and 0xDC7D
:" go through great lengths to get the UTF-16 representation
" of U+1F47D (EXTRATERRESTRIAL ALIEN) into an ABAP string
DATA(surrogate_pair) = cl_abap_codepage=>convert_from(
codepage = 'UTF-8'
source = CONV xstring( 'F09F91BD' ) ).
" surrogate_pair now contains the alien
DATA(length) = strlen( surrogate_pair ).
" --> 2
DATA(broken_surrogate) = surrogate_pair(1).
" oops... broken_surrogate only contains the first code unit (the high surrogate); this is not valid UTF-16
.
character matches single 16 bit code units:DATA(posix_result) = replace( val = surrogate_pair regex = `.` with = `x` occ = 0 ) ##regex_posix.
" --> 'xx'
UNICODE_HANDLING
modes, which allow you to choose between the UTF-16 and the UCS-2 encoding, depending on your needs:UNICODE_HANDLING | Description | Where Usable |
---|---|---|
STRICT | treat strings as UTF-16, throw an exception upon encountering invalid UTF-16 (i.e. broken surrogate pairs) | everywhere PCRE can be used; can be set via parameter UNICODE_HANDLING in CL_ABAP_REGEX=>CREATE_PCRE( ) ;can also be enabled with the special control verb (*UTF) at the start of a pattern, which also works outside of CL_ABAP_REGEX |
RELAXED | treat strings as UTF-16, ignore invalid UTF-16; parts of a string that are not valid UTF-16 cannot be matched in any way | only in conjunction with CL_ABAP_REGEX via parameter UNICODE_HANDLING |
IGNORE | treat strings as UCS-2; the \C syntax is enabled in patterns, the matching of surrogate pairs by their Unicode code point is however no longer possible | everywhere PCRE can be used; can be set via parameter UNICODE_HANDLING in CL_ABAP_REGEX=>CREATE_PCRE( ) ;this is the default outside of CL_ABAP_REGEX if no special control verb is used |
DATA(ucs2_result) = replace( val = surrogate_pair pcre = `.` with = `x` occ = 0 ).
" --> 'xx'
DATA(utf16_result) = replace( val = surrogate_pair pcre = `(*UTF).` with = `x` occ = 0 ).
" --> 'x'
strlen
example from above? We can now calculate the correct length of a UTF-16 string:DATA(length) = count( val = surrogate_pair pcre = `(*UTF).` ).
" --> 1
hy-\nphen
, where \n
denotes a line feed, becomes hyphen\n
. We may be tempted to write a regular expression like this:DATA(result1) = replace( val = |Pearl-\nand is not a town, Pear-\nland is| regex = `(\w+)-\n(\w+)\s*` with = `$1$2\n` occ = 0 ) ##regex_posix.
" --> 'Pearland\nis not a town, Pearland\nis' --> yeah, it works!
DATA(result2) = replace( val = |Pearl-\r\nand is not a town, Pear-\r\nland is| regex = `(\w+)-\n(\w+)\s*` with = `$1$2\n` occ = 0 ) ##regex_posix.
" --> 'Pearl-\r\nand is not a town, Pear-\r\nland is' --> oh no...
\r\n
(carriage return line feed) typically used in a Windows context, we would have to get a bit more clever with the regular expression.\R
. With this, our regular expression works for all kinds of newline sequences:DATA(result1) = replace( val = |Pearl-\nand is not a town, Pear-\nland is| pcre = `(\w+)-(\R)(\w+)\s*` with = `$1$3$2` occ = 0 ).
" --> 'Pearland\nis not a town, Pearland\nis' --> yeah, it works!
DATA(result2) = replace( val = |Pearl-\r\nand is not a town, Pear-\r\nland is| pcre = `(\w+)-(\R)(\w+)\s*` with = `$1$3$2` occ = 0 ).
" --> 'Pearland\r\nis not a town, Pearland\r\nis' --> still holding!
" --> also for vertical tab, form feed, ...
\R
matches by placing one of the following control verbs at the start of the pattern:Verb | Description |
---|---|
(*BSR_UNICODE) | \R matches all newline sequences as defined by the Unicode standard;this is the default |
(*BSR_ANYCRLF) | \R matches only \r , \n or \r\n |
^
and $
, which by default match the start and the end of the subject string respectively. Enabling multiline mode changes their semantics to now match the start and the end of every line in the subject string. Multiline mode can be enabled via parameter ENABLE_MULTILINE
of factory function CL_ABAP_REGEX=>CREATE_PCRE( )
, or by using the option setting syntax (?m)
inside your pattern:DATA(result1) = replace( val = |Hello\nWorld| pcre = `^` with = `- ` occ = 0 ).
" --> '- Hello\nWorld'
DATA(result2) = replace( val = |Hello\nWorld| pcre = `(?m)^` with = `- ` occ = 0 ).
" --> '- Hello\n- World'
\R
syntax, you can also specify what exactly should be recognized as a newline sequence in the context of ^
and $
, either using parameter MULTILINE_MODE
of factory function CL_ABAP_REGEX=>CREATE_PCRE( )
, or by prefixing your pattern with one of the following control verbs:Verb | What is recognized as a Newline Sequence |
---|---|
(*CR) | carriage return only |
(*LF) | linefeed only |
(*CRLF) | carriage return followed by linefeed |
(*ANYCRLF) | all three of the above |
(*ANY) | any Unicode newline sequence |
(*NUL) | the NUL character (binary zero) |
\R
syntax, meaning the following pattern is totally fine:DATA(pattern) = `(*BSR_UNICODE)(*CR)...`.
" --> '\R' will match any Unicode newline sequence
" --> '^' and '$' will also match before and after a carriage return
.
meta-character in PCRE by default does not match a newline sequence and shares its definition of newline sequences with ^
and $
:DATA(result1) = match( val = |Hello\nWorld!| pcre = `(*CR).+` ).
" '\n' is not considered a newline sequence
" --> 'Hello\nWorld!'
DATA(result2) = match( val = |Hello\rWorld!| pcre = `(*CR).+` ).
" '\r' is considered a newline sequence, so '.' does not match it
" --> 'Hello'
.
to match every character, including newlines, you can enable single line mode either using parameter DOT_ALL
of factory function CL_ABAP_REGEX=>CREATE_PCRE( )
, or by setting the (?s)
option inside your pattern:DATA(result2) = match( val = |Hello\rWorld!| pcre = `(*CR)(?s).+` ).
" '\r' is considered a newline sequence, but now we are in single line mode
" --> 'Hello\rWorld!'
Syntax | Matches |
---|---|
\A | start of subject (if matching on a subject is done with a starting offset, \A can never match) |
\Z | end of subject and before a newline at the end of the subject |
\z | end of subject |
\G | first matching position in subject (true if the current matching position is at the start point of the matching process, which may differ from the start of the subject e.g. if a starting offset is specified) |
CL_ABAP_REGEX
for multiple CL_ABAP_MATCHER
instances without the pattern being recompiled every time:DATA(regex) = CL_ABAP_REGEX=>CREATE_PCRE( pattern = `` ).
" --> checks pattern & compiles to byte code
DATA(matcher1) = regex->create_matcher( text = `` ).
" --> creates matching context & other stuff
DATA(result1) = matcher1->match( ).
" --> interprets byte code to perform actual match
DATA(matcher2) = regex->create_matcher( text = `` ).
" --> creates matching context & other stuff; pattern is NOT compiled again
DATA(result2) = matcher2->match( ).
" --> interprets byte code to perform actual match
CL_ABAP_REGEX=>CREATE_PCRE( )
, you can do so using parameter ENABLE_JIT
. If set to ABAP_TRUE
, the regular expression will always be JIT compiled. Setting ENABLE_JIT
to ABAP_FALSE
disables JIT compilation for this regular expression.(*NO_JIT)
control verb. If you want to have full control over all properties of your regular expression, use classes CL_ABAP_REGEX
and CL_ABAP_MATCHER
instead.Input | Should Match |
---|---|
(123) | ✔ |
((123)) | ✔ |
(((123))) | ✔ |
((((123)) | ❌ (unbalanced amount of parentheses) |
(((123w))) | ❌ (contains non-digit character) |
)((123))( | ❌ (parentheses are mixed up) |
\((\d++|(?R))\)
Element | Description |
---|---|
\( and \) | matches an opening and closing parenthesis literally |
(...|...) | matches either the left side or the right side; the left side is tried first |
\d++ | matches one or more digits possessively, meaning the match is treated as atomic and will not be backtracked into |
(?R) | recurses over the whole pattern |
++
is just an optimization to avoid unnecessary backtracking. The real magic is introduced by (?R)
which causes the whole pattern to be applied inside itself, similar to what a recursive subroutine call in your favorite programming language would do. Each recursion of the pattern will match an opening and a closing parenthesis and in between either some digits or another recursion of the pattern. Matching the digits is tried first and acts as our exit condition, so we do not fall into an infinite recursion loop.((123))
, the matching process roughly looks like follows:Pattern Recursion
DATA(result1) = match( val = `sense and sensibility` pcre = `(sens|respons)e\ and\ \1ibility` ).
" --> does match
DATA(result2) = match( val = `response and responsibility` pcre = `(sens|respons)e\ and\ \1ibility` ).
" --> does match
DATA(result3) = match( val = `sense and responsibility` pcre = `(sens|respons)e\ and\ \1ibility` ).
" --> does NOT match
DATA(result4) = match( val = `response and sensibility` pcre = `(sens|respons)e\ and\ \1ibility` ).
" --> does NOT match
\n
syntax, where n
is a number) only matches what its corresponding capture group has matched previously, but not all the things that its corresponding capture group could match.(?n)
syntax (where n
again is a number referring to a capture group):DATA(result1) = match( val = `sense and sensibility` pcre = `(sens|respons)e\ and\ (?1)ibility` ).
" --> does match
DATA(result2) = match( val = `response and responsibility` pcre = `(sens|respons)e\ and\ (?1)ibility` ).
" --> does match
DATA(result3) = match( val = `sense and responsibility` pcre = `(sens|respons)e\ and\ (?1)ibility` ).
" --> does match
DATA(result4) = match( val = `response and sensibility` pcre = `(sens|respons)e\ and\ (?1)ibility` ).
" --> does match
(?<name>...)
syntax. These capture groups can later be referred to by that name, either in backreferences using the \k<name>
syntax or in subroutine calls using the (?&name)
syntax:DATA(result1) = find( val = `foobarfoo` pcre = `(?<my_group>foo)bar\k<my_group>` ).
" --> found
DATA(result2) = find( val = `foobarfoo` pcre = `(?<my_group>foo)bar(?&my_group)` ).
" --> found
T ::= true | false | 0 | 1 | if T then T else T | succ(T) | pred(T) | iszero(T)
T --> true
T --> false
T --> 0
T --> 1
T --> if T then T else T
T --> succ(T)
T --> pred(T)
T --> iszero(T)
true
0
succ(pred(1))
if iszero(0) then false else pred(succ(1))
pred(succ(iszero(true)))
if succ(true) then iszero(false) else 1
true false
if true then false
13
hello
(?(DEFINE)
(?<true> true )
(?<false> false )
(?<zero> 0 )
(?<one> 1 )
(?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) )
(?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) )
(?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) )
(?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) )
(?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) )
)
\s*+ (?&T) \s*+
T --> true
becomes (?<true> true)
T
is added, representing all possible choices our non-terminal symbol T
can take; all choices simply become subroutine calls of the corresponding named capture group introduced in step 1T
simply call its capture group as a subroutine(?(DEFINE)...)
, which causes them to be skipped during matching; the idea of (DEFINE)
is that you can define named capture groups which you can later call as a subroutine from elsewhere(DEFINE)
, the whole machinery is kicked off by calling the T
capture group as a subroutineDEMO_REGEX
tool:DATA(pattern) = `(?(DEFINE)` &&
`(?<true> true )` &&
`(?<false> false )` &&
`(?<zero> 0 )` &&
`(?<one> 1 )` &&
`(?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) )` &&
`(?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) )` &&
`(?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) )` &&
`(?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) )` &&
`(?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) )` &&
`)` &&
`\s*+ (?&T) \s*+`.
DATA(result1) = xsdbool( matches( pcre = pattern val = `true` ) ).
" --> does match
DATA(result2) = xsdbool( matches( pcre = pattern val = `if iszero(pred(1)) then false else true` ) ).
" --> does match
DATA(result3) = xsdbool( matches( pcre = pattern val = `true false` ) ).
" --> does NOT match
succ(pred(1))
for our language looks like this:Syntax Tree
Abstract Syntax Tree (AST)
(?Cn)
or the (?C"text")
syntax. The former passes the integer number n
to the callout routine, the latter passes the string text
. To register a callout routine, an instance of a subclass of interface IF_ABAP_MATCHER_CALLOUT
has to be passed to CL_ABAP_MATCHER->SET_CALLOUT( )
. This means callouts can only be used with classes CL_ABAP_REGEX
and CL_ABAP_MATCHER
, but not with the built-in string functions or statements that accept regular expressions." implementation of interface if_abap_matcher_callout
CLASS demo_callout DEFINITION.
PUBLIC SECTION.
INTERFACES if_abap_matcher_callout.
ENDCLASS.
CLASS demo_callout IMPLEMENTATION.
" the actual callout routine;
" callout data passed by the pattern can be accessed via parameters 'callout_num' and
" 'callout_string'; additional matcher data can be accessed through the other parameters
" of this method
METHOD if_abap_matcher_callout~callout.
cl_demo_output=>write( |Number = '{ callout_num }'; String = '{ callout_string }'| ).
ENDMETHOD.
ENDCLASS.
START-OF-SELECTION.
" pattern with a callout after every 'a':
DATA(regex) = cl_abap_regex=>create_pcre( pattern = `a(?C42)a(?C99)a(?C"hello")` ).
DATA(matcher) = regex->create_matcher( text = `aaa` ).
" register callout routine:
matcher->set_callout( NEW demo_callout( ) ).
" perform actual match:
DATA(result) = matcher->match( ).
cl_demo_output=>display( ).
" --> Number = '42'; String = ''
" --> Number = '99'; String = ''
" --> Number = '0'; String = 'hello'
CL_ABAP_MATCHER
. You can use the data passed from the pattern to the callout routine to dispatch different actions. The callout routine also has access to information regarding the current matcher state and can even influence how matching will continue via its return value.(?(DEFINE)
(?<true> true (?C"true") )
(?<false> false (?C"false") )
(?<zero> 0 (?C"zero") )
(?<one> 1 (?C"one") )
(?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) (?C"if") )
(?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) (?C"succ") )
(?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) (?C"pred") )
(?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) (?C"iszero") )
(?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) )
)
\s*+ (?&T) \s*+
T
. We simply do not need a node for the latter in the final AST. To build the actual tree structure, we make use of the order in which our pattern is traversed and the callouts are executed. The callout routine will generate nodes of different kinds, depending on the string data passed from the pattern. Intermediate nodes will be stored on a stack:CLASS ast_builder DEFINITION.
PUBLIC SECTION.
INTERFACES if_abap_matcher_callout.
METHODS:
constructor,
get_root
RETURNING VALUE(root) TYPE REF TO ast_node.
PRIVATE SECTION.
" a basic stack data structure holding references
" to ast_node instances:
DATA: m_stack TYPE REF TO node_stack.
ENDCLASS.
CLASS ast_builder IMPLEMENTATION.
METHOD CONSTRUCTOR.
m_stack = NEW node_stack( ).
ENDMETHOD.
METHOD if_abap_matcher_callout~callout.
DATA: kind TYPE string,
child TYPE REF TO ast_node,
cond_child TYPE REF TO ast_node,
then_child TYPE REF TO ast_node,
else_child TYPE REF TO ast_node,
node TYPE REF TO ast_node.
" determine kind of node to create:
kind = callout_string.
" dispatch based on kind:
CASE kind.
WHEN `true` OR `false` OR `zero` OR `one`.
" nodes without children
m_stack->push( NEW ast_node( kind ) ).
WHEN `succ` OR `pred` OR `iszero`.
" nodes with a single child node
child = m_stack->pop( ).
node = NEW ast_node( kind ).
node->append( child ).
m_stack->push( node ).
WHEN `if`.
" node(s) with three child nodes;
" child nodes have to be popped off the stack in reverse order
else_child = m_stack->pop( ).
then_child = m_stack->pop( ).
cond_child = m_stack->pop( ).
node = NEW ast_node( kind ).
node->append( cond_child ).
node->append( then_child ).
node->append( else_child ).
m_stack->push( node ).
WHEN OTHERS.
" should not happen
ENDCASE.
ENDMETHOD.
METHOD get_root.
" after a successful match, the stack should contain
" only one item: the root node of the AST
root = m_stack->pop( ).
ENDMETHOD.
ENDCLASS.
if iszero(0) then false else true
in the following animation:AST Builder
$0
is substituted for the whole match, $1
for the contents of the first capture group and so on:DATA(result) = replace( val = `<world>` pcre = `<(.+?)>` with = `$0 becomes $1` ).
" --> '<world> becomes world'
${n:+true:false}
, allowing you to check if the n
-th capture group did participate in a match, substituting for true
if it did and false
if it did not:DATA(result1) = replace( val = `male` pcre = `(fe)?male` with = `${1:+her:his} majesty` ).
" --> 'his majesty'
DATA(result2) = replace( val = `female` pcre = `(fe)?male` with = `${1:+her:his} majesty` ).
" --> 'her majesty'
{n:-default}
for substituting for the contents of the n
-th capture group or default
if said capture group did not participate in the match:DATA(result1) = replace( val = `somebody` pcre = `(some)?body` with `${1:-no}body` ).
" --> 'somebody'
DATA(result2) = replace( val = `body` pcre = `(some)?body` with `${1:-no}body` ).
" --> 'nobody'
Syntax | Description |
---|---|
\u | the first character after \u that is inserted into the replacement text is converted to uppercase |
\l | the first character after \l that is inserted into the replacement text is converted to lowercase |
\U | all characters after \U up to the next \L or \E that are inserted into the replacement text are converted to uppercase |
\L | all characters after \L up to the next \U or \E that are inserted into the replacement text are converted to lowercase |
\E | terminates the current upper- or lowercase transformation |
DATA(result) = replace( val = `thEsE aRe noT THe dROiDs YoU arE loOKInG FOr`
pcre = `(\w)(\w*)` with = `\u$1\L$2` occ = 0 ).
" --> 'These Are Not The Droids You Are Looking For'
DATA(result1) = replace( val = `body` pcre = `(some)?body` with = `${1:+\U:\L}HeLLo` ).
" --> 'hello'
DATA(result2) = replace( val = `somebody` pcre = `(some)?body` with = `${1:+\U:\L}HeLLo` ).
" --> 'HELLO'
REPORT zpcre_parsing_demo.
"------------------------
" Node class to construct an abstact syntax tree (AST)
"------------------------
CLASS ast_node DEFINITION.
PUBLIC SECTION.
METHODS:
constructor
IMPORTING
kind TYPE string,
append
IMPORTING
node TYPE REF TO ast_node,
to_string
IMPORTING indentation TYPE i DEFAULT 0
RETURNING VALUE(str) TYPE string.
PRIVATE SECTION.
TYPES: BEGIN OF child_entry,
child TYPE REF TO ast_node,
END OF child_entry,
child_entries TYPE STANDARD TABLE OF child_entry WITH EMPTY KEY.
DATA: m_children TYPE child_entries,
m_kind TYPE string.
ENDCLASS.
CLASS ast_node IMPLEMENTATION.
METHOD constructor.
m_kind = kind.
ENDMETHOD.
METHOD append.
APPEND VALUE #( child = node ) TO m_children.
ENDMETHOD.
METHOD to_string.
" a very simple recursive way to turn a tree strutcture into a string;
" the result is not pretty, but oh well...
DATA(indent_str) = repeat( val = `-` occ = indentation ).
str = |{ indent_str }[{ m_kind }]|.
IF m_kind = `if` OR m_kind = `succ` OR m_kind = `pred` OR m_kind = `iszero`.
" recursively obtain string representations of child nodes
DATA(child_indent) = indentation + 2.
LOOP AT m_children ASSIGNING FIELD-SYMBOL(<child>).
DATA(child_str) = <child>-child->to_string( child_indent ).
str &&= |\n{ child_str }|.
ENDLOOP.
ENDIF.
ENDMETHOD.
ENDCLASS.
"------------------------
" Helper class implementing a first in last out (FILO) data structure, aka a stack
"------------------------
CLASS node_stack DEFINITION.
PUBLIC SECTION.
METHODS:
push
IMPORTING
entry TYPE REF TO ast_node,
pop
RETURNING VALUE(entry) TYPE REF TO ast_node,
size
RETURNING VALUE(size) TYPE i.
PRIVATE SECTION.
TYPES: BEGIN OF stack_entry,
entry TYPE REF TO ast_node,
END OF stack_entry,
stack_entries TYPE STANDARD TABLE OF stack_entry WITH EMPTY KEY.
DATA: m_entries TYPE stack_entries.
ENDCLASS.
CLASS node_stack IMPLEMENTATION.
METHOD push.
APPEND VALUE #( entry = entry ) TO m_entries.
ENDMETHOD.
METHOD pop.
DATA(last_entry) = size( ).
entry = m_entries[ last_entry ]-entry.
DELETE m_entries INDEX last_entry.
ENDMETHOD.
METHOD size.
size = lines( m_entries ).
ENDMETHOD.
ENDCLASS.
"------------------------
" Tree builder class, reacts on the callouts specified in the regex
"------------------------
CLASS ast_builder DEFINITION.
PUBLIC SECTION.
INTERFACES if_abap_matcher_callout.
METHODS:
constructor,
get_root
RETURNING VALUE(root) TYPE REF TO ast_node.
PRIVATE SECTION.
" a basic stack data structure holding references
" to ast_node instances:
DATA: m_stack TYPE REF TO node_stack.
ENDCLASS.
CLASS ast_builder IMPLEMENTATION.
METHOD CONSTRUCTOR.
m_stack = NEW node_stack( ).
ENDMETHOD.
METHOD if_abap_matcher_callout~callout.
DATA: kind TYPE string,
child TYPE REF TO ast_node,
cond_child TYPE REF TO ast_node,
then_child TYPE REF TO ast_node,
else_child TYPE REF TO ast_node,
node TYPE REF TO ast_node.
" determine kind of node to create:
kind = callout_string.
" dispatch based on kind:
CASE kind.
WHEN `true` OR `false` OR `zero` OR `one`.
" nodes without children
m_stack->push( NEW ast_node( kind ) ).
WHEN `succ` OR `pred` OR `iszero`.
" nodes with a single child node
child = m_stack->pop( ).
node = NEW ast_node( kind ).
node->append( child ).
m_stack->push( node ).
WHEN `if`.
" node(s) with three child nodes;
" child nodes have to be popped off the stack in reverse order
else_child = m_stack->pop( ).
then_child = m_stack->pop( ).
cond_child = m_stack->pop( ).
node = NEW ast_node( kind ).
node->append( cond_child ).
node->append( then_child ).
node->append( else_child ).
m_stack->push( node ).
WHEN OTHERS.
" should not happen
ENDCASE.
ENDMETHOD.
METHOD get_root.
" after a successful match, the stack should contain
" only one item: the root node of the AST
root = m_stack->pop( ).
ENDMETHOD.
ENDCLASS.
"------------------------
" Entry point
"------------------------
START-OF-SELECTION.
DATA input TYPE string.
" The heart of the parser;
" consists of named subroutine definitions like '(?<name> ...)'
" that call each other recursively and contain a callout like '(?C"string")'
" to trigger construction of the AST nodes
DATA(regex) = cl_abap_regex=>create_pcre(
pattern =
`(?(DEFINE)`
&& ` (?<true> true (?C"true") )`
&& ` (?<false> false (?C"false") )`
&& ` (?<zero> 0 (?C"zero") )`
&& ` (?<one> 1 (?C"one") )`
&& ` (?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) (?C"if") )`
&& ` (?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) (?C"succ") )`
&& ` (?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) (?C"pred") )`
&& ` (?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) (?C"iszero") )`
&& ` (?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) )`
&& `)`
&& `\s*+ (?&T) \s*+`
).
cl_demo_input=>request( CHANGING field = input ).
DATA(matcher) = regex->create_matcher( text = input ).
DATA(builder) = NEW ast_builder( ).
matcher->set_callout( builder ).
DATA(result) = matcher->match( ).
IF result = abap_true.
cl_demo_output=>display_text( builder->get_root( )->to_string( ) ).
ELSE.
cl_demo_output=>display_text( `The given input cannot be parsed` ).
ENDIF.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
2 | |
2 | |
2 | |
2 | |
1 | |
1 | |
1 | |
1 | |
1 | |
1 |