REGEX_TOO_COMPLEX
errorDEMO_REGEX
and DEMO_REGEX_TOY
though, which have been updated to also support PCREDATA(pcre_result) = match( val = `unfoldable` pcre = `un(fold|foldable)` ).
" --> returns 'unfold'
DATA(posix_result) = match( val = `unfoldable` regex = `un(fold|foldable)` ) ##regex_posix.
" --> returns 'unfoldable'
fold
, the leftmost alternative, POSIX tried all alternatives and found that matching foldable
actually results in the longest match at this position, so it returned that. To retrieve the longest match in this example using PCRE, we have several options:" 1. reorder the pattern so that the leftmost match is automatically the longest
DATA(fix1) = match( val = `unfoldable` pcre = `un(foldable|fold)` ).
" 2. anchor the pattern at the beginning and end of the subject string
DATA(fix2) = match( val = `unfoldable` pcre = `^un(fold|foldable)$` ).
" 3. anchor the pattern at the word boundaries
DATA(fix3) = match( val = `unfoldable` pcre = `\bun(fold|foldable)\b` ).
" 4. extract the common prefix
DATA(fix4) = match( val = `unfoldable` pcre = `unfold(able)?` ).
|
, but all cases where multiple matches start at the same location, for example using the ?
quantifier:DATA(pcre_result) = match( val = `unfoldable` pcre = `un(fold)?(foldable)?` ).
" --> returns 'unfold'
DATA(posix_result) = match( val = `unfoldable` regex = `un(fold)?(foldable)?` ) ##regex_posix.
" --> returns 'unfoldable'
DATA(pcre_result) = match( val `unfoldable` pcre = `un(fold(?!able))?(foldable)?` )
" --> returns 'unfoldable'
Hello World
:DATA(posix_result) = find( val = `Hello World` regex = `Hello World` ) ##regex_posix.
" --> found
DATA(pcre_result) = find( val = `Hello World` pcre = `Hello World` ).
" --> not found, what is going on...?
Hello World
is equivalent to HelloWorld
for PCRE in extended mode:DATA(posix_result) = find( val = `HelloWorld` regex = `Hello World` ) ##regex_posix.
" --> not found
DATA(pcre_result) = find( val = `HelloWorld` pcre = `Hello World` ).
" --> found
\
(backslash):DATA(result1) = find( val = `Hello World` pcre = `Hello\ World` ).
" --> found
DATA(result2) = find( val = `Hello World` pcre = `Hello \ World` ).
" --> also found as unescaped whitespaces are ignored
\s
syntax:DATA(result1) = find( val = `Hello World` pcre = `Hello\sWorld` ).
" --> found
DATA(result2) = find( val = `Hello World` pcre = `Hello \s World` ).
" --> also found
DATA(result3) = find( val = |Hello\tWorld| pcre = `Hello \s World` ). " where '\t' denotes the tabulation character
" --> also found as the tabulation character is considered a whitespace
(?(DEFINE)
(?<true> true )
(?<false> false )
(?<zero> 0 )
(?<one> 1 )
(?<if> if \s++ (?&T) \s++ then \s++ (?&T) \s++ else \s++ (?&T) )
(?<succ> succ \s*+ \( \s*+ (?&T) \s*+ \) )
(?<pred> pred \s*+ \( \s*+ (?&T) \s*+ \) )
(?<iszero> iszero \s*+ \( \s*+ (?&T) \s*+ \) )
(?<T> (?&true) | (?&false) | (?&zero) | (?&one) | (?&if) | (?&succ) | (?&pred) | (?&iszero) )
)
\s*+ (?&T) \s*+
(?(DEFINE)(?<true>true)(?<false>false)(?<zero>0)(?<one>1)(?<if>if\s++(?&T)\s++then\s++(?&T)\s++else\s++(?&T))(?<succ>succ\s*+\(\s*+(?&T)\s*+\))(?<pred>pred\s*+\(\s*+(?&T)\s*+\))(?<iszero>iszero\s*+\(\s*+(?&T)\s*+\))(?<T>(?&true)|(?&false)|(?&zero)|(?&one)|(?&if)|(?&succ)|(?&pred)|(?&iszero)))\s*+(?&T)\s*+
EXTENDED
to false when creating the regular expression via CL_ABAP_REGEX=>CREATE_PCRE( )
, or by using the option syntax (?-x)
in the pattern itself. The latter also works when used in the built-in string functions:DATA(pcre_result) = find( val = `Hello World` pcre = `(?-x)Hello World` ).
" --> found
.
meta-character matches anything. In PCRE this is not the case, as by default .
will match everything except a newline sequence:DATA(pcre_result) = replace( val = |Hello\nWorld| pcre = `.` with = `x` occ = 0 ).
" --> 'xxxxx\nxxxxx'
DATA(posix_result) = replace( val = |Hello\nWorld| regex = `.` with = `x` occ = 0 ) ##regex_posix.
" --> 'xxxxxxxxxxx'
.
meta-character can be controlled either via parameter NEWLINE_MODE
of factory function CL_ABAP_REGEX=>CREATE_PCRE( )
, or by prefixing your pattern with the corresponding control verb..
meta-character to behave exactly as in the POSIX case, you can enable the so called single line mode by either setting parameter DOT_ALL
of factory function CL_ABAP_REGEX=>CREATE_PCRE( )
to ABAP_TRUE
, or by setting the (?s)
option inside your pattern.Operation | Description | Default Behavior |
---|---|---|
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER | Unicode support is controlled by parameter UNICODE_HANDLING of the factory functions:
| UNICODE_HANDLING = STRICT is assumed unless specified otherwise |
built-in functions find , find_end , replace , ... and ABAP statements FIND and REPLACE | no additional parameter exists to control Unicode support, instead the verb (*UTF) can be specified at the start of the pattern to enable UNICODE_HANDLING = STRICT | if the (*UTF) verb is not specified at the start, UNICODE_HANDLING = RELAXED is assumed;the \C syntax can however not be used |
Operation | Treat Input as UCS-2 or UTF-16? | Accept Invalid UTF-16? | Action |
---|---|---|---|
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER | UTF-16 | Yes | set UNICODE_HANDLING to IGNORE |
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER | UTF-16 | No | set UNICODE_HANDLING to STRICT (default) |
methods of class CL_ABAP_REGEX and CL_ABAP_MATCHER | UCS-2 (ABAP default) | - | set UNICODE_HANDLING to RELAXED |
built-in functions and ABAP statements | UTF-16 | Yes | this cannot be achieved with the built-in functions and ABAP statements; use CL_ABAP_REGEX and CL_ABAP_MATCHER instead |
built-in functions and ABAP statements | UTF-16 | No | add verb (*UTF) to the start of the pattern |
built-in functions and ABAP statements | UCS-2 (ABAP default) | - | (default) |
Description | POSIX Syntax | PCRE Equivalent |
---|---|---|
matching uppercase and lowercase letters (and the negation thereof) | \u , \l , \U and \L | \p{Lu} , \p{Ll} , \P{Lu} and \P{Ll} \p and its negation \P are in fact much more powerful and can match a lot more character properties, e.g. \p{Sc} matches any currency symbol and \p{Hangul} matches any Hangul character |
word anchoring at the beginning or the end | \< and \> | \b(?=\w) or [[:<:]] and \b(?<=\w) or [[:>:]] |
matching all "unicode" characters | [[:unicode:]] | use a character range depending on the context, e.g. [^\x{00}-\x{ff}] |
$0
for the contents of the whole match and $n
for the contents of the n
-th capture group, they pretty much differ in everything else replacement related. We will not explore what PCRE adds to the table as we already did that in the last part. Instead, we will focus on the POSIX replacement syntax that is not directly supported by PCRE.$&
. This can be trivially replaced by the $0
syntax, which is equivalent.$`
and $'
syntax respectively:DATA(posix_result) = replace( val = `again and` regex = `and` with = '$0 $`' ) ##regex_posix.
" --> 'again and again'
" === breakdown ===
" subject = 'again and'
" match = 'and'
" $0 = 'and'
" $` = 'again '
" $0 $` = 'and again '
" replaced = 'again and again ' --> only the 'and' was replaced
and
using a capture group and just doing a simple capture group substitution:DATA(pcre_result) = replace( val = `again and` pcre = `^(.+?)and` with = `$0 $1` ).
" --> 'again and again'
Category | Symptoms | Prerequisites | Solution |
---|---|---|---|
extended mode |
|
| in PCRE's extended mode, which is enabled by default, whitespaces in the pattern are ignored; you can either:
see Whitespaces in Patterns |
the . meta-character |
|
| the . meta-character in PCRE by default does not match newline sequences; you can either:
see What the Dot matches |
unicode handling |
|
| instances of CL_ABAP_REGEX by default assume UTF-16 input; see Choosing the right Unicode Mode |
unicode handling |
|
| instances of CL_ABAP_REGEX by default assume UTF-16 input; see Choosing the right Unicode Mode |
replacement and substitution |
|
| $& , $' and $` are not supported by PCRE; you can:
see Replacement and Substitution |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
7 | |
6 | |
5 | |
4 | |
4 | |
4 | |
4 | |
3 | |
2 | |
2 |