cancel
Showing results for 
Search instead for 
Did you mean: 

How to strip several different characters from a string?

VolkerBarth
Contributor
20,395

Using SA 11.0.1.2299, I want to strip some characters from a string - independent of their position.

An example: I want to get rid of several punction characters like hyphen, slash and the like in formatted numbers to get the "raw numbers" - say for telephone numbers.

Of course, I can do that with several calls of the REPLACE function like

DECLARE strNr varchar(30);
SET strNr = '+49-2345/67 89';
SELECT REPLACE(REPLACE(REPLACE(REPLACE(strNr, ' ', ''), '-', ''), '/', ''), '+', '');

But this is not very readable (though I could put it in a function).

Question: Are there better ways - e.g. by means of REGEXP_SUBSTR()?

VolkerBarth
Contributor
0 Kudos

To clarify: I appreciate the hint to the "onlyNumbers" question but would still like an answer if regular expressions could be used to strip characters, too - I think with some kind of assertion.

Accepted Solutions (1)

Accepted Solutions (1)

MarkCulp
Participant

I think Jon is on the right track. Depending on your input, there could be some performance gains by filtering out all bad characters which are similar as it is found in the string by using the replace() function. Doing so limits the number of times through the loop to the number of distinct characters in your character set (as opposed to the length of your string).

The other optimization that I would do is to strip off the initial characters from the input string that are known to be valid so that they are not looked at again. Note that the database server has optimizations so that long strings are not actually copied when such operations are performed and hence this operation is not as expensive as you might think.

Here is a revised version which also generalizes the set of valid characters that are desired:

CREATE FUNCTION FilterChars( in @sInput long varchar,
                             in @valid  char(256) default '0-9' )
RETURNS long varchar
DETERMINISTIC
BEGIN
    declare @pattern   long varchar;
    declare @iNotValid integer;
    declare @sOutput   long varchar;

set @pattern   = '%[^' || @valid || ']%';
    set @sOutput   = '';
    set @iNotValid = patindex( @pattern, @sInput );

while @iNotValid > 0 loop
        // pick up everything to the left of the non numeric
        set @sOutput = @sOutput || left( @sInput, @iNotValid - 1 );

// strip off the invalid character and any similar chars in the string
        set @sInput = replace( substr( @sInput, @iNotValid+1 ),
                               substr( @sInput, @iNotValid, 1 ), '' );

// find the next invalid character
        set @iNotValid = patindex( @pattern, @sInput );
    end loop;

// return our string plus any remaining chars
    return @sOutput || @sInput;
END;

Example usage:

select FilterChars( 'abc123def' )  // returns '123'
select FilterChars( 'abc123def', 'a-z' )  // returns 'abcdef'

Answers (3)

Answers (3)

thomas_duemesnil
Participant

I personally like the use of regex for string matching. But ASA is missing a full blown REGEXP_REPLACE function.

At least im not able to come up with a regular expression that would do it with REGEXP_SUBSTRING in one run.

But in a loop it is possbile to strip the string. I have prepared a iSQL script for copy an paste. I think it should be easy to build a Function out of it.

BEGIN 
    declare cSearch     varchar(1024);
    declare cPattern    varchar(512);
    declare cResult     varchar(1024);
    declare cFound      varchar(1024);
    declare nFound      integer;
    declare nOccurrence integer;

set cSearch = '+49-2345/67 89';
    set cPattern = '(?<=(^|[^[:digit:]+]))[[:digit:]+]+';
    set nOccurrence = 1;
    set cFound = REGEXP_SUBSTR( cSearch, cPattern, 1, nOccurrence );

message 'Loop ', nOccurrence, ' ', cFound type info to client ;
    while cFound is not null and length(cFound) > 0 loop
        set cResult = string( cResult, cFound );
        set nOccurrence = nOccurrence + 1;
        set cFound = REGEXP_SUBSTR( cSearch, cPattern, 1, nOccurrence );
        message 'in Loop ', nOccurrence, ' ', cFound type info to client ;
    end loop;

message 'Final ', cResult, '<!' type info to client;

END

Output of this script is

Loop 1 +49
Loop 2 2345
Loop 3 67
Loop 4 89
Loop 5 
Final +4923456789<!
Execution time: 0.016 seconds

I think as always it depends on the string and the pattern which solution is faster.

One remark you have to avoid that the normal pattern '[[:digit:]+]+' to match to fast. So you have to add a assertion that only makes sure you are at the begin of the string '^' or after a not matched character '[^[:digit:]+]'.

VolkerBarth
Contributor
0 Kudos

Thanks for that nice proposal - it underlines my assumption that REGEXP_SUBSTR() is generally suited for this task but needs to be looped in order to replace/strip more than one occurrence of the chars to filter out. - For the quite simple task discussed here, Mark's suggestion seems easier to understand. But for different tasks (like formatting/checking e-mail addresses), the RegExp search seems a lot more flexible.

Breck_Carter
Participant

@TDuemesnil: please post a new "question" entitled "Product Suggestion: Please implement REGEXP_REPLACE". Include a business case for why you want it. FWIW here's a description of the Oracle version: http://download.oracle.com/docs/cd/B19306_01/server.102/b14200/functions130.htm#SQLRF06302

Former Member

Hi, see this question!

VolkerBarth
Contributor
0 Kudos

Thanks for the hint - it's the one Jon's response is refering to, too.

Former Member

Is it only stripping out non-numerics from a number, if so then the solution we worked on here would probably work. Here is my version from that thread.

 ALTER FUNCTION onlyNumber(in val long varchar)     
    RETURNS long varchar     
    DETERMINISTIC  
 BEGIN         
    declare sOut long varchar;         
    declare iCurPos integer;         
    declare iNonNum integer;

set sOut    = '';     
    set iCurPos = 1;     
    set iNonNum = patindex('%[^0-9]%', val);

while iNonNum <> 0 loop        
       // pick up everything to the left of the non numeric        
       set sOut = sOut || substr(val, iCurPos, iNonNum - 1);

// check starting at next numeric        
       set iCurPos = iCurPos + iNonNum + patindex('%[0-9]%', substr(val, iNonNum + 1)) - 1;        
       set iNonNum = patindex('%[^0-9]%', substr(val, iCurPos));     
    end loop;

// return our string plus any remaining chars     
    return sOut || substr(val, iCurPos);   
    END;

This might change based upon whether we expected lots of non numerics and whether non numerics might show up grouped together or not which is what I designed for above. If we're just looking at something like SSN 999-99-9999 then the set iCurPos should just be set iCurPos = iCurPos + iNonNum;

VolkerBarth
Contributor
0 Kudos

Thanks for the VERY quick response! I guess this would work though I will have to leave more than just digits. Guess I could add "[^0-9A-Z]" as pattern. - However, I would still get to know if regular expressions can't do that better...