Regular Expressions in Oracle Logan Palanisamy
Agenda  Introduction to regular expressions  REGEXP_* functions in Oracle  Coffee Break  Examples  More examples
Meeting Basics  Put your phones/pagers on vibrate/mute  Messenger: Change the status to offline or in-meeting  Remote attendees: Mute yourself (*6). Ask questions via Adobe Connect.
What are Regular Expressions?  A way to express patterns  credit cards, license plate numbers, vehicle identification numbers, voter id, driving license  UNIX (grep, egrep), PHP, JAVA support Regular Expressions  PERL made it popular
String operations before Regular Expression support in Oracle  Pull the data from DB and perform it in middle tier or FE  OWA_PATTERN in 9i and before  LIKE operator
LIKE operator  % matches zero or more of any character  _ matches exactly one character  Examples  WHERE col1 LIKE 'abc%';  WHERE col1 LIKE '%abc';  WHERE col1 LIKE 'ab_d';  WHERE col1 LIKE 'ab_d' escape '';  WHERE col1 NOT LIKE 'abc%';  Very limited functionality  Check whether first character is numeric: where c1 like '0%' OR c1 like '1%' OR .. .. c1 like '9%'  Very trivial with Regular Exp: where regexp_like(c1, '^[0-9]')
Regular Expressions Meta Meaning character . Matches any single "character" except newline. * Matches zero or more of the character preceding it e.g.: bugs*, table.* ^ Denotes the beginning of the line. ^A denotes lines starting with A $ Denotes the end of the line. :$ denotes lines ending with : Escape character (., *, [, , etc) [] matches one or more characters within the brackets. e.g. [aeiou], [a-z], [a-zA-Z], [0-9], [:alpha:], [a-z?,!] [^] negation - matches any characters other than the ones inside brackets. eg. ^[^13579] denotes all lines not starting with odd numbers, [^02468]$ denotes all lines not ending with even numbers 7
Extended Regular Expressions Meta character Meaning | alternation. e.g.: ho(use|me), the(y|m), (they|them) + one or more occurrences of previous character. ? zero or one occurrences of previous character. {n} exactly n repetitions of the previous char or group {n,} n or more repetitions of the previous char or group {,m} zero to m repetitions of the previous char or group {n, m} n to m repetitions of previous char or group (....) grouping or subexpression n back referencing where n stands for the nth sub- expression. e.g.: 1 is the back reference for first sub-expression. 8
POSIX Character Classes POSIX Description [:alnum:] Alphanumeric characters [:alpha:] Alphabetic characters [:ascii:] ASCII characters [:blank:] Space and tab [:cntrl:] Control characters [:digit:] Digits, Hexadecimal digits [:xdigit:] [:graph:] Visible characters (i.e. anything except spaces, control characters, etc.) [:lower:] Lowercase letters [:print:] Visible characters and spaces (i.e. anything except control characters) [:punct:] Punctuation and symbols. [:space:] All whitespace characters, including line breaks [:upper:] Uppercase letters [:word:] Word characters (letters, numbers and underscores)
Perl Character Classes Perl POSIX Description d [[:digit:]] [0-9] D [^[:digit:]] [^0-9] w [[:alnum:]_] [0-9a-zA-Z_] W [^[:alnum:]_] [^0-9a-zA-Z_] s [[:space:]] S [^[:space:]] 10
Tools to learn Regular Expressions  http://www.weitz.de/regex-coach/  http://www.regexbuddy.com/
REGEXP_* functions  Available from 10g onwards.  Powerful and flexible, but CPU-hungry.  Easy and elegant, but sometimes less performant  Usable on text literal, bind variable, or any column that holds character data such as CHAR, NCHAR, CLOB, NCLOB, NVARCHAR2, and VARCHAR2 (but not LONG).  Useful as column constraint for data validation
REGEXP_LIKE  Determines whether pattern matches.  REGEXP_LIKE (source_str, pattern, [,match_parameter])  Returns TRUE or FALSE.  Use in WHERE clause to return rows matching a pattern  Use as a constraint  alter table t add constraint alphanum check (regexp_like (x, '[[:alnum:]]'));  Use in PL/SQL to return a boolean.  IF (REGEXP_LIKE(v_name, '[[:alnum:]]')) THEN ..  Can't be used in SELECT clause  regexp_like.sql
REGEXP_SUBSTR  Extracts the matching pattern. Returns NULL when nothing matches  REGEXP_SUBSTR(source_str, pattern [, position [, occurrence [, match_parameter]]])  position: character at which to begin the search. Default is 1  occurrence: The occurrence of pattern you want to extract  regexp_substr.sql
REGEXP_INSTR  Returns the location of match in a string  REGEXP_INSTR(source_str, pattern, [, position [, occurrence [, return_option [, match_parameter]]]])  return_option:  0, the default, returns the position of the first character.  1 returns the position of the character following the occurence.  regexp_instr.sql
REGEXP_REPLACE  Search and Replace a pattern  REGEXP_REPLACE(source_str, pattern [, replace_str] [, position [, occurrence [, match_parameter]]]])  If replace_str is not specified, pattern/search_str is replaced with empty string  occurence:  when 0, the default, replaces all occurrences of the match.  when n, any positive integer, replaces the nth occurrence.  regexp_replace.sql
REGEXP_COUNT  New in 11g  Returns the number of times a pattern appears in a string.  REGEXP_COUNT(source_str, pattern [,position [,match_param]])  For simple patterns it is same as (LENGTH(source_str) – LENGTH(REPLACE(source_str, pattern)))/LENGTH(pattern)  regexp_count.sql
Pattern Matching modifiers  i – Specifies case-insensitive matching (ignore case)  c – Specifies case-sensitive matching  n – allows the period (.) to match the newline character  m - treats the source string as multiple lines.  x - ignores whitespace characters  when match_parameter is not specified,  case sensitivity is determined by NLS_SORT parameter (BINARY, BINARY_CI)  A period (.) doesn't match newline character  Source string is treated as a single line  match_params.sql
Is a CHAR column all numeric?  to_number(c1) returns ORA-01722: invalid number if a varchar2 column contains alpha characters.  is_numeric.sql
Check constraints  Put validation close to where the data is stored  No need to have validation at different clients  check_constraint.sql
Extract email-ids  Find email-ids embedded in text strings. Possible email-id formats: abc123@company.com namex@mail.company.com xyz_1@yahoo.co.in  extract_emailid.sql
Extract dates  Extract dates embedded in text strings. Possible formats 1/5/2007, 2-5-03, 12-31-2009, 1/31/10, 2/5-10  extract_date.sql
Extracting hostnames from URLs  Extract hostnames/domain-names embedded in text strings. Possible formats  http://us.mg201.mail.yahoo.com/dc/launch?.partner =sbc&.gx=1&.rand=fegr2vucbecu5  https://www.mybank.com:8080/abc/xyz  www.mybank.com  ftp://www.mycharity.org/abc/xyz  extract_hostname.sql
Convert value pairs to XML  Input: A string such as 'remain1=1;remain2=2;'  Output: An XML string <remain1><value=1></remain1> <remain2><value=2></remain2>  convert_to_xml.sql
Sort IP addresses in numerical order  Sort IP addresses, that are stored as character strings, in numerical order.  Input  10.10.20.10  127.0.0.1  166.22.33.44  187.222.121.0  20.23.23.20  sort_ip_address.sql
Extract first name, last name, and middle initial  Extract the first name, last name with an optional middle initial.  first_last_mi.sql
Finding the Last Occurrence  Find the last numeric sequence from a sequence.  Return 567 from 'abc/123/def567/xyz'  INSTR and SUBSTR allow backward search when position is negative. REGEXP functions don't allow backward search  last_occurrence.sql
Fuzzy Match  Tables t1 and t2 each have a varchar2(12) column (t1.x, t2.y).  A row in t1 is considered a match for a row in t2, if any six characters in t1.x matches with any six characters in t2.y  fuzzy_match.sql
The lazy operator  ? is lazy/non-greedy quantifier  greedy_lazy.sql
Meta-characters with multiple meanings  Same meta characters are used with multiple meanings  ^ used for anchoring and negation.  ? used as quantifier and lazy operator  () used for grouping or sub-expression  metachars_with_multiple_meanings.sql
Nuances  ? (zero or one), * (zero or more) could sometimes mislead you  nuances.sql
Stored patterns  patterns can be stored in table columns and be referenced in REGEXP functions  No need to hard-code them  stored_patterns.sql
Random things  Insert a dash before the two last digits  Remove a substring  Get rid of useless commas from a string  Find the word that comes immediately before a substring (e.g. XXX)  Replace whole words, not its parts  Trimming the trailing digits  random.sql
A few other points  When not to use Regular Expressions  If the same thing could be used without regular expressions and without too much coding.  POSIX notations need double brackets [[:upper]]. [:upper:] won't work. [[:UPPER:]] won't work either. It has to be in lower case letters.  Locale support provided with Collation Element ][.ce.]], and Equivalence Classes [[=e=]]  MySQL supports regular expressions with RLIKE
References  Oracle® Database Advanced Application Developer's Guide (http://download.oracle.com/docs/cd/E11882_0 1/appdev.112/e17125/adfns_regexp.htm#CHDGH BHF)  Anti-Patterns in Regular Expressions: http://gennick.com/antiregex.html  First Expressions. An article by Jonathan Gennick Oracle Magazine, Sep/Oct 2003.  Oracle Regular Expressions Pocket Reference by Gonathan Gennick  http://examples.oreilly.com/9780596006013/Re gexPocketData.sql
References ...  http://www.psoug.org/reference/regexp.html  http://download.oracle.com/docs/cd/E11882_01/se rver.112/e10592/conditions007.htm#SQLRF00501  http://www.oracle.com/technology/pub/articles/sat ernos_regexp.html  http://www.oracle.com/technology/products/datab ase/application_development/pdf/TWP_Regular_E xpressions.pdf  http://asktom.oracle.com/pls/asktom/asktom.searc h?p_string=regexp_
References ...  http://www.oracle.com/technology/obe/10gr2_db_single/de velop/regexp/regexp_otn.htm  http://www.oracle.com/technology/sample_code/tech/pl_sq l/index.html  http://forums.oracle.com/forums/thread.jspa?threadID=427 716  http://forums.oracle.com/forums/search.jspa?threadID=&q= regular+expression&objID=f75&dateRange=all&userID=&nu mResults=120&rankBy=9  http://www.oracle.com/technology/sample_code/tech/pl_sq l/index.html  http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUE STION_ID:2200894550208#1568589800346862515
Q&A  devel_oracle@

Regular expressions in oracle

  • 1.
    Regular Expressions inOracle Logan Palanisamy
  • 2.
    Agenda  Introduction toregular expressions  REGEXP_* functions in Oracle  Coffee Break  Examples  More examples
  • 3.
    Meeting Basics  Putyour phones/pagers on vibrate/mute  Messenger: Change the status to offline or in-meeting  Remote attendees: Mute yourself (*6). Ask questions via Adobe Connect.
  • 4.
    What are RegularExpressions?  A way to express patterns  credit cards, license plate numbers, vehicle identification numbers, voter id, driving license  UNIX (grep, egrep), PHP, JAVA support Regular Expressions  PERL made it popular
  • 5.
    String operations beforeRegular Expression support in Oracle  Pull the data from DB and perform it in middle tier or FE  OWA_PATTERN in 9i and before  LIKE operator
  • 6.
    LIKE operator  %matches zero or more of any character  _ matches exactly one character  Examples  WHERE col1 LIKE 'abc%';  WHERE col1 LIKE '%abc';  WHERE col1 LIKE 'ab_d';  WHERE col1 LIKE 'ab_d' escape '';  WHERE col1 NOT LIKE 'abc%';  Very limited functionality  Check whether first character is numeric: where c1 like '0%' OR c1 like '1%' OR .. .. c1 like '9%'  Very trivial with Regular Exp: where regexp_like(c1, '^[0-9]')
  • 7.
    Regular Expressions Meta Meaning character . Matches any single "character" except newline. * Matches zero or more of the character preceding it e.g.: bugs*, table.* ^ Denotes the beginning of the line. ^A denotes lines starting with A $ Denotes the end of the line. :$ denotes lines ending with : Escape character (., *, [, , etc) [] matches one or more characters within the brackets. e.g. [aeiou], [a-z], [a-zA-Z], [0-9], [:alpha:], [a-z?,!] [^] negation - matches any characters other than the ones inside brackets. eg. ^[^13579] denotes all lines not starting with odd numbers, [^02468]$ denotes all lines not ending with even numbers 7
  • 8.
    Extended Regular Expressions Metacharacter Meaning | alternation. e.g.: ho(use|me), the(y|m), (they|them) + one or more occurrences of previous character. ? zero or one occurrences of previous character. {n} exactly n repetitions of the previous char or group {n,} n or more repetitions of the previous char or group {,m} zero to m repetitions of the previous char or group {n, m} n to m repetitions of previous char or group (....) grouping or subexpression n back referencing where n stands for the nth sub- expression. e.g.: 1 is the back reference for first sub-expression. 8
  • 9.
    POSIX Character Classes POSIX Description [:alnum:] Alphanumeric characters [:alpha:] Alphabetic characters [:ascii:] ASCII characters [:blank:] Space and tab [:cntrl:] Control characters [:digit:] Digits, Hexadecimal digits [:xdigit:] [:graph:] Visible characters (i.e. anything except spaces, control characters, etc.) [:lower:] Lowercase letters [:print:] Visible characters and spaces (i.e. anything except control characters) [:punct:] Punctuation and symbols. [:space:] All whitespace characters, including line breaks [:upper:] Uppercase letters [:word:] Word characters (letters, numbers and underscores)
  • 10.
    Perl Character Classes Perl POSIX Description d [[:digit:]] [0-9] D [^[:digit:]] [^0-9] w [[:alnum:]_] [0-9a-zA-Z_] W [^[:alnum:]_] [^0-9a-zA-Z_] s [[:space:]] S [^[:space:]] 10
  • 11.
    Tools to learnRegular Expressions  http://www.weitz.de/regex-coach/  http://www.regexbuddy.com/
  • 12.
    REGEXP_* functions  Availablefrom 10g onwards.  Powerful and flexible, but CPU-hungry.  Easy and elegant, but sometimes less performant  Usable on text literal, bind variable, or any column that holds character data such as CHAR, NCHAR, CLOB, NCLOB, NVARCHAR2, and VARCHAR2 (but not LONG).  Useful as column constraint for data validation
  • 13.
    REGEXP_LIKE  Determines whetherpattern matches.  REGEXP_LIKE (source_str, pattern, [,match_parameter])  Returns TRUE or FALSE.  Use in WHERE clause to return rows matching a pattern  Use as a constraint  alter table t add constraint alphanum check (regexp_like (x, '[[:alnum:]]'));  Use in PL/SQL to return a boolean.  IF (REGEXP_LIKE(v_name, '[[:alnum:]]')) THEN ..  Can't be used in SELECT clause  regexp_like.sql
  • 14.
    REGEXP_SUBSTR  Extracts thematching pattern. Returns NULL when nothing matches  REGEXP_SUBSTR(source_str, pattern [, position [, occurrence [, match_parameter]]])  position: character at which to begin the search. Default is 1  occurrence: The occurrence of pattern you want to extract  regexp_substr.sql
  • 15.
    REGEXP_INSTR  Returns thelocation of match in a string  REGEXP_INSTR(source_str, pattern, [, position [, occurrence [, return_option [, match_parameter]]]])  return_option:  0, the default, returns the position of the first character.  1 returns the position of the character following the occurence.  regexp_instr.sql
  • 16.
    REGEXP_REPLACE  Search andReplace a pattern  REGEXP_REPLACE(source_str, pattern [, replace_str] [, position [, occurrence [, match_parameter]]]])  If replace_str is not specified, pattern/search_str is replaced with empty string  occurence:  when 0, the default, replaces all occurrences of the match.  when n, any positive integer, replaces the nth occurrence.  regexp_replace.sql
  • 17.
    REGEXP_COUNT  New in11g  Returns the number of times a pattern appears in a string.  REGEXP_COUNT(source_str, pattern [,position [,match_param]])  For simple patterns it is same as (LENGTH(source_str) – LENGTH(REPLACE(source_str, pattern)))/LENGTH(pattern)  regexp_count.sql
  • 18.
    Pattern Matching modifiers i – Specifies case-insensitive matching (ignore case)  c – Specifies case-sensitive matching  n – allows the period (.) to match the newline character  m - treats the source string as multiple lines.  x - ignores whitespace characters  when match_parameter is not specified,  case sensitivity is determined by NLS_SORT parameter (BINARY, BINARY_CI)  A period (.) doesn't match newline character  Source string is treated as a single line  match_params.sql
  • 19.
    Is a CHARcolumn all numeric?  to_number(c1) returns ORA-01722: invalid number if a varchar2 column contains alpha characters.  is_numeric.sql
  • 20.
    Check constraints  Putvalidation close to where the data is stored  No need to have validation at different clients  check_constraint.sql
  • 21.
    Extract email-ids  Findemail-ids embedded in text strings. Possible email-id formats: abc123@company.com namex@mail.company.com xyz_1@yahoo.co.in  extract_emailid.sql
  • 22.
    Extract dates  Extractdates embedded in text strings. Possible formats 1/5/2007, 2-5-03, 12-31-2009, 1/31/10, 2/5-10  extract_date.sql
  • 23.
    Extracting hostnames fromURLs  Extract hostnames/domain-names embedded in text strings. Possible formats  http://us.mg201.mail.yahoo.com/dc/launch?.partner =sbc&.gx=1&.rand=fegr2vucbecu5  https://www.mybank.com:8080/abc/xyz  www.mybank.com  ftp://www.mycharity.org/abc/xyz  extract_hostname.sql
  • 24.
    Convert value pairsto XML  Input: A string such as 'remain1=1;remain2=2;'  Output: An XML string <remain1><value=1></remain1> <remain2><value=2></remain2>  convert_to_xml.sql
  • 25.
    Sort IP addressesin numerical order  Sort IP addresses, that are stored as character strings, in numerical order.  Input  10.10.20.10  127.0.0.1  166.22.33.44  187.222.121.0  20.23.23.20  sort_ip_address.sql
  • 26.
    Extract first name,last name, and middle initial  Extract the first name, last name with an optional middle initial.  first_last_mi.sql
  • 27.
    Finding the LastOccurrence  Find the last numeric sequence from a sequence.  Return 567 from 'abc/123/def567/xyz'  INSTR and SUBSTR allow backward search when position is negative. REGEXP functions don't allow backward search  last_occurrence.sql
  • 28.
    Fuzzy Match  Tablest1 and t2 each have a varchar2(12) column (t1.x, t2.y).  A row in t1 is considered a match for a row in t2, if any six characters in t1.x matches with any six characters in t2.y  fuzzy_match.sql
  • 29.
    The lazy operator ? is lazy/non-greedy quantifier  greedy_lazy.sql
  • 30.
    Meta-characters with multiplemeanings  Same meta characters are used with multiple meanings  ^ used for anchoring and negation.  ? used as quantifier and lazy operator  () used for grouping or sub-expression  metachars_with_multiple_meanings.sql
  • 31.
    Nuances  ? (zeroor one), * (zero or more) could sometimes mislead you  nuances.sql
  • 32.
    Stored patterns  patternscan be stored in table columns and be referenced in REGEXP functions  No need to hard-code them  stored_patterns.sql
  • 33.
    Random things  Inserta dash before the two last digits  Remove a substring  Get rid of useless commas from a string  Find the word that comes immediately before a substring (e.g. XXX)  Replace whole words, not its parts  Trimming the trailing digits  random.sql
  • 34.
    A few otherpoints  When not to use Regular Expressions  If the same thing could be used without regular expressions and without too much coding.  POSIX notations need double brackets [[:upper]]. [:upper:] won't work. [[:UPPER:]] won't work either. It has to be in lower case letters.  Locale support provided with Collation Element ][.ce.]], and Equivalence Classes [[=e=]]  MySQL supports regular expressions with RLIKE
  • 35.
    References  Oracle® DatabaseAdvanced Application Developer's Guide (http://download.oracle.com/docs/cd/E11882_0 1/appdev.112/e17125/adfns_regexp.htm#CHDGH BHF)  Anti-Patterns in Regular Expressions: http://gennick.com/antiregex.html  First Expressions. An article by Jonathan Gennick Oracle Magazine, Sep/Oct 2003.  Oracle Regular Expressions Pocket Reference by Gonathan Gennick  http://examples.oreilly.com/9780596006013/Re gexPocketData.sql
  • 36.
    References ...  http://www.psoug.org/reference/regexp.html http://download.oracle.com/docs/cd/E11882_01/se rver.112/e10592/conditions007.htm#SQLRF00501  http://www.oracle.com/technology/pub/articles/sat ernos_regexp.html  http://www.oracle.com/technology/products/datab ase/application_development/pdf/TWP_Regular_E xpressions.pdf  http://asktom.oracle.com/pls/asktom/asktom.searc h?p_string=regexp_
  • 37.
    References ...  http://www.oracle.com/technology/obe/10gr2_db_single/de velop/regexp/regexp_otn.htm  http://www.oracle.com/technology/sample_code/tech/pl_sq l/index.html  http://forums.oracle.com/forums/thread.jspa?threadID=427 716  http://forums.oracle.com/forums/search.jspa?threadID=&q= regular+expression&objID=f75&dateRange=all&userID=&nu mResults=120&rankBy=9  http://www.oracle.com/technology/sample_code/tech/pl_sq l/index.html  http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUE STION_ID:2200894550208#1568589800346862515
  • 38.