[ Team LiB ] Previous Section Next Section

1.7 PCRE Lib

The Perl Compatible Regular Expression (PCRE) library is a free-for-any-use, open source regular expression library developed by Philip Hazel. PCRE has been incorporated into PHP, Apache 2.0, KDE, Exim MTA, Analog, and Postfix. Users of those programs can use the supported metacharacters listed in Table 1-26 through Table 1-30.

The PCRE library uses a Traditional NFA match engine. For an explanation of the rules behind an NFA engine, see Section 1.2.

This reference covers PCRE Version 4.0, which aims to emulate Perl 5.8-style regular expressions.

1.7.1 Supported Metacharacters

PCRE supports the metacharacters and metasequences listed in Table 1-26 through Table 1-30. For expanded definitions of each metacharacter, see Section 1.2.1.

Table 1-26. Character representations

Sequence

Meaning

\a

Alert (bell), x07.

\b

Backspace, x08, supported only in character class.

\e

ESC character, x1B.

\n

Newline, x0A.

\r

Carriage return, x0D.

\f

Form feed, x0C.

\t

Horizontal tab, x09.

\octal

Character specified by a three-digit octal code.

\xhex

Character specified by a one- or two-digit hexadecimal code.

\x{hex}

Character specified by any hexadecimal code.

\cchar

Named control character.

Table 1-27. Character classes and class-like constructs

Class

Meaning

[...]

A single character listed or contained in a listed range.

[^...]

A single character not listed and not contained within a listed range.

[:class:]

POSIX-style character class valid only within a regex character class.

.

Any character except newline (unless single-line mode, /s).

\C

One byte; however, this may corrupt a Unicode character stream.

\w

Word character, [a-zA-z0-9_].

\W

Non-word character, [^a-zA-z0-9_].

\d

Digit character, [0-9].

\D

Non-digit character, [^0-9].

\s

Whitespace character, [\n\r\f\t ].

\S

Non-whitespace character, [^\n\r\f\t ].

Table 1-28. Anchors and zero-width tests

Sequence

Meaning

^

Start of string, or after any newline if in multiline match mode, /m.

\A

Start of search string, in all match modes.

$

End of search string or before a string-ending newline, or before any newline if in multiline match mode, /m.

\Z

End of string or before a string-ending newline, in any match mode.

\z

End of string, in any match mode.

\G

Beginning of current search.

\b

Word boundary; position between a word character (\w) and either a non-word character (\W), the start of the string, or the end of the string.

\B

Not-word-boundary.

(?=...)

Positive lookahead.

(?!...)

Negative lookahead.

(?<=...)

Positive lookbehind.

(?<!...)

Negative lookbehind.

Table 1-29. Comments and mode modifiers

Modifier/sequence

Mode character

Meaning

PCRE_CASELESS

i

Case-insensitive matching for characters with codepoints values less than 256.

PCRE_MULTILINE

m

^ and $ match next to embedded \n.

PCRE_DOTALL

s

Dot (.) matches newline.

PCRE_EXTENDED

x

Ignore whitespace and allow comments (#) in pattern.

PCRE_UNGREEDY

U

Reverse greediness of all quantifiers: * becomes non-greedy and *? becomes greedy.

PCRE_ANCHORED

 

Force match to start at the first position searched.

PCRE_DOLLAR_ENDONLY

 

Force $ to match at only the end of a string instead of before a string ending with a newline. Overridden by multiline mode.

PCRE_NO_AUTO_CAPTURE

 

Disable capturing function of parentheses.

PCRE_UTF8

 

Treat regular expression and subject strings as strings of multibyte UTF-8 characters.

(?mode)

 

Turn listed modes (imsxU) on for the rest of the subexpression.

(?-mode)

 

Turn listed modes (imsxU) off for the rest of the subexpression.

(?mode:...)

 

Turn listed modes (xsmi) on within parentheses.

(?mode:...)

 

Turn listed modes (xsmi) off within parentheses.

\Q

 

Quote all following regex metacharacters.

\E

 

End a span started with \Q.

(?#...)

 

Treat substring as a comment.

#...

 

Treat rest of line as a comment in PCRE_EXTENDED mode.

Table 1-30. Grouping, capturing, conditional, and control

Sequence

Meaning

(...)

Group subpattern and capture submatch into \1,\2,...

(?P<name>...)

Group subpattern and capture submatch into named capture group, name.

\n

Contains the results of the nth earlier submatch from a parentheses capture group or a named capture group.

(?:...)

Group subpattern, but do not capture submatch.

(?>...)

Disallow backtracking for text matched by subpattern.

...|...

Try subpatterns in alternation.

*

Match 0 or more times.

+

Match 1 or more times.

?

Match 1 or 0 times.

{n}

Match exactly n times.

{n,}

Match at least n times.

{x,y}

Match at least x times, but no more than y times.

*?

Match 0 or more times, but as few times as possible.

+?

Match 1 or more times, but as few times as possible.

??

Match 0 or 1 time, but as few times as possible.

{n,}?

Match at least n times, but as few times as possible.

{x,y}?

Match at least x times, no more than y times, and as few times as possible.

*+

Match 0 or more times, and never backtrack.

++

Match 1 or more times, and never backtrack.

?+

Match 0 or 1 times, and never backtrack.

{n}+

Match at least n times, and never backtrack.

{n,}+

Match at least n times, and never backtrack.

{x,y}+

Match at least x times, no more than y times, and never backtrack.

(?(condition)...|...)

Match with if-then-else pattern. The condition can be either the number of a capture group or a lookahead or lookbehind construct.

(?(condition)...)

Match with if-then pattern. The condition can be either the number of a capture group or a lookahead or lookbehind construct.

1.7.2 PCRE API

Applications using PCRE should look for the API prototypes in pcre.h and include the actual library file, libpcre.a, by compiling with -lpcre.

Most functionality is contained in the functions pcre_compile( ), which prepares a regular expression data structure, and pcre_exec( ), which performs the pattern matching. You are responsible for freeing memory, although PCRE does provide pcre_free_substring( ) and pcre_free_substring_list( ) to help out.

PCRE API Synopsis

pcre *pcre_compile(const char * pattern, int options, const char ** errptr, int * erroffset, const unsigned char * tableptr)

Compile pattern with optional mode modifiers options and optional locale tables tableptr, which are created with pcre_maketables( ). Returns either a compiled regex or NULL with errptr pointing to an error message and erroffset pointing to the position in pattern where the error occurred.

int pcre_exec(const pcre * code, const pcre_extra *extra, const char * subject, int length, int startoffset, int options, int * ovector, int ovecsize)

Perform pattern matching with a compiled regular expression, code, and a supplied input string, subject, of length length. The results of a successful match are stored in ovector. The first and second elements of ovector contain the position of the first character in the overall match and the character following the end of the overall match. Each additional pair of elements, up to two thirds the length of ovector, contain the positions of the starting character and the character after capture group submatches. Optional parameters options contain mode modifiers, and pcre_extra contains the results of a call to pcre_study( ).

pcre_extra *pcre_study(const pcre * code, int options, const char ** errptr)

Return information to speed up calls to pcre_exec( ) with code. There are currently no options, so options should always be zero. If an error occurred, errptr points to an error message.

int pcre_copy_named_substring(const pcre * code, const char * subject, int * ovector, int stringcount, const char * stringname, char * buffer, int buffersize)

Copy the substring matched by the named capture group stringname into buffer. stringcount is the number of substrings placed into ovector, usually the result returned by pcre_exec( ).

int pcre_copy_substring(const char * subject, int * ovector, int stringcount, int stringnumber, char * buffer, int buffersize)

Copy the substring matched by the numbered capture group stringnumber into buffer. stringcount is the number of substrings placed into ovector, usually the result returned by pcre_exec( ).

int pcre_get_named_substring(const pcre * code, const char * subject, int * ovector, int stringcount, const char * stringname, const char ** stringptr)

Create a new string, pointed to by stringptr, containing the substring matched by the named capture group stringname. Returns the length of the substring. stringcount is the number of substrings placed into ovector, usually the result returned by pcre_exec( ).

int pcre_get_stringnumber(const pcre * code, const char * name)

Return the numbered capture group associated with the named capture group, name.

int pcre_get_substring(const char * subject, int * ovector, int stringcount, int stringnumber, const char ** stringptr)

Create a new string, pointed to by stringptr, containing the substring matched by the numbered capture group stringnumber. Returns the length of the substring. stringcount is the number of substrings placed into ovector, usually the result returned by pcre_exec( ).

int pcre_get_substring_list(const char * subject, int * ovector, int stringcount, const char *** listptr)

Return a list of pointers, listptr, to all captured substrings.

void pcre_free_substring(const char * stringptr)

Free memory pointed to by stringptr and allocated by pcre_get_named_substring( ) or pcre_get_substring_list( ).

void pcre_free_substring_list(const char ** stringptr)

Free memory pointed to by stringptr and allocated by pcre_get_substring_list( ).

const unsigned char *pcre_maketables(void)

Build character tables for the current locale.

int pcre_fullinfo(const pcre * code, const pcre_extra * extra, int what, void * where)

Place info on a regex specified by what into where. Available values for what are PCRE_INFO_BACKREFMAX, PCRE_INFO_CAPTURECOUNT, PCRE_INFO_FIRSTBYTE, PCRE_INFO_FIRSTTABLE, PCRE_INFO_LASTLITERAL, PCRE_INFO_NAMECOUNT, PCRE_INFO_NAMEENTRYSIZE, PCRE_INFO_NAMETABLE, PCRE_INFO_OPTIONS, PCRE_INFO_SIZE, and PCRE_INFO_STUDYSIZE.

int pcre_config(int what, void * where)

Place the value of build-time options specified by what into where. Available values for what are PCRE_CONFIG_UTF8, PCRE_CONFIG_NEWLINE, PCRE_CONFIG_LINK_SIZE, PCRE_CONFIG_POSIX_MALLOC_THRESHOLD, and PCRE_CONFIG_MATCH_LIMIT.

char *pcre_version(void)

Return a pointer to a string containing the PCRE version and release date.

void *(*pcre_malloc)(size_t)

Entry point PCRE uses for malloc( ) calls.

void (*pcre_free)(void *)

Entry point PCRE uses for pcre_free( ) calls.

int (*pcre_callout)(pcre_callout_block *)

Can be set to a callout function that will be called during matches.

1.7.3 Unicode Support

PCRE provides basic Unicode support. When a pattern is compiled with the PCRE_UTF8 flag, the pattern will run on Unicode text. However, PCRE has no capability to recognize any properties of characters whose values are greater than 256.

PCRE determines case and the property of being a letter or digit based on a set of default tables. You can supply an alternate set of tables based on a different locale. For example:

setlocale(LC_CTYPE, "fr");
tables = pcre_maketables(  );
re = pcre_compile(..., tables);

1.7.4 Examples

Examples Example 1-17 and Example 1-18 are adapted from an open source example written by Philip Hazel and copyright by the University of Cambridge, England.

Example 1-17. Simple match
#include <stdio.h>
#include <string.h>
#include <pcre.h>

#define CAPTUREVECTORSIZE 30   /* should be a multiple of 3 */

int main(int argc, char **argv)
{
pcre *regex;
const char *error;
int erroffset;
int capturevector[CAPTUREVECTORSIZE];
int rc;

char *pattern = "spider[- ]?man";
char *text ="SPIDERMAN menaces city!";

/* Compile Regex */
regex = pcre_compile(
  pattern,             
  PCRE_CASELESS,  /* OR'd mode modifiers */     
  &error,         /* error message */      
  &erroffset,     /* position in regex where error occurred */
  NULL);          /* use default locale */     

/* Handle Errors */
if (regex =  = NULL)
  {
  printf("Compilation failed at offset %d: %s\n", erroffset,
         error);
  return 1;
  }

/* Try Match */
rc = pcre_exec(
  regex,    /* compiled regular expression */                   
  NULL,     /* optional results from pcre_study */            
  text,     /* input string */         
  (int)strlen(text), /* length of input string */
  0,        /* starting position in input string */            
  0,        /* OR'd options */            
  capturevector, /* holds results of capture groups */            
  CAPTUREVECTORSIZE);            

/* Handle Errors */
if (rc < 0)
  {
  switch(rc)
    {
    case PCRE_ERROR_NOMATCH: printf("No match\n"); break;    
    default: printf("Matching error %d\n", rc); break;
    }
  return 1;
  }
return 0;
}
Example 1-18. Match and capture group
#include <stdio.h>
#include <string.h>
#include <pcre.h>

#define CAPTUREVECTORSIZE 30   /* should be a multiple of 3 */

int main(int argc, char **argv)
{
pcre *regex;
const char *error;
int erroffset;
int capturevector[CAPTUREVECTORSIZE];
int rc, i;

char *pattern = "(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)";
char *text ="12/30/1969";

/* Compile the Regex */
re = pcre_compile(
  pattern,              
  PCRE_CASELESS,  /* OR'd mode modifiers */
  &error,         /* error message */
  &erroffset,     /* position in regex where error occurred */
  NULL);          /* use default locale */   

/* Handle compilation errors */
if (re =  = NULL)
  {
  printf("Compilation failed at offset %d: %s\n", 
         erroffset, error);
  return 1;
  }
    
rc = pcre_exec(
  regex,    /* compiled regular expression */                   
  NULL,     /* optional results from pcre_study */            
  text,     /* input string */         
  (int)strlen(text), /* length of input string */
  0,        /* starting position in input string */            
  0,        /* OR'd options */            
  capturevector, /* holds results of capture groups */         
  CAPTUREVECTORSIZE);           

/* Handle Match Errors */
if (rc < 0)
  {
  switch(rc)
    {
    case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
    /*
    Handle other special cases if you like
    */
    default: printf("Matching error %d\n", rc); break;
    }
  return 1;
  }

/* Match succeded */

printf("Match succeeded\n");

/* Check for output vector for capture groups */
if (rc =  = 0)
  {
  rc = CAPTUREVECTORSIZE/3;
  printf("ovector only has room for %d captured substrings\n",
         rc - 1);
  }

/* Show capture groups */

for (i = 0; i < rc; i++)
  {
  char *substring_start = text + ovector[2*i];
  int substring_length = capturevector[2*i+1] 
                         - capturevector[2*i];
  printf("%2d: %.*s\n", i, substring_length, substring_start);
  }

return 0;
}

1.7.5 Other Resources

    [ Team LiB ] Previous Section Next Section