1.7 PCRE Lib
The Perl Compatible Regular Expression (PCRE)
library is a free-for-any-use, open source regular expression library
developed by Philip Hazel. PCRE has been incorporated
into PHP, Apache 2.0, KDE, Exim MTA, Analog, and Postfix. Users of
those programs can use the supported metacharacters listed in Table 1-26 through Table 1-30.
The PCRE library uses a Traditional NFA match engine. For an
explanation of the rules behind an NFA engine, see Section 1.2.
This reference covers PCRE Version 4.0, which aims to emulate Perl
5.8-style regular expressions.
1.7.1 Supported Metacharacters
PCRE supports the metacharacters and metasequences listed in
Table 1-26 through Table 1-30.
For expanded definitions of each metacharacter, see Section 1.2.1.
Table 1-26. Character representations|
\a
|
Alert (bell), x07.
|
\b
|
Backspace, x08, supported only in character class.
|
\e
|
ESC character, x1B.
|
\n
|
Newline, x0A.
|
\r
|
Carriage return, x0D.
|
\f
|
Form feed, x0C.
|
\t
|
Horizontal tab, x09.
|
\octal
|
Character specified by a three-digit octal code.
|
\xhex
|
Character specified by a one- or two-digit hexadecimal code.
|
\x{hex}
|
Character specified by any hexadecimal code.
|
\cchar
|
Named control character.
|
Table 1-27. Character classes and class-like constructs|
[...]
|
A single character listed or contained in a listed range.
|
[^...]
|
A single character not listed and not contained within a listed range.
|
[:class:]
|
POSIX-style character class valid only within a regex character class.
|
.
|
Any character except newline (unless single-line mode,
/s).
|
\C
|
One byte; however, this may corrupt a Unicode character stream.
|
\w
|
Word character, [a-zA-z0-9_].
|
\W
|
Non-word character, [^a-zA-z0-9_].
|
\d
|
Digit character, [0-9].
|
\D
|
Non-digit character, [^0-9].
|
\s
|
Whitespace character, [\n\r\f\t ].
|
\S
|
Non-whitespace character, [^\n\r\f\t ].
|
Table 1-28. Anchors and zero-width tests|
^
|
Start of string, or after any newline if in multiline match mode,
/m.
|
\A
|
Start of search string, in all match modes.
|
$
|
End of search string or before a string-ending newline, or before any
newline if in multiline match mode, /m.
|
\Z
|
End of string or before a string-ending newline, in any match mode.
|
\z
|
End of string, in any match mode.
|
\G
|
Beginning of current search.
|
\b
|
Word boundary; position between a word character
(\w) and either a non-word character
(\W), the start of the string, or the end of the
string.
|
\B
|
Not-word-boundary.
|
(?=...)
|
Positive lookahead.
|
(?!...)
|
Negative lookahead.
|
(?<=...)
|
Positive lookbehind.
|
(?<!...)
|
Negative lookbehind.
|
Table 1-29. Comments and mode modifiers|
PCRE_CASELESS
|
i
|
Case-insensitive matching for characters with codepoints values less
than 256.
|
PCRE_MULTILINE
|
m
|
^ and $ match next to embedded
\n.
|
PCRE_DOTALL
|
s
|
Dot (.) matches newline.
|
PCRE_EXTENDED
|
x
|
Ignore whitespace and allow comments (#) in
pattern.
|
PCRE_UNGREEDY
|
U
|
Reverse greediness of all quantifiers: * becomes
non-greedy and *? becomes greedy.
|
PCRE_ANCHORED
| |
Force match to start at the first position searched.
|
PCRE_DOLLAR_ENDONLY
| |
Force $ to match at only the end of a string
instead of before a string ending with a newline. Overridden by
multiline mode.
|
PCRE_NO_AUTO_CAPTURE
| |
Disable capturing function of parentheses.
|
PCRE_UTF8
| |
Treat regular expression and subject strings as strings of multibyte
UTF-8 characters.
|
(?mode)
| |
Turn listed modes (imsxU) on for the rest of the
subexpression.
|
(?-mode)
| |
Turn listed modes (imsxU) off for the rest of the
subexpression.
|
(?mode:...)
| |
Turn listed modes (xsmi) on within parentheses.
|
(?mode:...)
| |
Turn listed modes (xsmi) off within parentheses.
|
\Q
| |
Quote all following regex metacharacters.
|
\E
| |
End a span started with \Q.
|
(?#...)
| |
Treat substring as a comment.
|
#...
| |
Treat rest of line as a comment in PCRE_EXTENDED
mode.
|
Table 1-30. Grouping, capturing, conditional, and control|
(...)
|
Group subpattern and capture submatch into
\1,\2,...
|
(?P<name>...)
|
Group subpattern and capture submatch into named capture group,
name.
|
\n
|
Contains the results of the nth earlier
submatch from a parentheses capture group or a named capture group.
|
(?:...)
|
Group subpattern, but do not capture submatch.
|
(?>...)
|
Disallow backtracking for text matched by subpattern.
|
...|...
|
Try subpatterns in alternation.
|
*
|
Match 0 or more times.
|
+
|
Match 1 or more times.
|
?
|
Match 1 or 0 times.
|
{n}
|
Match exactly n times.
|
{n,}
|
Match at least n times.
|
{x,y}
|
Match at least x times, but no more than
y times.
|
*?
|
Match 0 or more times, but as few times as possible.
|
+?
|
Match 1 or more times, but as few times as possible.
|
??
|
Match 0 or 1 time, but as few times as possible.
|
{n,}?
|
Match at least n times, but as few times
as possible.
|
{x,y}?
|
Match at least x times, no more than
y times, and as few times as possible.
|
*+
|
Match 0 or more times, and never backtrack.
|
++
|
Match 1 or more times, and never backtrack.
|
?+
|
Match 0 or 1 times, and never backtrack.
|
{n}+
|
Match at least n times, and never
backtrack.
|
{n,}+
|
Match at least n times, and never
backtrack.
|
{x,y}+
|
Match at least x times, no more than
y times, and never backtrack.
|
(?(condition)...|...)
|
Match with if-then-else pattern. The
condition can be either the number of a
capture group or a lookahead or lookbehind construct.
|
(?(condition)...)
|
Match with if-then pattern. The condition
can be either the number of a capture group or a lookahead or
lookbehind construct.
|
1.7.2 PCRE API
Applications using PCRE should look for the API prototypes in
pcre.h and include the actual library file,
libpcre.a, by compiling with
-lpcre.
Most functionality is contained in the functions
pcre_compile( ), which prepares a regular
expression data structure, and pcre_exec(
), which performs the pattern matching.
You are responsible for freeing memory, although PCRE does provide
pcre_free_substring( ) and
pcre_free_substring_list( ) to help out.
- pcre *pcre_compile(const char * pattern, int options, const char ** errptr, int * erroffset, const unsigned char * tableptr)
-
Compile pattern with optional mode
modifiers options and optional locale
tables tableptr, which are created with
pcre_maketables( ). Returns either a compiled
regex or NULL with errptr pointing to an
error message and erroffset pointing to
the position in pattern where the error
occurred.
- int pcre_exec(const pcre * code, const pcre_extra *extra, const char * subject, int length, int startoffset, int options, int * ovector, int ovecsize)
-
Perform pattern matching with a compiled regular expression,
code, and a supplied input string,
subject, of length
length. The results of a successful match
are stored in ovector. The first and
second elements of ovector contain the
position of the first character in the overall match and the
character following the end of the overall match. Each additional
pair of elements, up to two thirds the length of
ovector, contain the positions of the
starting character and the character after capture group submatches.
Optional parameters options contain mode
modifiers, and pcre_extra contains the
results of a call to pcre_study( ).
- pcre_extra *pcre_study(const pcre * code, int options, const char ** errptr)
-
Return information to speed up calls to pcre_exec(
) with code. There are currently
no options, so options should always be
zero. If an error occurred, errptr points
to an error message.
- int pcre_copy_named_substring(const pcre * code, const char * subject, int * ovector, int stringcount, const char * stringname, char * buffer, int buffersize)
-
Copy the substring matched by the named capture group
stringname into
buffer.
stringcount is the number of substrings
placed into ovector, usually the result
returned by pcre_exec( ).
- int pcre_copy_substring(const char * subject, int * ovector, int stringcount, int stringnumber, char * buffer, int buffersize)
-
Copy the substring matched by the numbered capture group
stringnumber into
buffer.
stringcount is the number of substrings
placed into ovector, usually the result
returned by pcre_exec( ).
- int pcre_get_named_substring(const pcre * code, const char * subject, int * ovector, int stringcount, const char * stringname, const char ** stringptr)
-
Create a new string, pointed to by
stringptr, containing the substring
matched by the named capture group
stringname. Returns the length of the
substring. stringcount is the number of
substrings placed into ovector, usually
the result returned by pcre_exec( ).
- int pcre_get_stringnumber(const pcre * code, const char * name)
-
Return the numbered capture group associated with the named capture
group, name.
- int pcre_get_substring(const char * subject, int * ovector, int stringcount, int stringnumber, const char ** stringptr)
-
Create a new string, pointed to by
stringptr, containing the substring
matched by the numbered capture group
stringnumber. Returns the length of the
substring. stringcount is the number of
substrings placed into ovector, usually
the result returned by pcre_exec( ).
- int pcre_get_substring_list(const char * subject, int * ovector, int stringcount, const char *** listptr)
-
Return a list of pointers, listptr, to all
captured substrings.
- void pcre_free_substring(const char * stringptr)
-
Free memory pointed to by stringptr and
allocated by pcre_get_named_substring( ) or
pcre_get_substring_list( ).
- void pcre_free_substring_list(const char ** stringptr)
-
Free memory pointed to by stringptr and
allocated by pcre_get_substring_list( ).
- const unsigned char *pcre_maketables(void)
-
Build character tables for the current locale.
- int pcre_fullinfo(const pcre * code, const pcre_extra * extra, int what, void * where)
-
Place info on a regex specified by what
into where. Available values for
what are
PCRE_INFO_BACKREFMAX,
PCRE_INFO_CAPTURECOUNT,
PCRE_INFO_FIRSTBYTE,
PCRE_INFO_FIRSTTABLE,
PCRE_INFO_LASTLITERAL,
PCRE_INFO_NAMECOUNT,
PCRE_INFO_NAMEENTRYSIZE,
PCRE_INFO_NAMETABLE,
PCRE_INFO_OPTIONS,
PCRE_INFO_SIZE, and
PCRE_INFO_STUDYSIZE.
- int pcre_config(int what, void * where)
-
Place the value of build-time options specified by
what into
where. Available values for
what are
PCRE_CONFIG_UTF8,
PCRE_CONFIG_NEWLINE,
PCRE_CONFIG_LINK_SIZE,
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD, and
PCRE_CONFIG_MATCH_LIMIT.
- char *pcre_version(void)
-
Return a pointer to a string containing the PCRE version and release
date.
- void *(*pcre_malloc)(size_t)
-
Entry point PCRE uses for malloc( ) calls.
- void (*pcre_free)(void *)
-
Entry point PCRE uses for pcre_free( ) calls.
- int (*pcre_callout)(pcre_callout_block *)
-
Can be set to a callout function that will be called during matches.
1.7.3 Unicode Support
PCRE
provides basic Unicode support. When a pattern is compiled with the
PCRE_UTF8 flag, the pattern will run on Unicode
text. However, PCRE has no capability to recognize any properties of
characters whose values are greater than 256.
PCRE determines case and the property of being a letter or digit
based on a set of default tables. You can supply an alternate set of
tables based on a different locale. For example:
setlocale(LC_CTYPE, "fr");
tables = pcre_maketables( );
re = pcre_compile(..., tables);
1.7.4 Examples
Examples
Example 1-17 and Example 1-18 are
adapted from an open source example written by Philip Hazel and
copyright by the University of Cambridge, England.
Example 1-17. Simple match
#include <stdio.h>
#include <string.h>
#include <pcre.h>
#define CAPTUREVECTORSIZE 30 /* should be a multiple of 3 */
int main(int argc, char **argv)
{
pcre *regex;
const char *error;
int erroffset;
int capturevector[CAPTUREVECTORSIZE];
int rc;
char *pattern = "spider[- ]?man";
char *text ="SPIDERMAN menaces city!";
/* Compile Regex */
regex = pcre_compile(
pattern,
PCRE_CASELESS, /* OR'd mode modifiers */
&error, /* error message */
&erroffset, /* position in regex where error occurred */
NULL); /* use default locale */
/* Handle Errors */
if (regex = = NULL)
{
printf("Compilation failed at offset %d: %s\n", erroffset,
error);
return 1;
}
/* Try Match */
rc = pcre_exec(
regex, /* compiled regular expression */
NULL, /* optional results from pcre_study */
text, /* input string */
(int)strlen(text), /* length of input string */
0, /* starting position in input string */
0, /* OR'd options */
capturevector, /* holds results of capture groups */
CAPTUREVECTORSIZE);
/* Handle Errors */
if (rc < 0)
{
switch(rc)
{
case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
default: printf("Matching error %d\n", rc); break;
}
return 1;
}
return 0;
}
Example 1-18. Match and capture group
#include <stdio.h>
#include <string.h>
#include <pcre.h>
#define CAPTUREVECTORSIZE 30 /* should be a multiple of 3 */
int main(int argc, char **argv)
{
pcre *regex;
const char *error;
int erroffset;
int capturevector[CAPTUREVECTORSIZE];
int rc, i;
char *pattern = "(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)";
char *text ="12/30/1969";
/* Compile the Regex */
re = pcre_compile(
pattern,
PCRE_CASELESS, /* OR'd mode modifiers */
&error, /* error message */
&erroffset, /* position in regex where error occurred */
NULL); /* use default locale */
/* Handle compilation errors */
if (re = = NULL)
{
printf("Compilation failed at offset %d: %s\n",
erroffset, error);
return 1;
}
rc = pcre_exec(
regex, /* compiled regular expression */
NULL, /* optional results from pcre_study */
text, /* input string */
(int)strlen(text), /* length of input string */
0, /* starting position in input string */
0, /* OR'd options */
capturevector, /* holds results of capture groups */
CAPTUREVECTORSIZE);
/* Handle Match Errors */
if (rc < 0)
{
switch(rc)
{
case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\n", rc); break;
}
return 1;
}
/* Match succeded */
printf("Match succeeded\n");
/* Check for output vector for capture groups */
if (rc = = 0)
{
rc = CAPTUREVECTORSIZE/3;
printf("ovector only has room for %d captured substrings\n",
rc - 1);
}
/* Show capture groups */
for (i = 0; i < rc; i++)
{
char *substring_start = text + ovector[2*i];
int substring_length = capturevector[2*i+1]
- capturevector[2*i];
printf("%2d: %.*s\n", i, substring_length, substring_start);
}
return 0;
}
1.7.5 Other Resources
|