1.4 Java (java.util.regex)
Java 1.4 supports regular
expressions with Sun's
java.util.regex package. Although there are
competing packages available for previous versions of Java, Sun is
poised to become the standard. Sun's package uses a
Traditional NFA match engine. For an explanation of the rules behind
a Traditional NFA engine, see Section 1.2.
1.4.1 Supported Metacharacters
java.util.regex supports the
metacharacters and metasequences
listed in Table 1-10 through Table 1-14. For expanded definitions of each
metacharacter, see Section 1.2.1.
Table 1-10. Character representations|
\a
|
Alert (bell).
|
\b
|
Backspace, x08, supported only in character class.
|
\e
|
ESC character, x1B.
|
\n
|
Newline, x0A.
|
\r
|
Carriage return, x0D.
|
\f
|
Form feed, x0C.
|
\t
|
Horizontal tab, x09.
|
\0octal
|
Character specified by a one-, two-, or three-digit octal code.
|
\xhex
|
Character specified by a two-digit hexadecimal code.
|
\uhex
|
Unicode character specified by a four-digit hexadecimal code.
|
\cchar
|
Named control character.
|
Table 1-11. Character classes and class-like constructs|
[...]
|
A single character listed or contained in a listed range.
|
[^...]
|
A single character not listed and not contained within a listed
range.
|
.
|
Any character, except a line terminator (unless
DOTALL mode).
|
\w
|
Word character, [a-zA-Z0-9_].
|
\W
|
Non-word character, [^a-zA-Z0-9_].
|
\d
|
Digit, [0-9].
|
\D
|
Non-digit, [^0-9].
|
\s
|
Whitespace character, [ \t\n\f\r\x0B].
|
\S
|
Non-whitespace character, [^ \t\n\f\r\x0B].
|
\p{prop}
|
Character contained by given POSIX character class, Unicode property,
or Unicode block.
|
\P{prop}
|
Character not contained by given POSIX character class, Unicode
property, or Unicode block.
|
Table 1-12. Anchors and other zero-width tests|
^
|
Start of string, or after any newline if in
MULTILINE mode.
|
\A
|
Beginning of string, in any match mode.
|
$
|
End of string, or before any newline if in
MULTILINE mode.
|
\Z
|
End of string but before any final line terminator, in any match mode.
|
\z
|
End of string, in any match mode.
|
\b
|
Word boundary.
|
\B
|
Not-word-boundary.
|
\G
|
Beginning of current search.
|
(?=...)
|
Positive lookahead.
|
(?!...)
|
Negative lookahead.
|
(?<=...)
|
Positive lookbehind.
|
(?<!...)
|
Negative lookbehind.
|
Table 1-13. Comments and mode modifiers|
Pattern.UNIX_LINES
|
d
|
Treat \n as the only line terminator.
|
Pattern.DOTALL
|
s
|
Dot (.) matches any character, including a line
terminator.
|
Pattern.MULTILINE
|
m
|
^ and $ match next to embedded
line terminators.
|
Pattern.COMMENTS
|
x
|
Ignore whitespace and allow embedded comments starting with
#.
|
Pattern.CASE_INSENSITIVE
|
i
|
Case-insensitive match for ASCII characters.
|
Pattern.UNICODE_CASE
|
u
|
Case-insensitive match for Unicode characters.
|
Pattern.CANON_EQ
| |
Unicode "canonical equivalence"
mode where characters or sequences of a base character and combining
characters with identical visual representations are treated as
equals.
|
(?mode)
| |
Turn listed modes (idmsux) on for the rest of the
subexpression.
|
(?-mode)
| |
Turn listed modes (idmsux) off for the rest of the
subexpression.
|
(?mode:...)
| |
Turn listed modes (idmsux) on within parentheses.
|
(?-mode:...)
| |
Turn listed modes (idmsux) off within parentheses.
|
#...
| |
Treat rest of line as a comment in /x mode.
|
Table 1-14. Grouping, capturing, conditional, and control|
(...)
|
Group subpattern and capture submatch into
\1,\2,... and
$1, $2,....
|
\n
|
Contains text matched by the nth capture
group.
|
$n
|
In a replacement string, contains text matched by the
nth capture group.
|
(?:...)
|
Groups subpattern, but does not capture submatch.
|
(?>...)
|
Disallow backtracking for text matched by subpattern.
|
...|...
|
Try subpatterns in alternation.
|
*
|
Match 0 or more times.
|
+
|
Match 1 or more times.
|
?
|
Match 1 or 0 times.
|
{n}
|
Match exactly n times.
|
{n,}
|
Match at least n times.
|
{x,y}
|
Match at least x times, but no more than
y times.
|
*?
|
Match 0 or more times, but as few times as possible.
|
+?
|
Match 1 or more times, but as few times as possible.
|
??
|
Match 0 or 1 times, but as few times as possible.
|
{n,}?
|
Match at least n times, but as few times
as possible.
|
{x,y}?
|
Match at least x times, no more than
y times, and as few times as possible.
|
*+
|
Match 0 or more times, and never backtrack.
|
++
|
Match 1 or more times, and never backtrack.
|
?+
|
Match 0 or 1 times, and never backtrack.
|
{n}+
|
Match at least n times, and never
backtrack.
|
{n,}+
|
Match at least n times, and never
backtrack.
|
{x,y}+
|
Match at least x times, no more than
y times, and never backtrack.
|
1.4.2 Regular Expression Classes and Interfaces
Java 1.4 introduces two main
classes,
java.util.regex.Pattern and
java.util.regex.Matcher; an exception,
java.util.regex.PatternSyntaxException; and a new interface,
CharSequence. Additionally, Sun upgraded the
String class to implement the
CharSequence interface and to provide basic
pattern-matching methods. Pattern objects are
compiled regular expressions that can be applied to many strings. A
Matcher object is a match of one
Pattern applied to one string (or any object
implementing CharSequence).
Backslashes in
regular expression String literals need to be
escaped. So \n (newline) becomes
\\n when used in a Java String
literal that is to be used as a regular expression.
Description
New methods for pattern matching.
Methods
- boolean matches (String regex)
-
Return true if regex matches the entire
String.
- String[ ] split (String regex)
-
Return an array of the substrings surrounding matches of
regex.
- String [ ] split (String regex, int limit)
-
Return an array of the substrings surrounding the first
limit-1 matches of
regex.
- String replaceFirst (String regex, String replacement)
-
Replace the substring matched by regex
with replacement.
- String replaceAll (String regex, String replacement)
-
Replace all substrings matched by regex
with replacement.
| extends Object and implements Serializable
|
Description
Models a regular expression pattern.
Methods
- static Pattern compile(String regex)
-
Construct a Pattern object from
regex.
- static Pattern compile(String regex, int flags)
-
Construct a new Pattern object out of
regex and the OR'd
mode-modifier constants flags.
- int flags( )
-
Return the Pattern's mode
modifiers.
- Matcher matcher(CharSequence input)
-
Construct a Matcher object that will match this
Pattern against input.
- static boolean matches(String regex, CharSequence input)
-
Return true if regex matches the entire
string input.
- String pattern( )
-
Return the regular expression used to create this
Pattern.
- String[ ] split(CharSequence input)
-
Return an array of the substrings surrounding matches of this
Pattern in input.
- String[ ] split(CharSequence input, int limit)
-
Return an array of the substrings surrounding the first
limit matches of this pattern in
regex.
Description
Models a regular expression pattern matcher and pattern matching
results.
Methods
- Matcher appendReplacement(StringBuffer sb, String replacement)
-
Append substring preceding match and
replacement to
sb.
- StringBuffer appendTail(StringBuffer sb)
-
Appends substring following end of match to
sb.
- int end( )
-
Index of the first character after the end of the match.
- int end(int group)
-
Index of the first character after the text captured by
group.
- boolean find( )
-
Find the next match in the input string.
- boolean find(int start)
-
Find the next match after character position,
start.
- String group( )
-
Text matched by this Pattern.
- String group(int group)
-
Text captured by capture group, group.
- int groupCount( )
-
Number of capturing groups in Pattern.
- boolean lookingAt( )
-
True if match is at beginning of input.
- boolean matches( )
-
Return true if Pattern matches entire input string.
- Pattern pattern( )
-
Return Pattern object used by this
Matcher.
- String replaceAll(String replacement)
-
Replace every match with replacement.
- String replaceFirst(String replacement)
-
Replace first match with replacement.
- Matcher reset( )
-
Reset this matcher so that the next match starts at the beginning of
the input string.
- Matcher reset(CharSequence input)
-
Reset this matcher with new input.
- int start( )
-
Index of first character matched.
- int start(int group)
-
Index of first character matched in captured substring,
group.
java.util.regex.PatternSyntaxException | |
Description
Thrown to indicate a syntax error in a regular expression pattern.
Methods
- PatternSyntaxException(String desc, String regex, int index)
-
Construct an instance of this class.
- String getDescription( )
-
Return error description.
- int getIndex( )
-
Return error index.
- String getMessage( )
-
Return a multiline error message containing error description, index,
regular expression pattern, and indication of the position of the
error within the pattern.
- String getPattern( )
-
Return the regular expression pattern that threw the exception.
| implemented by CharBuffer, String, StringBuffer
|
Description
Defines an interface for read-only access so that regular expression
patterns may be applied to a sequence of characters.
Methods
- char charAt(int index)
-
Return the character at the zero-based position,
index.
- int length( )
-
Return the number of characters in the sequence.
- CharSequence subSequence(int start, int end)
-
Return a subsequence including the start
index and excluding the end index.
- String toString( )
-
Return a String representation of the sequence.
1.4.3 Unicode Support
This package supports Unicode 3.0,
although \w, \W,
\d, \D, \s,
and \S support only ASCII. You can use the
equivalent Unicode properties \p{L},
\P{L}, \p{Nd},
\P{Nd}, \p{Z}, and
\P{Z}. The word boundary sequences,
\b and \B, do understand
Unicode.
For supported Unicode properties and blocks, see Table 1-2. This package supports only the short
property names, such as \p{Lu}, and not
\p{Lowercase_Letter}. Block names require the
In prefix and support only the name form without
spaces or underscores; for example,
\p{InGreekExtended}, not
\p{In_Greek_Extended} or \p{In Greek
Extended}.
1.4.4 Examples
Example 1-5. Simple match
//Match Spider-Man, Spiderman, SPIDER-MAN, etc.
public class StringRegexTest {
public static void main(String[ ] args) throws Exception {
String dailybugle = "Spider-Man Menaces City!";
//regex must match entire string
String regex = "(?i).*spider[- ]?man.*";
if (dailybugle.matches(regex)) {
//do something
}
}
}
Example 1-6. Match and capture group
//Match dates formatted like MM/DD/YYYY, MM-DD-YY,...
import java.util.regex.*;
public class MatchTest {
public static void main(String[ ] args) throws Exception {
String date = "12/30/1969";
Pattern p =
Pattern.compile("(\\d\\d)[-/](\\d\\d)[-/](\\d\\d(?:\\d\\d)?)");
Matcher m = p.matcher(date);
if (m.find( )) {
String month = m.group(1);
String day = m.group(2);
String year = m.group(3);
}
}
}
Example 1-7. Simple substitution
//Convert <br> to <br /> for XHTML compliance
import java.util.regex.*;
public class SimpleSubstitutionTest {
public static void main(String[ ] args) {
String text = "Hello world. <br>";
try {
Pattern p = Pattern.compile("<br>", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(text);
String result = m.replaceAll("<br />");
}
catch (PatternSyntaxException e) {
System.out.println(e.getMessage( ));
}
catch (Exception e) { System.exit( ); }
}
}
Example 1-8. Harder substitution
//urlify - turn URL's into HTML links
import java.util.regex.*;
public class Urlify {
public static void main (String[ ] args) throws Exception {
String text = "Check the website, http://www.oreilly.com/catalog/repr.";
String regex =
"\\b # start at word\n"
+ " # boundary\n"
+ "( # capture to $1\n"
+ "(https?|telnet|gopher|file|wais|ftp) : \n"
+ " # resource and colon\n"
+ "[\\w/\\#~:.?+=&%@!\\-] +? # one or more valid\n"
+ " # characters\n"
+ " # but take as little\n"
+ " # as possible\n"
+ ")\n"
+ "(?= # lookahead\n"
+ "[.:?\\-] * # for possible punc\n"
+ "(?: [^\\w/\\#~:.?+=&%@!\\-] # invalid character\n"
+ "| $ ) # or end of string\n"
+ ")";
Pattern p = Pattern.compile(regex,
Pattern.CASE_INSENSITIVE + Pattern.COMMENTS);
Matcher m = p.matcher(text);
String result = m.replaceAll("<a href=\"$1\">$1</a>");
}
}
1.4.5 Other Resources
|