Java provides a powerful regular expression API under the java.util.regex package. This tutorial explains how to use the regular expression API.
Regular Expressions
A regular expression is a text pattern used for text search. In other words, search for occurrences of patterns in the text. For example, you can use regular expressions to search for email addresses or hyperlinks on web pages.
Regular Expression Example
Here is an example of a simple Java regular expression for searching for http:// in text
String text = "This is the text to be searched " + "for occurrences of the http:// pattern.";String pattern = ".*http://.*";boolean matches = Pattern.matches(pattern, text); System.out.println("matches = " + matches);
The example code does not actually detect whether the http:// found is a legal super Part of the link, such as the domain name and suffix (.com, .net, etc.). The code simply looks for the string http:// to appear.
API about regular expressions in Java6
This tutorial introduces the API about regular expressions in Java6.
Pattern (java.util.regex.Pattern)
Class java.util.regex.Pattern, referred to as Pattern, is the main entry in the Java regular expression API. Whenever you need to use regular expressions, start with the Pattern class
Pattern.matches()
The most direct way to check whether a regular expression pattern matches a piece of text is to call the static method Pattern.matches(). The example is as follows:
String text = "This is the text to be searched " + "for occurrences of the pattern.";String pattern = ".*is.*";boolean matches = Pattern.matches(pattern, text); System.out.println("matches = " + matches);
The above code finds the word "is" in the variable text ” appears, allowing "is" to contain 0 or more characters before and after (specified by .*)
Pattern.matches() method is suitable for checking that a pattern appears once in a text, or is suitable for the default of the Pattern class set up.
If you need to match multiple occurrences, even output different matching text, or just need non-default settings. You need to get a Pattern instance through the Pattern.compile() method.
Pattern.compile()
If you need to match a regular expression that appears multiple times in the text, you need to create a Pattern object through the Pattern.compile() method. An example is as follows
String text = "This is the text to be searched " + "for occurrences of the http:// pattern."; String patternString = ".*http://.*"; Pattern pattern = Pattern.compile(patternString);
You can specify a special flag in the Compile method:
Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE);
The Pattern class contains multiple flags (int type), which can control the way Pattern matches patterns. The flag in the above code makes the pattern matching ignore case
Pattern.matcher()
Once the Pattern object is obtained, the Matcher object can then be obtained. Matcher example is used to match patterns in text. The example is as follows
Matcher matcher = pattern.matcher(text);
The Matcher class has a matches() method that can check whether the text matches the pattern. The following is a complete example of Matcher
String text = "This is the text to be searched " + "for occurrences of the http:// pattern."; String patternString = ".*http://.*"; Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher(text);boolean matches = matcher.matches(); System.out.println("matches = " + matches);
Pattern.split()
The split() method of the Pattern class can use regular expressions as delimiters to split text into an array of String type. Example:
String text = "A sep Text sep With sep Many sep Separators";String patternString = "sep"; Pattern pattern = Pattern.compile(patternString);String[] split = pattern.split(text); System.out.println("split.length = " + split.length);for(String element : split){ System.out.println("element = " + element); }
In the above example, the text is divided into an array containing 5 strings.
Pattern.pattern()
The pattern of the Pattern class returns the regular expression used to create the Pattern object, example:
String patternString = "sep";Pattern pattern = Pattern.compile(patternString);String pattern2 = pattern.pattern();
The value of pattern2 in the above code is sep, which is the same as the patternString variable.
Matcher (java.util.regex.Matcher)
java.util.regex.Matcher class is used to match multiple occurrences of a regular expression in a piece of text. Matcher is also suitable for matching the same regular expression in multiple texts.
Matcher has many useful methods, please refer to the official JavaDoc for details. Only the core methods are introduced here.
The following code demonstrates how to use Matcher
String text = "This is the text to be searched " + "for occurrences of the http:// pattern."; String patternString = ".*http://.*"; Pattern pattern = Pattern.compile(patternString); Matcher matcher = pattern.matcher(text); boolean matches = matcher.matches();
First create a Pattern, then get the Matcher, call the matches() method, return true to indicate pattern matching, and return false to indicate no match.
You can do more with Matcher.
Create Matcher
Create a Matcher through the matcher() method of Pattern.
String text = "This is the text to be searched " + "for occurrences of the http:// pattern."; String patternString = ".*http://.*"; Pattern pattern = Pattern.compile(patternString); Matcher matcher = pattern.matcher(text);
matches()
The matches() method of the Matcher class is used to match regular expressions in text
boolean matches = matcher.matches();
If the text matches the regular expression, the matches() method returns true. Otherwise return false.
matches() method cannot be used to find multiple occurrences of a regular expression. If necessary, use the find(), start() and end() methods.
lookingAt()
lookingAt() is similar to the matches() method. The biggest difference is that the lookingAt() method matches a regular expression at the beginning of the text; while
matches() matches a regular expression for the entire text. In other words, if the regular expression matches the beginning of the text but not the entire text, lookingAt() returns true and matches() returns false. Example:
String text = "This is the text to be searched " + "for occurrences of the http:// pattern."; String patternString = "This is the"; Pattern pattern = Pattern.compile(patternString, Pattern.CASE_INSENSITIVE); Matcher matcher = pattern.matcher(text); System.out.println("lookingAt = " + matcher.lookingAt()); System.out.println("matches = " + matcher.matches());
The above example matches the regular expression "this is the" at the beginning of the text and the entire text respectively. The method that matches the beginning of the text (lookingAt()) returns true.
The method of matching regular expressions on the entire text (matches()) returns false because the entire text contains extra characters, and the regular expression requires the text to accurately match "this is the" without any extra characters before or after it.
find() + start() + end()
find() method is used to find regular expressions that appear in the text. The text is passed in through the Pattern.matcher(text) method when creating the Matcher. If there are multiple matches in the text, the find() method returns the first one, and each subsequent call to find() returns the next one.
start() and end() return the start and end positions of each matching string in the entire text. In fact, end() returns the last digit at the end of the string. In this way, the return values of start() and end() can be used directly in String.substring().
String text = "This is the text which is to be searched " + "for occurrences of the word 'is'."; String patternString = "is"; Pattern pattern = Pattern.compile(patternString); Matcher matcher = pattern.matcher(text); int count = 0;while(matcher.find()) { count++; System.out.println("found: " + count + " : " + matcher.start() + " - " + matcher.end()); }
这个例子在文本中找到模式 “is” 4次,输出如下:
found: 1 : 2 - 4 found: 2 : 5 - 7 found: 3 : 23 - 25 found: 4 : 70 - 72
reset()
reset() 方法会重置Matcher 内部的 匹配状态。当find() 方法开始匹配时,Matcher 内部会记录截至当前查找的距离。调用 reset() 会重新从文本开头查找。
也可以调用 reset(CharSequence) 方法. 这个方法重置Matcher,同时把一个新的字符串作为参数传入,用于代替创建 Matcher 的原始字符串。
group()
假设想在一个文本中查找URL链接,并且想把找到的链接提取出来。当然可以通过 start()和 end()方法完成。但是用group()方法更容易些。
分组在正则表达式中用括号表示,例如:
(John)
此正则表达式匹配John, 括号不属于要匹配的文本。括号定义了一个分组。当正则表达式匹配到文本后,可以访问分组内的部分。
使用group(int groupNo) 方法访问一个分组。一个正则表达式可以有多个分组。每个分组由一对括号标记。想要访问正则表达式中某分组匹配的文本,可以把分组编号传入 group(int groupNo)方法。
group(0) 表示整个正则表达式,要获得一个有括号标记的分组,分组编号应该从1开始计算。
String text = "John writes about this, and John writes about that," + " and John writes about everything. " ; String patternString1 = "(John)"; Pattern pattern = Pattern.compile(patternString1); Matcher matcher = pattern.matcher(text);while(matcher.find()) { System.out.println("found: " + matcher.group(1)); }
以上代码在文本中搜索单词John.从每个匹配文本中,提取分组1,就是由括号标记的部分。输出如下
found: John found: John found: John
多分组
上面提到,一个正则表达式可以有多个分组,例如:
(John) (.+?)
这个表达式匹配文本”John” 后跟一个空格,然后跟1个或多个字符,最后跟一个空格。你可能看不到最后的空格。
这个表达式包括一些字符有特别意义。字符 点 . 表示任意字符。 字符 + 表示出现一个或多个,和. 在一起表示 任何字符,出现一次或多次。字符? 表示 匹配尽可能短的文本。
完整代码如下
String text = "John writes about this, and John Doe writes about that," + " and John Wayne writes about everything." ; String patternString1 = "(John) (.+?) "; Pattern pattern = Pattern.compile(patternString1); Matcher matcher = pattern.matcher(text); while(matcher.find()) { System.out.println("found: " + matcher.group(1) + " " + matcher.group(2)); }
注意代码中引用分组的方式。代码输出如下
found: John writes found: John Doe found: John Wayne
嵌套分组
在正则表达式中分组可以嵌套分组,例如
((John) (.+?))
这是之前的例子,现在放在一个大分组里.(表达式末尾有一个空格)。
当遇到嵌套分组时, 分组编号是由左括号的顺序确定的。上例中,分组1 是那个大分组。分组2 是包括John的分组,分组3 是包括 .+? 的分组。当需要通过groups(int groupNo) 引用分组时,了解这些非常重要。
以下代码演示如何使用嵌套分组
String text = "John writes about this, and John Doe writes about that," + " and John Wayne writes about everything." ; String patternString1 = "((John) (.+?)) "; Pattern pattern = Pattern.compile(patternString1); Matcher matcher = pattern.matcher(text); while(matcher.find()) { System.out.println("found: "); }
输出如下
found: found: found:
replaceAll() + replaceFirst()
replaceAll() 和 replaceFirst() 方法可以用于替换Matcher搜索字符串中的一部分。replaceAll() 方法替换全部匹配的正则表达式,replaceFirst() 只替换第一个匹配的。
在处理之前,Matcher 会先重置。所以这里的匹配表达式从文本开头开始计算。
示例如下
String text = "John writes about this, and John Doe writes about that," + " and John Wayne writes about everything." ; String patternString1 = "((John) (.+?)) "; Pattern pattern = Pattern.compile(patternString1); Matcher matcher = pattern.matcher(text); String replaceAll = matcher.replaceAll("Joe Blocks "); System.out.println("replaceAll = " + replaceAll); String replaceFirst = matcher.replaceFirst("Joe Blocks "); System.out.println("replaceFirst = " + replaceFirst);
输出如下
replaceAll = Joe Blocks about this, and Joe Blocks writes about that,and Joe Blocks writes about everything. replaceFirst = Joe Blocks about this, and John Doe writes about that,and John Wayne writes about everything.
输出中的换行和缩进是为了可读而增加的。
注意第1个字符串中所有出现 John 后跟一个单词 的地方,都被替换为 Joe Blocks 。第2个字符串中,只有第一个出现的被替换。
appendReplacement() + appendTail()
appendReplacement() 和 appendTail() 方法用于替换输入文本中的字符串短语,同时把替换后的字符串附加到一个 StringBuffer 中。
当find() 方法找到一个匹配项时,可以调用 appendReplacement() 方法,这会导致输入字符串被增加到StringBuffer 中,而且匹配文本被替换。 从上一个匹配文本结尾处开始,直到本次匹配文本会被拷贝。
appendReplacement() 会记录拷贝StringBuffer 中的内容,可以持续调用find(),直到没有匹配项。
直到最后一个匹配项目,输入文本中剩余一部分没有拷贝到 StringBuffer. 这部分文本是从最后一个匹配项结尾,到文本末尾部分。通过调用 appendTail() 方法,可以把这部分内容拷贝到 StringBuffer 中.
String text = "John writes about this, and John Doe writes about that," + " and John Wayne writes about everything." ; String patternString1 = "((John) (.+?)) "; Pattern pattern = Pattern.compile(patternString1); Matcher matcher = pattern.matcher(text); StringBuffer stringBuffer = new StringBuffer(); while(matcher.find()){ matcher.appendReplacement(stringBuffer, "Joe Blocks "); System.out.println(stringBuffer.toString()); } matcher.appendTail(stringBuffer); System.out.println(stringBuffer.toString());
注意我们在while循环中调用appendReplacement() 方法。在循环完毕后调用appendTail()。 代码输出如下:
Joe Blocks Joe Blocks about this, and Joe Blocks Joe Blocks about this, and Joe Blocks writes about that, and Joe Blocks Joe Blocks about this, and Joe Blocks writes about that, and Joe Blocks writes about everything.
Java 正则表达式语法
为了更有效的使用正则表达式,需要了解正则表达式语法。正则表达式语法很复杂,可以写出非常高级的表达式。只有通过大量的练习才能掌握这些语法规则。
Java 正则表达式语法
为了更有效的使用正则表达式,需要了解正则表达式语法。正则表达式语法很复杂,可以写出非常高级的表达式。只有通过大量的练习才能掌握这些语法规则。
本篇文字,我们将通过例子了解正则表达式语法的基础部分。介绍重点将会放在为了使用正则表达式所需要了解的核心概念,不会涉及过多的细节。详细解释,参见 Java DOC 中的 Pattern 类.
基本语法
Before introducing advanced features, let’s take a quick look at the basic syntax of regular expressions. The
character
is one of the most commonly used expressions in regular expressions. Its function is to simply match a certain character. For example:
John
This simple expression will match the text John in an input text.
You can use any English characters in expressions. You can also use the octal, hexadecimal or unicode encoding of the character pair. For example:
101x41u0041
The above three expressions all represent the uppercase character A. The first is octal encoding (101), the second is hexadecimal encoding (41), and the third is unicode encoding (0041).
Character classification
Character classification is a structure that can be used for multiple purposes. characters to match instead of just one character. In other words, a character class matches one character in the input text against multiple allowed characters in the character class. For example, if you want to match the characters a, b or c, the expression is as follows:
[abc]
Use a pair of square brackets [] to indicate the character classification. The square brackets themselves are not part of the match.
A lot of things can be done with character classification. For example, if you want to match the word John, the first letter can be uppercase or lowercase J.
[Jj]ohn
Character classification [Jj] matches J or j, and the remaining ohn will accurately match the character ohn.
Predefined character classification
There are some predefined character categories available in regular expressions. For example, d represents any number, s represents any whitespace character, and w represents any word character.
Predefined character categories do not need to be enclosed in square brackets. Of course, they can also be used in combination
d[ds]
The first one matches any number, and the second one matches any number or blank character.
The complete list of predefined character categories is listed at the end of this article.
Boundary Matching
Regular expressions support matching boundaries, such as word boundaries, beginning or end of text. For example, w matches a word, ^ matches the beginning of a line, and $ matches the end of a line.
^This is a single line$
The above expression matches a line of text, only the text This is a single line. Pay attention to the beginning and end of line marks, which means that there cannot be any text before or after the text, only the beginning and end of the line.
The complete list of matching boundaries is listed at the end of this article.
Quantifier matching
Quantifiers can match multiple occurrences of an expression. For example, the following expression matches the letter A occurring 0 or more times.
A*
Quantifier * means 0 or more times. + means 1 or more times. ? means 0 or 1 times. There are other quantifiers, see the list later in this article.
Quantifier matching is divided into hungry mode, greedy mode and exclusive mode. Starvation mode matches as little text as possible. Greedy pattern matches as much text as possible. An exclusive pattern matches as much text as possible, even causing remaining expressions to fail to match.
The following demonstrates the differences between hunger mode, greedy mode and exclusive mode. Assume the following text:
John went for a walk, and John fell down, and John hurt his knee.
Expression in hunger mode:
John.*?
This expression matches John followed by 0 or more character. . represents any character. * means 0 or more times. ? Followed by * means * is in starvation mode.
In starvation mode, the quantifier will only match as few characters as possible, that is, 0 characters. The expression in the above example will match the word John, which appears 3 times in the input text.
If changed to greedy mode, the expression is as follows:
John.*
In greedy mode, the quantifier will match as many characters as possible. The expression now matches the first occurrence of John, and in greedy mode all remaining characters. This way, there is only one match.
Finally, we change to exclusive mode:
John.*+hurt
* followed by + indicates the exclusive mode quantifier.
This expression has no matches in the input text, even though the text includes John and hurt. Why is this? Because .*+ is an exclusive pattern. Unlike greedy mode, match as much text as possible so that the entire expression matches. Exclusive mode will match as many as possible, but does not consider whether the rest of the expression can be matched.
.*+ will match all characters after the first John, which will result in no matches for the remaining hurt in the expression. If you change to greedy mode, there will be a match. The expression is as follows:
John.*hurt
Logical operators
Regular expressions support a small number of logical operations (AND, OR, NOT).
The AND operation is the default, the expression John , means J with o with h with n.
or the operation needs to be displayed and specified, represented by |. For example, the expression John|hurt means John or hurt.
Characters
Character Classification
Built-in Character Classification
Boundary Matching
Quantifier