Detailed explanation of positional matching in regular expression tutorial

高洛峰
Release: 2023-03-04 17:44:01
Original
2013 people have browsed it

The examples in this article describe the positional matching of regular expression tutorials. Share it with everyone for your reference, the details are as follows:

Note: In all examples, the regular expression matching results are included between [and] in the source text. Some examples will be implemented using Java. If The usage of regular expressions in Java itself will be explained in the corresponding places. All java examples are tested under JDK1.6.0_13.

1. Introduction to the problem

If we want to match a certain word in a piece of text (not considering the multi-line mode for now, which will be introduced later), we may look like the following:

Text: Yesterday is history, tomorrow is a mystery, but today is a gift.

Regular expression: is

Result: Yesterday [is] h[is]tory, tomorrow 【is】a mystery, but today 【is】a gift.

Analysis: Originally it only wanted to match the word is, but it also matched the is contained in other words. To solve this problem, use boundary delimiters, which are metacharacters used in regular expressions to indicate where (or boundaries) we want the matching operation to occur.

2. Word Boundary

A commonly used boundary is the word boundary specified by the qualifier \b, which is used to match the beginning and end of a word. More precisely, it matches a position between a character that can be used to form a word (letter, number, underscore, which is the character matched by \w) and a character that cannot be used to form a word ( characters that match \W). Let’s look at the previous example:

Text: Yesterday is history, tomorrow is a mystery, but today is a gift.

Regular expression: \bis\b

Result: Yesterday [is] history, tomorrow [is] a mystery, but today [is] a gift.

Analysis: In the original text, there is a space before and after the word is, and this is consistent with the pattern \bis\ b matches (space is one of the characters used to separate words). The word history also contains is, because there are two characters h and t before and after it. Neither of these two characters can match \b.

If a word boundary is not matched, \B is used. For example:

Text: Please enter the nine-digit id as it appears on your color - coded pass-key.

Regular expression:\B-\B

Result : Please enter the [nine-digit] id as it appears on your color - coded [pass-key].

Analysis: \B-\B will match a hyphen that is not a word boundary before and after nine, nine There are no spaces before and after the hyphen in -digit and pass-key, so they can match. However, there are spaces before and after the hyphen in color-coded, so they cannot match.

3. String boundaries

Word boundaries can be used to match positions related to words (beginning of word, end of word, entire word, etc.). String boundaries have a similar purpose, but are used to match positions related to strings (beginning of string, end of string, entire string, etc.). There are two metacharacters used to define string boundaries: one is ^ used to define the beginning of the string, and the other is $ used to define the end of the string.

For example, if you want to check the legality of an XML document, legal XML documents all start with :

Text:

<?xml version="1.0" encoding="UTF-8"?>
<project basedir="." default="ear">
</project>
Copy after login

Regular expression: ^\s*<\?xml.*?\?>

Result:



Analysis: ^ matches the beginning of a string, so ^\s* will match the beginning of a string and subsequent zero or more whitespace characters, because spaces, tabs, and newlines are allowed before the tag and other whitespace characters. The usage of the

$ metacharacter is exactly the same as the usage of ^ except for the difference in position. For example, to check whether an html page ends with , you can use the pattern: \s*$

4. Multiple lines Matching pattern

Regular expressions can change the behavior of other metacharacters through some special metacharacters. Multiline matching mode can be enabled via (?m). The multi-line matching pattern will cause the regular expression engine to treat the line delimiter as a string delimiter. In multi-line matching mode, ^ not only matches the normal beginning of the string, but also matches the starting position after the line separator (newline character). $ not only matches the normal end of the string, but also matches the line separator (newline character). The end position behind.

When used, (?m) must appear at the front of the entire pattern. For example, use regular expressions to find all the single-line comments (starting with //) in a piece of Java code.

Text:

publicDownloadingDialog(Frame parent){
     //Callsuper constructor, specifying that dialog box is modal.
     super(parent,true);
     //Setdialog box title.
     setTitle("E-mailClient");
     //Instructwindow not to close when the "X" is clicked.
     setDefaultCloseOperation(DO_NOTHING_ON_CLOSE);
     //Puta message with a nice border in this dialog box.
     JPanelcontentPanel = new JPanel();
     contentPanel.setBorder(BorderFactory.createEmptyBorder(5,5, 5, 5));
     contentPanel.add(newJLabel("Downloading messages..."));
     setContentPane(contentPanel);
     //Sizedialog box to components.
     pack();
     //Centerdialog box over application.
     setLocationRelativeTo(parent);
}
Copy after login

Regular expression: (?m)^\s*//.*$

result:

publicDownloadingDialog(Frame parent){
【 //Call superconstructor, specifying that dialog box is modal.】
super(parent,true);
【 //Set dialog boxtitle.】
setTitle("E-mailClient");
【 //Instruct windownot to close when the "X" is clicked.】
setDefaultCloseOperation(DO_NOTHING_ON_CLOSE);
【 //Put a messagewith a nice border in this dialog box.】
JPanelcontentPanel = new JPanel();
contentPanel.setBorder(BorderFactory.createEmptyBorder(5,5, 5, 5));
contentPanel.add(newJLabel("Downloading messages..."));
setContentPane(contentPanel);
【 //Size dialog boxto components.】
pack();
【 //Center dialogbox over application.】
setLocationRelativeTo(parent);
}

分析:^\s*//.*$将匹配一个字符串的开始,然后是任意多个空白字符,再后面是//,再往后是任意文本,最后是一个字符串的结束。不过这个模式只能找出第一条注释,加上(?m)前缀后,将把换行符视为一个字符串分隔符,这样就可以把每一行注释匹配出来了。

java代码实现如下(文本保存在text.txt文件中):

public static String getTextFromFile(String path) throws Exception{
  BufferedReader br = new BufferedReader(new FileReader(new File(path)));
  StringBuilder sb = new StringBuilder();
  char[] cbuf = new char[1024];
  int len = 0;
  while(br.ready() && (len = br.read(cbuf)) > 0){
    br.read(cbuf);
    sb.append(cbuf, 0, len);
  }
    br.close();
  return sb.toString();
}
public static void multilineMatch() throws Exception{
  String text = getTextFromFile("E:/text.txt");
  String regex = "(?m)^\\s*//.*$";
  Matcher m = Pattern.compile(regex).matcher(text);
  while(m.find()){
    System.out.println(m.group());
  }
}
Copy after login

   

输出结果如下:

//Call super constructor, specifying that dialog box is modal.
//Set dialog box title.
//Instruct window not to close when the "X" is clicked.
//Put a message with a nice border in this dialog box.
//Size dialog box to components.
//Center dialog box over application.

五、小结

正则表达式不仅可以用来匹配任意长度的文本块,还可以用来匹配出现在字符串中特定位置的文本。\b用来指定一个单词边界(\B刚好相反)。^和$用来指定单词边界。如果与(?m)配合使用,^和$还将匹配在一个换行符处开头或结尾的字符串。在接下来的文章中将介绍子表达式的使用。

希望本文所述对大家正则表达式学习有所帮助。

更多正则表达式教程之位置匹配详解相关文章请关注PHP中文网!


Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template