Understand that you want to get all the content in the body tag
Regular expression below
/\<body\>([\s\S].*?)\<\/body\>/
The reason why it cannot match correctly is because it was written incorrectly.
Break down the key parts of this expression
([\s\S].*?)
[sS] matches a whitespace or non-whitespace character. In other words, it can match all characters, including newlines, spaces and tabs, but can only match one
.*? What does it mean?
. Indicates matching any character
except newline character
.* means matching 0 or more arbitrary characters (excluding newlines), always matching as many characters as possible.
Here? is used to modify *. Added together *? means lazy matching. What does it mean? Just match as few characters as possible. Which of 0 or more is the least? Of course there are 0, so .*? doesn't match anything.
Entire expression
<body>([\s\S].*?)<\/body> // 注意 < 和 > 是不需要转义的
matches content that contains only any one character or whitespace between <body> and </body>. and
<body>([\s\S])<\/body>
The matching content of
is the same, which means .*? has no effect.
Why is it OK to just remove .? Because after removing ., lazy matching of *? is used to modify
[\s\S]
part, indicating 0 or more whitespace or non-whitespace characters.
I think you are
[\s\S]
is understood to be used to match newlines. Adding . can match all content. In fact, according to your understanding, it should be written like this
<body>([\s\S.]*?)<\/body>
can also be matched in this way, but the . here is redundant because
[\s\S]
matches any content, including the content matched by ..
So the final answer is
<body>([\s\S]*?)<\/body>
matches 0 or more characters between <body> and </body>. So the content can be matched correctly.
That’s it.
PS: The layout is a bit messy, because escape characters are difficult to use in the SegmentFault editor
Understand that you want to get all the content in the body tag
Regular expression below
The reason why it cannot match correctly is because it was written incorrectly.
Break down the key parts of this expression
[sS] matches a whitespace or non-whitespace character. In other words, it can match all characters, including newlines, spaces and tabs, but can only match one
.*? What does it mean?
. Indicates matching any character
except newline character.* means matching 0 or more arbitrary characters (excluding newlines), always matching as many characters as possible.
Here? is used to modify
*
. Added together*?
means lazy matching. What does it mean? Just match as few characters as possible. Which of 0 or more is the least? Of course there are 0, so.*?
doesn't match anything.Entire expression
matches content that contains only any one character or whitespace between
The matching content of<body>
and</body>
. andis the same, which means
.*?
has no effect.Why is it OK to just remove
.
? Because after removing.
, lazy matching of*?
is used to modifypart, indicating 0 or more whitespace or non-whitespace characters.
I think you are
is understood to be used to match newlines. Adding
.
can match all content. In fact, according to your understanding, it should be written like thiscan also be matched in this way, but the
.
here is redundant becausematches any content, including the content matched by
.
.So the final answer is
matches 0 or more characters between
<body>
and</body>
. So the content can be matched correctly.That’s it.
PS: The layout is a bit messy, because escape characters are difficult to use in the SegmentFault editor