Home Backend Development PHP Tutorial Regular expressions and text mining--Text Mining

Regular expressions and text mining--Text Mining

Dec 05, 2016 am 11:56 AM
text mining regular expression

When conducting text mining, the wildcard character (Wildchar) in TSQL seems to be insufficient. At this time, using "CLR+regular expression" is a very good choice. Regular expressions seem to be very complicated, but they remain the same. If you are proficient in the metadata of regular expressions, you will be able to use regular expressions proficiently and flexibly to complete complex Text Mining work.

1. Special characters of regular expressions

1. Commonly used metacharacters

are used to match specific characters (letters, numbers, symbols). Note that letters are case-sensitive:

. : matches except line breaks. Any character
w: Matches letters or numbers or underscores or Chinese characters
s: Matches any whitespace character
d: Matches numbers
b: Matches the beginning or end of a word
^: Matches the beginning of a string
$: Matches a string The end of
k: Reference to the group name, for example: k, means to reference the group named group_name
group_number: group_number is the group number of the group, 1, 2, 3, etc., means to reference the group through the group number
2, repeated characters or groups

Specify the number of times the previous character or group is repeated:

: Repeat zero or more times

: Repeat one or more times
?: Repeat zero or one time
{n}: Repeat n times
{n ,}: repeated n times or more
{n,m}: repeated n to m times
3, grouping, escaping, branching, qualifier

These characters have specific meanings and uses:

(): Use parentheses to represent a group
<>: Define the group name. The string between < (", parentheses are no longer used as special characters
|: Branch, the expressions are "or" related
[]: Specify a list of qualified characters, one character must match any character in the list, specify the match in square brackets A character list, for example: [aeiou] A character must be any one in aeiou;
[^]: Specify a list of excluded characters, a character cannot be any character in the excluded list, the excluded character list is specified in square brackets, for example :[^aeiou] A character cannot be any one of aeiou;
Second, grouping reference

Grouping is a subexpression specified using parentheses; grouping reference refers to the repeated use of subexpressions in an expression , making the writing of regular expressions more concise. By default, regular expressions automatically assign a group number to each group. The rule is: the group number starts from 1, and from left to right, the group number increases by 1 (base-1). ), for example, the group number of the first group is 1, the group number of the second group is 2, and so on.

Three forms of grouping definition:

(exp): automatically assign group numbers through grouping. No. refers to the group;

(?exp): Name the group, refer to the group through the group name;
(?:exp): This group only matches text at the current position, after the group, the group cannot be referenced, the group has no Group name, and no group number;
1, refer to the group through the group number

Define a group (exp) in front of the regular expression, and after the expression, you can reference the expression of the group through the group number, and reference the group The syntax is: group_number;

For example: b(w+)bs+1b. In this regular expression, there is only one group (w+), and the group number is 1. After the group, use 1 to refer to the group. Replace 1 with the grouped subexpression, which is equivalent to: b(w+)bs+(w+)b.

2. Reference the group through the group name

In the regular expression, the group can be named. The named group format is: (?exp). The group name is name. The format for referencing the group through name is: k, through Group names and group numbers refer to groups, and their text matching behavior is the same.

For example: b(?w+)bs+1b, in the back of the group, use k to refer to the group, replace k with the subexpression of the group, which is equivalent to: b(w+)bs+(w+)b.

3, unquotable group

(?:exp): A group defined using this syntax cannot be quoted and can only match text at the current position. The regular expression does not automatically assign a group number to the group.

Three, assertion search

Assertion is a logical expression. Only when the expression is true, the match is successful. When a match is successful, the text returned does not contain prefixes or suffixes, i.e. the assertion is used to find text that comes before or after a specific "text". Four syntaxes for assertions:

(?=exp): The back of the text matches the expression exp, and the expression before the exp position is returned.

(?<=exp): The front of the text matches the expression exp, and the expression after the exp position is returned. Expression
(?!exp): The suffix of the text is not exp, returns an expression whose suffix is ​​not exp
(? < !exp): The prefix of the text is not exp, returns an expression whose prefix is ​​not exp
1, suffix matching

(?=exp): Matches the expression exp after the text and returns the expression before the exp position. Suffix matching is similar to TSQL’s “%ing”;

For example, regular expression: bw+(?=ingb)

Analysis: Assert that its suffix is ​​ing and it is the end of the word (b), match words ending with ing, but return the front part of the word, the part before ing;

For example, find "I'm reading a book" , it will match "reading" because the character ends with ing. The regular expression returns read and asserts that the returned text does not contain the suffix.

2, prefix matching

(?<=exp): The front of the text matches the expression exp, and the expression after the exp position is returned. Prefix matching is similar to TSQL's "re%";
For example, regular expression: (?<=bre)w+b

Analysis: The beginning of a word (b), and the prefix of the word is re, and the match starts with re The word returns the second half of the word, the part after re;

For example, if you search for "I am reading a book", it will match "reading", because the character starts with re, and the regular expression returns ading, Assert that the text returned does not contain the prefix.

3. Find text whose prefix or suffix is ​​not a specific text

These two assertion searches are opposite to the previous two and have little effect. Let’s have a brief understanding:

(?!exp): The suffix of the text is not exp, return The expression whose suffix is ​​not exp
(? < !exp): The prefix of the text is not exp, and the expression whose prefix is ​​not exp is returned
3.1 For example, regular expression: bw+(?!ingb)

Analysis: does not match ing For words ending in "I am reading a book", the returned text is: I,am,a,book

3.2 For example, regular expression: (?< !bre)w+b

Analysis: does not match the words ending with For words starting with re, search for "I am reading a book", and the returned text is: I, am, a, book


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP regular expression validation: number format detection PHP regular expression validation: number format detection Mar 21, 2024 am 09:45 AM

PHP regular expression verification: Number format detection When writing PHP programs, it is often necessary to verify the data entered by the user. One of the common verifications is to check whether the data conforms to the specified number format. In PHP, you can use regular expressions to achieve this kind of validation. This article will introduce how to use PHP regular expressions to verify number formats and provide specific code examples. First, let’s look at common number format validation requirements: Integers: only contain numbers 0-9, can start with a plus or minus sign, and do not contain decimal points. floating point

How to validate email address in Golang using regular expression? How to validate email address in Golang using regular expression? May 31, 2024 pm 01:04 PM

To validate email addresses in Golang using regular expressions, follow these steps: Use regexp.MustCompile to create a regular expression pattern that matches valid email address formats. Use the MatchString function to check whether a string matches a pattern. This pattern covers most valid email address formats, including: Local usernames can contain letters, numbers, and special characters: !.#$%&'*+/=?^_{|}~-`Domain names must contain at least One letter, followed by letters, numbers, or hyphens. The top-level domain (TLD) cannot be longer than 63 characters.

How to match timestamps using regular expressions in Go? How to match timestamps using regular expressions in Go? Jun 02, 2024 am 09:00 AM

In Go, you can use regular expressions to match timestamps: compile a regular expression string, such as the one used to match ISO8601 timestamps: ^\d{4}-\d{2}-\d{2}T \d{2}:\d{2}:\d{2}(\.\d+)?(Z|[+-][0-9]{2}:[0-9]{2})$ . Use the regexp.MatchString function to check if a string matches a regular expression.

Master regular expressions and string processing in Go language Master regular expressions and string processing in Go language Nov 30, 2023 am 09:54 AM

As a modern programming language, Go language provides powerful regular expressions and string processing functions, allowing developers to process string data more efficiently. It is very important for developers to master regular expressions and string processing in Go language. This article will introduce in detail the basic concepts and usage of regular expressions in Go language, and how to use Go language to process strings. 1. Regular expressions Regular expressions are a tool used to describe string patterns. They can easily implement operations such as string matching, search, and replacement.

PHP regular expressions: exact matching and exclusion of fuzzy inclusions PHP regular expressions: exact matching and exclusion of fuzzy inclusions Feb 28, 2024 pm 01:03 PM

PHP Regular Expressions: Exact Matching and Exclusion Fuzzy inclusion regular expressions are a powerful text matching tool that can help programmers perform efficient search, replacement and filtering when processing text. In PHP, regular expressions are also widely used in string processing and data matching. This article will focus on how to perform exact matching and exclude fuzzy inclusion operations in PHP, and will illustrate it with specific code examples. Exact match Exact match means matching only strings that meet the exact condition, not any variations or extra words.

How to verify password using regular expression in Go? How to verify password using regular expression in Go? Jun 02, 2024 pm 07:31 PM

The method of using regular expressions to verify passwords in Go is as follows: Define a regular expression pattern that meets the minimum password requirements: at least 8 characters, including lowercase letters, uppercase letters, numbers, and special characters. Compile regular expression patterns using the MustCompile function from the regexp package. Use the MatchString method to test whether the input string matches a regular expression pattern.

What are the regular expression wildcards? What are the regular expression wildcards? Nov 17, 2023 pm 01:40 PM

Regular expression wildcards include ".", "*", "+", "?", "^", "$", "[]", "[^]", "[a-z]", "[A-Z] ","[0-9]","\d","\D","\w","\W","\s&quo

Chinese character filtering: PHP regular expression practice Chinese character filtering: PHP regular expression practice Mar 24, 2024 pm 04:48 PM

PHP is a widely used programming language, especially popular in the field of web development. In the process of web development, we often encounter the need to filter and verify text input by users, among which character filtering is a very important operation. This article will introduce how to use regular expressions in PHP to implement Chinese character filtering, and give specific code examples. First of all, we need to clarify that the Unicode range of Chinese characters is from u4e00 to u9fa5, that is, all Chinese characters are in this range.

See all articles