When conducting text mining, the wildcard character (Wildchar) in TSQL seems to be insufficient. At this time, using "CLR+regular expression" is a very good choice. Regular expressions seem to be very complicated, but they remain the same. If you are proficient in the metadata of regular expressions, you will be able to use regular expressions proficiently and flexibly to complete complex Text Mining work.
1. Special characters of regular expressions
1. Commonly used metacharacters
are used to match specific characters (letters, numbers, symbols). Note that letters are case-sensitive:
. : matches except line breaks. Any character
w: Matches letters or numbers or underscores or Chinese characters
s: Matches any whitespace character
d: Matches numbers
b: Matches the beginning or end of a word
^: Matches the beginning of a string
$: Matches a string The end of
k: Reference to the group name, for example: k, means to reference the group named group_name
group_number: group_number is the group number of the group, 1, 2, 3, etc., means to reference the group through the group number
2, repeated characters or groups
Specify the number of times the previous character or group is repeated:
: Repeat zero or more times
: Repeat one or more times
?: Repeat zero or one time
{n}: Repeat n times
{n ,}: repeated n times or more
{n,m}: repeated n to m times
3, grouping, escaping, branching, qualifier
These characters have specific meanings and uses:
(): Use parentheses to represent a group
<>: Define the group name. The string between < (", parentheses are no longer used as special characters
|: Branch, the expressions are "or" related
[]: Specify a list of qualified characters, one character must match any character in the list, specify the match in square brackets A character list, for example: [aeiou] A character must be any one in aeiou;
[^]: Specify a list of excluded characters, a character cannot be any character in the excluded list, the excluded character list is specified in square brackets, for example :[^aeiou] A character cannot be any one of aeiou;
Second, grouping reference
(?exp): Name the group, refer to the group through the group name;
(?:exp): This group only matches text at the current position, after the group, the group cannot be referenced, the group has no Group name, and no group number;
1, refer to the group through the group number
(?<=exp): The front of the text matches the expression exp, and the expression after the exp position is returned. Expression
(?!exp): The suffix of the text is not exp, returns an expression whose suffix is not exp
(? < !exp): The prefix of the text is not exp, returns an expression whose prefix is not exp
1, suffix matching
Analysis: Assert that its suffix is ing and it is the end of the word (b), match words ending with ing, but return the front part of the word, the part before ing;
For example, find "I'm reading a book" , it will match "reading" because the character ends with ing. The regular expression returns read and asserts that the returned text does not contain the suffix.
2, prefix matching
(?<=exp): The front of the text matches the expression exp, and the expression after the exp position is returned. Prefix matching is similar to TSQL's "re%";
For example, regular expression: (?<=bre)w+b
Analysis: The beginning of a word (b), and the prefix of the word is re, and the match starts with re The word returns the second half of the word, the part after re;
For example, if you search for "I am reading a book", it will match "reading", because the character starts with re, and the regular expression returns ading, Assert that the text returned does not contain the prefix.
3. Find text whose prefix or suffix is not a specific text
These two assertion searches are opposite to the previous two and have little effect. Let’s have a brief understanding:
(?!exp): The suffix of the text is not exp, return The expression whose suffix is not exp
(? < !exp): The prefix of the text is not exp, and the expression whose prefix is not exp is returned
3.1 For example, regular expression: bw+(?!ingb)
Analysis: does not match ing For words ending in "I am reading a book", the returned text is: I,am,a,book
3.2 For example, regular expression: (?< !bre)w+b
Analysis: does not match the words ending with For words starting with re, search for "I am reading a book", and the returned text is: I, am, a, book