If your website has comments, you will definitely find that your website is often injected with advertisements by one person, such as part-time jobs, QQ accounts, Taobao part-time jobs, and website information. Let’s take a look at how to filter these contents.
The types of comments or other content advertisements posted by users generally have the following types:
1: Taobao part-time job, add QQ 123456789 group (with QQ number or WeChat number or other digital number)
2: Taobao part-time job, add QQ number (with English keywords)
3: Taobao part-time job, add QQ ① ① ① ① ① ① (Special digit number)
4: 22222222 (Full-width type number)
Filtering method:
Use regular rules to Match and replace the punctuation marks, numbers, and letters of the string to determine whether there are consecutive numbers or keywords (full-width and rounded corners are supported), because advertisements generally carry contact information such as QQ numbers. Therefore, we must first "purify" and replace the comments, convert the full-width ones into half-width ones, remove some "sand", such as punctuation marks, spaces, letters, etc., leaving only Chinese characters and numbers.
Example:
$comment= "This $% is a (1)8 artifact three or four website, come and join ④④he@#heqq 1 2 3 4 5 6 7 8″;
1:" "Purify" content and remove punctuation marks
$flag_arr=array('?','!','¥','(',')',':',''',''','"', '"','《','》',',','...','.',',','nbsp','】','【','~'); preg_replace('/s/','',preg_replace("/[[:punct:]]/",'',strip_tags(html_entity_decode(str_replace($flag_arr,'',$comment),ENT_QUOTES,'UTF-8 '))));
After processing, $comment becomes: "This is a (1)8 artifact 34 website. B come and join ①④hehe qqq12345678"
2: It may contain some full-width symbols. Or numbers, so use the following code to convert full-width symbols into half-width symbols that can be matched by regular expressions
$quanjiao = array('0' => '0', '1' => '1', '2' => ; '2', '3' => '3', '4' => '4','5' => '5', '6' => '6', '7' => ; '7', '8' => '8', '9' => '9', 'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E','F' => 'F', 'G' => 'G', 'H' => ; 'H', 'I' => 'I', 'J' => 'J', 'K' => 'K', 'L' => 'L', 'M' => ; 'M', 'N' => 'N', 'O' => 'O','P' => 'P', 'Q' => 'Q', 'R' => ; 'R', 'S' => 'S', 'T' => 'T','U' => 'U', 'V' => 'V', 'W' => ; 'W', 'X' => 'X', 'Y' => 'Y','Z' => 'Z', 'a' => 'a', 'b' => ; 'b', 'c' => 'c', 'd' => 'd','e' => 'e', 'f' => 'f', 'g' => ; 'g', 'h' => 'h', 'i' => 'i','j' => 'j', 'k' => 'k', 'l' => ; 'l', 'm' => 'm', 'n' => 'n','o' => 'o', 'p' => 'p', 'q' => ; 'q', 'rr' => 'r', 's' => 's', 't' => 't', 'u' => 'u', 'v' => ; 'v', 'w' => 'w', 'x' => 'x', 'y' => 'y', 'スz' => 'z','(' => ; '(', ')' => ')', '〔' => '[', '〕' => ']', '【' => '[','】' => ; ']', '〖' => '[', '〗' => ']', '"' => '[', '"' => ']',''' => ; '[', ''' => ']', '{' => '{', '}' => '}', '《' => '<','》' = > '>','%' => '%', '+' => '+', '—' => '-', '-' => '-', '~' => '-',':' => ':', '. ' => '.', ',' => ',', ',' => '.', ',' => '.', ';' => ',', '? ' => '?', '! ' => '!', '…' => '-', '‖' => '|', '"' => '"', ''' => '`', '' ' => '`', '|' => '|', '〃' => '"',' ' => ' ');
$comment=strtr($comment, $quanjiao) ;
php’s strtr function is used to convert specific characters in a string.You can use
strtr(string,from,to)
or
strtr(string,array)
After processing, $comment becomes:” This is a 18 artifact 34 website. B come and join①④heheqq12345678″;
3: The comments may also contain special characters (you can expand new special characters in the array below)
$special_num_char=array('①'=>'1','②'=>'2','③'=>'3','④'=>'4','⑤'= >'5','⑥'=>'6','⑦'=>'7','⑧'=>'8','⑨'=>'9','⑩'= >'10','⑴'=>'1','⑵'=>'2','⑶'=>'3','⑷'=>'4','⑸'= >'5','⑹'=>'6','⑺'=>'7','⑻'=>'8','⑼'=>'9','⑽'= >'10','一'=>'1','二'=>'2','三'=>'3','四'=>'4','五'= >'5','six'=>'6','seven'=>'7','eight'=>'8','nine'=>'9','zero'= >'0');
$comment=strtr($comment, $special_num_char);
After processing, $comment becomes: "This is a 18 artifact website B Come and join 14heheqq12345678";
If you comment Traditional Chinese numbers appear in it, such as 'zero', 'one', 'two', 'three', 'four', 'five', 'Lu', 'seven', 'eight', 'nine', 'shi' For these, just add and expand the $special_num_char above.
4: There may also be a mixture of normal numbers and Chinese character numbers in the comments. Just use the method in point 3 to convert them into normal numbers.
Example: This is an advertisement qq 1二二45六7899
After conversion:
This is an advertisement qq 1224567899
5: Regular processing to filter advertisements
Use regular matching preg_match_all('/d+/',$comment, $match)
Analyze the obtained match[0] matching array
foreach($match[0] as $val)//Whether there is a digital QQ number and WeChat ID? if(strlen($val)> = 6)
{// There is a number of numbers with a continuous length of more than 6 digits, and the suspicion of advertising is very large
$ is_ad = true; )
{//There are a lot of intermittent numbers, and there is suspicion of advertising
$is_ad=true;
}
ok, so you can judge whether the content is advertising, and you can filter most common ads
$flag_arr=array('?','!','¥','(',')',':','‘','’','“','”','《','》',',','…','。','、','nbsp','】','【','~'); $comment=preg_replace('/\s/','',preg_replace("/[[:punct:]]/",'',strip_tags(html_entity_decode(str_replace($flag_arr,'',$comment),ENT_QUOTES,'UTF-8')))); $quanjiao = array('0' => '0', '1' => '1', '2' => '2', '3' => '3', '4' => '4','5' => '5', '6' => '6', '7' => '7', '8' => '8', '9' => '9', 'A' => 'A', 'B' => 'B', 'C' => 'C', 'D' => 'D', 'E' => 'E','F' => 'F', 'G' => 'G', 'H' => 'H', 'I' => 'I', 'J' => 'J', 'K' => 'K', 'L' => 'L', 'M' => 'M', 'N' => 'N', 'O' => 'O','P' => 'P', 'Q' => 'Q', 'R' => 'R', 'S' => 'S', 'T' => 'T','U' => 'U', 'V' => 'V', 'W' => 'W', 'X' => 'X', 'Y' => 'Y','Z' => 'Z', 'a' => 'a', 'b' => 'b', 'c' => 'c', 'd' => 'd','e' => 'e', 'f' => 'f', 'g' => 'g', 'h' => 'h', 'i' => 'i','j' => 'j', 'k' => 'k', 'l' => 'l', 'm' => 'm', 'n' => 'n','o' => 'o', 'p' => 'p', 'q' => 'q', 'r' => 'r', 's' => 's', 't' => 't', 'u' => 'u', 'v' => 'v', 'w' => 'w', 'x' => 'x', 'y' => 'y', 'z' => 'z','(' => '(', ')' => ')', '〔' => '[', '〕' => ']', '【' => '[','】' => ']', '〖' => '[', '〗' => ']', '“' => '[', '”' => ']','‘' => '[', '\'' => ']', '{' => '{', '}' => '}', '《' => '<','》' => '>','%' => '%', '+' => '+', '—' => '-', '-' => '-', '~' => '-',':' => ':', '。' => '.', '、' => ',', ',' => '.', '、' => '.', ';' => ',', '?' => '?', '!' => '!', '…' => '-', '‖' => '|', '”' => '"', '\'' => '`', '‘' => '`', '|' => '|', '〃' => '"',' ' => ' '); $comment=strtr($comment, $quanjiao); $special_num_char=array('①'=>'1','②'=>'2','③'=>'3','④'=>'4','⑤'=>'5','⑥'=>'6','⑦'=>'7','⑧'=>'8','⑨'=>'9','⑩'=>'10','⑴'=>'1','⑵'=>'2','⑶'=>'3','⑷'=>'4','⑸'=>'5','⑹'=>'6','⑺'=>'7','⑻'=>'8','⑼'=>'9','⑽'=>'10','一'=>'1','二'=>'2','三'=>'3','四'=>'4','五'=>'5','六'=>'6','七'=>'7','八'=>'8','九'=>'9','零'=>'0'); $comment=strtr($comment, $special_num_char); preg_match_all('/\d+/',$comment,$match); $is_ad = false; foreach($match[0] as $val)//是否存在数字的qq号和微信号 { if(strlen($val)>=6) {//存在连续的长度超过6位的数字串,广告嫌疑很大 $is_ad=true; break; } } if(count($match[0])>=10) {//间断的数字很多,存在广告的嫌疑 $is_ad=true; }