


Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial
1. Use curl to achieve off-site collection
Please refer to my last note for details: http://www.jb51.net/article/46432.htm
2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;
Specific usage:
//The source character is $str
//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");
//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");
3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code
//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters
//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);
4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching
Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);
//Application
preg_match_all('/
$out will get all matching elements
$out[0][0] will be the entire character including
$out[0][1] will be only the (.* ?) The matched character segment in the brackets
// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]
//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]
5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP
//Example
$result=strip_tags($out[0][1 ]);

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

How to underline on the computer When entering text on the computer, we often need to use underlines to highlight certain content or mark it. However, for some people who are not very familiar with computer input methods, typing underline can be a bit confusing. This article will introduce you to how to underline on your computer. In different computer operating systems and software, the way to enter the underscore may be slightly different. The following will introduce the common methods on Windows operating system and Mac operating system respectively. First, let’s take a look at the operation in Windows

PHP regular expressions are a powerful tool for text processing and conversion. It can effectively manage text information by parsing text content and replacing or intercepting it according to specific patterns. Among them, a common application of regular expressions is to replace strings starting with specific characters. We will explain this as follows

Golang regular expressions use the pipe character | to match multiple words or strings, separating each option as a logical OR expression. For example: matches "fox" or "dog": fox|dog matches "quick", "brown" or "lazy": (quick|brown|lazy) matches "Go", "Python" or "Java": Go|Python |Java matches words or 4-digit zip codes: ([a-zA

How to remove Chinese in PHP using regular expressions: 1. Create a PHP sample file; 2. Define a string containing Chinese and English; 3. Use "preg_replace('/([\x80-\xff]*)/i', '',$a);" The regular method can remove Chinese characters from the query results.

In this article, we will learn how to remove HTML tags and extract plain text content from HTML strings using PHP regular expressions. To demonstrate how to remove HTML tags, let's first define a string containing HTML tags.

As a powerful programming language, Golang has high performance and concurrency capabilities, and also provides rich standard library support, including support for encoding conversion. This article will deeply explore the implementation principles of encoding conversion in Golang and analyze it with specific code examples. What is transcoding? Encoding conversion refers to the process of converting a sequence of characters from one encoding to another. In actual development, we often need to handle conversions between different encodings, such as converting UTF-8 encoded strings.

Learning dedecms encoding conversion function is not complicated. Simple code examples can help you quickly master this skill. In dedecms, the encoding conversion function is usually used to deal with problems such as Chinese garbled characters and special characters to ensure the normal operation of the system and the accuracy of data. The following will introduce in detail how to use the encoding conversion function of dedecms, allowing you to easily cope with various encoding-related needs. 1.UTF-8 to GBK In dedecms, if you need to convert UTF-8 encoded string to G

How to deal with encoding conversion issues in C++ development. During the C++ development process, we often encounter problems that require conversion between different encodings. Because there are differences between different encoding formats, you need to pay attention to some details when performing encoding conversion. This article will introduce how to deal with encoding conversion issues in C++ development. 1. Understand different encoding formats. Before dealing with encoding conversion issues, you first need to understand different encoding formats. Common encoding formats include ASCII, UTF-8, GBK, etc. ASCII is the earliest encoding format
