Home Backend Development PHP Tutorial Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

Some notes on post-collection data processing based on preg_match_all (encoding conversion and regular matching)_PHP tutorial

Jul 13, 2016 am 10:39 AM
match regular transcoding

1. Use curl to achieve off-site collection

Please refer to my last note for details: http://www.jb51.net/article/46432.htm

2. Encoding conversion
First find the encoding used by the collected website by viewing the source code, and transcode it through the mb_convert_encoding function;

Specific usage:

Copy code The code is as follows:

//The source character is $str

//The following is known The original encoding is GBK, converted to utf-8
mb_convert_encoding($str, "UTF-8", "GBK");

//The following unknown original encoding, after automatic detection by auto, convert the encoding For utf-8
mb_convert_encoding($str, "UTF-8", "auto");

3. In order to better avoid the obstacles of uncertain factors such as line breaks and spaces, it is necessary to first remove line breaks, spaces and tab characters in the collected source code

Copy code The code is as follows:

//Method 1, use str_replace to replace
$contents = str_replace(" rn", '', $contents); //Clear newline characters
$contents = str_replace("n", '', $contents); //Clear newline characters
$contents = str_replace("t" , '', $contents); //Clear tab characters
$contents = str_replace(" ", '', $contents); //Clear space characters

//Method 2, use regular expressions Expression replacement
$contents = preg_replace("/([rn|n|t| ]+)/",'',$contents);

4. Find the code segment you need to obtain through regular expression matching, and use preg_match_all to achieve the matching

Copy code The code is as follows:

Function explanation:
int preg_match_all ( string pattern, string subject, array matches [ , int flags] )
pattern is the regular expression
subject is the original text to be searched
matches is the array used to store the output results
flags is the stored pattern, including:
PREG_PATTERN_ORDER ; //The entire array is a two-dimensional array, $arr1[0] is an array of matching strings including the boundaries, $arr1[1] is an array of matching strings minus the boundaries
PREG_SET_ORDER; //The entire array is a two-dimensional array, $arr2[0][0] is the first matching string consisting of boundaries, $arr2[0][1] is the first matching string consisting of removing boundaries, and then The array can be deduced by analogy
PREG_OFFSET_CAPTURE; //The entire array is a three-dimensional array, $arr3[0][0][0] is the first matching string including the boundary, $arr3[0][0 ][1] is the offset to the boundary of the first matching string (the boundary is not included), and so on, $arr2[1][0][0] is the first including the boundary The matched string, $arr3[1][0][1] is the offset to the boundary of the first matched string (boundary is included);

//Application
preg_match_all('/(.*?)

/',$contents, $out, PREG_SET_ORDER);
$out will get all matching elements
$out[0][0] will be the entire character including


$out[0][1] will be only the (.* ?) The matched character segment in the brackets

// By analogy, the nth matched field can be obtained using the following method
$out[n-1][1]

//If there are multiple parentheses in the regular expression, the method to obtain the mth matching point in the sentence is
$out[n-1][m]

5. After obtaining the characters to be found, if you want to remove the html tags, you can easily achieve this by using the function strip_tags that comes with PHP

Copy code The code is as follows:

//Example
$result=strip_tags($out[0][1 ]);

www.bkjia.comtruehttp: //www.bkjia.com/PHPjc/728086.htmlTechArticle1. For details on using curl to achieve off-site collection, please refer to my last note: http://www.jb51 .net/article/46432.htm 2. Encoding conversion: First find the encoding used by the collected website by viewing the source code...
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to type underline on computer How to type underline on computer Feb 19, 2024 pm 08:36 PM

How to underline on the computer When entering text on the computer, we often need to use underlines to highlight certain content or mark it. However, for some people who are not very familiar with computer input methods, typing underline can be a bit confusing. This article will introduce you to how to underline on your computer. In different computer operating systems and software, the way to enter the underscore may be slightly different. The following will introduce the common methods on Windows operating system and Mac operating system respectively. First, let’s take a look at the operation in Windows

How to replace a string starting with something with php regular expression How to replace a string starting with something with php regular expression Mar 24, 2023 pm 02:57 PM

PHP regular expressions are a powerful tool for text processing and conversion. It can effectively manage text information by parsing text content and replacing or intercepting it according to specific patterns. Among them, a common application of regular expressions is to replace strings starting with specific characters. We will explain this as follows

How to match multiple words or strings using Golang regular expression? How to match multiple words or strings using Golang regular expression? May 31, 2024 am 10:32 AM

Golang regular expressions use the pipe character | to match multiple words or strings, separating each option as a logical OR expression. For example: matches "fox" or "dog": fox|dog matches "quick", "brown" or "lazy": (quick|brown|lazy) matches "Go", "Python" or "Java": Go|Python |Java matches words or 4-digit zip codes: ([a-zA

How to use regular expressions to remove Chinese characters in php How to use regular expressions to remove Chinese characters in php Mar 03, 2023 am 10:12 AM

How to remove Chinese in PHP using regular expressions: 1. Create a PHP sample file; 2. Define a string containing Chinese and English; 3. Use "preg_replace('/([\x80-\xff]*)/i', '',$a);" The regular method can remove Chinese characters from the query results.

How to use regular matching to remove html tags in php How to use regular matching to remove html tags in php Mar 21, 2023 pm 05:17 PM

In this article, we will learn how to remove HTML tags and extract plain text content from HTML strings using PHP regular expressions. To demonstrate how to remove HTML tags, let's first define a string containing HTML tags.

Explore the implementation mechanism of golang encoding conversion Explore the implementation mechanism of golang encoding conversion Feb 19, 2024 pm 03:21 PM

As a powerful programming language, Golang has high performance and concurrency capabilities, and also provides rich standard library support, including support for encoding conversion. This article will deeply explore the implementation principles of encoding conversion in Golang and analyze it with specific code examples. What is transcoding? Encoding conversion refers to the process of converting a sequence of characters from one encoding to another. In actual development, we often need to handle conversions between different encodings, such as converting UTF-8 encoded strings.

A simple way to learn dedecms encoding conversion function A simple way to learn dedecms encoding conversion function Mar 14, 2024 pm 02:09 PM

Learning dedecms encoding conversion function is not complicated. Simple code examples can help you quickly master this skill. In dedecms, the encoding conversion function is usually used to deal with problems such as Chinese garbled characters and special characters to ensure the normal operation of the system and the accuracy of data. The following will introduce in detail how to use the encoding conversion function of dedecms, allowing you to easily cope with various encoding-related needs. 1.UTF-8 to GBK In dedecms, if you need to convert UTF-8 encoded string to G

How to deal with encoding conversion problems in C++ development How to deal with encoding conversion problems in C++ development Aug 22, 2023 am 11:07 AM

How to deal with encoding conversion issues in C++ development. During the C++ development process, we often encounter problems that require conversion between different encodings. Because there are differences between different encoding formats, you need to pay attention to some details when performing encoding conversion. This article will introduce how to deal with encoding conversion issues in C++ development. 1. Understand different encoding formats. Before dealing with encoding conversion issues, you first need to understand different encoding formats. Common encoding formats include ASCII, UTF-8, GBK, etc. ASCII is the earliest encoding format

See all articles