


Use of regular processing function get_matches based on curl data collection_PHP tutorial
Based on the previous two blog posts:
Usage of single page collection function get_html based on curl data collection
Usage of single page parallel collection function get_htmls based on curl data collection
We have obtained the html file we need. Now we need to process the obtained file to obtain the collected data we need.
For the parsing of HTML documents, there is no parsing class like XML, because HTML documents have many unpaired tags and are not strict. At this time, you need to use some other auxiliary classes. Simplehtmldom is a parsing class similar to JQuery that operates HTML documents. It is very convenient to get the data you want, but unfortunately it is slow. This is not the focus of our discussion here. I mainly use regular expressions to match the data I need to collect, so that I can quickly get the information I need to collect.
Considering that get_html can judge the returned data, but get_htmls cannot judge, the following two functions were written to facilitate debugging and calling:
function get_matches($pattern,$html,$err_msg,$multi=false,$flags=0,$offset=0){
if(!$multi){
If (! Preg_match ($ Pattern, $ HTML, $ matches, $ Flags, $ OFFSET)) {
Echo $ ERR_MSG. "! Error message:". Get_preg_msg (). "N";
; return false;
return false; ".get_preg_err_msg ()."n"; error_code = preg_last_error ();
switch($error_code){
case PREG_NO_ERROR :
$err_msg = 'PREG_NO_ERROR';
break;
case PREG_INTERNAL _ERROR:
$err_msg = 'PREG_INTERNAL_ERROR';
break;
case PREG_BACKTRACK_LIMIT_ERROR:
$err_msg = 'PREG_BACKTRACK_LIMIT_ERROR';
case PREG_RECURSION_LIMIT_ERROR:
$err_msg = 'PREG_RECURSION_LIMIT_ERROR';
break;
case PREG_BAD_UTF8_ERROR:
$err_msg = 'PREG_BAD_UTF8_ERROR';
break;
case PREG_BAD_UTF8_OFFSET_ERROR:
$err_msg = 'PREG_BAD_UTF8_OFFSET_ERROR';
break;
default:
return 'Unknown error !';
}
return $err_msg.': '.$error_code;
}
can be called like this:
Copy the code
The code is as follows:
$url = 'http://www.baidu.com';
$matches = get_matches('!!',$html,'No link found',true);
if($matches){
Copy code
The code is as follows:
$urls = array('http://www.baidu.com','http://www.hao123.com');
$htmls = get_htmls($urls);
foreach($htmls as $html){
$matches = get_matches('!!',$html,'No link found',true);
if($matches){
var_dump($matches);
}
}
to get the required information, whether single page collection or multi-page collection , in the end PHP can still only process one page. Because of the use of get_matches, the returned value can be judged to be true or false, and the correct data can be obtained. Since the problem of exceeding the regular backtracking is encountered when using regular expressions, get_preg_err_msg is added to prompt the regular information.
Because when collecting data, the list page is often collected, and the content page is collected based on the content page link obtained from the list page, or more levels, then there will be a lot of nested loops, and the control of the code will feel inadequate. So can we separate the code of the collection list page from the code of the collection content page, or more levels of collection code, or even simplify the loop?

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



MetaMask (also called Little Fox Wallet in Chinese) is a free and well-received encryption wallet software. Currently, BTCC supports binding to the MetaMask wallet. After binding, you can use the MetaMask wallet to quickly log in, store value, buy coins, etc., and you can also get 20 USDT trial bonus for the first time binding. In the BTCCMetaMask wallet tutorial, we will introduce in detail how to register and use MetaMask, and how to bind and use the Little Fox wallet in BTCC. What is MetaMask wallet? With over 30 million users, MetaMask Little Fox Wallet is one of the most popular cryptocurrency wallets today. It is free to use and can be installed on the network as an extension

Go language provides two dynamic function creation technologies: closure and reflection. closures allow access to variables within the closure scope, and reflection can create new functions using the FuncOf function. These technologies are useful in customizing HTTP routers, implementing highly customizable systems, and building pluggable components.

In C++ function naming, it is crucial to consider parameter order to improve readability, reduce errors, and facilitate refactoring. Common parameter order conventions include: action-object, object-action, semantic meaning, and standard library compliance. The optimal order depends on the purpose of the function, parameter types, potential confusion, and language conventions.

The key to writing efficient and maintainable Java functions is: keep it simple. Use meaningful naming. Handle special situations. Use appropriate visibility.

1. The SUM function is used to sum the numbers in a column or a group of cells, for example: =SUM(A1:J10). 2. The AVERAGE function is used to calculate the average of the numbers in a column or a group of cells, for example: =AVERAGE(A1:A10). 3. COUNT function, used to count the number of numbers or text in a column or a group of cells, for example: =COUNT(A1:A10) 4. IF function, used to make logical judgments based on specified conditions and return the corresponding result.

Golang regular expressions use the pipe character | to match multiple words or strings, separating each option as a logical OR expression. For example: matches "fox" or "dog": fox|dog matches "quick", "brown" or "lazy": (quick|brown|lazy) matches "Go", "Python" or "Java": Go|Python |Java matches words or 4-digit zip codes: ([a-zA

The advantages of default parameters in C++ functions include simplifying calls, enhancing readability, and avoiding errors. The disadvantages are limited flexibility and naming restrictions. Advantages of variadic parameters include unlimited flexibility and dynamic binding. Disadvantages include greater complexity, implicit type conversions, and difficulty in debugging.

The benefits of functions returning reference types in C++ include: Performance improvements: Passing by reference avoids object copying, thus saving memory and time. Direct modification: The caller can directly modify the returned reference object without reassigning it. Code simplicity: Passing by reference simplifies the code and requires no additional assignment operations.
