Parse links in HTML using PHP-PHP Tutorial-php.cn

Parse links in HTML using PHP

王林

Release： 2023-06-14 13:10:01

Original

1724 people have browsed it

With the rapid development of the Internet, the number and scale of websites continue to expand. In order to improve the accessibility and user experience of the website, it is often necessary to add a large number of links to the web page. For some websites that require batch processing, manually checking and modifying links is obviously a tedious and error-prone task. Therefore, using PHP to parse links in HTML has become an efficient and fast way.

1. Get the HTML file

First, we need to get the HTML file to be processed through PHP. PHP provides a variety of ways to obtain HTML files, such as using the file_get_contents function, fopen and fread combination to read, etc. Here, we use the file_get_contents function.

$filename = 'example.html';
$html = file_get_contents($filename);

2. Parse the links in the HTML file

Get the HTML file, we need to extract the links within it as accurately as possible. Based on this, we can use regular expressions or PHP's built-in DOM parser.

Regular expression to extract links

To extract links through regular expressions, we need to understand the basic structure of HTML page links. Generally speaking, links in HTML pages are wrapped in a certain text content with a tags, and their basic structure is as follows:

Link text content

Therefore , we can match all links through regular expressions. The specific code is as follows:

$regexp ='1*href=['"]?(² )';
preg_match_all($regexp, $html, $match);
$link = array_unique($match[1]);

The above code uses regular expressions< as ¹*href=['"]?(²) to match the a tag and extract https:// in the href attribute www.php.cn/link/39cec6d4d21b5dade7544dab6881423e. Among them, ² means matching a series of characters without single quotes, double quotes and spaces. Finally, use the array_unique function to deduplicate all https://www.php.cn/link/39cec6d4d21b5dade7544dab6881423e.

Use DOM parser to extract links

PHP’s built-in DOM parser provides a more convenient and accurate way to parse links in HTML files. It can convert HTML pages into a Document Object Model (DOM) tree structure, so that the document tree can be traversed to query and extract information.

The specific code is as follows:

$doc = new DOMDocument();
$doc->loadHTML($html);
$links = $doc->getElementsByTagName ('a');
foreach ($links as $link) {

$href = $link->getAttribute('href');

Copy after login

}

In the above code, we first use DOMDocument to convert the $html string to the Document Object Model , and then obtain all a tags through the getElementsByTagName('a') method, traverse each a tag and extract the attribute value in its href attribute.

3. Process the links

After obtaining all the links, we need to process these links. The specific processing method depends on the needs. The following are some common processing methods:

replacement

Sometimes we need to batch modify certain parts of the link, such as links Remove the http:// prefix. You can use the str_replace function to replace strings.

foreach ($links as $link) {

$href = $link->getAttribute('href');
$new_href = str_replace('http://', '', $href);
$link->setAttribute('href', $new_href);

Copy after login

}

Sometimes we need to add all links Add some specific strings or parameters, such as adding utm_campaign=xxx parameters after all links. Can be added using string concatenation.

foreach ($links as $link) {

$href = $link->getAttribute('href');
$new_href = $href . '?utm_campaign=xxx';
$link->setAttribute('href', $new_href);

Copy after login

}

Filtering

Sometimes we need to filter out certain Links, such as certain advertising links. You can use if statements to judge and filter links.

foreach ($links as $link) {

$href = $link->getAttribute('href');
if (strstr($href, 'ad.')) {
    $link->parentNode->removeChild($link);
}

Copy after login

}

4. Save the HTML file

After processing all links, we need to save the results Save to HTML file. Just like reading an HTML file, use the file_put_contents function to write to the file.

$filename_new = 'example_new.html';
$html_new = $doc->saveHTML();
file_put_contents($filename_new, $html_new);

In summary , using PHP to parse links in HTML is an efficient and convenient batch processing method. Get links through regular expressions or DOM parsers, then process them, and finally save them to HTML files, so you can quickly update and modify a large number of links.

> ↩
'" > ↩

The above is the detailed content of Parse links in HTML using PHP. For more information, please follow other related articles on the PHP Chinese website!