How to Efficiently Ignore HTML Tags During Regular Expression Replacement?

Mary-Kate Olsen
Release: 2024-11-12 06:24:02
Original
246 people have browsed it

How to Efficiently Ignore HTML Tags During Regular Expression Replacement?

Ignoring HTML Tags in Regular Expression Replacement

Regular expressions are often insufficient for handling complex HTML parsing tasks, especially when dealing with cases like selectively ignoring tags. Instead, it is generally recommended to use DOMDocument and DOMXPath for such scenarios.

DOMXPath-Based Approach

To ignore HTML tags while performing replacements, DOMXPath can be used to selectively locate text elements within the document. For example, the following query would find all text nodes that contain the search term "apple span":

//*[contains(., "apple span")]/*[FALSE = contains(., "apple span")]/..
Copy after login

Creating a TextRange Class

Then, a custom TextRange class can be created to represent a list of DOM text nodes. This class enables string operations to be performed on these text nodes as if they were a single string.

Processing the Search Results

For each matching text node range, elements can be created and inserted around the text nodes to highlight them. This would generate the desired results without affecting HTML tags.

Example

Here's a sample code that demonstrates this approach:

$doc = new DOMDocument;
$doc->loadXML('<html><body>This is some <span>text</span> that span</body></html>');
$xp = new DOMXPath($doc);

$anchor = $doc->getElementsByTagName('body')->item(0);
$r = $xp->query('//*[contains(., "span")]/*[FALSE = contains(., "span")]/..', $anchor);

foreach($r as $node)
{   
    $textNodes = $xp->query('.//child::text()', $node);
    $range = new TextRange($textNodes);
    while(FALSE !== $start = strpos($range, "span"))
    {
        $base = $range->split($start);
        $range = $base->split(strlen("span"));
        foreach($base->getNodes() as $node)
        {
            $span = $doc->createElement('span');
            $span->setAttribute('class', 'search_hightlight');
            $node = $node->parentNode->replaceChild($span, $node);
            $span->appendChild($node);
        }
    }
}

echo $doc->saveXML(); // Output the modified XML with highlighted text
Copy after login

This approach allows for robust and efficient ignoring of HTML tags during replacement operations, ensuring consistent results without breaking the HTML structure.

The above is the detailed content of How to Efficiently Ignore HTML Tags During Regular Expression Replacement?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template