How can I extract and categorize text data from an HTML document based on specific element classes using PHP?-PHP Tutorial-php.cn

How can I extract and categorize text data from an HTML document based on specific element classes using PHP?

Mary-Kate Olsen

Release： 2024-11-12 15:48:01

Original

648 people have browsed it

How can I extract and categorize text data from an HTML document based on specific element classes using PHP?

Retrieve Text from Elements with Specified Class as a Comprehensive Array

In this query, the task at hand is to extract and categorize text data from an HTML document based on specific element classes. The HTML document contains various paragraphs with classes like "Heading1-P" and "Normal-P," each containing corresponding headings and content.

To accomplish this, we can utilize PHP DOM Document and XPath. The process involves parsing the HTML document and traversing its elements using XPath. We define a custom function, parseToArray() that takes an XPath object and class name as inputs. This function iterates through the elements matching the class and extracts their text content into an array.

Here's the detailed solution:

$test = <<< HTML
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 1</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 1</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 2</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 2</span>
</p>
<p class="Heading1-P">
    <span class="Heading1-H">Chapter 3</span>
</p>
<p class="Normal-P">
    <span class="Normal-H">This is chapter 3</span>
</p>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($test);
$xpath = new DOMXPath($dom);
$heading = parseToArray($xpath, 'Heading1-H');
$content = parseToArray($xpath, 'Normal-H');

var_dump($heading);
echo "<br/>";
var_dump($content);
echo "<br/>";

function parseToArray(DOMXPath $xpath, string $class): array
{
    $xpathquery = "//[@class='$class']";
    $elements = $xpath->query($xpathquery);

    $resultarray = [];
    foreach ($elements as $element) {
        $nodes = $element->childNodes;
        foreach ($nodes as $node) {
            $resultarray[] = $node->nodeValue;
        }
    }

    return $resultarray;
}

Copy after login

The function parseToArray() identifies elements based on a specific class name and extracts their text content into an array. Subsequently, two arrays are created: $heading and $content, which contain the chapter titles and corresponding paragraph text, respectively. The output of the code will be as follows:

array(3) {
  [0] =>
  string(8) "Chapter 1"
  [1] =>
  string(8) "Chapter 2"
  [2] =>
  string(8) "Chapter 3"
}
array(3) {
  [0] =>
  string(16) "This is chapter 1"
  [1] =>
  string(16) "This is chapter 2"
  [2] =>
  string(16) "This is chapter 3"
}

Copy after login

By employing this approach, you can efficiently retrieve and separate text content based on specific class names from an HTML document, allowing for flexible and targeted data processing.

The above is the detailed content of How can I extract and categorize text data from an HTML document based on specific element classes using PHP?. For more information, please follow other related articles on the PHP Chinese website!