Note: As the name suggests, it is useful for simple tasks. It uses regular expressions instead of an HTML parser, so it will be much slower for more complex tasks. Most of its codebase was written in 2008, with only minor improvements made since then. It does not follow modern PHP coding standards and is difficult to incorporate into modern PSR-compliant projects.
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
How to modify HTML elements:
// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');
$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html;
Extract content from HTML:
// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;
Grab Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
I prefer to use one of the native XML extensions because they are generally faster with PHP than all 3rd party libraries and give me all the control I need over the markup.
Working with DOM takes some time to become productive, but in my opinion, it's worth the time. Since DOM is a language-neutral interface, you'll find implementations in multiple languages, so if you need to change programming languages, you most likely already know how to use that language's DOM API.
How to use DOM extensions has been covered extensively on StackOverflow, so if and when you choose to use it, you can be sure that most of the problems you encounter can be solved by searching/browsing Stack Overflow.
XMLReader, like DOM, is based on libxml. I don't know how to trigger the HTML parser module, so using XMLReader to parse corrupted HTML may not be as powerful as using a DOM, where you can explicitly tell it to use libxml's HTML parser module.
The benefit of building on top of DOM/libxml is that you get good performance out of the box because you're building on native extensions. However, not all third-party libraries go this route. Some of them
I generally do not recommend this parser. The code base is terrible and the parser itself is quite slow and memory intensive. Not all jQuery selectors (such as subselectors) are possible. Any libxml based library should easily outperform this.
Again, I would not recommend this parser. Quite slow when CPU usage is high. There is also no function to clear the memory of created DOM objects. These problems are especially severe in nested loops. The document itself is inaccurate and contains misspellings, and there has been no fix response since April 14, 2016.
HTML 5
You can use the above to parse HTML5, but some weird things may happen due to the tags allowed by HTML5. Therefore, for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so performance will be slower and memory usage increased compared to extensions compiled with lower-level languages.
Most of the code snippets you find on the web for matching tags are fragile. In most cases, they only work with very specific snippets of HTML. Small markup changes (such as adding a space somewhere, or adding or changing an attribute in the markup) can cause a regular expression to fail when written incorrectly. Before using RegEx on HTML, you should know what you are doing.
HTML parser already knows the syntax rules of HTML. Regular expressions must be taught for every new regular expression you write. Regular expressions are good in some cases, but it really depends on your use case.
You can write a more reliable parser , but using regular expressions to write a complete and reliable custom parser when the above libraries already exist and do a better job in this regard Well, that's a waste of time.
TrySimple HTML DOM parser.
Note: As the name suggests, it is useful for simple tasks. It uses regular expressions instead of an HTML parser, so it will be much slower for more complex tasks. Most of its codebase was written in 2008, with only minor improvements made since then. It does not follow modern PHP coding standards and is difficult to incorporate into modern PSR-compliant projects.
Example:
How to get HTML elements:
How to modify HTML elements:
Extract content from HTML:
Grab Slashdot:
Native XML extension
I prefer to use one of the native XML extensions because they are generally faster with PHP than all 3rd party libraries and give me all the control I need over the markup.
DOM
DOM is capable of parsing and modifying real-world (broken) HTML, it can perform XPath queries . It is based on libxml.
Working with DOM takes some time to become productive, but in my opinion, it's worth the time. Since DOM is a language-neutral interface, you'll find implementations in multiple languages, so if you need to change programming languages, you most likely already know how to use that language's DOM API.
How to use DOM extensions has been covered extensively on StackOverflow, so if and when you choose to use it, you can be sure that most of the problems you encounter can be solved by searching/browsing Stack Overflow.
Basic usage examples and General concept overview can be found in other answers.
XMLReader
XMLReader, like DOM, is based on libxml. I don't know how to trigger the HTML parser module, so using XMLReader to parse corrupted HTML may not be as powerful as using a DOM, where you can explicitly tell it to use libxml's HTML parser module.
A basic usage example is provided in another answer.
XML parser The
XML parser library is also based on libxml and implements aSAX style XML push parser. It's probably a better choice than DOM or SimpleXML for memory management, but harder to use than the pull parser implemented by XMLReader.
SimpleXml
SimpleXML is an option when you know that the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml as it will block.are provided, and there are many other examples in the PHP manual.
3rd party library (based on libxml)If you prefer to use a 3rd party library, I recommend actually using
DOM/libxml below instead of string parsing.
FluentDomThis is described as "Abandoned software and bugs: use at your own risk" but appears to be minimally maintained.
laminas-dom
fDOMDocument
sabre/xml
FluidXML
3rd party (not based on libxml)
The benefit of building on top of DOM/libxml is that you get good performance out of the box because you're building on native extensions. However, not all third-party libraries go this route. Some of them
are listed belowPHP Simple HTML DOM Parser
I generally do not recommend this parser. The code base is terrible and the parser itself is quite slow and memory intensive. Not all jQuery selectors (such as subselectors) are possible. Any libxml based library should easily outperform this.
PHP Html parser
Again, I would not recommend this parser. Quite slow when CPU usage is high. There is also no function to clear the memory of created DOM objects. These problems are especially severe in nested loops. The document itself is inaccurate and contains misspellings, and there has been no fix response since April 14, 2016.
HTML 5
You can use the above to parse HTML5, but some weird things may happen due to the tags allowed by HTML5. Therefore, for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so performance will be slower and memory usage increased compared to extensions compiled with lower-level languages.
HTML5DomDocument
HTML5
Regular expression
Last and least recommended, you can use regular expressionsto extract data from HTML a >. In general, the use of regular expressions on HTML is discouraged.
Most of the code snippets you find on the web for matching tags are fragile. In most cases, they only work with very specific snippets of HTML. Small markup changes (such as adding a space somewhere, or adding or changing an attribute in the markup) can cause a regular expression to fail when written incorrectly. Before using RegEx on HTML, you should know what you are doing.
HTML parser already knows the syntax rules of HTML. Regular expressions must be taught for every new regular expression you write. Regular expressions are good in some cases, but it really depends on your use case.
You can write a more reliable parser , but using regular expressions to write a complete and reliable custom parser when the above libraries already exist and do a better job in this regard Well, that's a waste of time.
See alsoCthulhu Way Parsing Html
books
If you want to spend some money, you can take a look
I am not affiliated with PHP architects or authors.