A photographer friend of mine implored me to find and download images of picture frames from the internet. I eventually landed on a web page that had a number of them available for free but there was a problem: a link to download all the images together wasn’t present.
I didn’t want to go through the stress of downloading the images individually, so I wrote this PHP class to find, download and zip all images found on the website.
It searches a URL for images, downloads and saves the images into a folder, creates a ZIP archive of the folder and finally deletes the folder.
The class uses Symfony’s DomCrawler component to search for all image links found on the webpage and a custom zip function that creates the zip file. Credit to David Walsh for the zip function.
The class consists of five private properties and eight public methods including the __construct magic method.
Below is the list of the class properties and their roles.
1. $folder: stores the name of the folder that contains the scraped images.
2. $url: stores the webpage URL.
3. $html: stores the HTML document code of the webpage to be scraped.
4. $fileName: stores the name of the ZIP file.
5. $status: saves the status of the operation. I.e if it was a success or failure.
Let’s get started building the class.
Create the class ZipImages containing the above five properties.
<span><span><?php </span></span><span><span>class ZipImages { </span></span><span> <span>private $folder; </span></span><span> <span>private $url; </span></span><span> <span>private $html; </span></span><span> <span>private $fileName; </span></span><span> <span>private $status;</span></span>
Create a __construct magic method that accepts a URL as an argument.
The method is quite self-explanatory.
<span>public function __construct($url) { </span> <span>$this->url = $url; </span> <span>$this->html = file_get_contents($this->url); </span> <span>$this->setFolder(); </span><span>}</span>
The created ZIP archive has a folder that contains the scraped images. The setFolder method below configures this.
By default, the folder name is set to images but the method provides an option to change the name of the folder by simply passing the folder name as its argument.
<span><span><?php </span></span><span><span>class ZipImages { </span></span><span> <span>private $folder; </span></span><span> <span>private $url; </span></span><span> <span>private $html; </span></span><span> <span>private $fileName; </span></span><span> <span>private $status;</span></span>
setFileName provides an option to change the name of the ZIP file with a default name set to zipImages:
<span>public function __construct($url) { </span> <span>$this->url = $url; </span> <span>$this->html = file_get_contents($this->url); </span> <span>$this->setFolder(); </span><span>}</span>
At this point, we instantiate the Symfony crawler component to search for images, then download and save all the images into the folder.
<span>public function setFolder($folder="image") { </span> <span>// if folder doesn't exist, attempt to create one and store the folder name in property $folder </span> <span>if(!file_exists($folder)) { </span> <span>mkdir($folder); </span> <span>} </span> <span>$this->folder = $folder; </span><span>}</span>
After the download is complete, we compress the image folder to a ZIP Archive using our custom create_zip function.
<span>public function setFileName($name = "zipImages") { </span> <span>$this->fileName = $name; </span><span>}</span>
Lastly, we delete the created folder after the ZIP file has been created.
<span>public function domCrawler() { </span> <span>//instantiate the symfony DomCrawler Component </span> <span>$crawler = new Crawler($this->html); </span> <span>// create an array of all scrapped image links </span> <span>$result = $crawler </span> <span>->filterXpath('//img') </span> <span>->extract(array('src')); </span> <span>// download and save the image to the folder </span> <span>foreach ($result as $image) { </span> <span>$path = $this->folder."/".basename($image); </span> <span>$file = file_get_contents($image); </span> <span>$insert = file_put_contents($path, $file); </span> <span>if (!$insert) { </span> <span>throw new <span>\Exception</span>('Failed to write image'); </span> <span>} </span> <span>} </span><span>}</span>
Get the status of the operation. I.e if it was successful or an error occurred.
<span>public function createZip() { </span> <span>$folderFiles = scandir($this->folder); </span> <span>if (!$folderFiles) { </span> <span>throw new <span>\Exception</span>('Failed to scan folder'); </span> <span>} </span> <span>$fileArray = array(); </span> <span>foreach($folderFiles as $file){ </span> <span>if (($file != ".") && ($file != "..")) { </span> <span>$fileArray[] = $this->folder."/".$file; </span> <span>} </span> <span>} </span> <span>if (create_zip($fileArray, $this->fileName.'.zip')) { </span> <span>$this->status = <span><span><<<HTML</span> </span></span><span>File successfully archived. <a href="<span><span>$this->fileName</span>.zip">Download it now</a> </span></span><span><span>HTML<span>;</span></span> </span> <span>} else { </span> <span>$this->status = "An error occurred"; </span> <span>} </span><span>}</span>
Process all the methods above.
<span>public function deleteCreatedFolder() { </span> <span>$dp = opendir($this->folder) or die ('ERROR: Cannot open directory'); </span> <span>while ($file = readdir($dp)) { </span> <span>if ($file != '.' && $file != '..') { </span> <span>if (is_file("<span><span>$this->folder</span>/<span>$file</span>"</span>)) { </span> <span>unlink("<span><span>$this->folder</span>/<span>$file</span>"</span>); </span> <span>} </span> <span>} </span> <span>} </span> <span>rmdir($this->folder) or die ('could not delete folder'); </span><span>}</span>
You can download the full class from Github.
For the class to work, the Domcrawler component and create_zip function need to be included. You can download the code for this function here.
Download and install the DomCrawler component via Composer simply by adding the following require statement to your composer.json file:
<span>public function getStatus() { </span> <span>echo $this->status; </span><span>}</span>
Run $ php composer.phar install to download the library and generate the vendor/autoload.php autoloader file.
<span>public function process() { </span> <span>$this->domCrawler(); </span> <span>$this->createZip(); </span> <span>$this->deleteCreatedFolder(); </span> <span>$this->getStatus(); </span><span>}</span>
In this article, we learned how to create a simple PHP image scraper that automatically compresses downloaded images into a Zip archive. If you have alternative solutions or suggestions for improvement, please leave them in the comments below, all feedback is welcome!
Symfony’s DomCrawler Component is a powerful tool that allows developers to traverse and manipulate HTML and XML documents. It provides an API that is easy to use and understand, making it a popular choice for web scraping tasks. The DomCrawler Component can be used to select specific elements on a page, extract data from them, and even modify their content.
Installing Symfony’s DomCrawler Component is straightforward. You can use Composer, a dependency management tool for PHP. Run the following command in your project directory: composer require symfony/dom-crawler. This will download and install the DomCrawler Component along with its dependencies.
To scrape images using Symfony’s DomCrawler Component, you first need to create a new instance of the Crawler class and load the HTML content into it. Then, you can use the filter method to select the image elements and extract their src attributes. Here’s a basic example:
$crawler = new Crawler($html);
$crawler->filter('img')->each(function (Crawler $node) {
echo $node->attr('src');
});
Yes, you can use Symfony’s DomCrawler Component with Laravel. Laravel’s HTTP testing functionality actually uses the DomCrawler Component under the hood. This means you can use the same methods and techniques to traverse and manipulate HTML content in your Laravel tests.
Symfony’s DomCrawler Component provides several methods to select elements, including filter, filterXPath, and selectLink. These methods allow you to select elements based on their tag name, XPath expression, or link text, respectively.
Yes, you can modify the content of elements using Symfony’s DomCrawler Component. The each method allows you to iterate over each selected element and perform operations on it. For example, you can change the src attribute of an image element like this:
$crawler->filter('img')->each(function (Crawler $node) {
$node->attr('src', 'new-image.jpg');
});
When using Symfony’s DomCrawler Component, errors and exceptions can be handled using try-catch blocks. For example, if the filter method doesn’t find any matching elements, it will throw an InvalidArgumentException. You can catch this exception and handle it appropriately.
Yes, you can use Symfony’s DomCrawler Component to scrape websites that require authentication. However, this requires additional steps, such as sending a POST request with the login credentials and storing the session cookie.
You can extract attribute values using the attr method provided by Symfony’s DomCrawler Component. For example, to extract the src attribute of an image element, you can do the following:
$crawler->filter('img')->each(function (Crawler $node) {
echo $node->attr('src');
});
Unfortunately, Symfony’s DomCrawler Component cannot directly scrape AJAX-loaded content because it doesn’t execute JavaScript. However, you can use tools like Guzzle and Goutte in combination with the DomCrawler Component to send HTTP requests and handle AJAX-loaded content.
The above is the detailed content of Image Scraping with Symfony's DomCrawler. For more information, please follow other related articles on the PHP Chinese website!