Collecting web content is a very common need. For more traditional static pages, curl can handle it. But if there is dynamically loaded content in the page, such as the text content of articles loaded through ajax in some pages, and if some pages undergo some additional processing after loading (image address replacement, etc...) and you want to collect these processed Content. Then the awesome curl is helpless.
People who have had similar needs may say, old man, use PhantomJS!
Yes, this is a way, and for a long time PhantomJS has been one of the few tools that can solve such needs.
But what I want to introduce today is a tool that came from behind - puppeteer, which developed rapidly with the rise of Chrome Headless technology. And very importantly, puppeteer is developed and maintained by Chrome’s official team, which can be said to be quite reliable!
puppeteer is a js package. If you want to use it in Laravel, you have to use another artifact, spatie/browsershot.
Installation
Install spatie/browsershot
browsershot is a composer package from the great team spatie
$ composer require spatie/browsershot
Install puppeteer
$ npm i puppeteer --save
You can also secure puppeteer globally, but as far as personal experience is concerned, it is more recommended to install it in the project, because in this way different projects will not be affected by the globally installed puppeteer at the same time. In addition, it is also convenient to install phpdeployer in the project. Upgrade (upgrading phpdeploy will not affect the operation of online projects. You must know that upgrading/installing puppeteer is very time-consuming, and sometimes success is not guaranteed).
When installing puppeteer, Chromium-Browser will be downloaded. In view of our special national conditions, it is very likely that it cannot be downloaded. In this regard, please show your skills...
Use
to collect the content of articles on the mobile version of Toutiao today as an example.
use Spatie\Browsershot\Browsershot; public function getBodyHtml() { $newsUrl = 'https://m.toutiao.com/i6546884151050502660/'; $html = Browsershot::url($newsUrl) ->windowSize(480, 800) ->userAgent('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Mobile Safari/537.36') ->mobile() ->touch() ->bodyHtml(); \Log::info($html); }
After running, you can see the following content in the log (the screenshot is only part of it)
In addition, you can also save the page as an image or PDF document.
use Spatie\Browsershot\Browsershot; public function getBodyHtml() { $newsUrl = 'https://m.toutiao.com/i6546884151050502660/'; Browsershot::url($newsUrl) ->windowSize(480, 800) ->userAgent('Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Mobile Safari/537.36') ->mobile() ->touch() ->setDelay(1000) ->save(public_path('images/toutiao.jpg')); }
The boxes in the picture are related to the system fonts. A setDelay() method is used in the code to take a screenshot after the content is loaded. It is simple and crude and may not be the best solution.
Possible problems
The system must support the Chromium browser. Of course, most browsers now support it. Otherwise, there is nothing you can do. Let’s use PhantomJS. .
After puppeteer is installed in the project, there may be permission problems when calling. This requires giving appropriate permissions to the /node_modules/puppeteer directory under the project.
Summary
puppeteer is used in testing, collection and other scenarios, and is a very powerful tool. It is enough for light collection tasks, such as this article, which is used to collect some small pages in Laravel (php), but if you need to quickly collect a large amount of content, Python or something like that
The above is the detailed content of Using puppeteer in Laravel to collect asynchronously loaded web page content. For more information, please follow other related articles on the PHP Chinese website!