The content of this article is about how Node implements batch crawling and saving of headline videos (code implementation). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
The general routine for crawling videos or pictures in batches is to use a crawler to obtain a collection of file links, and then save the files one by one through methods such as writeFile. However, the video link of Toutiao cannot be captured in the html file (server-side rendering output) that needs to be crawled. The video link is dynamically calculated and added to the video tag based on the known key or hash value of the video through the algorithm or decryption method in certain js files when the page is rendered on the client side. This is also an anti-crawling measure for the website.
When we browse these pages, we can see the calculated file address through the audit element. However, when downloading in batches, it is obviously not advisable to manually obtain video links one by one. Fortunately, puppeteer provides the function of simulating access to Chrome, allowing us to crawl the final page rendered by the browser.
npm i npm start
Notice: The process of installing puppeteer is a little slow, please wait patiently.
Configuration file// 配置相关 module.exports = { originPath: 'https://www.ixigua.com', // 页面请求地址 savePath: 'D:/videoZZ' // 存放路径 }
Official API
puppeteer provides a high-level API to control Chrome or Chromium.
puppeteer Main function:
Use web pages to generate PDFs and images
Crawl SPA applications and generate pre-rendered content (i.e. "SSR" server-side rendering)
Can capture content from the website
Automated form submission, UI testing, keyboard input, etc.
API used:
puppeteer.launch() Launch browser instance
browser .newPage() Create a new page
page.goto() Enter the specified webpage
page.screenshot() Screenshot
page.waitFor() The page waits, which can be time, a certain element, or a certain function
page.$eval() Gets a specified element, Equivalent to document.querySelector
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://example.com'); await page.screenshot({path: 'example.png'}); await browser.close(); })();
const downloadVideo = async video => { // 判断视频文件是否已经下载 if (!fs.existsSync(`${config.savePath}/${video.title}.mp4`)) { await getVideoData(video.src, 'binary').then(fileData => { console.log('下载视频中:', video.title) savefileToPath(video.title, fileData).then(res => console.log(`${res}: ${video.title}`) ) }) } else { console.log(`视频文件已存在:${video.title}`) } }
getVideoData (url, encoding) { return new Promise((resolve, reject) => { let req = http.get(url, function (res) { let result = '' encoding && res.setEncoding(encoding) res.on('data', function (d) { result += d }) res.on('end', function () { resolve(result) }) res.on('error', function (e) { reject(e) }) }) req.end() }) }
savefileToPath (fileName, fileData) { let fileFullName = `${config.savePath}/${fileName}.mp4` return new Promise((resolve, reject) => { fs.writeFile(fileFullName, fileData, 'binary', function (err) { if (err) { console.log('savefileToPath error:', err) } resolve('已下载') }) }) }
The above is the detailed content of How can Node crawl headline videos in batches and save them (code implementation). For more information, please follow other related articles on the PHP Chinese website!