I'm developing a website that uses puppeteer to scrape data from another website. When I run the npm server on my local machine it scrapes the data just fine, but when I deploy it to Heroku it only runs the first three files I'm looking for and then stops.
I essentially want to scrape data about courses from my school website, so I run this line inside a for loop,
let data =wait for crawler.scrapeData(classesTaken[i].code)
This will run the function below. I've replaced the actual website URL for my own privacy.
const browser = await puppeteer.launch({ args: [ '--no-sandbox', '--disable-setuid-sandbox' ] }) const page = await browser.newPage() await page.goto("website url") await page.type('#crit-keyword', code) await page.click('#search-button') await page.waitForSelector(".result__headline") await page.click(".result__headline") await page.waitForSelector("div.text:nth-child(2)") let data = await page.evaluate(() => { let classTitle = document.querySelector("div.text:nth-child(2)").textContent .toLowerCase().split(' ') .map((s) => s.charAt(0).toUpperCase() + s.substring(1)).join(' ').replace('Ii', "II") let classDesc = document.querySelector(".section--description > div:nth-child(2)").textContent.replace('Lec/lab/rec.', '').trim() return { title: classTitle, desc: classDesc } }) console.log(`== Finished grabbing ${code}`) return data
This works fine on my own local server. However, when I push to my Heroku site, it only runs the first three classes of code. I have a feeling this may be due to my dyno being out of memory, but I don't know how to make it wait for available memory.
This is the deployment log
2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475 2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331 2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370 2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317 2023-05-22T17:29:49.992208+00:00 app[web.1]: const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`); 2023-05-22T17:29:49.992209+00:00 app[web.1]: ^ 2023-05-22T17:29:49.992209+00:00 app[web.1]: 2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded 2023-05-22T17:29:49.992211+00:00 app[web.1]: at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26) 2023-05-22T17:29:49.992230+00:00 app[web.1]: at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56) 2023-05-22T17:29:49.992230+00:00 app[web.1]: at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31) 2023-05-22T17:29:49.992230+00:00 app[web.1]: at process.processTicksAndRejections (node:internal/process/task_queues:95:5) 2023-05-22T17:29:49.992231+00:00 app[web.1]: at async Object.scrapeData (/app/crawler.js:9:21) 2023-05-22T17:29:49.992231+00:00 app[web.1]: at async getClassData (file:///app/server.mjs:40:16) 2023-05-22T17:29:49.992234+00:00 app[web.1]:
I read somewhere that try clearing the build cache using these commands
$ heroku plugins:install heroku-builds $ heroku builds:cache:purge --app your-app-name
I've tried this but nothing works. I also followed the troubleshooting instructions for Heroku on the puppeteer GitHub.
The reason why I believe this may be related to my dynamic memory is because of this related article. If this is the case, I'd like to figure out how to wait until there is free memory to use.
EDIT: I'm also now running the browser in headless mode, which results in the exact same error.
After logging further, I discovered that the problem was that I opened the browser and then never closed it, causing a memory leak. By adding the line
await browser.close()
before the return statement of thescrapeData()
function, the memory leak stops and the server is able to parse all class code correctly.