Heroku limits Puppeteer to running only three times
P粉986860950
P粉986860950 2024-04-02 19:49:08
0
1
705

I'm developing a website that uses puppeteer to scrape data from another website. When I run the npm server on my local machine it scrapes the data just fine, but when I deploy it to Heroku it only runs the first three files I'm looking for and then stops.

I essentially want to scrape data about courses from my school website, so I run this line inside a for loop,

let data =wait for crawler.scrapeData(classesTaken[i].code)

This will run the function below. I've replaced the actual website URL for my own privacy.

const browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox'
      ]
    })
    const page = await browser.newPage()
    
    await page.goto("website url")
    await page.type('#crit-keyword', code)
    await page.click('#search-button')

    await page.waitForSelector(".result__headline")

    await page.click(".result__headline")

    await page.waitForSelector("div.text:nth-child(2)")

    let data = await page.evaluate(() => {
        let classTitle = document.querySelector("div.text:nth-child(2)").textContent
            .toLowerCase().split(' ')
            .map((s) => s.charAt(0).toUpperCase() + s.substring(1)).join(' ').replace('Ii', "II")
        let classDesc =  document.querySelector(".section--description > div:nth-child(2)").textContent.replace('Lec/lab/rec.', '').trim()

        return {
            title: classTitle,
            desc: classDesc
        }
    })

    console.log(`== Finished grabbing ${code}`)

    return data

This works fine on my own local server. However, when I push to my Heroku site, it only runs the first three classes of code. I have a feeling this may be due to my dyno being out of memory, but I don't know how to make it wait for available memory.

This is the deployment log

2023-05-22T17:29:18.421015+00:00 app[web.1]: == Finished grabbing CS 475
2023-05-22T17:29:19.098698+00:00 app[web.1]: == Finished grabbing CS 331
2023-05-22T17:29:19.783377+00:00 app[web.1]: == Finished grabbing CS 370

2023-05-22T17:29:49.992190+00:00 app[web.1]: /app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317

2023-05-22T17:29:49.992208+00:00 app[web.1]:     const timeoutError = new Errors_js_1.TimeoutError(`waiting for ${taskName} failed: timeout ${timeout}ms exceeded`);

2023-05-22T17:29:49.992209+00:00 app[web.1]:                          ^

2023-05-22T17:29:49.992209+00:00 app[web.1]: 

2023-05-22T17:29:49.992210+00:00 app[web.1]: TimeoutError: waiting for target failed: timeout 30000ms exceeded

2023-05-22T17:29:49.992211+00:00 app[web.1]:     at waitWithTimeout (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/util.js:317:26)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at Browser.waitForTarget (/app/node_modules/puppeteer/lib/cjs/puppeteer/common/Browser.js:405:56)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at ChromeLauncher.launch (/app/node_modules/puppeteer/lib/cjs/puppeteer/node/ChromeLauncher.js:100:31)

2023-05-22T17:29:49.992230+00:00 app[web.1]:     at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async Object.scrapeData (/app/crawler.js:9:21)

2023-05-22T17:29:49.992231+00:00 app[web.1]:     at async getClassData (file:///app/server.mjs:40:16)

2023-05-22T17:29:49.992234+00:00 app[web.1]:

I read somewhere that try clearing the build cache using these commands

$ heroku plugins:install heroku-builds
$ heroku builds:cache:purge --app your-app-name

I've tried this but nothing works. I also followed the troubleshooting instructions for Heroku on the puppeteer GitHub.

The reason why I believe this may be related to my dynamic memory is because of this related article. If this is the case, I'd like to figure out how to wait until there is free memory to use.

EDIT: I'm also now running the browser in headless mode, which results in the exact same error.

P粉986860950
P粉986860950

reply all(1)
P粉129168206

After logging further, I discovered that the problem was that I opened the browser and then never closed it, causing a memory leak. By adding the line await browser.close() before the return statement of the scrapeData() function, the memory leak stops and the server is able to parse all class code correctly.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template