Scrape multiple URLs with PyQt QWebPage
PyQt's QWebPage provides a means to render web pages, making it suitable for dynamically loaded content. However, attempting multiple renderings may result in crashes or unexpected behavior.
Problem Identification
The issue in the provided code stems from the creation of multiple QApplications and QWebPages for each URL fetch. Instead, a single instance of each should be utilized, with the WebPage relying on its loadFinished signal to trigger internal processing of subsequent URLs.
Solution
The following improvements address the problem:
Usage
Example code demonstrating how to use the improved WebPage:
def my_html_processor(html, url): print('loaded: [%d chars] %s' % (len(html), url)) import sys app = QApplication(sys.argv) webpage = WebPage(verbose=False) webpage.htmlReady.connect(my_html_processor) # example 1: process list of urls urls = ['https://en.wikipedia.org/wiki/Special:Random'] * 3 print('Processing list of urls...') webpage.process(urls) # example 2: process one url continuously import signal, itertools signal.signal(signal.SIGINT, signal.SIG_DFL) print('Processing url continuously...') print('Press Ctrl+C to quit') url = 'https://en.wikipedia.org/wiki/Special:Random' webpage.process(itertools.repeat(url)) sys.exit(app.exec_())
References
The above is the detailed content of How to Efficiently Scrape Multiple URLs Using PyQt QWebPage?. For more information, please follow other related articles on the PHP Chinese website!