Selenium を Scrapy と統合して動的 Web ページを処理するにはどうすればよいですか?-Python チュートリアル-php.cn

ホームページ

バックエンド開発

Python チュートリアル

Selenium を Scrapy と統合して動的 Web ページを処理するにはどうすればよいですか?

Susan Sarandon

Nov 17, 2024 pm 01:14 PM

How can Selenium be integrated with Scrapy to handle dynamic web pages?

動的 Web ページのための Selenium と Scrapy の統合

はじめに
Scrapy は強力な Web スクレイピングフレームワークです。ただし、動的な Web ページに遭遇すると制限に直面します。自動 Web ブラウザテストツールである Selenium は、ユーザーインタラクションをシミュレートし、ページコンテンツをレンダリングすることで、このギャップを埋めることができます。 Selenium を Scrapy と統合して動的 Web ページを処理する方法は次のとおりです。

Selenium 統合オプション
Selenium と Scrapy を統合するには、主に 2 つのオプションがあります:

オプション 1: Scrapy で Selenium を呼び出すパーサー
- Scrapy パーサーメソッド内で Selenium セッションを開始します。
- Selenium を使用してページに移動し、操作し、必要に応じてデータを抽出します。
- このオプションは、Selenium のきめ細かい制御を提供します。
オプション 2:scrapy-selenium ミドルウェアを使用する
- scrapy-selenium ミドルウェアパッケージをインストールします。
- 特定のリクエストまたはすべてのリクエストを処理するようにミドルウェアを構成します
- ミドルウェアは、Scrapy のパーサーによって処理される前に、Selenium を使用してページを自動的にレンダリングします。

Scrapy Spider の Selenium の例
最初の統合を使用する次の Scrapy スパイダーについて考えてみましょう。オプション:

class ProductSpider(CrawlSpider):
    name = "product_spider"
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/shanghai']
    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
        ]

    def parse_product(self, response):
        self.log("parsing product %s" % response.url, level=INFO)
        driver = webdriver.Firefox()
        driver.get(response.url)
        # Perform Selenium actions to extract product data
        product_data = driver.find_element_by_xpath('//h1').text
        driver.close()
        # Yield extracted data as a scrapy Item
        yield {'product_name': product_data}

ログイン後にコピー

追加の例と代替案

Scrapy Selenium を使用した eBay でのページネーション処理の場合:

class ProductSpider(scrapy.Spider):
  # ...
  def parse(self, response):
      self.driver.get(response.url)
      while True:
          # Get next page link and click it
          next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')
          try:
              next.click()
              # Scrape data and write to items
          except:
              break

ログイン後にコピー