使用'加載更多”按鈕抓取無限滾動頁面：逐步指南-Python教學-PHP中文網

嘗試從動態網頁載入資料時，您的抓取工具是否卡住了？您是否對無限滾動或那些討厭的“加載更多”按鈕感到沮喪？

你並不孤單。如今，許多網站都實施這些設計來改善使用者體驗，但它們對於網頁抓取工具來說可能具有挑戰性。

本教學將引導您完成適合初學者的演練，使用 載入更多 按鈕抓取示範頁面。目標網頁如下圖所示：

Demo web page for scraping

最後，您將學習如何：

設定 Selenium 進行網頁抓取。
自動化「載入更多」按鈕互動。
提取產品數據，例如名稱、價格和連結。

讓我們開始吧！

第 1 步：先決條件

開始之前，請確保滿足以下先決條件：

已安裝 Python：從 python.org 下載並安裝最新的 Python 版本，包括安裝過程中的 pip。
基礎：熟悉網頁抓取概念、Python 程式設計以及使用 requests、BeautifulSoup 和 Selenium 等函式庫。

所需的圖書館：

請求：用於傳送 HTTP 請求。
BeautifulSoup：用於解析 HTML 內容。
Selenium：用於模擬使用者交互，例如瀏覽器中的按鈕點擊。

您可以在終端機中使用以下命令安裝這些庫：

pip install requests beautifulsoup4 selenium

登入後複製

在使用 Selenium 之前，您必須安裝與您的瀏覽器相符的網路驅動程式。在本教學中，我們將使用 Google Chrome 和 ChromeDriver。不過，您可以對 Firefox 或 Edge 等其他瀏覽器執行類似的步驟。

安裝網路驅動程式

檢查您的瀏覽器版本：
開啟 Google Chrome 並導覽至 幫助 >關於 Google Chrome 從三點選單中尋找 Chrome 版本。
下載 ChromeDriver:
造訪 ChromeDriver 下載頁面。
下載與您的 Chrome 版本相符的驅動程式版本。
將 ChromeDriver 新增至您的系統路徑：
解壓縮下載的檔案並將其放置在 /usr/local/bin (Mac/Linux) 或 C:WindowsSystem32 (Windows) 等目錄中。

驗證安裝

在專案目錄中初始化一個 Python 檔案 scraper.py 並透過執行以下程式碼片段來測試一切設定是否正確：

from selenium import webdriver
driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH
driver.get("https://www.scrapingcourse.com/button-click")
print(driver.title)
driver.quit()

登入後複製

您可以透過在終端機上執行以下命令來執行上述檔案程式碼：

pip install requests beautifulsoup4 selenium

登入後複製

如果上面的程式碼運行沒有錯誤，它將啟動瀏覽器介面並打開演示頁面 URL，如下所示：

Demo Page in Selenium Browser Instance

Selenium 隨後將提取 HTML 並列印頁面標題。你會看到這樣的輸出 -

from selenium import webdriver
driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH
driver.get("https://www.scrapingcourse.com/button-click")
print(driver.title)
driver.quit()

登入後複製

這將驗證 Selenium 是否可以使用。安裝所有要求並準備使用後，您可以開始存取演示頁面的內容。

第 2 步：存取內容

第一步是取得頁面的初始內容，這將為您提供頁面 HTML 的基線快照。這將幫助您驗證連線並確保抓取過程的有效起點。

您將透過使用 Python 中的 Requests 庫發送 GET 請求來檢索頁面 URL 的 HTML 內容。程式碼如下：

python scraper.py

登入後複製

上面的程式碼將輸出包含前 12 個產品資料的原始 HTML。

HTML 的快速預覽可確保請求成功並且您正在使用有效的資料。

第 3 步：載入更多產品

要存取剩餘的產品，您需要以程式設計方式點擊頁面上的「載入更多」按鈕，直到沒有更多產品可用。由於此交互涉及 JavaScript，因此您將使用 Selenium 來模擬按鈕單擊。

在編寫程式碼之前，我們先檢查一下頁面以定位：

「載入更多」 按鈕選擇器 (load-more-btn)。
儲存產品詳細資料的 div (product-item)。

您將透過載入更多產品來獲得所有產品，透過執行以下程式碼為您提供更大的資料集：

Load More Button Challenge to Learn Web Scraping - ScrapingCourse.com

登入後複製

此程式碼開啟瀏覽器，導航至頁面，並與「載入更多」按鈕互動。然後提取更新後的 HTML（現在包含更多產品資料）。

如果你不希望Selenium每次執行這段程式碼時都打開瀏覽器，它還提供了無頭瀏覽器功能。無頭瀏覽器具有實際 Web 瀏覽器的所有功能，但沒有圖形使用者介面 (GUI)。

您可以透過定義 ChromeOptions 物件並將其傳遞給 WebDriver Chrome 建構函數，在 Selenium 中啟用 Chrome 的無頭模式，如下所示：

import requests
# URL of the demo page with products
url = "https://www.scrapingcourse.com/button-click"
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
    print(html_content) # Optional: Preview the HTML
else:
    print(f"Failed to retrieve content: {response.status_code}")

登入後複製

當您執行上述程式碼時，Selenium 將啟動一個無頭 Chrome 實例，因此您將不再看到 Chrome 視窗。這對於在伺服器上執行抓取腳本時不想在 GUI 上浪費資源的生產環境來說是理想的選擇。

現在已檢索到完整的 HTML 內容，是時候提取有關每個產品的具體詳細資訊了。

第四步：解析產品訊息

在此步驟中，您將使用 BeautifulSoup 解析 HTML 並識別產品元素。然後，您將提取每個產品的關鍵詳細信息，例如名稱、價格和連結。

pip install requests beautifulsoup4 selenium

登入後複製

在輸出中，您應該看到產品詳細資訊的結構化列表，包括名稱、圖像 URL、價格和產品頁面鏈接，如下所示 -

from selenium import webdriver
driver = webdriver.Chrome() # Ensure ChromeDriver is installed and in PATH
driver.get("https://www.scrapingcourse.com/button-click")
print(driver.title)
driver.quit()

登入後複製

上面的程式碼將原始 HTML 資料組織成結構化格式，使其更容易使用和準備輸出資料以進行進一步處理。

第 5 步：將產品資訊匯出到 CSV

您現在可以將提取的資料組織到 CSV 檔案中，這使得分析或共用變得更加容易。 Python 的 CSV 模組對此有所幫助。

python scraper.py

登入後複製

上述程式碼將建立一個新的 CSV 文件，其中包含所有必需的產品詳細資訊。

以下是概述的完整程式碼：

Load More Button Challenge to Learn Web Scraping - ScrapingCourse.com

登入後複製

上面的程式碼將建立一個 products.csv，如下所示：

import requests
# URL of the demo page with products
url = "https://www.scrapingcourse.com/button-click"
# Send a GET request to the URL
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
    html_content = response.text
    print(html_content) # Optional: Preview the HTML
else:
    print(f"Failed to retrieve content: {response.status_code}")

登入後複製

第 6 步：取得熱門產品的額外數據

現在，假設您想要識別價格最高的前 5 個產品，並從其各個頁面中提取其他資料（例如產品描述和 SKU 程式碼）。您可以使用以下程式碼來做到這一點：

from selenium import webdriver
from selenium.webdriver.common.by import By
import time
# Set up the WebDriver (make sure you have the appropriate driver installed, e.g., ChromeDriver)
driver = webdriver.Chrome()
# Open the page
driver.get("https://www.scrapingcourse.com/button-click")
# Loop to click the "Load More" button until there are no more products
while True:
    try:
        # Find the "Load more" button by its ID and click it
        load_more_button = driver.find_element(By.ID, "load-more-btn")
        load_more_button.click()
        # Wait for the content to load (adjust time as necessary)
        time.sleep(2)
    except Exception as e:
        # If no "Load More" button is found (end of products), break out of the loop
        print("No more products to load.")
        break
# Get the updated page content after all products are loaded
html_content = driver.page_source
# Close the browser window
driver.quit()

登入後複製

以下是概述的完整程式碼：

from selenium import webdriver
from selenium.webdriver.common.by import By

import time

# instantiate a Chrome options object
options = webdriver.ChromeOptions()

# set the options to use Chrome in headless mode
options.add_argument("--headless=new")

# initialize an instance of the Chrome driver (browser) in headless mode
driver = webdriver.Chrome(options=options)

...

登入後複製

此程式碼依價格降序對產品進行排序。然後，對於價格最高的前 5 個產品，腳本打開其產品頁面並使用 BeautifulSoup 提取產品描述和 SKU。

上面程式碼的輸出會是這樣的：

from bs4 import BeautifulSoup
# Parse the page content with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract product details
products = []
# Find all product items in the grid
product_items = soup.find_all('div', class_='product-item')
for product in product_items:
    # Extract the product name
    name = product.find('span', class_='product-name').get_text(strip=True)

    # Extract the product price
    price = product.find('span', class_='product-price').get_text(strip=True)

    # Extract the product link
    link = product.find('a')['href']

    # Extract the image URL
    image_url = product.find('img')['src']

    # Create a dictionary with the product details
    products.append({
        'name': name,
        'price': price,
        'link': link,
        'image_url': image_url
})
# Print the extracted product details
for product in products[:2]:
    print(f"Name: {product['name']}")
    print(f"Price: {product['price']}")
    print(f"Link: {product['link']}")
    print(f"Image URL: {product['image_url']}")
    print('-' * 30)

登入後複製

上面的程式碼將更新 products.csv，它現在將具有以下資訊：

Name: Chaz Kangeroo Hoodie
Price: 
Link: https://scrapingcourse.com/ecommerce/product/chaz-kangeroo-hoodie
Image URL: https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh01-gray_main.jpg
------------------------------
Name: Teton Pullover Hoodie
Price: 
Link: https://scrapingcourse.com/ecommerce/product/teton-pullover-hoodie
Image URL: https://scrapingcourse.com/ecommerce/wp-content/uploads/2024/03/mh02-black_main.jpg
------------------------------
…

登入後複製