如何使用 Selenium 抓取受登入保護的網站(逐步指南)

Barbara Streisand
發布: 2024-11-02 10:34:30
原創
722 人瀏覽過

How to Scrape Login-Protected Websites with Selenium (Step by Step Guide)

我抓取受密碼保護的網站的步驟:

  1. 擷取 HTML 表單元素:使用者名稱 ID、密碼 ID 和登入按鈕類別
  2. - 使用 requests 或 Selenium 等工具自動登入:填入使用者名,等待,填入密碼,等待,點選登入
  3. - 儲存會話 cookie 以進行身份驗證
  4. - 繼續抓取經過驗證的頁面

免責聲明:我已在 https://www.scrapewebapp.com/ 上為此特定用例建立了一個 API。因此,如果您想快速完成它,請使用它,否則請繼續閱讀。

讓我們使用這個範例:假設我想從我的帳戶 https://www.scrapewebapp.com/ 中抓取我自己的 API 金鑰。在此頁面:https://app.scrapewebapp.com/account/api_key

1. 登入頁面

首先,您需要找到登入頁面。如果您嘗試造訪登入後的頁面,大多數網站都會為您重新導向303,因此如果您嘗試直接抓取https://app.scrapewebapp.com/account/api_key,您將自動取得登入頁面https:// app.scrapewebapp.com/login。因此,如果尚未提供,這是自動查找登入頁面的好方法。

好的,現在我們有了登入頁面,我們需要找到新增使用者名稱或電子郵件以及密碼和實際登入按鈕的位置。最好的方法是建立一個簡單的腳本,使用類型「電子郵件」、「使用者名稱」、「密碼」來尋找輸入的 ID,並尋找類型為「提交」的按鈕。我在下面為您編寫了程式碼:

from bs4 import BeautifulSoup


def extract_login_form(html_content: str):
    """
    Extracts the login form elements from the given HTML content and returns their CSS selectors.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Finding the username/email field
    username_email = (
        soup.find("input", {"type": "email"})
        or soup.find("input", {"name": "username"})
        or soup.find("input", {"type": "text"})
    )  # Fallback to input type text if no email type is found

    # Finding the password field
    password = soup.find("input", {"type": "password"})

    # Finding the login button
    # Searching for buttons/input of type submit closest to the password or username field
    login_button = None

    # First try to find a submit button within the same form
    if password:
        form = password.find_parent("form")
        if form:
            login_button = form.find("button", {"type": "submit"}) or form.find(
                "input", {"type": "submit"}
            )
    # If no button is found in the form, fall back to finding any submit button
    if not login_button:
        login_button = soup.find("button", {"type": "submit"}) or soup.find(
            "input", {"type": "submit"}
        )

    # Extracting CSS selectors
    def generate_css_selector(element, element_type):
        if "id" in element.attrs:
            return f"#{element['id']}"
        elif "type" in element.attrs:
            return f"{element_type}[type='{element['type']}']"
        else:
            return element_type

    # Generate CSS selectors with the updated logic
    username_email_css_selector = None
    if username_email:
        username_email_css_selector = generate_css_selector(username_email, "input")

    password_css_selector = None
    if password:
        password_css_selector = generate_css_selector(password, "input")

    login_button_css_selector = None
    if login_button:
        login_button_css_selector = generate_css_selector(
            login_button, "button" if login_button.name == "button" else "input"
        )

    return username_email_css_selector, password_css_selector, login_button_css_selector


def main(html_content: str):
    # Call the extract_login_form function and return its result
    return extract_login_form(html_content)
登入後複製
登入後複製

2。使用 Selenium 實際登入

現在您需要建立一個 selenium webdriver。我們將使用 chrome headless 來透過 Python 運行它。安裝方法如下:

# Install selenium and chromium

!pip install selenium
!apt-get update 
!apt install chromium-chromedriver

!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
登入後複製

然後實際登入我們的網站並儲存 cookie。我們將保存所有 cookie,但您只能根據需要儲存身分驗證 cookie。

# Imports
from selenium import webdriver
from selenium.webdriver.common.by import By
import requests
import time

# Set up Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# Initialize the WebDriver
driver = webdriver.Chrome(options=chrome_options)

try:
    # Open the login page
    driver.get("https://app.scrapewebapp.com/login")

    # Find the email input field by ID and input your email
    email_input = driver.find_element(By.ID, "email")
    email_input.send_keys("******@gmail.com")

    # Find the password input field by ID and input your password
    password_input = driver.find_element(By.ID, "password")
    password_input.send_keys("*******")

    # Find the login button and submit the form
    login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']")
    login_button.click()

    # Wait for the login process to complete
    time.sleep(5)  # Adjust this depending on your site's response time


finally:
    # Close the browser
    driver.quit()
登入後複製

3. 儲存 Cookie

就像透過 driver.getcookies() 函數將它們保存到字典中一樣簡單。

def save_cookies(driver):
    """Save cookies from the Selenium WebDriver into a dictionary."""
    cookies = driver.get_cookies()
    cookie_dict = {}
    for cookie in cookies:
        cookie_dict[cookie['name']] = cookie['value']
    return cookie_dict
登入後複製

從 WebDriver 儲存 cookie

cookie = save_cookies(驅動程式)

4. 從我們登入的會話中取得數據

在這部分中,我們將使用簡單的庫請求,但您也可以繼續使用 selenium。

現在我們想從此頁面取得實際的 API:https://app.scrapewebapp.com/account/api_key。

因此,我們從請求庫建立一個會話並將每個 cookie 新增到其中。然後請求 URL 並列印回應文字。

def scrape_api_key(cookies):
    """Use cookies to scrape the /account/api_key page."""
    url = 'https://app.scrapewebapp.com/account/api_key'

    # Set up the session to persist cookies
    session = requests.Session()

    # Add cookies from Selenium to the requests session
    for name, value in cookies.items():
        session.cookies.set(name, value)

    # Make the request to the /account/api_key page
    response = session.get(url)

    # Check if the request is successful
    if response.status_code == 200:
        print("API Key page content:")
        print(response.text)  # Print the page content (could contain the API key)
    else:
        print(f"Failed to retrieve API key page, status code: {response.status_code}")
登入後複製

5. 取得您想要的實際數據(獎勵)

我們得到了我們想要的頁面文本,但是有很多我們不關心的數據。我們只想要 api_key。

最好、最簡單的方法是使用像 ChatGPT(GPT4o 模型)這樣的人工智慧。

這樣提示模型:「您是專家抓取工具,您只會提取從上下文中詢問的資訊。我需要來自 {context} 的 api-key 值」

from bs4 import BeautifulSoup


def extract_login_form(html_content: str):
    """
    Extracts the login form elements from the given HTML content and returns their CSS selectors.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Finding the username/email field
    username_email = (
        soup.find("input", {"type": "email"})
        or soup.find("input", {"name": "username"})
        or soup.find("input", {"type": "text"})
    )  # Fallback to input type text if no email type is found

    # Finding the password field
    password = soup.find("input", {"type": "password"})

    # Finding the login button
    # Searching for buttons/input of type submit closest to the password or username field
    login_button = None

    # First try to find a submit button within the same form
    if password:
        form = password.find_parent("form")
        if form:
            login_button = form.find("button", {"type": "submit"}) or form.find(
                "input", {"type": "submit"}
            )
    # If no button is found in the form, fall back to finding any submit button
    if not login_button:
        login_button = soup.find("button", {"type": "submit"}) or soup.find(
            "input", {"type": "submit"}
        )

    # Extracting CSS selectors
    def generate_css_selector(element, element_type):
        if "id" in element.attrs:
            return f"#{element['id']}"
        elif "type" in element.attrs:
            return f"{element_type}[type='{element['type']}']"
        else:
            return element_type

    # Generate CSS selectors with the updated logic
    username_email_css_selector = None
    if username_email:
        username_email_css_selector = generate_css_selector(username_email, "input")

    password_css_selector = None
    if password:
        password_css_selector = generate_css_selector(password, "input")

    login_button_css_selector = None
    if login_button:
        login_button_css_selector = generate_css_selector(
            login_button, "button" if login_button.name == "button" else "input"
        )

    return username_email_css_selector, password_css_selector, login_button_css_selector


def main(html_content: str):
    # Call the extract_login_form function and return its result
    return extract_login_form(html_content)
登入後複製
登入後複製

如果您想要一個簡單可靠的 API 來實現這一切,請嘗試我的新產品 https://www.scrapewebapp.com/

如果你喜歡這篇文章,請給我鼓掌並關注我。確實有很大幫助!

以上是如何使用 Selenium 抓取受登入保護的網站(逐步指南)的詳細內容。更多資訊請關注PHP中文網其他相關文章!

來源:dev.to
本網站聲明
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn
作者最新文章
熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板