如何使用 Selenium 抓取受登录保护的网站(分步指南)
我抓取受密码保护的网站的步骤:
- 捕获 HTML 表单元素:用户名 ID、密码 ID 和登录按钮类
- - 使用 requests 或 Selenium 等工具自动登录:填写用户名,等待,填写密码,等待,点击登录
- - 存储会话 cookie 以进行身份验证
- - 继续抓取经过身份验证的页面
免责声明:我已在 https://www.scrapewebapp.com/ 上为此特定用例构建了一个 API。因此,如果您想快速完成它,请使用它,否则请继续阅读。
让我们使用这个例子:假设我想从我的帐户 https://www.scrapewebapp.com/ 中抓取我自己的 API 密钥。在此页面上:https://app.scrapewebapp.com/account/api_key
1. 登录页面
首先,您需要找到登录页面。如果您尝试访问登录后的页面,大多数网站都会给您重定向 303,因此如果您尝试直接抓取 https://app.scrapewebapp.com/account/api_key,您将自动获取登录页面 https:// app.scrapewebapp.com/login。因此,如果尚未提供,这是自动查找登录页面的好方法。
好的,现在我们有了登录页面,我们需要找到添加用户名或电子邮件以及密码和实际登录按钮的位置。最好的方法是创建一个简单的脚本,使用类型“电子邮件”、“用户名”、“密码”查找输入的 ID,并查找类型为“提交”的按钮。我在下面为您编写了代码:
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
2。使用 Selenium 实际登录
现在您需要创建一个 selenium webdriver。我们将使用 chrome headless 来通过 Python 运行它。安装方法如下:
# Install selenium and chromium !pip install selenium !apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin import sys sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
然后实际登录我们的网站并保存 cookie。我们将保存所有 cookie,但您只能根据需要保存身份验证 cookie。
# Imports from selenium import webdriver from selenium.webdriver.common.by import By import requests import time # Set up Chrome options chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') # Initialize the WebDriver driver = webdriver.Chrome(options=chrome_options) try: # Open the login page driver.get("https://app.scrapewebapp.com/login") # Find the email input field by ID and input your email email_input = driver.find_element(By.ID, "email") email_input.send_keys("******@gmail.com") # Find the password input field by ID and input your password password_input = driver.find_element(By.ID, "password") password_input.send_keys("*******") # Find the login button and submit the form login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']") login_button.click() # Wait for the login process to complete time.sleep(5) # Adjust this depending on your site's response time finally: # Close the browser driver.quit()
3. 存储 Cookie
就像通过 driver.getcookies() 函数将它们保存到字典中一样简单。
def save_cookies(driver): """Save cookies from the Selenium WebDriver into a dictionary.""" cookies = driver.get_cookies() cookie_dict = {} for cookie in cookies: cookie_dict[cookie['name']] = cookie['value'] return cookie_dict
从 WebDriver 保存 cookie
cookie = save_cookies(驱动程序)
4. 从我们登录的会话中获取数据
在这部分中,我们将使用简单的库请求,但您也可以继续使用 selenium。
现在我们想从此页面获取实际的 API:https://app.scrapewebapp.com/account/api_key。
因此,我们从请求库创建一个会话并将每个 cookie 添加到其中。然后请求 URL 并打印响应文本。
def scrape_api_key(cookies): """Use cookies to scrape the /account/api_key page.""" url = 'https://app.scrapewebapp.com/account/api_key' # Set up the session to persist cookies session = requests.Session() # Add cookies from Selenium to the requests session for name, value in cookies.items(): session.cookies.set(name, value) # Make the request to the /account/api_key page response = session.get(url) # Check if the request is successful if response.status_code == 200: print("API Key page content:") print(response.text) # Print the page content (could contain the API key) else: print(f"Failed to retrieve API key page, status code: {response.status_code}")
5. 获取您想要的实际数据(奖励)
我们得到了我们想要的页面文本,但是有很多我们不关心的数据。我们只想要 api_key。
最好、最简单的方法是使用像 ChatGPT(GPT4o 模型)这样的人工智能。
这样提示模型:“您是一名专家抓取工具,您只会提取从上下文中询问的信息。我需要来自 {context} 的 api-key 值”
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
如果您想要一个简单可靠的 API 来实现这一切,请尝试我的新产品 https://www.scrapewebapp.com/
如果你喜欢这篇文章,请给我鼓掌并关注我。确实有很大帮助!
以上是如何使用 Selenium 抓取受登录保护的网站(分步指南)的详细内容。更多信息请关注PHP中文网其他相关文章!

热AI工具

Undresser.AI Undress
人工智能驱动的应用程序,用于创建逼真的裸体照片

AI Clothes Remover
用于从照片中去除衣服的在线人工智能工具。

Undress AI Tool
免费脱衣服图片

Clothoff.io
AI脱衣机

Video Face Swap
使用我们完全免费的人工智能换脸工具轻松在任何视频中换脸!

热门文章

热工具

记事本++7.3.1
好用且免费的代码编辑器

SublimeText3汉化版
中文版,非常好用

禅工作室 13.0.1
功能强大的PHP集成开发环境

Dreamweaver CS6
视觉化网页开发工具

SublimeText3 Mac版
神级代码编辑软件(SublimeText3)

Linux终端中查看Python版本时遇到权限问题的解决方法当你在Linux终端中尝试查看Python的版本时,输入python...

使用FiddlerEverywhere进行中间人读取时如何避免被检测到当你使用FiddlerEverywhere...

在使用Python的pandas库时,如何在两个结构不同的DataFrame之间进行整列复制是一个常见的问题。假设我们有两个Dat...

Uvicorn是如何持续监听HTTP请求的?Uvicorn是一个基于ASGI的轻量级Web服务器,其核心功能之一便是监听HTTP请求并进�...

如何在10小时内教计算机小白编程基础?如果你只有10个小时来教计算机小白一些编程知识,你会选择教些什么�...

攻克Investing.com的反爬虫策略许多人尝试爬取Investing.com(https://cn.investing.com/news/latest-news)的新闻数据时,常常�...
