Penafian: Saya telah membina API untuk kes penggunaan khusus ini di https://www.scrapewebapp.com/. Jadi, jika anda ingin menyelesaikannya dengan cepat, gunakannya, jika tidak, baca terus.
Mari kita gunakan contoh ini: katakan saya mahu mengikis kunci API saya sendiri daripada akaun saya di https://www.scrapewebapp.com/. Ia ada di halaman ini: https://app.scrapewebapp.com/account/api_key
Pertama, anda perlu mencari halaman log masuk. Kebanyakan tapak web akan memberi anda ubah hala 303 jika anda cuba mengakses halaman di belakang log masuk, jadi jika anda cuba mengikis terus https://app.scrapewebapp.com/account/api_key, anda akan mendapat halaman log masuk https:// secara automatik app.scrapewebapp.com/login. Jadi ini adalah cara yang baik untuk mengautomasikan pencarian halaman log masuk jika belum disediakan.
Ok, sekarang kita mempunyai halaman log masuk, kita perlu mencari tempat untuk menambah nama pengguna atau e-mel serta kata laluan dan butang log masuk sebenar. Cara terbaik ialah mencipta skrip ringkas yang mencari ID input menggunakan jenis "e-mel", "nama pengguna", "kata laluan" dan mencari butang dengan jenis "serahkan". Saya membuat kod untuk anda di bawah:
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
2. Menggunakan Selenium untuk Log Masuk Sebenarnya
Kini anda perlu mencipta pemacu web selenium. Kami akan menggunakan chrome tanpa kepala untuk menjalankannya dengan Python. Begini cara memasangnya:
# Install selenium and chromium !pip install selenium !apt-get update !apt install chromium-chromedriver !cp /usr/lib/chromium-browser/chromedriver /usr/bin import sys sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
Kemudian log masuk ke laman web kami dan simpan kuki. Kami akan menyimpan semua kuki, tetapi anda hanya boleh menyimpan kuki pengesahan jika anda mahu.
# Imports from selenium import webdriver from selenium.webdriver.common.by import By import requests import time # Set up Chrome options chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') # Initialize the WebDriver driver = webdriver.Chrome(options=chrome_options) try: # Open the login page driver.get("https://app.scrapewebapp.com/login") # Find the email input field by ID and input your email email_input = driver.find_element(By.ID, "email") email_input.send_keys("******@gmail.com") # Find the password input field by ID and input your password password_input = driver.find_element(By.ID, "password") password_input.send_keys("*******") # Find the login button and submit the form login_button = driver.find_element(By.CSS_SELECTOR, "button[type='submit']") login_button.click() # Wait for the login process to complete time.sleep(5) # Adjust this depending on your site's response time finally: # Close the browser driver.quit()
Ia semudah menyimpannya ke dalam kamus daripada fungsi driver.getcookies().
def save_cookies(driver): """Save cookies from the Selenium WebDriver into a dictionary.""" cookies = driver.get_cookies() cookie_dict = {} for cookie in cookies: cookie_dict[cookie['name']] = cookie['value'] return cookie_dict
Simpan kuki daripada WebDriver
kuki = simpan_kuki(pemandu)
Dalam bahagian ini, kami akan menggunakan permintaan perpustakaan yang mudah, tetapi anda boleh terus menggunakan selenium juga.
Sekarang kami mahu mendapatkan API sebenar daripada halaman ini: https://app.scrapewebapp.com/account/api_key.
Jadi kami membuat sesi daripada perpustakaan permintaan dan menambah setiap kuki ke dalamnya. Kemudian minta URL dan cetak teks respons.
def scrape_api_key(cookies): """Use cookies to scrape the /account/api_key page.""" url = 'https://app.scrapewebapp.com/account/api_key' # Set up the session to persist cookies session = requests.Session() # Add cookies from Selenium to the requests session for name, value in cookies.items(): session.cookies.set(name, value) # Make the request to the /account/api_key page response = session.get(url) # Check if the request is successful if response.status_code == 200: print("API Key page content:") print(response.text) # Print the page content (could contain the API key) else: print(f"Failed to retrieve API key page, status code: {response.status_code}")
Kami mendapat teks halaman yang kami mahukan, tetapi terdapat banyak data yang kami tidak pedulikan. Kami hanya mahu api_key.
Cara terbaik dan termudah untuk melakukannya ialah menggunakan AI seperti ChatGPT (model GPT4o).
Gesa model seperti ini: “Anda pakar pengikis dan anda hanya akan mengeluarkan maklumat yang diminta daripada konteks. Saya memerlukan nilai kunci api saya daripada {context}”
from bs4 import BeautifulSoup def extract_login_form(html_content: str): """ Extracts the login form elements from the given HTML content and returns their CSS selectors. """ soup = BeautifulSoup(html_content, "html.parser") # Finding the username/email field username_email = ( soup.find("input", {"type": "email"}) or soup.find("input", {"name": "username"}) or soup.find("input", {"type": "text"}) ) # Fallback to input type text if no email type is found # Finding the password field password = soup.find("input", {"type": "password"}) # Finding the login button # Searching for buttons/input of type submit closest to the password or username field login_button = None # First try to find a submit button within the same form if password: form = password.find_parent("form") if form: login_button = form.find("button", {"type": "submit"}) or form.find( "input", {"type": "submit"} ) # If no button is found in the form, fall back to finding any submit button if not login_button: login_button = soup.find("button", {"type": "submit"}) or soup.find( "input", {"type": "submit"} ) # Extracting CSS selectors def generate_css_selector(element, element_type): if "id" in element.attrs: return f"#{element['id']}" elif "type" in element.attrs: return f"{element_type}[type='{element['type']}']" else: return element_type # Generate CSS selectors with the updated logic username_email_css_selector = None if username_email: username_email_css_selector = generate_css_selector(username_email, "input") password_css_selector = None if password: password_css_selector = generate_css_selector(password, "input") login_button_css_selector = None if login_button: login_button_css_selector = generate_css_selector( login_button, "button" if login_button.name == "button" else "input" ) return username_email_css_selector, password_css_selector, login_button_css_selector def main(html_content: str): # Call the extract_login_form function and return its result return extract_login_form(html_content)
Jika anda mahukan semua itu dalam API yang mudah dan boleh dipercayai, sila cuba produk baharu saya https://www.scrapewebapp.com/
Jika anda suka siaran ini, sila beri saya tepuk tangan dan ikuti saya. Ia sangat membantu!
Atas ialah kandungan terperinci Cara Mengikis Laman Web yang Dilindungi Log Masuk dengan Selenium (Panduan Langkah demi Langkah). Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!