HTML elements cannot be located during web crawling. This element is visible in the site inspection tool
P粉225961749
P粉225961749 2023-09-19 12:41:46
0
1
587

I'm trying to crawl the titles of all tables from this URL: https://www.nature.com/articles/s41586-023-06192-4

I can find this HTML element on the website:

<b id="Tab1" data-test="table-caption">Table 1 Calculated Ct–M–Ct angles</b>

I cannot crawl this title because it cannot be found. Even if I print the HTML script to the console, this element cannot be found.

I use the following code to print the HTML script:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

url = 'https://www.nature.com/articles/s41586-023-06192-4'

session = HTMLSession()
response = session.get(url)

response.html.render()

soup = BeautifulSoup(response.html.raw_html.decode('utf-8'), 'html.parser')
print(soup.prettify())

Use BeautifulSoup’s crawling function:

def get_tables(driver):
    tables = []
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

    for i in range(1, 11):
        try:
            table_caption = soup.find('b', {'id': f'Tab{i}', 'data-test': 'table-caption'})
            table_text = table_caption.text if table_caption else "Not Available"
            if table_text != "Not Available":
                print(f"找到表格{i}:{table_text}")
            else:
                print(f"未找到表格{i}。")
            tables.append(table_text)
        except Exception as e:
            print(f"处理表格{i}时出错:{str(e)}")
            tables.append("Not Available")

    return tables

Use Selenium’s crawling function:

def get_tables(driver):
    tables = []

    for i in range(1, 11):
        try:
            table_caption = driver.find_element_by_css_selector(f'b#Tab{i}[data-test="table-caption"]')
            table_text = table_caption.text if table_caption else "Not Available"
            if table_text != "Not Available":
                print(f"找到表格{i}:{table_text}")
            else:
                print(f"未找到表格{i}。")
            tables.append(table_text)
        except Exception as e:
            print(f"处理表格{i}时出错:{str(e)}")
            tables.append("Not Available")

    return tables

I try to use Selenium and BeautifulSoup to crawl the website. I've checked the iframe. I delayed the fetch operation for 40 seconds to ensure the page loaded completely. Even GPT4 cannot solve this problem.

P粉225961749
P粉225961749

reply all(1)
P粉920485285

So the code you used looks fine, the problem that comes to mind is that the website may be loading that element you want to crawl via JavaScript or some XHR call, so when you use the requests library to send the request, it cannot Get that element.

The way to solve this problem is to try to use Selenium, open the website with selenium, and then load the page source code into bs4, so that your code can work normally.

Note: When the entire website is loaded, load the page source code into bs4. You will also need to create a login function using selenium, as this website requires a login to view content.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template