Verwenden Sie Beautiful Soup, um Daten zu extrahieren, wenn das Div nicht vorhanden ist

Question

Ich versuche, Tabellendaten aus Tausenden von HTML-Dateien oder Site-Daten zu extrahieren, aber die Tabellen verfügen nicht über Divs, um dies zu vereinfachen, und ich bin sehr neu bei beautifulsoup. Im Moment bearbeite ich das gesamte in CSV konvertierte HTML manuell und füge es in meine Datenbank ein, um die Tabellen zu erstellen, aber ich nehme lieber einfach das, was ich bereits habe. <

P粉463291248 · Answer

BeautifulSoup 允许您搜索 div 以外的内容。

假设您显示的 html 想要检索看起来像跑步者的内容，您可以执行类似的操作。

from bs4 import BeautifulSoup

file_path = 'scrap.html'

with open(file_path, 'r',
          encoding='utf-8') as file:  # We simulate a return from an html request by just opening an .html file
    html_content = file.read()

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', {"class": "racetable"})  # We are looking for the table with the 'racetable' class
rows_table = table.find_all('tr')[1:]  # All lines in the table without the first one

columns_name = [
    row.get_text() for row in rows_table[0].find_all('td')
]  # We get the name of each column in a list

runners = []
for row in rows_table[1:]:  # We repeat on all the lines except the first one which is the one with the name of the columns
    data = [
        elem.get_text().strip() for elem in row.find_all('td')
    ]
    runner = {
        "place": data[columns_name.index("Place")],
        "name": data[columns_name.index("Name")],
        "city": data[columns_name.index("City")],
        "bib_no": data[columns_name.index("Bib No")],
        "age": data[columns_name.index("Age")],
        "gender": data[columns_name.index("Gender")],
        "age_group": data[columns_name.index("Age Group")],
        "total_time": data[columns_name.index("Total Time")],
        "pace": data[columns_name.index("Pace")]
    }
    print(runner)
    runners.append(runner)

打印的结果看起来像这样

{'place': '1', 'name': 'Runner 1', 'city': 'ANYTOWN  PA', 'bib_no': '390', 'age': '52', 'gender': 'M', 'age_group': '1:Overall', 'total_time': '18:43.93', 'pace': '6:03/M'}
{'place': '2', 'name': 'Runner 2', 'city': 'ANYTOWN  PA', 'bib_no': '380', 'age': '33', 'gender': 'M', 'age_group': '1:19-39', 'total_time': '19:31.27', 'pace': '6:18/M'}
{'place': '3', 'name': 'Runner 3', 'city': 'ANYTOWN  PA', 'bib_no': '389', 'age': '65', 'gender': 'F', 'age_group': '1:Overall', 'total_time': '45:45.20', 'pace': '14:46/M'}
{'place': '4', 'name': 'Runner 4', 'city': 'ANYTOWN  PA', 'bib_no': '381', 'age': '18', 'gender': 'F', 'age_group': '1: 1-18', 'total_time': '53:28.84', 'pace': '17:15/M'}
{'place': '5', 'name': 'Runner 5', 'city': 'ANYTOWN  PA', 'bib_no': '382', 'age': '41', 'gender': 'F', 'age_group': '1:40-59', 'total_time': '53:30.48', 'pace': '17:16/M'}
{'place': '6', 'name': 'Runner 6', 'city': 'ANYTOWN  PA', 'bib_no': '384', 'age': '14', 'gender': 'M', 'age_group': '1: 1-18', 'total_time': '57:38.66', 'pace': '18:36/M'}
{'place': '7', 'name': 'Runner 7', 'city': 'ANYTOWN  PA', 'bib_no': '385', 'age': '72', 'gender': 'F', 'age_group': '1:60-99', 'total_time': '57:40.11', 'pace': '18:36/M'}