Ich versuche, Tabellendaten aus ein paar tausend HTML-Dateien oder Site-Daten zu extrahieren, aber die Tabellen haben keine Divs, um das zu vereinfachen, und ich bin ein Neuling in Sachen „Beautiful Soup“. Im Moment bearbeite ich das gesamte in CSV konvertierte HTML manuell und füge es in meine Datenbank ein, um die Tabelle zu erstellen, aber ich nehme lieber einfach das, was ich bereits habe.
< <body style="margin-top:140px;"> <div id="container"> <!-- Left div --> <div> </div> <!-- Center div --> <div> <!-- Image Link --> <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a> </div> <!-- Right div --> <div> </div> </div> <A Name = "Top"></A> <H1>5k Run</H1> <H1>Overall Finish List</H1> <H2>September 24, 2022</H2> <HR noshade> <B><I> </I></B> <HR noshade> <table border=0 cellpadding=0 cellspacing=0 class="racetable"> <tr> <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td> </tr> <tr> <td class=h11>Place</td> <td class=h12>Name</td> <td class=h12>City</td> <td class=h11>Bib No</td> <td class=h11>Age</td> <td class=h11>Gender</td> <td class=h11>Age Group</td> <td class=h11>Total Time</td> <td class=h11>Pace</td> </tr> <tr> <td class=d01>1</td> <td class=d02>Runner 1</td> <td class=d02>ANYTOWN PA</td> <td class=d01>390</td> <td class=d01>52</td> <td class=d01>M</td> <td class=d01>1:Overall</td> <td class=d01> 18:43.93</td> <td class=d01>6:03/M</td> </tr> <tr> <td class=d01>2</td> <td class=d02>Runner 2</td> <td class=d02>ANYTOWN PA</td> <td class=d01>380</td> <td class=d01>33</td> <td class=d01>M</td> <td class=d01>1:19-39</td> <td class=d01> 19:31.27</td> <td class=d01>6:18/M</td> </tr> <tr> <td class=d01>3</td> <td class=d02>Runner 3</td> <td class=d02>ANYTOWN PA</td> <td class=d01>389</td> <td class=d01>65</td> <td class=d01>F</td> <td class=d01>1:Overall</td> <td class=d01> 45:45.20</td> <td class=d01>14:46/M</td> </tr> <tr> <td class=d01>4</td> <td class=d02>Runner 4</td> <td class=d02>ANYTOWN PA</td> <td class=d01>381</td> <td class=d01>18</td> <td class=d01>F</td> <td class=d01>1: 1-18</td> <td class=d01> 53:28.84</td> <td class=d01>17:15/M</td> </tr> <tr> <td class=d01>5</td> <td class=d02>Runner 5</td> <td class=d02>ANYTOWN PA</td> <td class=d01>382</td> <td class=d01>41</td> <td class=d01>F</td> <td class=d01>1:40-59</td> <td class=d01> 53:30.48</td> <td class=d01>17:16/M</td> </tr> <tr> <td class=d01>6</td> <td class=d02>Runner 6</td> <td class=d02>ANYTOWN PA</td> <td class=d01>384</td> <td class=d01>14</td> <td class=d01>M</td> <td class=d01>1: 1-18</td> <td class=d01> 57:38.66</td> <td class=d01>18:36/M</td> </tr> <tr> <td class=d01>7</td> <td class=d02>Runner 7</td> <td class=d02>ANYTOWN PA</td> <td class=d01>385</td> <td class=d01>72</td> <td class=d01>F</td> <td class=d01>1:60-99</td> <td class=d01> 57:40.11</td> <td class=d01>18:36/M</td> </tr> </table> <HR noshade> <p> <!-- 0c17 22.0 2e9 --> </BODY> </HTML> >
Ich habe versucht, Divs hinzuzufügen, ohne großen Erfolg.
BeautifulSoup 允许您搜索 div 以外的内容。
假设您显示的 html 想要检索看起来像跑步者的内容,您可以执行类似的操作。
打印的结果看起来像这样