我正在尝试从几千个 html 文件或站点数据中提取表数据,但是这些表没有 div 来使这变得简单,而且我对 beautiful soup 还很陌生。现在,我正在手动编辑所有转换后的 html 到 csv 并将它们放入我的数据库中以创建表格,但我宁愿只抓取我已经拥有的内容。
< <body style="margin-top:140px;"> <div id="container"> <!-- Left div --> <div> </div> <!-- Center div --> <div> <!-- Image Link --> <a href="http://www.website.com"><img src="http://website.com/wp-content/uploads/2016/12/Blue-Transparent.png" style = "max-width:100%; max-height:120px;" alt="Center Banner"></a> </div> <!-- Right div --> <div> </div> </div> <A Name = "Top"></A> <H1>5k Run</H1> <H1>Overall Finish List</H1> <H2>September 24, 2022</H2> <HR noshade> <B><I> </I></B> <HR noshade> <table border=0 cellpadding=0 cellspacing=0 class="racetable"> <tr> <td class=h01 colspan="9"><H2>1st Alarm 5k</H2></td> </tr> <tr> <td class=h11>Place</td> <td class=h12>Name</td> <td class=h12>City</td> <td class=h11>Bib No</td> <td class=h11>Age</td> <td class=h11>Gender</td> <td class=h11>Age Group</td> <td class=h11>Total Time</td> <td class=h11>Pace</td> </tr> <tr> <td class=d01>1</td> <td class=d02>Runner 1</td> <td class=d02>ANYTOWN PA</td> <td class=d01>390</td> <td class=d01>52</td> <td class=d01>M</td> <td class=d01>1:Overall</td> <td class=d01> 18:43.93</td> <td class=d01>6:03/M</td> </tr> <tr> <td class=d01>2</td> <td class=d02>Runner 2</td> <td class=d02>ANYTOWN PA</td> <td class=d01>380</td> <td class=d01>33</td> <td class=d01>M</td> <td class=d01>1:19-39</td> <td class=d01> 19:31.27</td> <td class=d01>6:18/M</td> </tr> <tr> <td class=d01>3</td> <td class=d02>Runner 3</td> <td class=d02>ANYTOWN PA</td> <td class=d01>389</td> <td class=d01>65</td> <td class=d01>F</td> <td class=d01>1:Overall</td> <td class=d01> 45:45.20</td> <td class=d01>14:46/M</td> </tr> <tr> <td class=d01>4</td> <td class=d02>Runner 4</td> <td class=d02>ANYTOWN PA</td> <td class=d01>381</td> <td class=d01>18</td> <td class=d01>F</td> <td class=d01>1: 1-18</td> <td class=d01> 53:28.84</td> <td class=d01>17:15/M</td> </tr> <tr> <td class=d01>5</td> <td class=d02>Runner 5</td> <td class=d02>ANYTOWN PA</td> <td class=d01>382</td> <td class=d01>41</td> <td class=d01>F</td> <td class=d01>1:40-59</td> <td class=d01> 53:30.48</td> <td class=d01>17:16/M</td> </tr> <tr> <td class=d01>6</td> <td class=d02>Runner 6</td> <td class=d02>ANYTOWN PA</td> <td class=d01>384</td> <td class=d01>14</td> <td class=d01>M</td> <td class=d01>1: 1-18</td> <td class=d01> 57:38.66</td> <td class=d01>18:36/M</td> </tr> <tr> <td class=d01>7</td> <td class=d02>Runner 7</td> <td class=d02>ANYTOWN PA</td> <td class=d01>385</td> <td class=d01>72</td> <td class=d01>F</td> <td class=d01>1:60-99</td> <td class=d01> 57:40.11</td> <td class=d01>18:36/M</td> </tr> </table> <HR noshade> <p> <!-- 0c17 22.0 2e9 --> </BODY> </HTML> >
我尝试过添加 div,但没有取得太大成功。
BeautifulSoup 允许您搜索 div 以外的内容。
假设您显示的 html 想要检索看起来像跑步者的内容,您可以执行类似的操作。
打印的结果看起来像这样