在爬一个网站数据的时候发现,旧的页面采用的表格和现在的格式不一样,这到不算大问题,只是旧式表格采用的是表格格式并不规则。因网站登陆本身需要账号,就不提供网址了。
具体如下:
新式:
旧式:
在旧式表中中,列名行与数据第一行有6个td标签,其余仅有5个td标签。
表格中的tr标签与td标签均没有特别的属性用做区分。
目前我的处理方式是:
新式:
读列名行,按顺序做一个列表例如:['厂家','备注','单位','变化']。
之后每行数据 按顺序制作成一个字典例如{'厂家':'ABC','备注':'ABC'}
然后插入到我的数据库中。
旧式:
方法类似,只是我要每行 判断cells的数量 来确定读哪部分。
我的问题是:
请问 有没有更好的办法,将表格中的数据按照格式读取出来,甚至能处理旧式表格这样的布局?
旧式表格:
<tr style="height: 16.55pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: windowtext 1.5pt double; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 9.66%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="9%">
<p style="text-align: center; layout-grid-mode: char" align="center">产品</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p style="text-align: left; layout-grid-mode: char" align="left">厂家</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p style="text-align: center; layout-grid-mode: char" align="center">元<span>/公斤</span></p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">涨<span>/跌</span></p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">产地</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 16.55pt; border-top: windowtext 1.5pt double; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center">备注</p>
</td>
</tr>
<tr style="height: 9.3pt">
<td rowspan="7" style="border-bottom: windowtext 1.5pt double; border-left: windowtext 1.5pt double; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 9.66%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="9%">
<p style="text-align: center; layout-grid-mode: char" align="center">进</p>
<p style="text-align: center; layout-grid-mode: char" align="center">口</p>
<p style="text-align: center; layout-grid-mode: char" align="center">原</p>
<p style="text-align: center; layout-grid-mode: char" align="center">生</p>
<p style="text-align: center; layout-grid-mode: char" align="center">多</p>
<p style="text-align: center; layout-grid-mode: char" align="center">晶</p>
<p style="text-align: center; layout-grid-mode: char" align="center">硅</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left">WackerChemie</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">480</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">德国</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
<tr style="height: 9.3pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left">Hemlock Semiconductor</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">450</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">美国</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
<tr style="height: 9.3pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left">Tokuyama Corporation</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">460</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">日本</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
<tr style="height: 9.3pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left">MEMC Electronic Materials,Inc</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">450</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">美国</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
<tr style="height: 9.3pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left">MitsubishiPolysilicon</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">460</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">日本</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
<tr style="height: 9.3pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left">REC Group</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">430</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">美国</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 9.3pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
<tr style="height: 14.55pt">
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 45.2%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="45%">
<p align="left"><span style="color: black">DC Chemical</span></p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 13.4%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="13%">
<p align="center">430</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%" valign="top">
<p style="text-align: center; layout-grid-mode: char" align="center">-</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 11.7%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="11%">
<p style="text-align: left; layout-grid-mode: char" align="left">韩国</p>
</td>
<td style="border-bottom: windowtext 1.5pt double; border-left: #f0f0f0; padding-bottom: 0cm; background-color: transparent; padding-left: 5.4pt; width: 10.02%; padding-right: 5.4pt; height: 14.55pt; border-top: #f0f0f0; border-right: windowtext 1.5pt double; padding-top: 0cm" width="10%">
<p style="text-align: center; text-indent: 5.25pt; layout-grid-mode: char" align="center"> </p>
</td>
</tr>
。。我想把新式的表格也发出来。可惜这些里面各种属性太多了 超过限制了。
html 소스코드를 제공해 주시면 더 수월할 것 같아요~
HTML 소스 코드가 불완전합니다. <table> 태그를 추가한 후 Excel에 직접 붙여넣으면 표가 됩니다.
엑셀을 알면 쉽다
파이썬3
으아악결과:
으아악Pandas 배우기
redad_html (url, match='원하는 테이블의 문자')
이렇게 하면 원하는 테이블의 데이터 내용을 직접 가져올 수 있습니다. 아주 멋지다.