python爬虫 - Python 爬虫提取网页信息

Question

爬取网址是：http://www.xici.net.co/nn/1以上是HTML网页内容，需获取IP地址，端口号，地方，是否高匿，两个时间 一下是我写的Python，但只能实现部分，请各位大神指点下谢谢。。。。 {代码...} 结果是类似下面的...

高洛峰 · Answer

The following code can solve the problem, thank you for your answers. . .

import requests
from bs4 import BeautifulSoup


def getInfo(url):
    proxy_info = []
    page_code = requests.get(url).text
    soup = BeautifulSoup(page_code)
    table_soup = soup.find('table')
    proxy_list = table_soup.findAll('tr')[1:]
    for tr in proxy_list:
        td_list = tr.findAll('td')
        ip = td_list[2].string
        port = td_list[3].string
        location = td_list[4].string or td_list[4].find('a').string
        anonymity = td_list[5].string
        proxy_type = td_list[6].string
        speed = td_list[7].find('p', {'class': 'bar'})['title']
        connect_time = td_list[8].find('p', {'class': 'bar'})['title']
        validate_time = td_list[9].string

        # strip
        l = [ip, port, location, anonymity, proxy_type, speed, connect_time, validate_time]
        for i in range( len(l) ):
            if l[i]:
                l[i] = l[i].strip()
        proxy_info.append(l)

    return proxy_info

if __name__ == '__main__':
    url = 'http://www.xici.net.co/nn/1'
    proxy_info = getInfo(url)
    for row in proxy_info:
        for s in row:
            print s,
        print

大家讲道理 · Answer

Use xpath to find it. . lxml parsing

伊谢尔伦 · Answer

It feels like there may be something wrong with the regular expression.

First look at the document structure:

Each ... tag contains a complete list of content, while ...The tag contains a single item of content. ...标签里包含了一列完整的内容,而...标签里是一个单项内容。

建议用正则表达是从标签开始对每一个标签进行解析。

大概这样：r'(.*?(.*?).......)'

这里面(.*?)

It is recommended to use regular expressions to parse each tag starting from the tag.

Probably like this:r'(.*?(.*?).. .....)'

Here (.*?) is the parsed IP address, which is similar later. 🎜 🎜It’s a little troublesome to write, but it shouldn’t be wrong. 🎜 🎜In fact, it will be much easier to use BeautifulSoup. 🎜

大家讲道理 · Answer

Using re to operate html is also boring, let’s use xpath.

大家讲道理 · Answer

Recommended to use BeautifulSoup

大家讲道理 · Answer

BeautifulSoup is a good choice, but writing regular expression code yourself is not elegant enough.

PHPz · Answer

......scrapy

迷茫 · Answer

<p>scrapy...</p>

Php8, I'm coming too

Learn website layout in 30 minutes

Shangguan Oracle Beginner to Proficient Video Tutorial

Your first line of UNI-APP code

Flutter from scratch to app launch

Brother Lian New Linux Video Tutorial

AXURE 9 Video Tutorial (Suitable for Product Manager Interactive Product Design UI)

Zero Basic Proficiency PS Video Tutorial

16 day UI video tutorial to get you started

PS Techniques and Slicing Techniques Video Tutorial

Alibaba Cloud Environment Construction and Project Launch Video Tutorial

Overview of Computer Networks - Basic Knowledge that Programmers Must Master

Essential Tutorial for Programmers - HTTP Protocol Explanation

Websocket Video Tutorial