python爬虫 - Python 爬虫 提取网页信息
黄舟
黄舟 2017-04-17 15:48:25
0
8
394


爬取网址是:http://www.xici.net.co/nn/1
以上是HTML网页内容,
需获取IP地址,端口号,地方,是否高匿,两个时间

一下是我写的Python,但只能实现部分,请各位大神指点下
谢谢。。。。

import re
import urllib

a = raw_input('input url:')

s = urllib.urlopen(a)
s1 = s.read()


def getinfo(aaa):
    #reg = re.compile(r'(?<![\.\d])(?:\d{1,3}\.){3}\d{1,3}(?![\.\d])')
    #reg = re.compile(r'<td>(\d+)\.(\d+)\.(\d+)\.(\d+)</td>\s*<td>(\d+)</td>\s*<td>([/u4e00-/u9fa5]+)</td>')
    reg = re.compile(r'<td>(\w+)</td>\s*<td>([\u4e00-\u9fa5]+)</td>')
    l = re.findall(reg, aaa)
    print l
getinfo(s1)

结果是类似下面的,不一定是表格

|ip|端口号|位置|是否高匿|类型|速度|连接时间|验证时间|
|-|-|-|-|-|-|-|-|-|
|122.89.9.70|80|台湾|高匿|HTTP|1.27秒|0.325秒|15-08-28 16:30|
|123.69.48.45|8080|江苏南京|高匿|HTTPS|1.07秒|0.5秒|15-08-28 17:30|

黄舟
黄舟

人生最曼妙的风景,竟是内心的淡定与从容!

reply all(8)
小葫芦

The following code can solve the problem, thank you for your answers. . .

import requests
from bs4 import BeautifulSoup


def getInfo(url):
    proxy_info = []
    page_code = requests.get(url).text
    soup = BeautifulSoup(page_code)
    table_soup = soup.find('table')
    proxy_list = table_soup.findAll('tr')[1:]
    for tr in proxy_list:
        td_list = tr.findAll('td')
        ip = td_list[2].string
        port = td_list[3].string
        location = td_list[4].string or td_list[4].find('a').string
        anonymity = td_list[5].string
        proxy_type = td_list[6].string
        speed = td_list[7].find('p', {'class': 'bar'})['title']
        connect_time = td_list[8].find('p', {'class': 'bar'})['title']
        validate_time = td_list[9].string

        # strip
        l = [ip, port, location, anonymity, proxy_type, speed, connect_time, validate_time]
        for i in range( len(l) ):
            if l[i]:
                l[i] = l[i].strip()
        proxy_info.append(l)

    return proxy_info

if __name__ == '__main__':
    url = 'http://www.xici.net.co/nn/1'
    proxy_info = getInfo(url)
    for row in proxy_info:
        for s in row:
            print s,
        print
大家讲道理

Use xpath to find it. . lxml parsing

伊谢尔伦

It feels like there may be something wrong with the regular expression.

First look at the document structure:

Each <tr>...</tr> tag contains a complete list of content, while <td>...</td>The tag contains a single item of content. <tr>...</tr>标签里包含了一列完整的内容,而<td>...</td>标签里是一个单项内容。

建议用正则表达是从<tr>标签开始对每一个<td>标签进行解析。

大概这样:r'(<tr class.*?>.*?<td.*?<td.*?<td>(.*?)</td>.......</tr>)'

这里面(.*?)

It is recommended to use regular expressions to parse each <td> tag starting from the <tr> tag.

Probably like this:r'(<tr class.*?>.*?<td.*?<td.*?<td>(.*?)</td>.. .....</tr>)'

Here (.*?) is the parsed IP address, which is similar later. 🎜 🎜It’s a little troublesome to write, but it shouldn’t be wrong. 🎜 🎜In fact, it will be much easier to use BeautifulSoup. 🎜
大家讲道理

Using re to operate html is also boring, let’s use xpath.

大家讲道理

Recommended to use BeautifulSoup

大家讲道理

BeautifulSoup is a good choice, but writing regular expression code yourself is not elegant enough.

PHPzhong

......scrapy

迷茫

scrapy...

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!