python菜鸟想做一个简单的爬虫求教程

Question

python菜鸟 想做一个简单的爬虫 求教程 ps:一般公司做爬虫采集的话常用什么语言

PHP中文网 · Answer

Crawling content, usually HTTP requests, requests +1
The webpage you crawled down is to do some string processing to get the information you want. beautifulsoup, regular expressions, str.find() are all acceptable

For general web pages, the above two points are enough. For websites with ajax requests, you may not be able to crawl the content you want. It may be more convenient to find its API.

高洛峰 · Answer

A tutorial compiled when I was studying in the past:

Python crawler tutorial

高洛峰 · Answer

Post a scraping script that can be used directly to the subject. The purpose is to obtain the Douban ID and movie title of the movie currently being released on Douban. The script depends on the beautifulsoup library and needs to be installed. Beautifulsoup Chinese documentation

Supplement: If the subject hopes to build a real crawler program that can crawl the site or can customize the crawling of specified pages, it is recommended that the subject study scrapy

Grab the python sample code:

#!/usr/bin/env python
#coding:UTF-8

import urllib
import urllib2
import traceback

from bs4 import BeautifulSoup
from lxml import etree as ET

def fetchNowPlayingDouBanInfo():
    doubaninfolist = []

    try:
        #使用proxy时,请取消屏蔽
#         proxy_handler = urllib2.ProxyHandler({"http" : '172.23.155.73:8080'})
#         opener = urllib2.build_opener(proxy_handler)
#         urllib2.install_opener(opener)      

        url = "http://movie.douban.com/nowplaying/beijing/"

        #设置http-useragent
        useragent = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.57 Safari/537.36'}
        req = urllib2.Request(url, headers=useragent)    

        page = urllib2.urlopen(req, timeout=10)
        html_doc = page.read()

        soup = BeautifulSoup(html_doc, "lxml")

        try:

            nowplaying_ul = soup.find("p", id="nowplaying").find("ul", class_="lists")

            lilist = nowplaying_ul.find_all("li", class_="list-item")
            for li in lilist:
                doubanid = li["id"]
                title = li["data-title"]

                doubaninfolist.append({"douban_id" : doubanid, "title" : title, "coverinfolist" : [] })

        except TypeError, e:
            print('(%s)TypeError: %s.!' % (url, traceback.format_exc()))
        except Exception:
            print('(%s)generic exception: %s.' % (url, traceback.format_exc()))

    except urllib2.HTTPError, e:
        print('(%s)http request error code - %s.' % (url, e.code))
    except urllib2.URLError, e:
        print('(%s)http request error reason - %s.' % (url, e.reason))
    except Exception:
        print('(%s)http request generic exception: %s.' % (url, traceback.format_exc()))

    return doubaninfolist

if __name__ =="__main__":
   doubaninfolist = fetchNowPlayingDouBanInfo()
   print doubaninfolist

巴扎黑 · Answer

For simple ones that don’t require a framework, you can check out the requests and beautifulsoup libraries. If you are familiar with python syntax, after reading these two, you can almost write a simple crawler.

Generally companies use crawlers. The ones I have seen mostly use java or python.

大家讲道理 · Answer

Baidu search python + crawler

高洛峰 · Answer

A simple crawler with the simplest practical framework. Take a look at the introductory post on the Internet
Recommend scrapy

PHP中文网 · Answer

There are indeed many articles on the Internet about how to write a simple crawler in Python, but most of these articles can only be regarded as examples, and there are still very few that can be actually applied. I think crawlers are just about getting content, analyzing the content, and then storing it. If you are new to it, you can just Google it. If you want to do in-depth research, you can look for the code on Github and take a look.

I only know a little bit about Python, I hope this helps.

怪我咯 · Answer

You can take a look at my scrapy information

天蓬老师 · Answer

Scrapy saves you a lot of time
There are many examples on github

迷茫 · Answer

Post a code to climb Tmall:

def areaFlow(self, parturl, tablename, date):
        while True:
            url = parturl + self.lzSession + '&days=' + str(date) + '..' + str(date)
            print url
            try:
                html = urllib2.urlopen(url, timeout=30)
            except Exception, ex:
                writelog(str(ex))
                writelog(str(traceback.format_exc()))
                break;
            responegbk = html.read()
            try:
                respone = responegbk.encode('utf8')
            except Exception, ex:
                writelog(str(ex))
            # 如果lzSession过期则会返回errcode：500的错误
            if respone.find('"errcode":500') != -1:
                print 'nodata'
                break;
            # 如果时间不对则返回errcode：100的错误
            elif respone.find('"errcode":100') != -1:
                print 'login error'
                self.catchLzsession()
            else:
                try:
                    resstr = re.findall(r'(?<=\<)(.*?)(?=/>)', respone, re.S)
                    writelog('地域名称    浏览量    访问量')
                    dictitems = []
                    for iarea in resstr:
                        items = {}
                        areaname = re.findall(r'(?<=name=\")(.*?)(?=\")', iarea, re.S)
                        flowamount = re.findall(r'(?<=浏览量：)(.*?)(?=<)', iarea, re.S)
                        visitoramount = re.findall(r'(?<=访客数：)(.*?)(?=\")', iarea, re.S)
                        print '%s %s %s' % (areaname[0], flowamount[0], visitoramount[0])
                        items['l_date'] = str(self.nowDate)
                        items['vc_area_name'] = str(areaname[0])
                        items['i_flow_amount'] = str(flowamount[0].replace(',', ''))
                        items['i_visitor_amount'] = str(visitoramount[0].replace(',', ''))
                        items['l_catch_datetime'] = str(self.nowTime)
                        dictitems.append(items)
                    writeInfoLog(dictitems)
                    insertSqlite(self.sqlite, tablename, dictitems)
                    break
                except Exception,ex:
                    writelog(str(ex))
                    writelog(str(traceback.format_exc()))
            time.sleep(1)