Home Backend Development Python Tutorial python正则表达式抓取成语网站

python正则表达式抓取成语网站

Jun 16, 2016 am 08:46 AM
python regular expression

1、首先找到一个在线成语网站

2、查看网页结构,定义正则式

看一下要抓的成语的标签有什么特点,查看源码,可以发现要抓的成语都在标签中,如:安如磐石,成语事实上就是一个瞄文本,不同成语指向的链接不同,其实也就"/cy0/93.html"中的数字不同,所以正则式里匹配两次数字就行了,定义正则式 reg =   "(.*?)"。
3、上代码吧

复制代码 代码如下:

#anthor jiqunpeng
#time 20121124
import urllib
import re

def getHtml(url): #从URL中读取html内容
    page = urllib.urlopen(url)
    html = page.read()
    page.close()
    return html

def getDictionary(html): #匹配成语
    reg = "(.*?)"  
    dicList = re.compile(reg).findall(html)
    return dicList

def getItemSite():#手工把每个字母开头的页面数统计下来
    itemSite = {}#申明为空字典
    itemSite["A"] = 3
    itemSite["B"] = 21
    itemSite["C"] = 19
    itemSite["D"] = 18
    itemSite["E"] = 2
    itemSite["F"] = 14
    itemSite["G"] = 13
    itemSite["H"] = 15
    itemSite["J"] = 23
    itemSite["K"] = 6
    itemSite["L"] = 15
    itemSite["M"] = 12
    itemSite["N"] = 5
    itemSite["O"] = 1
    itemSite["P"] = 6
    itemSite["Q"] = 16
    itemSite["R"] = 8
    itemSite["S"] = 26
    itemSite["T"] = 12
    itemSite["W"] = 13
    itemSite["X"] = 16
    itemSite["Y"] = 35
    itemSite["A"] = 21
    return itemSite
   

if __name__== "__main__":
    dicFile = open("dic.txt","w+")#保存成语的文件
    domainsite = "http://chengyu.itlearner.com/list/"
    itemSite = getItemSite()
    for key,values in itemSite.items():
        for index in range(1,values+1):
            site = key +"_"+str(index)+".html"             
            dictionary = getDictionary(getHtml(domainsite+site))
            for dic in dictionary:
                dicFile.write(dic[2]+"@@CY\n")#标记为成语,分词时使用
        print key+'字母成语抓取完毕'       
    dicFile.close()   
    print '全部成语抓取完毕'

把成语保存在了txt文本中,还添加了一个后缀标签。
最后注意,设计正则表达式时可能会出现明明认为是正确的,就是匹配不了,对空白字符要留意,比如说要解析:

复制代码 代码如下:

                kkun

           


你看不出第一行与第二行的空白字符是什么,可以index = html.find('avatar_name'),html[4677:4677+100]看到非空白字符。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Google AI announces Gemini 1.5 Pro and Gemma 2 for developers Google AI announces Gemini 1.5 Pro and Gemma 2 for developers Jul 01, 2024 am 07:22 AM

Google AI has started to provide developers with access to extended context windows and cost-saving features, starting with the Gemini 1.5 Pro large language model (LLM). Previously available through a waitlist, the full 2 million token context windo

How to download deepseek Xiaomi How to download deepseek Xiaomi Feb 19, 2025 pm 05:27 PM

How to download DeepSeek Xiaomi? Search for "DeepSeek" in the Xiaomi App Store. If it is not found, continue to step 2. Identify your needs (search files, data analysis), and find the corresponding tools (such as file managers, data analysis software) that include DeepSeek functions.

How do you ask him deepseek How do you ask him deepseek Feb 19, 2025 pm 04:42 PM

The key to using DeepSeek effectively is to ask questions clearly: express the questions directly and specifically. Provide specific details and background information. For complex inquiries, multiple angles and refute opinions are included. Focus on specific aspects, such as performance bottlenecks in code. Keep a critical thinking about the answers you get and make judgments based on your expertise.

How to search deepseek How to search deepseek Feb 19, 2025 pm 05:18 PM

Just use the search function that comes with DeepSeek. Its powerful semantic analysis algorithm can accurately understand the search intention and provide relevant information. However, for searches that are unpopular, latest information or problems that need to be considered, it is necessary to adjust keywords or use more specific descriptions, combine them with other real-time information sources, and understand that DeepSeek is just a tool that requires active, clear and refined search strategies.

How to program deepseek How to program deepseek Feb 19, 2025 pm 05:36 PM

DeepSeek is not a programming language, but a deep search concept. Implementing DeepSeek requires selection based on existing languages. For different application scenarios, it is necessary to choose the appropriate language and algorithms, and combine machine learning technology. Code quality, maintainability, and testing are crucial. Only by choosing the right programming language, algorithms and tools according to your needs and writing high-quality code can DeepSeek be successfully implemented.

How to use deepseek to settle accounts How to use deepseek to settle accounts Feb 19, 2025 pm 04:36 PM

Question: Is DeepSeek available for accounting? Answer: No, it is a data mining and analysis tool that can be used to analyze financial data, but it does not have the accounting record and report generation functions of accounting software. Using DeepSeek to analyze financial data requires writing code to process data with knowledge of data structures, algorithms, and DeepSeek APIs to consider potential problems (e.g. programming knowledge, learning curves, data quality)

The Key to Coding: Unlocking the Power of Python for Beginners The Key to Coding: Unlocking the Power of Python for Beginners Oct 11, 2024 pm 12:17 PM

Python is an ideal programming introduction language for beginners through its ease of learning and powerful features. Its basics include: Variables: used to store data (numbers, strings, lists, etc.). Data type: Defines the type of data in the variable (integer, floating point, etc.). Operators: used for mathematical operations and comparisons. Control flow: Control the flow of code execution (conditional statements, loops).

Problem-Solving with Python: Unlock Powerful Solutions as a Beginner Coder Problem-Solving with Python: Unlock Powerful Solutions as a Beginner Coder Oct 11, 2024 pm 08:58 PM

Pythonempowersbeginnersinproblem-solving.Itsuser-friendlysyntax,extensivelibrary,andfeaturessuchasvariables,conditionalstatements,andloopsenableefficientcodedevelopment.Frommanagingdatatocontrollingprogramflowandperformingrepetitivetasks,Pythonprovid

See all articles