Home > Backend Development > Python Tutorial > How to use regular expressions in Python to process html files

How to use regular expressions in Python to process html files

WBOY
Release: 2023-05-17 22:35:47
forward
1510 people have browsed it

使用Python中的正则表达式处理html文件

finditer方法是一种全匹配方法。已经使用过findall方法的话,该方法将返回由多个匹配字符串组成的列表。对于多个匹配项,finditer会按顺序返回一个迭代器,每个迭代生成一个匹配对象。这些匹配对象可通过for循环访问,在下面的代码中,因此组1可以被打印。

您需要撰写 Python 正则表达式,以便在 HTML 文本文件中识别特定的模式。将代码添加到STARTER脚本为这些模式编译RE(将它们分配给有意义的变量名称),并将这些RE应用于文件的每一行,打印出找到的匹配项。

1.编写识别HTML标签的模式,然后将其打印为“TAG:TAG string”(例如“TAG:b”代表标签)。为了简单起见,假设左括号和右括号每个标记的(<,>)将始终出现在同一行文本中。第一次尝试可能使regex“<.*>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点,找出为什么这不是一个好的解决方案。编写一个更好的解决方案,解决这个问题

2.修改代码,使其区分开头和结尾标记(例如p与/p)打印OPENTAG和CLOSETAG

import sys, re

#------------------------------

testRE = re.compile(&#39;(logic|sicstus)&#39;, re.I)
testI = re.compile(&#39;<[A-Za-z]>&#39;, re.I)
testO = re.compile(&#39;<[^/](\S*?)[^>]*>&#39;)
testC = re.compile(&#39;</(\S*?)[^>]*>&#39;)

with open(&#39;RGX_DATA.html&#39;) as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == &#39;&#39;:
            continue
        print(&#39;  &#39;, &#39;-&#39; * 100, &#39;[%d]&#39; % linenum, &#39;\n   TEXT:&#39;, line, end=&#39;&#39;)
    
        m = testRE.search(line)
        if m:
            print(&#39;** TEST-RE:&#39;, m.group(1))

        mm = testRE.finditer(line)
        for m in mm:
            print(&#39;** TEST-RE:&#39;, m.group(1))
        
        index= testI.finditer(line)
        for i in index:
           print(&#39;Tag:&#39;,i.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
           
        open1= testO.finditer(line)
        for m in open1:
           print(&#39;opening:&#39;,m.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
           
        close1= testC.finditer(line)
        for n in close1:
           print(&#39;closing:&#39;,n.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
Copy after login

请注意,有些HTML标签有参数,例如:

<table border=1 cellspacing=0 cellpadding=8>
Copy after login

成功查找到并打印标记标签,确保启用带参数和不带参数的标记模式。现在扩展您的代码,以便打印两个打开的标签标签和参数,例如:

OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8

 		open1= testO.finditer(line)
        for m in open1:
            #print(&#39;opening:&#39;,m.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;))
            firstm= m.group().replace(&#39;<&#39;, &#39;&#39;).replace(&#39;>&#39;, &#39;&#39;).split()
            num = 0
            for otherm in firstm:
                if num == 0:
                    print(&#39;opening:&#39;,otherm)
                else:
                    print(&#39;pram:&#39;,otherm)
                num+= 1
Copy after login

在正则表达式中,可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为\N(其中N为正整数),并返回到第N个匹配的文本正则表达式组。例如,正则表达式,如:r" (\w+) \1 仅当与组(\w+)完全匹配的字符串再次出现时才匹配 backref\1出现的位置。这可能与字符串“踢”匹配.例如,“the”出现两次。使用反向引用编写一个模式,当一行包含成对的open和关闭标签,例如在粗体中.

考虑到我们可能想要创建一个执行HTML剥离的脚本,即一个HTML文件,并返回一个纯文本文件,所有HTML标记都已从中删除出来这里我们不打算这样做,而是考虑一个更简单的例子,即删除我们在输入数据文件的任何行中找到的HTML标记。

如果您已经定义了一条RE来识别HTML标签,您应该可以将生成的文本输出为STRIPPED,并将其打印在屏幕上。。

import sys, re

#------------------------------
# PART 1: 

   # Key thing is to avoid matching strings that include
   # multiple tags, e.g. treating &#39;<p><b>&#39; as a single
   # tag. Can do this in several ways. Firstly, use
   # non-greedy matching, so get shortest possible match
   # including the two angle brackets:

tag = re.compile(&#39;</?(.*?)>&#39;) 

   # The above treats the &#39;/&#39; of a close tag as a separate
   # optional component - so that this doesn&#39;t turn up as
   # part of the match &#39;.group(1)&#39;, which is meant to return
   # the tag label. 
   # Following alternative solution uses a negated character
   # class to explicitly prevent this including &#39;>&#39;: 

tag = re.compile(&#39;</?([^>]+)>&#39;) 

   # Finally, following version separates finding the tag
   # label string from any (optional) parameters that might
   # also appear before the close angle bracket:

tag = re.compile(r&#39;</?(\w+\b)([^>]+)?>&#39;) 

   # Note that use of &#39;\b&#39; (as word boundary anchor) here means
   # we must mark the regex string as a &#39;raw&#39; string (r&#39;..&#39;). 

#------------------------------
# PART 2: 

   # Following closeTag definition requires first first char
   # after the open angle bracket to be &#39;/&#39;, while openTag
   # definition excludes this by requiring first char to be
   # a &#39;word char&#39; (\w):

openTag  = re.compile(r&#39;<(\w[^>]*)>&#39;)
closeTag = re.compile(r&#39;</([^>]*)>&#39;)

   # Following revised definitions are more carefully stated
   # for correct extraction of tag label (separately from
   # any parameters:

openTag  = re.compile(r&#39;<(\w+\b)([^>]+)?>&#39;)
closeTag = re.compile(r&#39;</(\w+\b)\s*>&#39;)

#------------------------------
# PART 3: 

   # Above openTag definition will already get the string
   # encompassing any parameters, and return it as
   # m.group(2), i.e. defn: 

openTag  = re.compile(r&#39;<(\w+\b)([^>]+)?>&#39;)

   # If assume that parameters are continuous non-whitespace
   # chars separated by whitespace chars, then we can divide
   # them up using split - and that&#39;s how we handle them
   # here. (In reality, parameter strings can be a lot more
   # messy than this, but we won&#39;t try to deal with that.)

#------------------------------
# PART 4: 

openCloseTagPair = re.compile(r&#39;<(\w+\b)([^>]+)?>(.*?)</\1\s*>&#39;)

   # Note use of non-greedy matching for the text falling
   # *between* the open/close tag pair - to avoid false
   # results where have two similar tag pairs on same line.

#------------------------------
# PART 5: URLS

   # This is quite tricky. The URL expressions in the file
   # are of two kinds, of which the first is a string
   # between double quotes ("..") which may include
   # whitespace. For this case we might have a regex: 

url = re.compile(&#39;href=("[^">]+")&#39;, re.I)

   # The second case does not have quotes, and does not
   # allow whitespace, consisting of a continuous sequence
   # of non-whitespace material (that ends when you reach a
   # space or close bracket &#39;>&#39;). This might be: 

url = re.compile(&#39;href=([^">\s]+)&#39;, re.I)

   # We can combine these two cases as follows, and still
   # get the expression back as group(1):

url = re.compile(r&#39;href=("[^">]+"|[^">\s]+)&#39;, re.I)

   # Note that I&#39;ve done nothing here to exclude &#39;mailto:&#39;
   # links as being accepted as URLS. 

#------------------------------

with open(&#39;RGX_DATA.html&#39;) as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == &#39;&#39;:
            continue
        print(&#39;  &#39;, &#39;-&#39; * 100, &#39;[%d]&#39; % linenum, &#39;\n   TEXT:&#39;, line, end=&#39;&#39;)
    
        # PART 1: find HTML tags
        # (The following uses &#39;finditer&#39; to find ALL matches
        # within the line)
    
        mm = tag.finditer(line)
        for m in mm:
            print(&#39;** TAG:&#39;, m.group(1), &#39; + [%s]&#39; % m.group(2))
    
        # PART 2,3: find open/close tags (+ params of open tags)
    
        mm = openTag.finditer(line)
        for m in mm:
            print(&#39;** OPENTAG:&#39;, m.group(1))
            if m.group(2):
                for param in m.group(2).split():
                    print(&#39;    PARAM:&#39;, param)
    
        mm = closeTag.finditer(line)
        for m in mm:
            print(&#39;** CLOSETAG:&#39;, m.group(1))
    
        # PART 4: find open/close tag pairs appearing on same line
    
        mm = openCloseTagPair.finditer(line)
        for m in mm:
            print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3)))
    
        # PART 5: find URLs:
    
        mm = url.finditer(line)
        for m in mm:
            print(&#39;** URL:&#39;, m.group(1))

        # PART 6: Strip out HTML tags (note that .sub will do all
        # possible substitutions, unless number is limited by count
        # keyword arg - which is fortunately what we want here)

        stripped = tag.sub(&#39;&#39;, line)
        print(&#39;** STRIPPED:&#39;, stripped, end = &#39;&#39;)
Copy after login

The above is the detailed content of How to use regular expressions in Python to process html files. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:yisu.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template