正则表达式 - python Regex:匹配XML标签中内容

Question

总结 Parser具有通用性,处理良性的xml,解析完后你可以得到xml文档任何位置的信息.优先选择 Regex具有针对性,处理非良性的xml,当你预先知道需要匹配的信息位置,尝试Regex 在Update3中给出了一个实例。 我现在有这...

天蓬老师 · Answer

>>> str = "1...A...2...B"                                                                                                  
>>> p3 = re.compile(r'(?<=<(?Pa|b)>)(.*?)(?=)')
>>> [m.group() for m in p3.finditer(str)]                                                                                                              
['1', 'A', '2', 'B']
>>> p3.findall(str)
[('a', '1'), ('b', 'A'), ('a', '2'), ('b', 'B')]

高洛峰 · Answer

How many times have I said this...I'm tired of it...
XML has its own library lxml, BS4

Regular expressions should be used to do their proper job, instead of using your brain to manipulate XML

ringa_lee · Answer

str = '1...A...2...B'
p5 = re.compile(r'(?<=<[ab]>)(.*?)(?=)')
p5.findall(str) # ['1', 'A', '2', 'B']

天蓬老师 · Answer

Supplement 3:

Here are the positive answers to the questions directly addressed separately from Supplement 2.

As for the matching problem itself, my suggestion is:

If A and B are paired, it is best to observe whether there are line breaks, parent tags, etc., which can be used to distinguish each group. For example, it would be best to have such a data source:

If not, then you have to think of other ways. The central idea is still "try not to be cheated."
The main deception of
is that there may be consecutive or . For example, ABABAAABAB, then the first two of the three A in the middle are best discarded.
So to be on the safe side, it’s best not to do it all at once (?P.*).*(?P.*).

The usage I recommend is: (?:<(?Pa)>(?P.*))|(?:<(?Pb)>(?P.*)), get all the tags at once, whether they are A or B.

and then scan it again, only considering adjacent A and B as a set of valid data.

Note that the above codes are all written by hand. They have not been tested or even looked at in detail. They are only for reference.

Supplement 2:

There is a legitimate reason why XML is not standard. In response to this actual situation, my suggestion is:

Try to use an XML interpreter that supports mixed/tolerant mode. Tolerance of some XML flaws is actually the underlying basis of many HTML parsers.

Don’t do things in one step. First disconnect each record, and then analyze the details of each field within the scope of each record. In this way, at least all problems can be controlled within one record, avoiding "a single incident affecting the whole body." (Refer to this answer)

Always consider Regex as your last resort.

In addition, I must criticize the poster very seriously: You are another negative example of XY PROBLEM.

At first, I only came up with a very simple and standardized XML fragment, but after two updates, I finally revealed the important information that "XML may not be standardized".

Are you deliberately saving some trump card to protect your fragile self-esteem when you are criticized? !

Can you be more vulnerable! ! !

Supplement 1:

Cannot agree with Update 2 of the question text.

Using regular expressions to match regular XML means that as long as you dig a few small holes within the rules of XML, lazy programmers will fall into them.

I think Regex parsing XML is "definitely not suitable for practical applications" and should not be a matter of doubt. If it is forced to be done, it means that the actual program can only be adapted to some specific situations. And if there is any slight change in the data source (for example, the programmer temporarily commented out a small number of labels), humans may be required to hotfix it. The result is that skyscrapers are built on loose sand, and the programs that programmers work hard to write will soon become unusable. This will be a never-ending cycle.

"As long as it is a matter, there is no absolute". Isn't this judgment itself "absolute"? I think principles are principles. Some issues have clear right and wrong, and some muddy waters cannot be disturbed. If you can step back a little here and let go a little bit there on issues of principle, then the program written in this way may only fall into an elusive and unpredictable ending.

It’s normal to have other opinions on SOF. Do you have to agree with them when you see them? !

The only thing that is certain is that if you use XML as an example to learn regular rules, there is no harm in doing it.

I would rather turn this issue upside down.

Why do some people always like to use regular expressions to parse XML/HTML? !

When did it become possible to use Parser or Regex to parse XML? Each has its own strengths, and it became an issue that can be discussed and discussed? ? ! !

Is this an issue that needs to be discussed? ? ? ! ! !

Never replace XML interpreters with regex

Iron principle!

Never step back!

No matter how simple XML is, it won’t work!

Because you cannot use a simple regular expression to cover all the complex structures of XML. There are so many situations in XML, where it is weird but correct, where it is just tolerated, and where it should simply report an error. This is not covered by regular expressions.

For example, in the following situations, ask yourself: If you use regular rules to do it, will you consider everything?

Note:

Unparsed text section: "CDATA Section" should be ignored too ]]>

Escape of entity: content of A < B is A < B instead of A < B

Self-closing tag: is an equivalent to , shouldn't be ignored

When an element has multiple attribute values, the order of the attributes may be arbitrary

So regularity and XML interpreters are two things with completely different complexities. The result of mixing is: the price will definitely be returned to you with interest one day. Don't give up on writing solid code just because "it will serve your purpose." This is using physical "diligence" to cover up absolute laziness in mind.

Players who have participated in the Informatics Olympiad in middle schools or ACM/ICPC in universities understand a simple truth:

The sample data can be passed, and the entire question can be Accepted are two completely different concepts.

The same goes for actual programming. For this requirement, considering that XML is a standard, the code involving XML must "guarantee" that it can work for XML that conforms to the standard, instead of constantly tossing to make the code "look" applicable. The one-sided "sample data" you set.

Look at this article "An Interlude in Linux 2.6.39-rc3" and remember the teachings of Linus Torvalds:

This kind of “I broke things, so now I will jiggle things randomly until they unbreak” is not acceptable.
This "I messed up, I just tinkered with it until it worked again" approach is unacceptable.