Python - The title of the web page contains a newline. How to extract it using regular expressions?
女神的闺蜜爱上我
女神的闺蜜爱上我 2017-06-22 11:51:43
0
2
962

When using python to do CSDN web crawler, when crawling the title of the web page, I always use the regular expression (?<=\<title\>). ?(?=\< ) cannot be used in CSDN. Go to the CSDN source code and see that the title breaks into new lines and displays

As a result, the original regular expression cannot be used. Then, the question arises. The title of a webpage like this contains a newline. How to extract it with a regular expression?

PS:

  1. I don’t want to use xpath or beautifulsoup methods, I just need regular expressions

  2. CSDN itself has an anti-crawler mechanism. It’s not because of this anti-crawler that I couldn’t crawl the title

thank you all

Referring to @caimaoy's method, I changed the regular expression to (?<=\<title\>)(?:.|\n) ?(?=\<)## After #, the title is extracted perfectly. Thank you all again.

女神的闺蜜爱上我
女神的闺蜜爱上我

reply all(2)
仅有的幸福
  1. re.M Multi-line mode

  2. Write multi-line matching by yourself http://python3-cookbook.readt...

曾经蜡笔没有小新

Add a flag to the expression

tite = '......'
print(re.findall('(?<=\<title\>).+?(?=\<)', title, re.S))
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!