For more targeted ones, you can use tags such as p and article to make simple judgments. If you need something more general, you can analyze the collected web page data, such as writing an algorithm to calculate the density of Chinese (non-tagged text) to determine whether it is the main text. I haven't done it specifically, but the idea is basically this.
Source address: http://www.cnblogs.com/jasondan/p/3497757.html
For more targeted ones, you can use tags such as p and article to make simple judgments. If you need something more general, you can analyze the collected web page data, such as writing an algorithm to calculate the density of Chinese (non-tagged text) to determine whether it is the main text. I haven't done it specifically, but the idea is basically this.
HTTP protocol simulation, (usually using request, urllib2 module)
Information extraction (due to the special nature of HTML documents, xpath, beautifulsoup is generally used)