Crawling content, usually HTTP requests, requests +1
The webpage you crawled down is to do some string processing to get the information you want. beautifulsoup, regular expressions, str.find() are all acceptable
For general web pages, the above two points are enough. For websites with ajax requests, you may not be able to crawl the content you want. It may be more convenient to find its API.
Post a scraping script that can be used directly to the subject. The purpose is to obtain the Douban ID and movie title of the movie currently being released on Douban. The script depends on the beautifulsoup library and needs to be installed. Beautifulsoup Chinese documentation
Supplement: If the subject hopes to build a real crawler program that can crawl the site or can customize the crawling of specified pages, it is recommended that the subject study scrapy
For simple ones that don’t require a framework, you can check out the requests and beautifulsoup libraries. If you are familiar with python syntax, after reading these two, you can almost write a simple crawler.
Generally companies use crawlers. The ones I have seen mostly use java or python.
There are indeed many articles on the Internet about how to write a simple crawler in Python, but most of these articles can only be regarded as examples, and there are still very few that can be actually applied. I think crawlers are just about getting content, analyzing the content, and then storing it. If you are new to it, you can just Google it. If you want to do in-depth research, you can look for the code on Github and take a look.
I only know a little bit about Python, I hope this helps.
str.find()
are all acceptableFor general web pages, the above two points are enough. For websites with ajax requests, you may not be able to crawl the content you want. It may be more convenient to find its API.
A tutorial compiled when I was studying in the past:
Python crawler tutorial
Post a scraping script that can be used directly to the subject. The purpose is to obtain the Douban ID and movie title of the movie currently being released on Douban. The script depends on the beautifulsoup library and needs to be installed. Beautifulsoup Chinese documentation
Supplement: If the subject hopes to build a real crawler program that can crawl the site or can customize the crawling of specified pages, it is recommended that the subject study scrapy
Grab the python sample code:
For simple ones that don’t require a framework, you can check out the requests and beautifulsoup libraries. If you are familiar with python syntax, after reading these two, you can almost write a simple crawler.
Generally companies use crawlers. The ones I have seen mostly use java or python.
Baidu search python + crawler
A simple crawler with the simplest practical framework. Take a look at the introductory post on the Internet
Recommend scrapy
There are indeed many articles on the Internet about how to write a simple crawler in Python, but most of these articles can only be regarded as examples, and there are still very few that can be actually applied. I think crawlers are just about getting content, analyzing the content, and then storing it. If you are new to it, you can just Google it. If you want to do in-depth research, you can look for the code on Github and take a look.
I only know a little bit about Python, I hope this helps.
You can take a look at my scrapy information
Scrapy saves you a lot of time
There are many examples on github
Post a code to climb Tmall: