In the process of learning python, learning to obtain the content of the website is the knowledge and skills we must master. Today I will share the basic process of the crawler. Only by understanding the process, we will slowly master it step by step. Knowledge included
Python web crawler probably requires the following steps:
1. Obtain the address of the website
Some website URLs are very easy to obtain, obviously, but some URLs require us to analyze them in the browser
2. Obtain the website address
The URLs of some websites are very easy to obtain, obviously, but some URLs need to be analyzed in the browser to get
3. Requesting the URL
is mainly to obtain The source code of the URL we need is convenient for us to obtain data
4. Obtaining the response
It is very important to obtain the response. Only when we obtain the response can we access the website Extract the content. When necessary, we need to obtain the cookie through the login URL to simulate the login operation
5. Obtain the specified data in the source code
This is What we call the required data content is that the content in a URL is large and complex. We need to obtain the information we need. The three main methods I currently use are re (regular expression) xpath and bs. 4
6. Processing and beautifying data
When we obtain the data, some data will be very messy, with many necessary spaces and labels. Wait, at this time we need to remove the unnecessary things in the data
7. Save
The last step is to save the data we obtained so that We can check it at any time, usually through folders, text documents, databases, tables, etc.
The above is the detailed content of How to crawl data in python. For more information, please follow other related articles on the PHP Chinese website!