I have read many Python crawler articles recommended in the circle of friends, but they all feel that they are too childish. Processing content is originally the strength of PHP. The only advantage of Python is that it comes with Linux. Like Perl, this I think Linux is quite uninteresting, but Mac is kind and comes with Python, Perl, PHP, and Ruby. Of course, I also hate discussing the quality of a language. Every language must have its reasons for existence. Anyway, PHP is the most used language in the world, everyone understands it^_^
What was quite popular a few days ago was that a person wrote a multi-threaded crawler program in C#, which captured 30 million QQ users in the QQ space, of which 3 million users had QQ numbers, nicknames, space names and other information. In other words, with details, it’s only 3 million, and it took two weeks. It’s nothing. In order to prove that PHP is the best language in the world, although everyone knows it^_^, I wrote a multi-process crawler program in PHP. It only took one day to capture 1 million Zhihu users, and it is currently in the 8th circle (depth=8) of users who are related to each other (followers and followers).
Crawler programming:
Because Zhihu needs to log in to get the followers page, so after logging in from chrome, copy the cookie and give it to the curl program to simulate login.
Using two independent loop process groups (user index process group, user details process group), using PHP's pcntl extension, encapsulating a very easy-to-use class, it is almost the same as golang's Ctrip.
The following is a screenshot of user details, the user index code is similar
A digression here. After testing, my 8-core Macbook is the fastest when running 16 processes, and the 16-core Linux server is actually the fastest when running 8 processes. This is a bit baffling to me. But since the final number of processes has been tested, just follow the best settings.
1. The user index process group first starts with a user, captures the user’s followers and followers, and then merges them into the database. Because it is a multi-process, when there are two processes processing the same user into the database There will be duplicate users, so a unique index must be created for the user name field in the database. Of course, third-party caches such as redis can also be used to ensure atomicity. This is a matter of opinion.
After passing step one, we will get the following user list:
2. The user details process group obtains the user details that were first entered into the database in chronological order, and updates the update time to the current time. This can become an infinite loop and the program can run endlessly. , constantly updating user information in a loop.
The program ran stably until the next day, but suddenly there was no new data. I checked and found that Zhihu had changed the rules. I don’t know if it was to prevent me, or it was just a coincidence, but the data returned to me was like this
The first thing I felt was that the data was outputted to me randomly so that I could not collect it. I changed the IP and simulated and disguised some data, but it was no use. Suddenly it felt very familiar. Could it be gzip? With a skeptical attitude, I tried gzip. First of course, I told Zhihu not to give me gzip-compressed data
Remove "Accept-Encoding: gzip,deflatern"; and it won't work!
It seems that Zhihu is forcing me to gzip compress the data. In this case, I will decompress it. I checked PHP to decompress gzip and found that there is only one function gzinflate, so I added the obtained content:
$content = substr($content, 10);
$content = gzinflate($content));
Of course you can also use the one that comes with curl:
curl_setopt( self::$ch, CURLOPT_ENCODING, 'gzip' );
I really want to say here that PHP is really the best language in the world. With just one or two functions, the problem is completely solved and the program runs happily again.
Zhihu’s carefulness also gave me countless help when matching content. For example, I need to distinguish the gender of the user:
Haha, just kidding, actually the style contains icon-profile-female and icon-profile-male ^_^
It hurts me to capture so many users. What’s the use?
It’s actually useless, I just feel so idle ^_^
With this information, you can actually do some big data analysis that others just keep talking about at first
The most common ones are of course:
1. Gender distribution
2. Geographical distribution
3. Occupational distribution, which company are you from
4. The ratio of men to women in each occupation
5. When do people usually go to Zhihu? Post questions, pay attention to issues, those issues deserve attention
Of course, sort by the number of followers, the number of viewers, the number of questions, the number of answers, etc. to see what people are paying attention to, including people's livelihood, society, geography, politics, and the entire Internet. .
Perhaps, you can also analyze the avatars, use an open source pornographic detection program to filter out the pornographic ones, and then save Dongguan? ^_^
Then, you can also look at what those people who graduated from college did in the end.
With these data, can you open up your imagination ^_^
The following are some interesting charts made using these data. The real-time chart data can be viewed at http://www.epooll.com/zhihu/