Zhihu is a very popular knowledge sharing community. Many users have contributed a large number of high-quality questions and answers. For people who study and work, this content is very helpful for solving problems and expanding their horizons. If you want to organize and utilize this content, you need to use scrapers to obtain relevant data. This article will introduce how to use PHP to write a program to crawl Zhihu questions and answers.
Question title
Question description
The number of followers, views, and answers to the question
Question tag
Related Questions
Questions on Zhihu have a very obvious feature, that is, each question has a unique URL. So we can get information about the problem by constructing a URL and sending an HTTP request.
The following is a PHP code demonstration:
<?php $url = 'https://www.zhihu.com/question/36189228'; $html = file_get_contents($url); $data = array(); preg_match('/<title>(.*?)</title>/', $html, $match); $data['title'] = $match[1]; preg_match('/<div class="QuestionHeader-detail">(.*?)</div>/', $html, $match); $data['description'] = $match[1]; preg_match('/<div class="NumberBoard-value">(.*?)</div><span class="NumberBoard-label">关注者</span>/', $html, $match); $data['followers'] = $match[1]; preg_match('/<div class="NumberBoard-value">(.*?)</div><span class="NumberBoard-label">浏览</span>/', $html, $match); $data['views'] = $match[1]; preg_match('/<div class="NumberBoard-value">(.*?)</div><div class="NumberBoard-label">回答</div>/', $html, $match); $data['answers'] = $match[1]; preg_match_all('/<a href="/topic/(.*?)">(.*?)</a>/', $html, $matches); $data['tags'] = implode(',', $matches[2]); preg_match_all('/<a class="RelatedQuestionItem-title" href="(.*?)" target="_blank">(.*?)</a>/', $html, $matches); $data['related_questions'] = array_combine($matches[1], $matches[2]); echo json_encode($data, JSON_UNESCAPED_UNICODE);
PHP's regular expression is used here to match the required information in the HTML text. Although this method depends on the HTML page structure, it can normally capture the required data in most cases. It can be seen that through simple code, we can obtain various information about this problem.
The author of the answer
The content of the answer
The answer Number of likes and comments
For each answer, we can also obtain its related information by constructing a URL and sending an HTTP request.
The following is a PHP code demonstration:
<?php $url = 'https://www.zhihu.com/question/36189228/answer/243147352'; $html = file_get_contents($url); $data = array(); preg_match('/<meta itemprop="name" content="(.*?)">/', $html, $match); $data['author'] = $match[1]; preg_match('/<div class="RichText ztext">(.*?)</div>/', $html, $match); $data['content'] = $match[1]; preg_match('/<button class="Button VoteButton VoteButton--up" aria-pressed="false" tabindex="0" aria-label="(.*?)">/', $html, $match); $data['upvotes'] = $match[1]; preg_match('/<button class="Button CommentButton" tabindex="0" aria-label="(.*?)">/', $html, $match); $data['comments'] = $match[1]; echo json_encode($data, JSON_UNESCAPED_UNICODE);
Similarly, we used PHP's regular expressions to match the required information in the HTML text. It is worth noting that getting the content of the answer requires using ztext instead of the AnswerItem-content class. This is because Zhihu changed the relevant CSS class names after the update.
The above is the detailed content of Use PHP to implement a program to capture Zhihu questions and answers. For more information, please follow other related articles on the PHP Chinese website!