This tutorial demonstrates building a SitePoint search engine surpassing WordPress capabilities using Diffbot's structured data extraction. We'll leverage Diffbot's API for crawling and searching, employing a Homestead Improved environment for development.
Key Advantages:
Implementation:
We'll create a SitePoint search engine in two steps:
The Diffbot Crawljob:
Creating a Crawljob (using the Diffbot PHP client):
composer require swader/diffbot-php-client
job.php
:include 'vendor/autoload.php'; use Swader\Diffbot\Diffbot; $diffbot = new Diffbot('my_token'); // Replace 'my_token' with your Diffbot token $job = $diffbot->crawl('sp_search'); $job ->setSeeds(['https://www.sitepoint.com']) ->notify('your_email@example.com') // Replace with your email ->setMaxToCrawl(1000000) ->setMaxToProcess(1000000) ->setRepeat(1) ->setMaxRounds(0) ->setPageProcessPatterns(['']) ->setOnlyProcessIfNew(1) ->setUrlCrawlPatterns(['^http://www.sitepoint.com', '^https://www.sitepoint.com']) ->setApi($diffbot->createArticleAPI('crawl')->setMeta(true)->setDiscussion(false)); $job->call();
Running php job.php
creates the Crawljob, visible in the Diffbot Crawlbot interface.
Searching with the Search API:
Use the Search API to query the indexed data:
$search = $diffbot->search('author:"Bruno Skvorc"'); $search->setCol('sp_search'); $result = $search->call(); // Display results (example) echo '<table><thead><tr><td>Title</td><td>Url</td></tr></thead><tbody>'; foreach ($search as $article) { echo '<tr><td>' . $article->getTitle() . '</td><td><a href="' . $article->getResolvedPageUrl() . '">Link</a></td></tr>'; } echo '</tbody></table>';
The Search API supports advanced queries (keywords, date ranges, fields, boolean operators). Meta information is accessible via $search->call(true);
. Crawljob status is checked using $diffbot->crawl('sp_search')->call();
.
Conclusion:
Diffbot provides a powerful solution for creating custom search engines. While potentially costly for individuals, it offers significant benefits for teams and organizations managing large websites. Remember to respect website terms of service before crawling. The next part will focus on building the search engine's GUI.
Frequently Asked Questions (rephrased and consolidated):
This section answers common questions regarding crawling, indexing, and using Diffbot for large-scale data extraction. The original FAQ section is quite extensive and repetitive; this condensed version maintains the core information.
robots.txt
file to restrict access.The above is the detailed content of Crawling and Searching Entire Domains with Diffbot. For more information, please follow other related articles on the PHP Chinese website!