Caterpillar is a PHP library intended for website crawling and can easily be modified for screen scraping. It handles parallel requests using a modified version of Josh Fraser’s Rolling Curl library which utilizes
curl_multi() functions in an efficient manner. You can learn more about Josh and his current projects on his blog, Online Aspect.
Because requests are handled in parallel, the fastest completed requests will trigger enqueuing of any newly found URLs. This ensures the crawler runs continuously and efficiently. Rolling Curl is set to allow for a maximum number of simultaneous connections to ensure you do not DOS attack the requested host with requests.
- Import the caterpillar.sql file into the database of your choice.
- Copy the library to your application and include.
- Modify the configuration file
/caterpillar/inc/config.inc.phpwith your MySQL database login credentials.
- Your database user will need the privilege for creating and dropping TEMPORARY TABLES.
- Refer to the following for example instantiation and usage:
$caterpillar = new Caterpillar(‘http://www.url-to-crawl.org’, $config[‘db_user’], $config[‘db_pass’], $config[‘db_name’], $config[‘db_host’]);
This library was primarily created as a means for automatically generating your own Google XML Sitemaps. The crawler results stored in the database contain important information such as inbound links, a hash of the content to check for changes, the last modification date, and the url itself. When you combine all of the stored data, you have the necessary information for generating a prioritized sitemap with fairly accurate modified dates. The crawler should never leave it’s starting domain.
It’s highly recommended that you take a look at the
resetIndex() method as it performs database cleanup and garbage collection on broken or removed links. The current functionality assumes you run the crawler weekly and it will remove any URLs that were not found in the past two weeks.