Not really a support request...
• Attached is an example PHP cache warmer I made, it traverses the base XMLsitemap (provided by the XMLsitemap module), then traverses each XML file for URLs, compares these URL's to the Boost files, if there is not a Boost file it requests one creating the Boost file, skipping those that are Boosted, the script runs until it reaches the Max number of cache files designated to create per run. Its meant to be run by Cron.
• I think that traversing the XMLsitemap to generate the Boost cache would be the the way to go for an integrated crawler.
• I hope that this can help someone else and perhaps lend some ideas to a new D7 crawler. Please feel free to critique.

in reference to attached PHP script:

  • note variables up front and that you need an XML sitemap
  • this script travereses the variable $base_sitemap that points to the sitemaps
  • traverses each sitemap for each URL
  • compares these URL's to boost cache files located at $base_path
  • if no cache file exists make a request to create it
  • limited to $max_updated # of files to boost (currently 50)
  • meant to be run by cron
  • I detailed my production script with logging, but left this example simpler w/ echoes
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Anonymous’s picture

Quite frankly, you've already referenced the other threads to do with a crawler in 7.x like #1785292: Cron Crawler Not Running and cron and wget or sed and the site map external to boost give better results.

The general scenario considered would be having a large site with many archived documents, running everything through boost to crawl it is going to use a lot of resources and probably time out cron, then when an anonymous surfer visits one of the older pages, it's going to be such a rare event that it will have timed out on the cache and be removed and taken from the database again. On a popular site, the pages are always going to be fresh through a continual regeneration process whether that be the current (IMHO misnamed Crawler) and pages being edited, or just by anonymous visitors, so the load is going to be rather low. An external process on a cron job can generate the archived over a staggered period of time and split the site into sections if really necessary.

BeatnikDude’s picture

Status: Active » Closed (works as designed)

I suppose, but I needed to generate these pages up front. I was just dropping this out there for reference, sorry if I missed the proper thread.

RAWDESK’s picture

Very inventive and creative approach you've been applying here, based on an XMLsitemap driven site index !

Have been digging further into this, trying to optimize the parallel http requests you are firing on your own website.
As predicted by Philip, i bounced against a insufficient resources issue after approx. 20 requests, processed via your curl approach.

From a previous project, where i had to post data in batch to a fulfillment rest service, i got familiar with this module :
https://www.drupal.org/project/background_process
It allows you to queue internal processes or http requests in predefined bundles of a certain amount.
This module is then in background taking care of the further handling, more or less comparable with VBO (Views Bulk Operations).

Schermafbeelding 2017-02-21 om 12.10.32.png
The cron job itself took approx. 25 seconds run, while the background http request processing went on for about 2 minutes for +50 pages.

So here's the recipe for my Boost Crawler, driven by Ultimate Cron and Background Process.

Ultimate cron callback function :

function _boost_cache_XML_sitemap_warmer($job) {

	global $base_url;
	$base_path = 'cache/normal/'; // path to Boost cache root
	$base_sitemap = $base_url . '/sitemap.xml'; // starting sitemap
	$max_updated = 100;	// max nodes create cache for per run --> could probably be removed

	$index_reviewed = 0;
	$index_updated = 0;
	$urls = array();
	$xml = json_decode(json_encode((array) simplexml_load_file($base_sitemap,'SimpleXMLElement')), 1);
	foreach ($xml['url'] as $url_list) {
		$index_reviewed++;
		if ($index_updated >= $max_updated) {break;}
		$url = $url_list['loc'];
		$temp_file =  $base_path . substr($url, 7) ."_.html";
		if (!(file_exists($temp_file))) {
  		$urls[] = background_process_http_request($url, array('postpone' => TRUE));
		$index_updated++;	//index URL's updated
		}
	}
	background_process_http_request_process($urls, array('limit' => 5));

	if ($index_updated > 0) {
		$email = variable_get('site_mail');
		$subject = 'Boost cache updated pages for '. variable_get('site_name');
		$mail_content = 'Number of regenerated pages = ' . $index_updated;
		_notify_organizer_by_mail($subject, $mail_content, $email);
	}
}

Thanks for your useful contribution !