Not really a support request...
• Attached is an example PHP cache warmer I made, it traverses the base XMLsitemap (provided by the XMLsitemap module), then traverses each XML file for URLs, compares these URL's to the Boost files, if there is not a Boost file it requests one creating the Boost file, skipping those that are Boosted, the script runs until it reaches the Max number of cache files designated to create per run. Its meant to be run by Cron.
• I think that traversing the XMLsitemap to generate the Boost cache would be the the way to go for an integrated crawler.
• I hope that this can help someone else and perhaps lend some ideas to a new D7 crawler. Please feel free to critique.
in reference to attached PHP script:
- note variables up front and that you need an XML sitemap
- this script travereses the variable $base_sitemap that points to the sitemaps
- traverses each sitemap for each URL
- compares these URL's to boost cache files located at $base_path
- if no cache file exists make a request to create it
- limited to $max_updated # of files to boost (currently 50)
- meant to be run by cron
- I detailed my production script with logging, but left this example simpler w/ echoes
Comment | File | Size | Author |
---|---|---|---|
#3 | Schermafbeelding 2017-02-21 om 12.10.32.png | 48.97 KB | RAWDESK |
BOOST_CACHE_WARMER_example.php_.txt | 2.45 KB | BeatnikDude |
Comments
Comment #1
Anonymous (not verified) CreditAttribution: Anonymous commentedQuite frankly, you've already referenced the other threads to do with a crawler in 7.x like #1785292: Cron Crawler Not Running and cron and wget or sed and the site map external to boost give better results.
The general scenario considered would be having a large site with many archived documents, running everything through boost to crawl it is going to use a lot of resources and probably time out cron, then when an anonymous surfer visits one of the older pages, it's going to be such a rare event that it will have timed out on the cache and be removed and taken from the database again. On a popular site, the pages are always going to be fresh through a continual regeneration process whether that be the current (IMHO misnamed Crawler) and pages being edited, or just by anonymous visitors, so the load is going to be rather low. An external process on a cron job can generate the archived over a staggered period of time and split the site into sections if really necessary.
Comment #2
BeatnikDude CreditAttribution: BeatnikDude commentedI suppose, but I needed to generate these pages up front. I was just dropping this out there for reference, sorry if I missed the proper thread.
Comment #3
RAWDESK CreditAttribution: RAWDESK for Colruyt Group Services commentedVery inventive and creative approach you've been applying here, based on an XMLsitemap driven site index !
Have been digging further into this, trying to optimize the parallel http requests you are firing on your own website.
As predicted by Philip, i bounced against a insufficient resources issue after approx. 20 requests, processed via your curl approach.
From a previous project, where i had to post data in batch to a fulfillment rest service, i got familiar with this module :
https://www.drupal.org/project/background_process
It allows you to queue internal processes or http requests in predefined bundles of a certain amount.
This module is then in background taking care of the further handling, more or less comparable with VBO (Views Bulk Operations).
The cron job itself took approx. 25 seconds run, while the background http request processing went on for about 2 minutes for +50 pages.
So here's the recipe for my Boost Crawler, driven by Ultimate Cron and Background Process.
Ultimate cron callback function :
Thanks for your useful contribution !