I am hoping to build cache manually whenever I clear cache for any of my developments. In order to do this, I want to write a program which will browse all required nodes. Problem is my website requires credentials to log in. I tried using drush and curl (command is - drush -u 1 php-script myfile.php), but looks like curl initiates the new session every time. So, I tried writing the download page script using curl by passing userid/password and cookie, but its not working. It is redirecting to login screen. Any help here please?
I wrote following code here....
<?php
// $url = page to POST data
// $ref_url = tell the server which page you came from (spoofing)
// $login = true will make a clean cookie-file.
// $proxy = proxy data
// $proxystatus = do you use a proxy ? true/false
function curl_grab_page($url,$ref_url,$data,$login,$proxy,$proxystatus){
if($login == 'true') {
$fp = fopen("/tmp/cookie.txt", "w");
fclose($fp);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/cookie.txt");
// curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'true') {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref_url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
// curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
ob_start();
return curl_exec ($ch); // execute the curl command
ob_end_clean();
curl_close ($ch);
unset($ch);
}
echo curl_grab_page("http://mywebsiteaddress.com/signon?destination=node%2F761096", "http://mywebsiteaddress.com/signon", "name=myusername&pass=mypassword&op=Log%20in", "true", "null", "false");
Keywords to search this topic: crawler, curl crawler, accessing password protected website using curl, crawling, cache builder,crawl
Comments
Additional info
Idea is to run this script after every cache flush command so that users will have better experience of portal in terms of performance. Currently, will run this script using drush, but later will code some module to build the cache.
Any help in writing this crawler please?
_
Please don't post duplicate threads, I've deleted the dupe(s). Thanks.
Here is the solution
Cache warmer
I used the code from LetUsBePrecise on my drupal site. I use authcache with cacherouter. If I login into the crawler code with a username and a password, the code is running. But thereafter, when I login into my Drupalsite with the same userid and password, the pages are not cached. Thats dissapointing.
Further I must explain that I don't use CURLOPT_FOLLOWLOCATION because my site runs on a shared host with open_basedir set. So this CURL option cannot be activated on my site. Therefore I used a workaround that I found on http://www.bin-co.com/php/scripts/load/ , see the article from Itzco.
Who will help me to figure this out? At the end the solution can be very usefull for a lot of Drupal users who are using authcache.
Kind regards,
Jan Walhof