Crawling a website which requires credentials to login

By LetUsBePrecise on 7 Jul 2011 at 06:10 UTC

I am hoping to build cache manually whenever I clear cache for any of my developments. In order to do this, I want to write a program which will browse all required nodes. Problem is my website requires credentials to log in. I tried using drush and curl (command is - drush -u 1 php-script myfile.php), but looks like curl initiates the new session every time. So, I tried writing the download page script using curl by passing userid/password and cookie, but its not working. It is redirecting to login screen. Any help here please?

I wrote following code here....

<?php

// $url = page to POST data
// $ref_url = tell the server which page you came from (spoofing)
// $login = true will make a clean cookie-file.
// $proxy = proxy data
// $proxystatus = do you use a proxy ? true/false

function curl_grab_page($url,$ref_url,$data,$login,$proxy,$proxystatus){
if($login == 'true') {
$fp = fopen("/tmp/cookie.txt", "w");
fclose($fp);
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/cookie.txt");
// curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
if ($proxystatus == 'true') {
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
}
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref_url);

curl_setopt($ch, CURLOPT_HEADER, TRUE);
// curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
ob_start();
return curl_exec ($ch); // execute the curl command
ob_end_clean();
curl_close ($ch);
unset($ch);
}

echo curl_grab_page("http://mywebsiteaddress.com/signon?destination=node%2F761096", "http://mywebsiteaddress.com/signon", "name=myusername&pass=mypassword&op=Log%20in", "true", "null", "false");

Keywords to search this topic: crawler, curl crawler, accessing password protected website using curl, crawling, cache builder,crawl

Comments

Additional info

LetUsBePrecise commented 8 July 2011 at 18:54

Idea is to run this script after every cache flush command so that users will have better experience of portal in terms of performance. Currently, will run this script using drush, but later will code some module to build the cache.

Any help in writing this crawler please?

_

WorldFallz commented 7 July 2011 at 16:40

Please don't post duplicate threads, I've deleted the dupe(s). Thanks.

Here is the solution

LetUsBePrecise commented 15 July 2011 at 16:43

/* This script is for building cache. It just crawls the required nodes
*/

include_once './includes/bootstrap.inc';
include_once './includes/common.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
$i = 0;
$j = 0;
$base_url='http://my.websiteaddress.com/';

if (user_access('administer nodes')) {
//opening handler for file which will have all urls
$fh_csvfilename=fopen('/tmp/node.txt','w');

//SQL to get only required nodes.
//This SQL is for my environment. Just change it as per your requirement.
//
$sqltext="SELECT CONCAT(c.nid,',',"."'".$base_url."', 'node/', c.nid) url FROM node_counter c, node n where c.totalcount > 10 and c.nid = n.nid";
$sql=db_query($sqltext);

//Fetch all records of SQL statement
while ($n = db_fetch_object($sql)) {
$j++;
$url=$n->url;
fwrite($fh_csvfilename,$url."\n");
}
fclose($fh_csvfilename);
}

// call download_data_from_url function which will read all URLs from /tmp/node.txt and
// run curl multihandle to get data for all URLs.
download_data_from_url('/tmp/node.txt');

// printing number of nodes processed
print $j .' nodes.';
print "\n";

function download_data_from_url($csvfilenamefull) {
$allurls=file($csvfilenamefull);
$counturls=count($allurls);
$crl = curl_init();
$login_url = "http://my.websiteaddress.com/user/login";
$cookie = "/tmp/cookie.txt";
// Remove the cookie, because we can't reuse old cookie files.
unlink($cookie);

curl_setopt($crl, CURLOPT_URL, $login_url);
curl_setopt($crl, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($crl, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($crl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($crl, CURLOPT_POST, 1);

// This array will hold the field names and values.
$postdata=array(
"name"=>"myusername",
"pass"=>"mypassword",
"form_id"=>"user_login",
"op"=>"Log in"
);
// Tell curl we're going to send $postdata as the POST data
curl_setopt ($crl, CURLOPT_POSTFIELDS, $postdata);
$result=curl_exec($crl);
$headers = curl_getinfo($crl);
if ($headers['url'] == $url) {
die("Cannot login.");
}
else {
echo "We are in.";
echo "\n";
}

//node data will be saved in this location.
$save_to="/tmp/sk/";

$k=0;

foreach ($allurls as $i => $url1)
{
$array_url=explode(",",$url1);
$filename=$array_url[0].'.html';
$url=$array_url[1];
$g=$save_to.$filename;
$fp=fopen ($g, "w");

curl_setopt ($crl, CURLOPT_FILE, $fp);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT,60);
curl_setopt ($crl, CURLOPT_URL, $url);
$result=curl_exec($crl);
$headers = curl_getinfo($crl);
fclose ($fp);
$k++;
echo $k.".".$url;
}
curl_close($crl);
}

Cache warmer

janwalhof commented 5 August 2011 at 21:42

I used the code from LetUsBePrecise on my drupal site. I use authcache with cacherouter. If I login into the crawler code with a username and a password, the code is running. But thereafter, when I login into my Drupalsite with the same userid and password, the pages are not cached. Thats dissapointing.

Further I must explain that I don't use CURLOPT_FOLLOWLOCATION because my site runs on a shared host with open_basedir set. So this CURL option cannot be activated on my site. Therefore I used a workaround that I found on http://www.bin-co.com/php/scripts/load/ , see the article from Itzco.

Who will help me to figure this out? At the end the solution can be very usefull for a lot of Drupal users who are using authcache.

Kind regards,
Jan Walhof

Crawling a website which requires credentials to login

Comments

Additional info

_

Here is the solution

Cache warmer

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community