Hi,

I know that the different path for the same content isn't good for SEO.

Right now I put into my .htaccess:
RewriteCond %{REQUEST_FILENAME} !\.(gif|png|jpg|jpeg|jfif|bmp|css|js|zip|ico)$ [NC]
RewriteCond %{HTTP_HOST} ^madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn1.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn2.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn3.madfanboy\.com$ [NC]
RewriteRule ^(.*)$ http://www.madfanboy.com/$1 [L,R=301]

Now when I type: cdn1.madfanboy.com it's redirect to: madfanboy.com
But if it's http://cdn1.madfanboy.com/site/sites/default/files/images_save/deadspace... - it won't redirect anywhere. It is correct?

Thanks

Files: 
CommentFileSizeAuthor
#91 seo_redirection-1060358-91.patch16.38 KBbangpound
PASSED: [[SimpleTest]]: [MySQL] 81 pass(es).
[ View ]
#90 seo_redirection-1060358-90.patch16.31 KBbangpound
PASSED: [[SimpleTest]]: [MySQL] 81 pass(es).
[ View ]
#84 seo_redirection-1060358-84.patch16.26 KBianthomas_uk
PASSED: [[SimpleTest]]: [MySQL] 68 pass(es).
[ View ]
#77 seo_redirection-1060358-77.patch16.61 KBWim Leers
PASSED: [[SimpleTest]]: [MySQL] 64 pass(es).
[ View ]
#75 seo_redirection-1060358-73.patch5.06 KBjemond
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch seo_redirection-1060358-73.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#71 seo_redirection-1060358-69.patch4.82 KBjemond
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch seo_redirection-1060358-69.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#68 cdn_seo_d6.zip43.2 KBjemond
#68 cdn_seo_d7.zip42.7 KBjemond
#57 cdn_seo_sandbox_cleanup.patch8.57 KBWim Leers
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch cdn_seo_sandbox_cleanup.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#57 cdn_seo.zip22.75 KBWim Leers

Comments

Wim Leers’s picture

Assigned:Unassigned» Wim Leers
Status:Active» Fixed

Correct, that can't have any effect, since a CDN does not necessarily use Apache, thus a .htaccess wouldn't have any effect, nor would the .htaccess be copied.

Hence, this won't work.

Many sites use Origin Pull CDNs that could theoretically trigger this problem. In practice, there are no SEO issues.

superfedya’s picture

RewriteCond %{REQUEST_FILENAME} !\.(gif|png|jpg|jpeg|jfif|bmp|css|js|zip|ico)$ [NC]
RewriteCond %{HTTP_HOST} ^madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn1.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn2.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn3.madfanboy\.com$ [NC]
RewriteRule ^(.*)$ http://www.madfanboy.com/$1 [L,R=301]

This rules isn't correct, because http://cdn2.madfanboy.com/site/sites/default/files/imagecache/small/pict... redirects to http://www.madfanboy.com. How I can fix that?

Wim Leers’s picture

Is cdn2.madfanboy.com an actual CDN or is it just your own file server?

superfedya’s picture

Actual CDN.

Wim Leers’s picture

So you're saying your CDN supports .htaccess files? What CDN is that then?

superfedya’s picture

cdn2.madfanboy.com is just a redirect to madfanboy.com.

Wim Leers’s picture

Then why do you say it is an actual CDN? Then it's NOT an actual CDN.

superfedya’s picture

Sorry, I don't know those things very well. So, if I understand everything right, after installation of CDN with Origin Pull mode I don't need modify my .htaccess for better SEO? Because right now all my content are accessible from 4 domains (madfanboy.com, cdn1.madfanboy.com...), I don't thing that the duplicate is a good thing for SEO...

mikeytown2’s picture

@superfedya
same rules will apply
http://drupal.org/node/597178#comment-2735418
Do these work for you?

Wim Leers
submodule htaccess generator for cdn?

Wim Leers’s picture

@mikeytown2: but a .htaccess only works for alternative domains (or subdomains) that are self-hosted through Apache HTTPd. It's useless when you use a CDN, e.g. CloudFront. How do people deal with this in general with Origin Pull CDNs? I can't find it. Dynamically serve a different robots.txt to the CDN crawler?

mikeytown2’s picture

@Wim Leers
superfedya is self-hosted; I'm sure other requests like this will come in. Will an origin pull cdn copy html?

Wim Leers’s picture

An Origin Pull CDN will indeed copy HTML. That's the crux of the problem, indeed.

mikeytown2’s picture

Depends on how the origin pull is setup. If you have it pull from cdn.example.com then the htaccess rules will work. I think this should be the standard way of doing it.

I did notice that HTTP_VIA is getting set when doing an origin pull. We are using limelight. If this is some sort of standard then we could detect it this way, but I have a feeling this isn't ideal.

<?php
$_SERVER
['HTTP_VIA'] =  '1.1 sw.cds152.sea.llnw.net:8000 (EdgePrism/3.8.0.3), 1.1 cds246.sea.llnw.net:80 (EdgePrism/3.8.0.3)';
?>

Thinking about this and .htaccess rules are not needed; one could do it in cdn_init. It's not as fast but it would make life easier. Just issue a location: 301 header. In order to let custom generation through like imagecache do a simple query

SELECT *
FROM menu_router
WHERE path LIKE 'sites/default/files/%'

Replacing sites/default/files with file_directory_path. Or use hook_menu_alter as this only gets run on menu rebuilds... something to think about because I'm creating a 404 handler for css/js aggregates so auto detecting what should be allowed through is a smarter option.

superfedya’s picture

@mikeytown2
Hmmm, it works :)

Thanks

Wim Leers’s picture

@mikeytown2: care to roll a patch? :) And I don't think having a subdomain that serves the CDN's request is very typical. In any case, I never do that, because it's unnecessary.

superfedya’s picture

@mikeytown2

Nope, doesn't work.

#Parallel Redirect
RewriteCond %{REQUEST_URI} !(^/(.*)(.js|.css)) [NC]
RewriteCond %{HTTP_ACCEPT} !(.*image.*|.*css.*|.*javascript.*) [NC]
RewriteCond %{HTTP_HOST} ^(cdn1.example.com|cdn2.example.com|cdn3.example.com)$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

Tried this code and got this problem for some users:
http://www.webpagetest.org/results/11/02/19/MY/YK7/1_screen.jpg

Erased and everything become normal:
http://www.webpagetest.org/result/110219_EY_YKA/1/screen_shot/

Any other solution?

mikeytown2’s picture

I think cdn_init is the best way to handle this; I should be able to program around the issues superfedya is having. Expect a patch in a couple of days; will be busy with other issues. This is a concern of ours because we recently put a couple of our sites on limelight.

superfedya’s picture

Thank you mikeytown2!

superfedya’s picture

Any news?

Thanks

mikeytown2’s picture

Nope; busy over here at the moment.
http://drupal.org/project/advagg

superfedya’s picture

Because google can lower site position in search results if they will find 4 different addresse for the same site :(

mikeytown2’s picture

Status:Fixed» Active

still not fixed... still working over here, I got this issue to take care of and then I'll jump on this one #1078060: CSS Embedded Images - Add in support for advagg's hooks

Wim Leers’s picture

Any news, Mike? :)

mikeytown2’s picture

same as before; still pounding out issue for advagg

jmesam’s picture

subscribe

basicmagic.net’s picture

subscribe

mikeytown2’s picture

Version:6.x-2.1» 6.x-2.x-dev

Looking at this in more depth now. For us we have http://media.example.com in the CDN mapping setting, and that come back to us as http://source.example.com from the CDN provider. $_SERVER['HTTP_VIA'] is unreliable so that's out. Looks like we would need another field where one can blacklist certain domains; this gets complicated when one considers things like my files_proxy module which will forward requests to the correct host based on the contents of the files path. Thinking about this the files_proxy module needs to run before this code and that should solve that issue.

We would like to 301 the request to the root domain but there is an issue with a multisite using the same CDN domain. 301-ing to the correct domain gets a lot worse when you have domain access on top of of a multisite with over 1.1k domains to pick from (our setup). Thus I think a 404 is the safest way to do this.

So here is the proposal. Default is to Fast 404 if the request is for a blacklisted domain & the path is not in the menu_router table. Blacklist would be a text field as there can be multiple domains. I'm thinking this might be a CDN submodule, or even a stand alone as this can get complex fairly quick. The more advanced setup is to 301; this will not work in our case, but it is perfectly valid for other use cases. 301 would have a mapping where requests to this domain go here; maybe looking at the referrer as well. If it doesn't know what to do 404 then. I'll probably steal a bunch of code from this patch to accomplish this #1056578: Allow for a domain whitelist/blacklist setting.

Wim Leers’s picture

That sounds about right. And clearly, since you needed 3 paragraphs to explain something that is deceivingly simple at first sight, this can quickly become fairly complex. I'd be happy to include this is as a submodule with the CDN module.

Finally, your initial solution ("we have http://media.example.com in the CDN mapping setting, and that come back to us as http://source.example.com from the CDN provider.") should work in most (simple, single-site) set-ups, right? It'd be as simple as returning a 404 for any HTML document served through Drupal.

bibo’s picture

Subscribe.

I'm facing the same challenge. Although .htaccess rules would be a fast solution, I would prefer a inbuilt, reusable (and simple) way to make sure that my selfhosted "cdn"-subdomains only return files, and otherwise all traffic is forwarded to the main site.

Glad to see you guys are working on this :)

mr.j’s picture

++

mikeytown2’s picture

Initial work on this. I have the blacklist working & imagecache/advagg works as well. Need to set the module weight to be fairly heavy, create an admin section, get the whitelist working, intergrate with domain access, etc...

<?php
/**
 * Default value to see if the CDN SEO module is enabled.
 */
define('CDN_SEO_ENABLED', TRUE);

/**
 * Defined value for the whitelist.
 */
define('CDN_SEO_WHITELIST', 1);

/**
 * Defined value for the blacklist.
 */
define('CDN_SEO_BLACKLIST', 2);

/**
 * Default value to see what mode to use.
 */
define('CDN_SEO_HOST_MODE', CDN_SEO_BLACKLIST);


/**
 * Implementation of hook_init().
 */
function cdn_seo_init() {
 
// Exit early if this is disabled.
 
if (!variable_get('cdn_seo_enabled', CDN_SEO_ENABLED)) {
    return;
  }

 
// Get context.
 
$host = empty($_SERVER['HTTP_HOST']) ? $_SERVER['SERVER_NAME'] : $_SERVER['HTTP_HOST'];
 
$file_dir = file_directory_path();

 
// Special handling for requests to the files dir.
 
if (strpos($_GET['q'], $file_dir) === 0) {

   
// If this path has a menu item then exit here and let the callback work it.
   
$router_item = menu_get_item();
    if (!empty(
$router_item)) {
      return;
    }
  }

 
// Get all CDN domains.
 
$mode = variable_get('cdn_seo_host_mode', CDN_SEO_HOST_MODE);
  if (
$mode == CDN_SEO_BLACKLIST) {
   
$blacklisted_domains = variable_get('cdn_seo_blacklist', cdn_get_domains());
  }
  if (
$mode == CDN_SEO_WHITELIST) {
   
$whitelisted_domains = variable_get('cdn_seo_whitelist', '*');
  }

 
// If this host is blacklisted then fast 404.
 
if (!empty($blacklisted_domains)) {
    foreach (
$blacklisted_domains as $bad_domain) {
      if (
strcmp($bad_domain, $host) === 0) {
       
cdn_seo_fast404('Bad domain');
      }
    }
  }


}

/**
 * Send out a fast 404 and exit.
 *
 * @param $msg
 *   send this message in the header.
 */
function cdn_seo_fast404($msg = '') {
  global
$base_path;
  if (!
headers_sent()) {
   
header($_SERVER['SERVER_PROTOCOL'] . ' 404 Not Found');
   
header('X-CDN-SEO: ' . $msg);
  }
  print
'<html>';
  print
'<head><title>404 Not Found</title></head>';
  print
'<body><h1>Not Found</h1>';
  print
'<p>The requested URL was not found on this server.</p>';
  print
'<p><a href="' . $base_path . '">Home</a></p>';
  print
'</body></html>';
  exit();
}
?>
mikeytown2’s picture

Status:Needs review» Active

module code is in a sandbox for now
http://drupal.org/sandbox/mikeytown2/1213552
Blacklist only.
snapshot download link

mikeytown2’s picture

Status:Active» Needs review
mr.j’s picture

Looking into this a bit more, I don't think any of this is going to work in our situation which is an externally hosted origin pull CDN (edgecast) which is set up to have full access to everything in our root domain. I did a few tests:

- www.site.com : works as normal
- cdn.site.com : shows the content from www (this is what we want to prevent)
- de.site.com : German subdomain using Domain Access. works as normal
- de.site.com/german-url : works as normal
- www.site.com/german-url : redirects to de.site.com/german-url (as expected, using domain redirect module)
- cdn.site.com/german-url : redirects to de.site.com/german-url. This suggests that the CDN is trying to access www.site.com/german-url and is being told to redirect by the domain redirect, just like a normal user.

This last step suggest that we won't be able to tell at the Drupal level that the request is coming through the CDN unless something like $_SERVER['HTTP_VIA'] is set, but you said before its unreliable.

Adding to our .htaccess to catch requests to the cdn subdomain doesn't work as it is externally hosted and the origin pull server is requesting stuff off the www.

For now we have canonical URL output switched on using the nodewords and domain_meta modules (need this patch: #1245660: Support for Canonical URLS) and this seems to be the best solution, as everything can be served off the CDN domain with the canonical URL in the html source pointing to the correct domain. Fingers crossed that this will point the search engines in the right direction and clean things up, as we have noticed pages on our cdn domain being indexed by google recently.

mikeytown2’s picture

Status:Active» Needs review

@mr.j
I have a sandbox module that will prevent cdn.site.com from showing content. http://drupal.org/sandbox/mikeytown2/1213552

The code (cdn_seo.module) if fairly straight forward; just set cdn.site.com to be in the blacklist at admin/settings/cdn/seo

mr.j’s picture

Thanks, I took a look at it before. That code is using $_SERVER['HTTP_HOST'] or $_SERVER[''SERVER_NAME'].

If I understand things correctly, using our origin pull CDN which we do not host, it will request the page from the regular www domain before caching. So when the hook_init code runs, I expect that either/both of those variables will be pointing to our www domain, not the cdn subdomain. Therefore it will not stop the request and the CDN will cache and return the page.

I tried a quick test (without using that module) by putting a few lines in .htaccess to return a 404 if a page was requested off the cdn subdomain. I then requested it and it returned fine instead of returning a 404, which suggests that my server has no knowledge that the request is coming from the CDN - at least not from the host part of the request.

mikeytown2’s picture

If both regular and cdn traffic hit your servers with the same host name it will always be very hard to pick the 2 apart. We use source.example.com and blacklist it.

mr.j’s picture

Aha the penny has dropped! Create a local cdn subdomain and instruct the origin pull CDN to use that instead. Then block anything we don't want on that one.

Thanks for the help.

philsward’s picture

I've pulled some code from a post out on the old parallel project and someoneone suggested using something like the following:

RewriteCond %{HTTP_HOST} ^cdn(1|2|3)\.example\.com$ [NC]
RewriteCond %{REQUEST_URI} !\.(css|js|swf|png|gif|jpe?g|pdf|zip)$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R]

I find it to be a lot cleaner than some of the other suggestions posted and I have it working fine on several of my sites.

Anyone care to shoot some bullets at it?

philsward’s picture

Edit: Sorry, I didn't mean to post twice...

mikeytown2’s picture

@philsward
Mind testing out a php module that has the same goal?
http://drupal.org/sandbox/mikeytown2/1213552

bibo’s picture

@mikeytown2, I would like to test your module if it was for Drupal 7.

There isn't that much code, so I guess it would probably be somewhat easy to upgrade it for D7, right?

mikeytown2’s picture

Looking it over and the sandbox might be able to run in D7 with no code changes.

Wim Leers’s picture

Status:Needs review» Reviewed & tested by the community

Looks great! :) Thanks again, mikeytown2! I'm currently cleaning up the code, adding the whitelist functionality and extending the admin UI to support both black- and whitelist (with nice ctools-powered dependent form items in D6). I'll also provide the D7 port.

My reroll of cdn_seo should be posted later today.

bibo’s picture

Looks great! :) Thanks again, mikeytown2! I'm currently cleaning up the code, adding the whitelist functionality and extending the admin UI to support both black- and whitelist (with nice ctools-powered dependent form items in D6). I'll also provide the D7 port.

My reroll of cdn_seo should be posted later today.

Oh my!

This sums up my feelings :)

Wim Leers’s picture

@bibo :)

I was on the road for the better part of the day yesterday, which prevented me from working on this further. Today I have other work scheduled. I should be able to finish it this weekend.

bibo’s picture

@ Wim Leers
As said, I'm very glad to hear you're working on this. I'll be the first to try it out once it's done :)

Wim Leers’s picture

I've done the port of the Far Future expiration functionality to D7 this weekend, which was sponsored and had a very tight deadline. This has culminated in the 2.2 release for D6 and D7.

Hence the CDN SEO stuff has been delayed. I hope to work on that next weekend.

Wim Leers’s picture

When I started working on this again, I realized that this goes against the spirit of the CDN module: make everything work nicely out-of-the-box.

So I did a bit more research, and found these:
- http://www.seomoz.org/q/how-was-cdn-seomoz-org-configured
- http://blog.maxcdn.com/news/cdns-and-duplicate-content/
- http://www.seroundtable.com/cloud-cdn-seo-13665.html
- http://www.webmasterworld.com/google/4390245.htm

It sounds like simply serving a different robots.txt and using rel="canonical" will go a long way in solving this problem; even though you could still access the origin content.

Besides that, I think it actually makes sense to just rely on the User-Agent header (and possibly Via, too) to detect CDN requests. Far less set-up needed. Won't cover 100% at first, but it'd be easy enough to update. It wouldn't be failproof, but at least it would be foolproof.
For now, I'd look for the following case-insensitive substrings: "cloudfront", "akamai". I know they are substrings of those CDN's user agent strings. I couldn't find anything about the other CDNs' user agents.

bibo’s picture

Glad to hear you're still/again working at this.

When I started working on this again, I realized that this goes against the spirit of the CDN module: make everything work nicely out-of-the-box.

I'm not sure which part you're referring to? Some configuration is required any way.

It sounds like simply serving a different robots.txt and using rel="canonical" will go a long way in solving this problem; even though you could still access the origin content.

I read studied those links and learned new things, thanks! So, SEO-wise it would be sufficient, good. Still, changing robots.txt for CDN:s while keeping the same file structure requires some server tweaking (not "out-of-the-box" functionality).

Besides that, I think it actually makes sense to just rely on the User-Agent header (and possibly Via, too) to detect CDN requests.

What about our own pull cdn's? I'm currently not using Akamai or any of those.

To be honest I was hoping this would solve more than just SEO. Such as potential performance and creditibility issues. Not all crawling is done by search engines. I've been watching traffic on some sites (usually live http traffic with varnishlog), and found out there is shitloads of traffic that is generated by bots and sniffers of different types. A single site may have several active crawlers 24/7. They can generate a lot of traffic, which might avoid site caching (varnish for example treats the sites as separate). Several cdn's could generate 5-10 full bootstrap requests per second, of totally useless traffic. On many sites this can create more server load than normal traffic (and yes, I've seen this at least partially).

Additionally, on some sites it could be possible the crawlers could leave a footprint on the site, making some links etc refer to the cdn site instead of the real site. I'm not saying exactly this has happened, but it could. Also, I simply dont want people to be able to visit my page with the wrong url, intentionally or not.

The most KISS-approach I can think of is that 301-redirect that was discussed before.

But you know what you're doing. I should concentrate on configuring the webservers so that limited filetype request go through (and that imagecache, js/css generation works).

Wim Leers’s picture

I simply dont want people to be able to visit my page with the wrong url, intentionally or not.

Yes. But that's exactly why I'm proposing to detect CDN spiders through the User Agent header!

mikeytown2’s picture

Wim Leers’s picture

#53: The problem with the Via header is that it also gets set for proxies, so I think it's likely more parsing work.

What are your thoughts on #50? Do you consider it feasible etc? I'm still going to add the CDN SEO module for people who want full control, and have the capabilities to do so. It's a much tighter solution, for sure. I'm merely trying to make it work as good as possible out of the box.

mr.j’s picture

Several cdn's could generate 5-10 full bootstrap requests per second, of totally useless traffic. On many sites this can create more server load than normal traffic (and yes, I've seen this at least partially).

...

The most KISS-approach I can think of is that 301-redirect that was discussed before.

This all depends on the bot. If it is a bad bot that is not set to remove a URL from its crawl list after encountering a 301, then doing a 301 will achieve nothing as the URL will still be crawlable and instead just redirect to the content from the main domain. So it won't save you anything on server resources. You could even say it would add more load as the crawler has to make 2 requests for every URL that gets a 301.

If on the other hand you have the canonical links set up then good bots will understand not to bother crawling the CDN domain in future.

So if you're really concerned about server load then you need to block access to the CDN domain for non static content, not redirect it.

mr.j’s picture

I forgot to add, if you want to serve a custom robots.txt for your origin pull CDN subdomain your provider doesn't allow you to configure one (ours doesn't), you can do it using .htaccess.

Create a robots_cdn.txt and put it in the root directory with your robots.txt file. robots_cdn.txt should have the following content:

User-agent: *
Disallow: /

Then add this to .htaccess and anything that attempts to crawl static.yourdomain.com/robots.txt will be served the content of robots_cdn.txt instead.

# This attempts to serve a custom robots.txt to the CDN subdomain
RewriteCond %{HTTP_HOST} ^static\.yourdomain\.com$ [NC]
RewriteRule ^robots.txt robots_cdn.txt [L,NC]

That should at least put a stop to good bots crawling your CDN subdomains.

Wim Leers’s picture

StatusFileSize
new22.75 KB
new8.57 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch cdn_seo_sandbox_cleanup.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

I asked Aaron Peters for advice. He's a WPO guru and has experience with CDNs: he maintains CDN Planet.

He wonders if this is really a problem. For his clients, he doesn't deal with this, as this is really a non-issue. E.g. for http://etsy.com, there's http://site dot etsystatic dot com. As long as there aren't any links to the CDN domain (which is why I wrote it down so strangely), you'll be fine.

He does understand why you'd want to prevent it from happening though, and in that case he agrees that the approach outlined by mikeytown2 is the best one.

So, in the general case, this is simply not important. That's why I think it makes most sense to make this a separate project: in the common case, you won't want or need this. As such, there's no point for me to continue the development of this functionality.

I've taken mikeytown2's code from his sandbox (see #33) and cleaned it up. I've attached a patch that applies to his sandbox. I've also attached the entire cdn_seo module as a .zip file. It's written for Drupal 6. It's 90% ready, it only needs the "whitelist" functionality (which is trivial to implement).

Wim Leers’s picture

Assigned:Wim Leers» Unassigned
Category:support» feature
Priority:Normal» Minor
Status:Reviewed & tested by the community» Needs work
mikeytown2’s picture

@Wim Leers
You want me to promote the sandbox to a full project with your patch applied?

Edit:
Wim, you have been given full permissions to http://drupal.org/sandbox/mikeytown2/1213552

Wim Leers’s picture

I didn't make myself clear. This is what *I* think is best. I'm still looking for feedback.

What are your thoughts, mikeytown2? And yours, mr.j and bibo?

mr.j’s picture

Our site currently 301s any non static content on the CDN subdomain to the proper domain. In addition we have the canonical link tag and the custom robots.txt! The reason for all this is that google indexed pages on our CDN subdomain several months ago somehow. We never linked to it to our knowledge, but anyway it would be trivial for someone else to link to it maliciously if it was a known search ranking attack strategy. Despite all these countermeasures, the pages are still indexed by google now. So avoiding the situation is key, otherwise it could be a long time before it is cleaned up and who knows if google penalises sites for getting into this situation.

Personally I prefer handling the 301 with .htaccess as you can avoid processing anything using Drupal code which is costly in comparison. After all its 3 lines in .htaccess vs another module. But I see no harm in having the module for those who can't or don't want to fiddle with .htaccess. But I do think that something should be there - and from my own experience I would recommend that it is on by default.

Wim Leers’s picture

I'd actually prefer a .htaccess generator over a module as well.

It's interesting to see that you *have* seen the adverse effects! Thanks for sharing!

fiftyz’s picture

I have the following setup:
http://www.example.com (default site address)
http://static1.example.com (fake cdn with the same webroot as http://www.example.com)
http://static2.example.com (fake cdn with the same webroot as http://www.example.com)

In .htaccess (which is common on all three sites I have the following lines:

 
...
  # RewriteBase /

  RewriteCond %{HTTP_HOST} ^static2.example.com$ [NC]
  RewriteCond %{REQUEST_URI} !\.(png|gif|jpg|jpeg|js|css|swf)$ [NC]
  RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]

  # Pass all requests not referring directly to files in the filesystem to
  # index.php. Clean URLs are handled in drupal_environment_initialize().
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d
  RewriteCond %{REQUEST_URI} !=/favicon.ico
  RewriteRule ^ index.php [L]
...

But is not working as it should because after rewriting %{REQUEST_FILENAME} to index and [INTERNAL REDIRECT] is executing the first check and redirects to main domain. Does anyone know how could I solve this?

The rewrite log looks like this:

XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] add path info postfix: /path/to/drupal/cdn -> /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js -> cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] applying pattern '(^|/)\.' to uri 'cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] add path info postfix: /path/to/drupal/cdn -> /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js -> cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] applying pattern '^' to uri 'cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (4) [perdir /path/to/drupal/] RewriteCond: input='static2.example.com' pattern='^example\.com$' [NC] => not-matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] add path info postfix: /path/to/drupal/cdn -> /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js -> cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] applying pattern '^(.*)$' to uri 'cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (4) [perdir /path/to/drupal/] RewriteCond: input='static2.example.com' pattern='^static2.example.com$' [NC] => matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (4) [perdir /path/to/drupal/] RewriteCond: input='/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js' pattern='!\.(png|gif|jpg|jpeg|js|css|swf)$' [NC] => not-matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] add path info postfix: /path/to/drupal/cdn -> /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js -> cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] applying pattern '^' to uri 'cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (4) [perdir /path/to/drupal/] RewriteCond: input='/path/to/drupal/cdn' pattern='!-f' => matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (4) [perdir /path/to/drupal/] RewriteCond: input='/path/to/drupal/cdn' pattern='!-d' => matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (4) [perdir /path/to/drupal/] RewriteCond: input='/cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js' pattern='!=/favicon.ico' => matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (2) [perdir /path/to/drupal/] rewrite 'cdn/farfuture/7GI2kv8RAC3p0649WErME8_-sEVfe0mgFUCZdJnzJ4c/drupal-cache:m1ic77/sites/example.com/files/js/js_0xSOf1oN6BsJgLtVvcyShl4BsQ4So6JMSkF4OVIYLYQ.js' -> 'index.php'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (3) [perdir /path/to/drupal/] add per-dir prefix: index.php -> /path/to/drupal/index.php
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (2) [perdir /path/to/drupal/] strip document_root prefix: /path/to/drupal/index.php -> /index.php
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722fb0cd8/initial] (1) [perdir /path/to/drupal/] internal redirect with /index.php [INTERNAL REDIRECT]
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/index.php -> index.php
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (3) [perdir /path/to/drupal/] applying pattern '(^|/)\.' to uri 'index.php'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/index.php -> index.php
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (3) [perdir /path/to/drupal/] applying pattern '^' to uri 'index.php'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (4) [perdir /path/to/drupal/] RewriteCond: input='static2.example.com' pattern='^example\.com$' [NC] => not-matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (3) [perdir /path/to/drupal/] strip per-dir prefix: /path/to/drupal/index.php -> index.php
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (3) [perdir /path/to/drupal/] applying pattern '^(.*)$' to uri 'index.php'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (4) [perdir /path/to/drupal/] RewriteCond: input='static2.example.com' pattern='^static2.example.com$' [NC] => matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (4) [perdir /path/to/drupal/] RewriteCond: input='/index.php' pattern='!\.(png|gif|jpg|jpeg|js|css|swf)$' [NC] => matched
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (2) [perdir /path/to/drupal/] rewrite 'index.php' -> 'http://www.example.com/index.php'
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (2) [perdir /path/to/drupal/] explicitly forcing redirect with http://www.example.com/index.php
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (1) [perdir /path/to/drupal/] escaping http://www.example.com/index.php for redirect
XX.XX.XXX.XX - - [27/Mar/2012:14:48:36 +0300] [static2.example.com/sid#7f8722a5e3a8][rid#7f8722f95ba0/initial/redir#1] (1) [perdir /path/to/drupal/] redirect to http://www.example.com/index.php [REDIRECT/301]
doubouil’s picture

My 2 cents on this issue.

Config :
www.[domain].net (main website)
cdn-img.[domain].net (img cdn)
cdn-js.[domain].net (js/json CDN)
cdn-css.[domain].net (css CDN)
cdn-files.[domain].net (other stuff CDN)
hotlink-img.[domain].net (beautiful URL for sharing img)

All are fake CDN routed to the same base directory (hotlink-img points to sites/[site]/files)

In settings.php

<?php
$url_parts
= explode(".", $_SERVER['HTTP_HOST'] ); // split the page URL
$test = explode("-", $url_parts[0] ); // explode the subdomains on dash

if( in_array( $test[0], array("cdn","hotlink") ) ) { // if 1st part of subdomains is cdn or hotlink
    // make a fast 403 page
   
drupal_add_http_header('Status', '403 Forbidden');
    print
'<html xmlns="http://www.w3.org/1999/xhtml"><head><title>403 Forbidden</title></head><body><h1>Not Allowed</h1><p>You cannot access website via CDN-url.</p></body></html>';
    exit();
}
?>

Any time a PHP page should be generated, it basically checks the subdomains to see if it's the website. Calling a direct file is not part of a PHP process so it renders the request as usual (file if present, 404 if not).

The 403 header avoid Google from indexing duplicate content, and the array("cdn","hotlink") still allows subdomains to work if they don't have their own settings.php.

Also, setting a base_url in settings.php gived me a hint on that trick, I had a maintenance page which worked at cdn-js.[domain].net but anytime I tried a different page (ex : cdn-js.[domain].net/user) it kept redirecting me to www.[domain].net/user.

Am I missing something or is it a viable solution ? I'm not convinced by the "code in settings.php" part, should it be in a hook_init/hook_boot ?

philsward’s picture

My rewrite rules in comment #40 work fine for D6, but after upgrading to D7, I've run into issues...

If an image style has not been generated for a given page (ie a thumbnail), then the system won't generate it. After the image has been generated on the backend, everything works as expected with the re-writing.

Because of the way D7 now handles the images, if you apply a redirect to images before they the resized style has been generated, you will receive a Notice: Use of undefined constant Y - assumed 'Y' in eval() (line 2 of /home/userdir/public_html/h/somesite.com/modules/php/php.module(80) : eval()'d code).

superfedya’s picture

With Apache the problem was resolved like that:

  RewriteCond %{HTTP_HOST} ^mysite\.com$ [NC]
  RewriteRule ^(.*)$ http://www.mysite.com/site/$1 [L,R=301]
 
  RewriteCond %{HTTP_HOST} ^cdn1.mysite\.com$ [NC]
  RewriteRule ^(.*)$ http://www.mysite.com/site/$1 [L,R=301]
 
  RewriteCond %{HTTP_HOST} ^cdn2.mysite\.com$ [NC]
  RewriteRule ^(.*)$ http://www.mysite.com/site/$1 [L,R=301]
 
  RewriteCond %{HTTP_HOST} ^cdn3.mysite\.com$ [NC]
  RewriteRule ^(.*)$ http://www.mysite.com/site/$1 [L,R=301]

  RewriteCond %{HTTP_HOST} ^49.4.91.32$ [NC]
  RewriteRule ^(.*)$ http://www.mysite.com/site/$1 [L,R=301]

Any idea how to rewrite this config to Nginx?

Thanks

federico’s picture

Hi,

I tried to use the submitted module on #57 , but I coudn't add cdn.example.com into the blacklist, I got this error:

implode(): Invalid arguments passed on the line 20 of the file /var/www/sites/all/modules/cdn_seo/cdn_seo.admin.inc.

I'm using D6 and CDN 6.x-2.5

I also tried to put

RewriteCond %{HTTP_HOST} ^cdn\.example\.com$ [NC]
RewriteCond %{REQUEST_URI} !\.(css|js|ico|png|gif|jpe?g|pdf|zip|swf)$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R]

in .htaccess, but when I visit http://cdn.example.com/samplepath I'm redirected to: http://www.example.com/index.php?q=samplepath

I'll appreciate your help.

jemond’s picture

StatusFileSize
new42.7 KB
new43.2 KB

Attached is a quick-and-dirty port of the cdn_seo module for Drupal 7 from #57, and the Drupal 6 version with a bug fix for #67. The Drupal 7 version includes the fix as well. My initial tests with this did not work on my site, but my hosting platform is somewhat non-standard (Pantheon). I will debug further the week of 7/7 and report back (I am on vacation this coming week).

jemond’s picture

Status:Needs work» Needs review

After more debugging I don't think using the host to match the incoming request is the best approach. It's unreliable (it didn't work for me at all on Pantheon). As stated earlier, I think User-Agent detection is the way to go.

Below is the snippet I used to fix this problem. I dropped it into settings.php and confirmed it fixes the issue on my live site that uses Amazon Cloudfront. I'm using a 301 instead of a 404 because this tells requesters what the "correct" URL they are trying to reach is. If it happens to be a bad link on the main site, then the main site will return a 404 (as it should).

<?php
// getallheaders() is only available when using Apache. For things like
// nginx, we have to define our own function.
if (!function_exists('getallheaders')) {
  function
getallheaders() {
   
$headers = array();

   
// Loop to get all of the headers in the request
   
foreach ($_SERVER as $name => $value) {
      if (
substr($name, 0, 5) == 'HTTP_') {
       
// RFC2616 (HTTP/1.1) defines header fields as case-insensitive
       
$headers[strtolower(str_replace(' ', '-', ucwords(strtolower(str_replace('_', ' ', substr($name, 5))))))] = $value;
      }
    }
    return
$headers;
  }
}
$headers = getallheaders();
$cdn_user_agents = array('Amazon CloudFront');
if(isset(
$headers['user-agent']) && in_array($headers['user-agent'], $cdn_user_agents)) {
 
// A 301 is SEO friendly, as it tells the search engine what the real URL is
  // for this content.
 
header('HTTP/1.0 301 Moved Permanently');
 
header('Location: http://www.example.com'. $_SERVER['REQUEST_URI']);

 
// To ensure this redirect occurs immediately we don't use drupal_exit().
 
exit();
}
?>

(D.o is messing up the header line by making it a link.)

I'm willing to roll a patch for the D7 branch if there aren't any objections to this approach. If this is case, I think there are three to do items:
1. Find the user agents for other CDNs
2. Make the user-agent check a partial match instead of an exact match
3. Track the list of user-agents in the config screen for the module, with a good default of as full a list as we can find of user-agents

Any concerns with this approach?

The last submitted patch, cdn_seo_sandbox_cleanup.patch, failed testing.

jemond’s picture

Version:6.x-2.x-dev» 7.x-2.x-dev
StatusFileSize
new4.82 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch seo_redirection-1060358-69.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

D7 dev patch attached for #69. I still need to expand the list of user-agents by default, but I've deployed this live and it's working. After some thought, since we allow the user to enter in the user agent list that causes redirection I think a partial match would be too fragile, as it could produce false positives.

drupalsim91’s picture

i have the same problem: if the image style do not exist i have a 404 error. plaese resolve this.

Wim Leers’s picture

Status:Needs review» Needs work

You're essentially taking the path I was taking in #50.

User-Agent detection is always fragile (this is also what Aaron Peters said). But then again, it *is* the easiest route to solving this problem. The major advantage is that it doesn't require a separate domain, which the typical recommended solution does (i.e. have a separate CDN-facing cdnorigin.yoursite.com and visitor-facing yoursite.com.).

This solution may not be as complete as the one set forth by mikeytown2 and continued by me in #57, but I think it's better nevertheless, because it requires less configuration.

The one thing I would add is support for rel="canonical" HTTP headers: http://googlewebmastercentral.blogspot.be/2011/06/supporting-relcanonica....

So, in summary:
- yes, I'll commit your work (though it should meet Drupal's coding standards first)
- I'd like to see support for the aforementioned headers as well — if you don't have time for this, @jemond, then I'll do that :)

Thank you very much!

jemond’s picture

Thanks for the review @Wim.
1. Had no idea about the rel="canonical" HTTP headers. Cool! Will add them.
2. I will fixed what I goofed in terms of coding standards.

I will get working on an updated patch. Possible I might have it ready later today, or early next week.

jemond’s picture

StatusFileSize
new5.06 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch seo_redirection-1060358-73.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

Attached is a new patch that adds the rel=canonical header for the redirects. I also added some line wrapping to comply with Drupal standards. I'm not super-certain what additional coding standards I've goofed, so any guidance would be appreciated. I can follow-up with an updated patch.

I've deployed this patch live and verified it's working. Here are the response headers for:
http://assets1.zujava.com/how-to-get-your-kid-exercising-after-school

Cache-Control no-cache, must-revalidate, post-check=0, pre-check=0
Connection close
Content-Type text/html
Date Mon, 16 Jul 2012 18:22:05 GMT
Etag "1342462925"
Expires Sun, 19 Nov 1978 05:00:00 GMT
Last-Modified Mon, 16 Jul 2012 18:22:05 +0000
Link <http://www.zujava.com/how-to-get-your-kid-exercising-after-school>; rel="canonical"
Location http://www.zujava.com/how-to-get-your-kid-exercising-after-school
Server nginx/1.0.15
Via 1.0 7d2cfd509570f1fce6ed360cb72250b4.cloudfront.net (CloudFront)
X-Amz-Cf-Id oBmUOIds_n4Jug3dwW3Q0JsDpa67Dl57llkCekAHvvaHvGY_PeHaag==,oc1Oju-OvGEuCswSlwDlymLSpedDeOG9p4ral2GNhpF26r7psw8eWQ==,9ZcFtcPEkljNAADeTSQI5FtWAiGWYtn0-LCVOaHKT8gMBAXjVv_NAA==
X-Cache Miss from cloudfront
x-drupal-cache MISS
jemond’s picture

Status:Needs work» Needs review
Wim Leers’s picture

Assigned:Unassigned» Wim Leers
StatusFileSize
new16.61 KB
PASSED: [[SimpleTest]]: [MySQL] 64 pass(es).
[ View ]

Rerolled patch. Addressed issues:

  • No test coverage. Patch now comes with full unit test coverage.
  • Breaks when used with e.g. image styles. The code assumes that Drupal never generates files, i.e. that every time Drupal itself serves something to a CDN, that it is by definition something that should be redirected. This does not hold true, examples are: generated images (core's image module), e.g. generated PDFs, but also the CDN module's Far Future expiration functionality itself.
  • Lack of substring matching for CDN user agents: case-insensitive, yet precise matches were required.
  • All the jumbling with getallheaders was not necessary at all.
  • Coding standards.
  • Advanced Help documentation.

It goes without saying that you will be credited as well, @jemond. You have been instrumental in pushing this issue forward. Thanks *so* much for your help :) I just pushed it the extra mile to ensure it won't break anything. I'm looking forward to your feedback!

This functionality will go into the 2.7 release, the 2.6 release will contain bugfixes only.

jemond’s picture

Thanks for the clean-up Wim! This looks good.

This is queued up on test now and set for deployment to live tomorrow. I will update next week if there are any problems in production.

Wim Leers’s picture

Thanks :)

jemond’s picture

Status:Needs review» Reviewed & tested by the community

This has been in production since 9/27 and I can confirm it's redirecting main page requests In an SEO-friendly way. Looks good. No problems.

Wim Leers’s picture

Perfect :) This will go in version 2.7 then :)

ianthomas_uk’s picture

Great news, and thanks to Wim for pointing me to this when I was doing a bit of housekeeping on another issue.

Is there a rough release date for 2.7? Or a list of issues blocking it? If it's going to be a while I'll apply the patch manually, but I'd prefer to use the released version if possible.

Wim Leers’s picture

2.7 will only contain this patch.

2.6 will be a bugfix release, and besides the many bugfixes already committed, I'd also like to see #1719568: CDN URLs are not properly encoded in some edge cases and #1790348: Far-Future mode: background images and font files referenced in CSS files incorrectly rewritten, but only in an Omega subtheme fixed before releasing 2.6.

ianthomas_uk’s picture

Status:Reviewed & tested by the community» Needs review
StatusFileSize
new16.26 KB
PASSED: [[SimpleTest]]: [MySQL] 68 pass(es).
[ View ]

The patch from #77 no longer applies to 7.x-2.x, so here is a slightly amended version. The changes should be identical, it's just the lines around those changes that are different.

Setting to needs review so the patch gets tested. Please can someone set back to RBTC if the test succeeds and you agree that my patch is the same as #77.

Wim Leers’s picture

Woot, thanks @ianmthomasuk :)

pierrot’s picture

Version:7.x-2.x-dev» 6.x-2.x-dev
Component:Miscellaneous» Origin Pull mode — Far Future expiration
Assigned:Wim Leers» Unassigned
Status:Needs review» Closed (fixed)

The d6 module didn't work for me. I implemented the #69 solution that worked like a charm. Only user-agent did the trick.

My settings : pressflow 6 + varnish + origin pull + imagecache + cloudfront

ianthomas_uk’s picture

Version:6.x-2.x-dev» 7.x-2.x-dev
Component:Origin Pull mode — Far Future expiration» Miscellaneous
Status:Closed (fixed)» Needs review

Please never change status to "Closed (fixed)" - if you resolve a bug then it needs to be marked "Fixed" and the system will mark it closed after two weeks. Only do that if you've committed a fix to the correct repository. You've been able to work around on your own copy which is using a different version of Drupal - that is not going to do anything to help new people installing the D7 version, which is what this bug was about (although the fix may get back ported).

pierrot’s picture

My bad! Thanks for the explanation.

Wim Leers’s picture

With the 2.6 release finally out the door, I'm going to commit this soon and ship it as part of the 2.7 release. I first want to make sure that the 2.6 release is indeed a sufficiently stable (bugfix) release.

Somewhat related: http://www.cdnplanet.com/blog/how-protect-your-cdn-origin-server/

bangpound’s picture

StatusFileSize
new16.31 KB
PASSED: [[SimpleTest]]: [MySQL] 81 pass(es).
[ View ]

The patch applies cleanly to the 7.x-2.x branch, but if you include it in a drush make file, it fails.

This is the same patch as in comment #84 but updated to apply cleanly on 7.x-2.x branch.

bangpound’s picture

StatusFileSize
new16.38 KB
PASSED: [[SimpleTest]]: [MySQL] 81 pass(es).
[ View ]

I was getting PHP warnings about $_SERVER['HTTP_USER_AGENT'] not being set. Apparently this is a legitimate situation, so I've added a condition to check if this key exists in the array.

The patch attached is the merge of #90 and the patch below.

diff --git a/cdn.module b/cdn.module
index 5e147c3..ec6b4cb 100644
--- a/cdn.module
+++ b/cdn.module
@@ -898,11 +898,13 @@ function _cdn_seo_should_redirect($path) {

     // Use case-insensitive substring matching to match the current User-Agent
     // to the list of CDN user agents.
-    $ua = drupal_strtolower($_SERVER['HTTP_USER_AGENT']);
-    $cdn_user_agents = explode("\n", drupal_strtolower(variable_get(CDN_SEO_USER_AGENTS_VARIABLE, CDN_SEO_USER_AGENTS_DEFAULT)));
-    foreach ($cdn_user_agents as $cdn_ua) {
-      if (strstr($ua, trim($cdn_ua))) {
-        return url($path, array('absolute' => TRUE));
+    if (isset($_SERVER['HTTP_USER_AGENT'])) {
+      $ua = drupal_strtolower($_SERVER['HTTP_USER_AGENT']);
+      $cdn_user_agents = explode("\n", drupal_strtolower(variable_get(CDN_SEO_USER_AGENTS_VARIABLE, CDN_SEO_USER_AGENTS_DEFAULT)));
+      foreach ($cdn_user_agents as $cdn_ua) {
+        if (strstr($ua, trim($cdn_ua))) {
+          return url($path, array('absolute' => TRUE));
+        }
       }
     }
   }

asb’s picture

Issue summary:View changes

After spending over an hour to read through this 3.5 year old issue, I'm under the impression that the initial problem - possible duplicate content penalties and a generally unprofessional site design - are still unsolved. Or have I missed something?

I'm running Pressflow with cdn-6.x-2.x-dev, where the patches from #90 and #91 appear to be not backported and the whole site is accessible on all sharded subdomains (static.*, js.* css.*).

If the 'cdn' module can not provide a solution to the problems it creates: What have other users done to prevent messed up search engine results and rogue bots crawling four sites instead of one?

standingtall’s picture

@asb

CDN is an SEO risky module. The best solution I found was to claim cdn.mydomain.com in Google webmaster tools and then remove all the URLs.

asb’s picture

Yes, search engine penalties are one problem. Another problem are bots and crawlers which currently are responsible for about 60% of the whole web traffic, as a news magazine reported a couple of weeks ago. Judging from our logfiles, this numbers appear plausible - we have sh*tloads of bots on our servers, and they cause significant load (and mostly don't honor robots.txt).

With some Varnish logic it's possible, at least to blacklist the most offending bots; however, we can not block Googlebot, Msnbot, Slurp and the other legitimate crawlers just to prevent indexing the sharded domains (and yes, they find these domains). The additional load is a major problem since the sites we operate have an average of 100k pages, thusly the crawlers create a lot of traffic and cause a significant load on the web heads.

mr.j’s picture

I posted a few times earlier in this thread. At the time we had a lot of pages on our cdn subdomain indexed in google as duplicates. These days they have all disappeared so I must have been successful in cleaning it all up. I don't believe that we are using any extra modules or patches proposed in this thread at all, rather it is all done in .htaccess or server config to avoid hitting Drupal bootstrap as much as possible.

Bearing in mind we use an Edgecast origin pull CDN and we're using Drupal 6.x, what I did was:
1. Create a subdomain specifically for the origin pull CDN to use: static.example.com. Never link to that subdomain anywhere or put it in your Drupal settings. The only place you use it is in your CDN server settings. i.e. tell it to pull content from your site using that subdomain. If you are really paranoid you could use a password generator to make up a 30 character random string for the subdomain and use that instead.
2. Assign a subdomain that will actually be published for static files on your Drupal site in the html - i.e. cdn.example.com - and set your server up with a CNAME record to pass requests on to that subdomain to your CDN server.
3. Set up the Drupal CDN module to use that second subdomain for static content when it is published. i.e. cdn.example.com
4. Redirect any requests for non-static files on your subdomain you created in step 1 to www.example.com. This is what our .htaccess looks like:

# Valid subdomains
RewriteCond %{HTTP_HOST} !^(www|static)\.example\.com$ [NC]
RewriteRule (.*) http://www.example.com/$1 [R=301,L]

# Static subdomain is reserved for use by the origin pull CDN. These rules
# should stop someone requesting normal pages through the subdomain.
RewriteCond %{HTTP_HOST} ^static\.example\.com$ [NC]
RewriteCond %{REQUEST_URI} !\.(png|gif|ico|jpe?g|bmp|pdf|swf|js|css|zip)$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

5. Set up canonical URLs using the nodewords module just in case, so that all our published pages are indexed on the www subdomain.

The end result is this:
1. Your pages are served on www.example.com with static files served from cdn.example.com via the CDN module for Drupal.
2. Requests to cdn.example.com are routed to our CDN, which forwards the requests through to static.example.com.
3. If the request is for an allowed static file, it gets returned and cached by the CDN.
4. If it isn't, the request is 301's back to the www subdomain.

Looking back in hindsight we possibly could have gotten away with not using the intermediary subdomain (static.example.com) if we could set up a rule in .htaccess to determine if a request is being made by our CDN server (maybe looking at the request headers) but its working now so there's no point in me messing with it.

asb’s picture

@mr.j: Thanks a lot for your highly instructive reply! I believe your approach might be a useful workaround, until the 'cdn' module includes bangpound's patches.

Never link to that subdomain anywhere or put it in your Drupal settings.

Theis will be probably futile. If a subdomain is registered in the DNS, it's public. Anything can query for CNAME or A records. Additionally, I suspect that nowadays bots do not even bother to look up DNS records anymore, they just guess. At least we had lots of requests for domains like vvv.example.com or wwww.example.com in our logfiles. Solution: a) Don't use wildcards, neither in DNS nor in the webserver config; b) Redirect all unwanted requests to something useful, e.g.:

  RewriteCond %{HTTP_HOST} ^mail\.example\.com$ [NC]
  RewriteRule ^(.*)$ http://example.com/$1 [L,R=301]

4. Redirect any requests for non-static files on your subdomain you created in step 1 to www.example.com.

I adapted this, and it appears to work after modifying our homegrown hotlink protection:

  RewriteCond %{HTTP_REFERER} !^$
  RewriteCond %{HTTP_REFERER} !^http(s)?://(www\.)?example.com [NC]
  RewriteRule \.(jpg|jpeg|png|gif)$ - [NC,F,L]

…at least mostly (our sharded subdomains still have cookies for whatever reason).

5. Set up canonical URLs using the nodewords module just in case […]

Good pointer, I missed that.

Thanks again!

asb’s picture

This issue might be more complex than I assumed before. The .htaccess redirects suggested in the previous post actually slow the page loading down. When inspecting the network traffic with Dragonfly (Opera's incarnation of Firefox's Firebug) or the YSlow plugin, the diagram shows a high number of 301 redirects within the page (those pointing to ressources that are supposed to be managed by the 'cdn' module).

The duplicate content issue is discussed in #981148: Image Cache + Origin Pull mode + CNAME subdomains as well; in #8, mikeytown2 suggests another set of rewrite rules. It might be interesting, to compare them with the suggestions from #95 in this issue. For the time being, I have disabled the rewrite rules.

mr.j’s picture

We actually have wildcard subdomains on so no-one knows what the CDN's specific static subdomain is. But you are right that if you don't have wildcards then you're making the subdomain visible through CNAME records. The first rule in our .htaccess above makes sure that only the subdomains that we want to use are allowed and everything else is redirected to www.

I don't see the same 301 issue you have described. Once the CDN caches the static content it is just served straight from the CDN's public subdomain. i.e. cdn.example.com without a 301.

asb’s picture

I'm currently testing those redirects in .hataccess:

<IfModule mod_rewrite.c>
  RewriteEngine on
  RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]
  …
  RewriteCond %{HTTP_HOST} ^www.example.com [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]
  RewriteCond %{HTTP_HOST} cdn.example.com [NC]
  RewriteCond %{REQUEST_URI} !\.(png|gif|jpg|jpeg|ico)$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]
  RewriteCond %{HTTP_HOST} js.example.com [NC]
  RewriteCond %{REQUEST_URI} !\.(js|…)$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]
  RewriteCond %{HTTP_HOST} css.example.com [NC]
  RewriteCond %{REQUEST_URI} !\.(css|…)$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]
  …
</IfModule>

This ugly redirect monster is supposed to a) avoid accesses without subdomain (to enable cookieless subdomains), b) handle language-specific sub-domains, c) handle requests to sharded sub-domains for static content to the main sub-domain, if non-static content is requested.

Someone fluent in RegEx could do this definitely more efficient, but for now it appears to work,

mikeytown2’s picture

Commenting what each rule does is a good idea.

<IfModule mod_rewrite.c>
  RewriteEngine on

  # Redirect example.com to de.example.com.
  RewriteCond %{HTTP_HOST} ^example\.com$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]

  …
  # Redirect www.example.com to de.example.com.
  RewriteCond %{HTTP_HOST} ^www.example.com [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]

  # Redirect cdn.example.com to to de.example.com for all non image requests.
  RewriteCond %{HTTP_HOST} cdn.example.com [NC]
  RewriteCond %{REQUEST_URI} !\.(png|gif|jpg|jpeg|ico)$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]

  # Redirect js.example.com to to de.example.com for all non js requests.
  RewriteCond %{HTTP_HOST} js.example.com [NC]
  RewriteCond %{REQUEST_URI} !\.(js|…)$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]

  # Redirect css.example.com to to de.example.com for all non css requests.
  RewriteCond %{HTTP_HOST} css.example.com [NC]
  RewriteCond %{REQUEST_URI} !\.(css|…)$ [NC]
  RewriteRule ^(.*)$ http://de.example.com/$1 [L,R=301]
  …
</IfModule>

There might be a way to get the cdn js css rules in one but keeping it simple is usually a good idea. That way when you look at it again in 3 months you know what's going on.

asb’s picture

@mikeytown2: Actually I removed my (longish) comments from the code block in the attempt to make it look smaller (and thus easier to read). Wrong assumption ;)

I asked the experts over at http://stackoverflow.com/questions/24705187/different-redirects-based-on... for advice if my approach des even make sense (and borrowed your concise comments for that, thanks!)

kgil’s picture

Amazon CloudFront, Akamai...
Apache, Nginx, Varnish...
http://www.cliip.net/20140705/0101/drupal-cdn-module-and-seo-avoid-dupli...

asb’s picture

@kgil: That approach won't work with sharded subdomains for those having to migrate from 'parallels' module to 'CDN'.

… and my redirect attempts from #99 are crap, they do not work properly. The 'CDN' module plus sharded subdomains and 'AdvAgg' simply re-introduce the same issues we had before when there was no 'AdvAgg'. And it causes new SEO issues as a bonus for free.