Hi,
I know that the different path for the same content isn't good for SEO.
Right now I put into my .htaccess:
RewriteCond %{REQUEST_FILENAME} !\.(gif|png|jpg|jpeg|jfif|bmp|css|js|zip|ico)$ [NC]
RewriteCond %{HTTP_HOST} ^madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn1.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn2.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn3.madfanboy\.com$ [NC]
RewriteRule ^(.*)$ http://www.madfanboy.com/$1 [L,R=301]
Now when I type: cdn1.madfanboy.com it's redirect to: madfanboy.com
But if it's http://cdn1.madfanboy.com/site/sites/default/files/images_save/deadspace... - it won't redirect anywhere. It is correct?
Thanks
Comment | File | Size | Author |
---|---|---|---|
#107 | interdiff.txt | 1.52 KB | Wim Leers |
#107 | 1060358-106.patch | 16.76 KB | Wim Leers |
#105 | interdiff.txt | 1.49 KB | Wim Leers |
#105 | 1060358-105.patch | 1.49 KB | Wim Leers |
#104 | 1060358-104.patch | 17.79 KB | Wim Leers |
Comments
Comment #1
Wim LeersCorrect, that can't have any effect, since a CDN does not necessarily use Apache, thus a .htaccess wouldn't have any effect, nor would the .htaccess be copied.
Hence, this won't work.
Many sites use Origin Pull CDNs that could theoretically trigger this problem. In practice, there are no SEO issues.
Comment #2
superfedya CreditAttribution: superfedya commentedRewriteCond %{REQUEST_FILENAME} !\.(gif|png|jpg|jpeg|jfif|bmp|css|js|zip|ico)$ [NC]
RewriteCond %{HTTP_HOST} ^madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn1.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn2.madfanboy\.com$ [NC,OR]
RewriteCond %{HTTP_HOST} ^cdn3.madfanboy\.com$ [NC]
RewriteRule ^(.*)$ http://www.madfanboy.com/$1 [L,R=301]
This rules isn't correct, because http://cdn2.madfanboy.com/site/sites/default/files/imagecache/small/pict... redirects to http://www.madfanboy.com. How I can fix that?
Comment #3
Wim LeersIs cdn2.madfanboy.com an actual CDN or is it just your own file server?
Comment #4
superfedya CreditAttribution: superfedya commentedActual CDN.
Comment #5
Wim LeersSo you're saying your CDN supports .htaccess files? What CDN is that then?
Comment #6
superfedya CreditAttribution: superfedya commentedcdn2.madfanboy.com is just a redirect to madfanboy.com.
Comment #7
Wim LeersThen why do you say it is an actual CDN? Then it's NOT an actual CDN.
Comment #8
superfedya CreditAttribution: superfedya commentedSorry, I don't know those things very well. So, if I understand everything right, after installation of CDN with Origin Pull mode I don't need modify my .htaccess for better SEO? Because right now all my content are accessible from 4 domains (madfanboy.com, cdn1.madfanboy.com...), I don't thing that the duplicate is a good thing for SEO...
Comment #9
mikeytown2 CreditAttribution: mikeytown2 commented@superfedya
same rules will apply
http://drupal.org/node/597178#comment-2735418
Do these work for you?
Wim Leers
submodule htaccess generator for cdn?
Comment #10
Wim Leers@mikeytown2: but a .htaccess only works for alternative domains (or subdomains) that are self-hosted through Apache HTTPd. It's useless when you use a CDN, e.g. CloudFront. How do people deal with this in general with Origin Pull CDNs? I can't find it. Dynamically serve a different robots.txt to the CDN crawler?
Comment #11
mikeytown2 CreditAttribution: mikeytown2 commented@Wim Leers
superfedya is self-hosted; I'm sure other requests like this will come in. Will an origin pull cdn copy html?
Comment #12
Wim LeersAn Origin Pull CDN will indeed copy HTML. That's the crux of the problem, indeed.
Comment #13
mikeytown2 CreditAttribution: mikeytown2 commentedDepends on how the origin pull is setup. If you have it pull from cdn.example.com then the htaccess rules will work. I think this should be the standard way of doing it.
I did notice that HTTP_VIA is getting set when doing an origin pull. We are using limelight. If this is some sort of standard then we could detect it this way, but I have a feeling this isn't ideal.
Thinking about this and .htaccess rules are not needed; one could do it in cdn_init. It's not as fast but it would make life easier. Just issue a location: 301 header. In order to let custom generation through like imagecache do a simple query
Replacing sites/default/files with file_directory_path. Or use hook_menu_alter as this only gets run on menu rebuilds... something to think about because I'm creating a 404 handler for css/js aggregates so auto detecting what should be allowed through is a smarter option.
Comment #14
superfedya CreditAttribution: superfedya commented@mikeytown2
Hmmm, it works :)
Thanks
Comment #16
Wim Leers@mikeytown2: care to roll a patch? :) And I don't think having a subdomain that serves the CDN's request is very typical. In any case, I never do that, because it's unnecessary.
Comment #17
superfedya CreditAttribution: superfedya commented@mikeytown2
Nope, doesn't work.
#Parallel Redirect
RewriteCond %{REQUEST_URI} !(^/(.*)(.js|.css)) [NC]
RewriteCond %{HTTP_ACCEPT} !(.*image.*|.*css.*|.*javascript.*) [NC]
RewriteCond %{HTTP_HOST} ^(cdn1.example.com|cdn2.example.com|cdn3.example.com)$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [L,R=301]
Tried this code and got this problem for some users:
http://www.webpagetest.org/results/11/02/19/MY/YK7/1_screen.jpg
Erased and everything become normal:
http://www.webpagetest.org/result/110219_EY_YKA/1/screen_shot/
Any other solution?
Comment #18
mikeytown2 CreditAttribution: mikeytown2 commentedI think cdn_init is the best way to handle this; I should be able to program around the issues superfedya is having. Expect a patch in a couple of days; will be busy with other issues. This is a concern of ours because we recently put a couple of our sites on limelight.
Comment #19
superfedya CreditAttribution: superfedya commentedThank you mikeytown2!
Comment #20
superfedya CreditAttribution: superfedya commentedAny news?
Thanks
Comment #21
mikeytown2 CreditAttribution: mikeytown2 commentedNope; busy over here at the moment.
http://drupal.org/project/advagg
Comment #22
superfedya CreditAttribution: superfedya commentedBecause google can lower site position in search results if they will find 4 different addresse for the same site :(
Comment #23
mikeytown2 CreditAttribution: mikeytown2 commentedstill not fixed... still working over here, I got this issue to take care of and then I'll jump on this one #1078060: CSS Embedded Images - Add in support for advagg's hooks
Comment #24
Wim LeersAny news, Mike? :)
Comment #25
mikeytown2 CreditAttribution: mikeytown2 commentedsame as before; still pounding out issue for advagg
Comment #26
Anonymous (not verified) CreditAttribution: Anonymous commentedsubscribe
Comment #27
basicmagic.net CreditAttribution: basicmagic.net commentedsubscribe
Comment #28
mikeytown2 CreditAttribution: mikeytown2 commentedLooking at this in more depth now. For us we have
http://media.example.com
in the CDN mapping setting, and that come back to us ashttp://source.example.com
from the CDN provider.$_SERVER['HTTP_VIA']
is unreliable so that's out. Looks like we would need another field where one can blacklist certain domains; this gets complicated when one considers things like my files_proxy module which will forward requests to the correct host based on the contents of the files path. Thinking about this the files_proxy module needs to run before this code and that should solve that issue.We would like to 301 the request to the root domain but there is an issue with a multisite using the same CDN domain. 301-ing to the correct domain gets a lot worse when you have domain access on top of of a multisite with over 1.1k domains to pick from (our setup). Thus I think a 404 is the safest way to do this.
So here is the proposal. Default is to Fast 404 if the request is for a blacklisted domain & the path is not in the menu_router table. Blacklist would be a text field as there can be multiple domains. I'm thinking this might be a CDN submodule, or even a stand alone as this can get complex fairly quick. The more advanced setup is to 301; this will not work in our case, but it is perfectly valid for other use cases. 301 would have a mapping where requests to this domain go here; maybe looking at the referrer as well. If it doesn't know what to do 404 then. I'll probably steal a bunch of code from this patch to accomplish this #1056578: Allow for a domain whitelist/blacklist setting.
Comment #29
Wim LeersThat sounds about right. And clearly, since you needed 3 paragraphs to explain something that is deceivingly simple at first sight, this can quickly become fairly complex. I'd be happy to include this is as a submodule with the CDN module.
Finally, your initial solution ("we have
http://media.example.com
in the CDN mapping setting, and that come back to us ashttp://source.example.com
from the CDN provider.") should work in most (simple, single-site) set-ups, right? It'd be as simple as returning a 404 for any HTML document served through Drupal.Comment #30
bibo CreditAttribution: bibo commentedSubscribe.
I'm facing the same challenge. Although .htaccess rules would be a fast solution, I would prefer a inbuilt, reusable (and simple) way to make sure that my selfhosted "cdn"-subdomains only return files, and otherwise all traffic is forwarded to the main site.
Glad to see you guys are working on this :)
Comment #31
mr.j CreditAttribution: mr.j commented++
Comment #32
mikeytown2 CreditAttribution: mikeytown2 commentedInitial work on this. I have the blacklist working & imagecache/advagg works as well. Need to set the module weight to be fairly heavy, create an admin section, get the whitelist working, intergrate with domain access, etc...
Comment #33
mikeytown2 CreditAttribution: mikeytown2 commentedmodule code is in a sandbox for now
http://drupal.org/sandbox/mikeytown2/1213552
Blacklist only.
snapshot download link
Comment #34
mikeytown2 CreditAttribution: mikeytown2 commentedComment #35
mr.j CreditAttribution: mr.j commentedLooking into this a bit more, I don't think any of this is going to work in our situation which is an externally hosted origin pull CDN (edgecast) which is set up to have full access to everything in our root domain. I did a few tests:
-
www.site.com
: works as normal-
cdn.site.com
: shows the content from www (this is what we want to prevent)-
de.site.com
: German subdomain using Domain Access. works as normal-
de.site.com/german-url
: works as normal-
www.site.com/german-url
: redirects tode.site.com/german-url
(as expected, using domain redirect module)-
cdn.site.com/german-url
: redirects tode.site.com/german-url
. This suggests that the CDN is trying to accesswww.site.com/german-url
and is being told to redirect by the domain redirect, just like a normal user.This last step suggest that we won't be able to tell at the Drupal level that the request is coming through the CDN unless something like $_SERVER['HTTP_VIA'] is set, but you said before its unreliable.
Adding to our .htaccess to catch requests to the cdn subdomain doesn't work as it is externally hosted and the origin pull server is requesting stuff off the www.
For now we have canonical URL output switched on using the nodewords and domain_meta modules (need this patch: #1245660: Support for Canonical URLS) and this seems to be the best solution, as everything can be served off the CDN domain with the canonical URL in the html source pointing to the correct domain. Fingers crossed that this will point the search engines in the right direction and clean things up, as we have noticed pages on our cdn domain being indexed by google recently.
Comment #36
mikeytown2 CreditAttribution: mikeytown2 commented@mr.j
I have a sandbox module that will prevent cdn.site.com from showing content. http://drupal.org/sandbox/mikeytown2/1213552
The code (cdn_seo.module) if fairly straight forward; just set
cdn.site.com
to be in the blacklist atadmin/settings/cdn/seo
Comment #37
mr.j CreditAttribution: mr.j commentedThanks, I took a look at it before. That code is using $_SERVER['HTTP_HOST'] or $_SERVER[''SERVER_NAME'].
If I understand things correctly, using our origin pull CDN which we do not host, it will request the page from the regular www domain before caching. So when the hook_init code runs, I expect that either/both of those variables will be pointing to our www domain, not the cdn subdomain. Therefore it will not stop the request and the CDN will cache and return the page.
I tried a quick test (without using that module) by putting a few lines in .htaccess to return a 404 if a page was requested off the cdn subdomain. I then requested it and it returned fine instead of returning a 404, which suggests that my server has no knowledge that the request is coming from the CDN - at least not from the host part of the request.
Comment #38
mikeytown2 CreditAttribution: mikeytown2 commentedIf both regular and cdn traffic hit your servers with the same host name it will always be very hard to pick the 2 apart. We use source.example.com and blacklist it.
Comment #39
mr.j CreditAttribution: mr.j commentedAha the penny has dropped! Create a local cdn subdomain and instruct the origin pull CDN to use that instead. Then block anything we don't want on that one.
Thanks for the help.
Comment #40
philsward CreditAttribution: philsward commentedI've pulled some code from a post out on the old parallel project and someoneone suggested using something like the following:
I find it to be a lot cleaner than some of the other suggestions posted and I have it working fine on several of my sites.
Anyone care to shoot some bullets at it?
Comment #41
philsward CreditAttribution: philsward commentedEdit: Sorry, I didn't mean to post twice...
Comment #42
mikeytown2 CreditAttribution: mikeytown2 commented@philsward
Mind testing out a php module that has the same goal?
http://drupal.org/sandbox/mikeytown2/1213552
Comment #43
bibo CreditAttribution: bibo commented@mikeytown2, I would like to test your module if it was for Drupal 7.
There isn't that much code, so I guess it would probably be somewhat easy to upgrade it for D7, right?
Comment #44
mikeytown2 CreditAttribution: mikeytown2 commentedLooking it over and the sandbox might be able to run in D7 with no code changes.
Comment #45
Wim LeersLooks great! :) Thanks again, mikeytown2! I'm currently cleaning up the code, adding the whitelist functionality and extending the admin UI to support both black- and whitelist (with nice ctools-powered dependent form items in D6). I'll also provide the D7 port.
My reroll of cdn_seo should be posted later today.
Comment #46
bibo CreditAttribution: bibo commentedOh my!
This sums up my feelings :)
Comment #47
Wim Leers@bibo :)
I was on the road for the better part of the day yesterday, which prevented me from working on this further. Today I have other work scheduled. I should be able to finish it this weekend.
Comment #48
bibo CreditAttribution: bibo commented@ Wim Leers
As said, I'm very glad to hear you're working on this. I'll be the first to try it out once it's done :)
Comment #49
Wim LeersI've done the port of the Far Future expiration functionality to D7 this weekend, which was sponsored and had a very tight deadline. This has culminated in the 2.2 release for D6 and D7.
Hence the CDN SEO stuff has been delayed. I hope to work on that next weekend.
Comment #50
Wim LeersWhen I started working on this again, I realized that this goes against the spirit of the CDN module: make everything work nicely out-of-the-box.
So I did a bit more research, and found these:
- http://www.seomoz.org/q/how-was-cdn-seomoz-org-configured
- http://blog.maxcdn.com/news/cdns-and-duplicate-content/
- http://www.seroundtable.com/cloud-cdn-seo-13665.html
- http://www.webmasterworld.com/google/4390245.htm
It sounds like simply serving a different
robots.txt
and usingrel="canonical"
will go a long way in solving this problem; even though you could still access the origin content.Besides that, I think it actually makes sense to just rely on the
User-Agent
header (and possiblyVia
, too) to detect CDN requests. Far less set-up needed. Won't cover 100% at first, but it'd be easy enough to update. It wouldn't be failproof, but at least it would be foolproof.For now, I'd look for the following case-insensitive substrings: "cloudfront", "akamai". I know they are substrings of those CDN's user agent strings. I couldn't find anything about the other CDNs' user agents.
Comment #51
bibo CreditAttribution: bibo commentedGlad to hear you're still/again working at this.
I'm not sure which part you're referring to? Some configuration is required any way.
I read studied those links and learned new things, thanks! So, SEO-wise it would be sufficient, good. Still, changing robots.txt for CDN:s while keeping the same file structure requires some server tweaking (not "out-of-the-box" functionality).
What about our own pull cdn's? I'm currently not using Akamai or any of those.
To be honest I was hoping this would solve more than just SEO. Such as potential performance and creditibility issues. Not all crawling is done by search engines. I've been watching traffic on some sites (usually live http traffic with varnishlog), and found out there is shitloads of traffic that is generated by bots and sniffers of different types. A single site may have several active crawlers 24/7. They can generate a lot of traffic, which might avoid site caching (varnish for example treats the sites as separate). Several cdn's could generate 5-10 full bootstrap requests per second, of totally useless traffic. On many sites this can create more server load than normal traffic (and yes, I've seen this at least partially).
Additionally, on some sites it could be possible the crawlers could leave a footprint on the site, making some links etc refer to the cdn site instead of the real site. I'm not saying exactly this has happened, but it could. Also, I simply dont want people to be able to visit my page with the wrong url, intentionally or not.
The most KISS-approach I can think of is that 301-redirect that was discussed before.
But you know what you're doing. I should concentrate on configuring the webservers so that limited filetype request go through (and that imagecache, js/css generation works).
Comment #52
Wim LeersYes. But that's exactly why I'm proposing to detect CDN spiders through the User Agent header!
Comment #53
mikeytown2 CreditAttribution: mikeytown2 commentedSee #13 for other detection methods.
Note: http://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Requests
Comment #54
Wim Leers#53: The problem with the Via header is that it also gets set for proxies, so I think it's likely more parsing work.
What are your thoughts on #50? Do you consider it feasible etc? I'm still going to add the CDN SEO module for people who want full control, and have the capabilities to do so. It's a much tighter solution, for sure. I'm merely trying to make it work as good as possible out of the box.
Comment #55
mr.j CreditAttribution: mr.j commentedThis all depends on the bot. If it is a bad bot that is not set to remove a URL from its crawl list after encountering a 301, then doing a 301 will achieve nothing as the URL will still be crawlable and instead just redirect to the content from the main domain. So it won't save you anything on server resources. You could even say it would add more load as the crawler has to make 2 requests for every URL that gets a 301.
If on the other hand you have the canonical links set up then good bots will understand not to bother crawling the CDN domain in future.
So if you're really concerned about server load then you need to block access to the CDN domain for non static content, not redirect it.
Comment #56
mr.j CreditAttribution: mr.j commentedI forgot to add, if you want to serve a custom robots.txt for your origin pull CDN subdomain your provider doesn't allow you to configure one (ours doesn't), you can do it using .htaccess.
Create a robots_cdn.txt and put it in the root directory with your robots.txt file. robots_cdn.txt should have the following content:
Then add this to .htaccess and anything that attempts to crawl
static.yourdomain.com/robots.txt
will be served the content of robots_cdn.txt instead.That should at least put a stop to good bots crawling your CDN subdomains.
Comment #57
Wim LeersI asked Aaron Peters for advice. He's a WPO guru and has experience with CDNs: he maintains CDN Planet.
He wonders if this is really a problem. For his clients, he doesn't deal with this, as this is really a non-issue. E.g. for http://etsy.com, there's
http://site dot etsystatic dot com
. As long as there aren't any links to the CDN domain (which is why I wrote it down so strangely), you'll be fine.He does understand why you'd want to prevent it from happening though, and in that case he agrees that the approach outlined by mikeytown2 is the best one.
So, in the general case, this is simply not important. That's why I think it makes most sense to make this a separate project: in the common case, you won't want or need this. As such, there's no point for me to continue the development of this functionality.
I've taken mikeytown2's code from his sandbox (see #33) and cleaned it up. I've attached a patch that applies to his sandbox. I've also attached the entire cdn_seo module as a .zip file. It's written for Drupal 6. It's 90% ready, it only needs the "whitelist" functionality (which is trivial to implement).
Comment #58
Wim LeersComment #59
mikeytown2 CreditAttribution: mikeytown2 commented@Wim Leers
You want me to promote the sandbox to a full project with your patch applied?
Edit:
Wim, you have been given full permissions to http://drupal.org/sandbox/mikeytown2/1213552
Comment #60
Wim LeersI didn't make myself clear. This is what *I* think is best. I'm still looking for feedback.
What are your thoughts, mikeytown2? And yours, mr.j and bibo?
Comment #61
mr.j CreditAttribution: mr.j commentedOur site currently 301s any non static content on the CDN subdomain to the proper domain. In addition we have the canonical link tag and the custom robots.txt! The reason for all this is that google indexed pages on our CDN subdomain several months ago somehow. We never linked to it to our knowledge, but anyway it would be trivial for someone else to link to it maliciously if it was a known search ranking attack strategy. Despite all these countermeasures, the pages are still indexed by google now. So avoiding the situation is key, otherwise it could be a long time before it is cleaned up and who knows if google penalises sites for getting into this situation.
Personally I prefer handling the 301 with .htaccess as you can avoid processing anything using Drupal code which is costly in comparison. After all its 3 lines in .htaccess vs another module. But I see no harm in having the module for those who can't or don't want to fiddle with .htaccess. But I do think that something should be there - and from my own experience I would recommend that it is on by default.
Comment #62
Wim LeersI'd actually prefer a .htaccess generator over a module as well.
It's interesting to see that you *have* seen the adverse effects! Thanks for sharing!
Comment #63
SilviuChingaru CreditAttribution: SilviuChingaru commentedI have the following setup:
http://www.example.com (default site address)
http://static1.example.com (fake cdn with the same webroot as http://www.example.com)
http://static2.example.com (fake cdn with the same webroot as http://www.example.com)
In .htaccess (which is common on all three sites I have the following lines:
But is not working as it should because after rewriting %{REQUEST_FILENAME} to index and [INTERNAL REDIRECT] is executing the first check and redirects to main domain. Does anyone know how could I solve this?
The rewrite log looks like this:
Comment #64
doubouil CreditAttribution: doubouil commentedMy 2 cents on this issue.
Config :
www.[domain].net (main website)
cdn-img.[domain].net (img cdn)
cdn-js.[domain].net (js/json CDN)
cdn-css.[domain].net (css CDN)
cdn-files.[domain].net (other stuff CDN)
hotlink-img.[domain].net (beautiful URL for sharing img)
All are fake CDN routed to the same base directory (hotlink-img points to sites/[site]/files)
In settings.php
Any time a PHP page should be generated, it basically checks the subdomains to see if it's the website. Calling a direct file is not part of a PHP process so it renders the request as usual (file if present, 404 if not).
The 403 header avoid Google from indexing duplicate content, and the
array("cdn","hotlink")
still allows subdomains to work if they don't have their own settings.php.Also, setting a base_url in settings.php gived me a hint on that trick, I had a maintenance page which worked at cdn-js.[domain].net but anytime I tried a different page (ex : cdn-js.[domain].net/user) it kept redirecting me to www.[domain].net/user.
Am I missing something or is it a viable solution ? I'm not convinced by the "code in settings.php" part, should it be in a hook_init/hook_boot ?
Comment #65
philsward CreditAttribution: philsward commentedMy rewrite rules in comment #40 work fine for D6, but after upgrading to D7, I've run into issues...
If an image style has not been generated for a given page (ie a thumbnail), then the system won't generate it. After the image has been generated on the backend, everything works as expected with the re-writing.
Because of the way D7 now handles the images, if you apply a redirect to images before they the resized style has been generated, you will receive a
Notice: Use of undefined constant Y - assumed 'Y' in eval() (line 2 of /home/userdir/public_html/h/somesite.com/modules/php/php.module(80) : eval()'d code).
Comment #66
superfedya CreditAttribution: superfedya commentedWith Apache the problem was resolved like that:
Any idea how to rewrite this config to Nginx?
Thanks
Comment #67
federico CreditAttribution: federico commentedHi,
I tried to use the submitted module on #57 , but I coudn't add cdn.example.com into the blacklist, I got this error:
I'm using D6 and CDN 6.x-2.5
I also tried to put
in .htaccess, but when I visit http://cdn.example.com/samplepath I'm redirected to: http://www.example.com/index.php?q=samplepath
I'll appreciate your help.
Comment #68
jemond CreditAttribution: jemond commentedAttached is a quick-and-dirty port of the cdn_seo module for Drupal 7 from #57, and the Drupal 6 version with a bug fix for #67. The Drupal 7 version includes the fix as well. My initial tests with this did not work on my site, but my hosting platform is somewhat non-standard (Pantheon). I will debug further the week of 7/7 and report back (I am on vacation this coming week).
Comment #69
jemond CreditAttribution: jemond commentedAfter more debugging I don't think using the host to match the incoming request is the best approach. It's unreliable (it didn't work for me at all on Pantheon). As stated earlier, I think User-Agent detection is the way to go.
Below is the snippet I used to fix this problem. I dropped it into settings.php and confirmed it fixes the issue on my live site that uses Amazon Cloudfront. I'm using a 301 instead of a 404 because this tells requesters what the "correct" URL they are trying to reach is. If it happens to be a bad link on the main site, then the main site will return a 404 (as it should).
(D.o is messing up the header line by making it a link.)
I'm willing to roll a patch for the D7 branch if there aren't any objections to this approach. If this is case, I think there are three to do items:
1. Find the user agents for other CDNs
2. Make the user-agent check a partial match instead of an exact match
3. Track the list of user-agents in the config screen for the module, with a good default of as full a list as we can find of user-agents
Any concerns with this approach?
Comment #71
jemond CreditAttribution: jemond commentedD7 dev patch attached for #69. I still need to expand the list of user-agents by default, but I've deployed this live and it's working. After some thought, since we allow the user to enter in the user agent list that causes redirection I think a partial match would be too fragile, as it could produce false positives.
Comment #72
drupalsim91 CreditAttribution: drupalsim91 commentedi have the same problem: if the image style do not exist i have a 404 error. plaese resolve this.
Comment #73
Wim LeersYou're essentially taking the path I was taking in #50.
User-Agent detection is always fragile (this is also what Aaron Peters said). But then again, it *is* the easiest route to solving this problem. The major advantage is that it doesn't require a separate domain, which the typical recommended solution does (i.e. have a separate CDN-facing cdnorigin.yoursite.com and visitor-facing yoursite.com.).
This solution may not be as complete as the one set forth by mikeytown2 and continued by me in #57, but I think it's better nevertheless, because it requires less configuration.
The one thing I would add is support for
rel="canonical"
HTTP headers: http://googlewebmastercentral.blogspot.be/2011/06/supporting-relcanonica....So, in summary:
- yes, I'll commit your work (though it should meet Drupal's coding standards first)
- I'd like to see support for the aforementioned headers as well — if you don't have time for this, @jemond, then I'll do that :)
Thank you very much!
Comment #74
jemond CreditAttribution: jemond commentedThanks for the review @Wim.
1. Had no idea about the rel="canonical" HTTP headers. Cool! Will add them.
2. I will fixed what I goofed in terms of coding standards.
I will get working on an updated patch. Possible I might have it ready later today, or early next week.
Comment #75
jemond CreditAttribution: jemond commentedAttached is a new patch that adds the rel=canonical header for the redirects. I also added some line wrapping to comply with Drupal standards. I'm not super-certain what additional coding standards I've goofed, so any guidance would be appreciated. I can follow-up with an updated patch.
I've deployed this patch live and verified it's working. Here are the response headers for:
http://assets1.zujava.com/how-to-get-your-kid-exercising-after-school
Comment #76
jemond CreditAttribution: jemond commentedComment #77
Wim LeersRerolled patch. Addressed issues:
getallheaders
was not necessary at all.It goes without saying that you will be credited as well, @jemond. You have been instrumental in pushing this issue forward. Thanks *so* much for your help :) I just pushed it the extra mile to ensure it won't break anything. I'm looking forward to your feedback!
This functionality will go into the 2.7 release, the 2.6 release will contain bugfixes only.
Comment #78
jemond CreditAttribution: jemond commentedThanks for the clean-up Wim! This looks good.
This is queued up on test now and set for deployment to live tomorrow. I will update next week if there are any problems in production.
Comment #79
Wim LeersThanks :)
Comment #80
jemond CreditAttribution: jemond commentedThis has been in production since 9/27 and I can confirm it's redirecting main page requests In an SEO-friendly way. Looks good. No problems.
Comment #81
Wim LeersPerfect :) This will go in version 2.7 then :)
Comment #82
ianthomas_ukGreat news, and thanks to Wim for pointing me to this when I was doing a bit of housekeeping on another issue.
Is there a rough release date for 2.7? Or a list of issues blocking it? If it's going to be a while I'll apply the patch manually, but I'd prefer to use the released version if possible.
Comment #83
Wim Leers2.7 will only contain this patch.
2.6 will be a bugfix release, and besides the many bugfixes already committed, I'd also like to see #1719568: CDN URLs are not properly encoded in some edge cases and #1790348: Far-Future mode: background images and font files referenced in CSS files incorrectly rewritten, but only in an Omega subtheme fixed before releasing 2.6.
Comment #84
ianthomas_ukThe patch from #77 no longer applies to 7.x-2.x, so here is a slightly amended version. The changes should be identical, it's just the lines around those changes that are different.
Setting to needs review so the patch gets tested. Please can someone set back to RBTC if the test succeeds and you agree that my patch is the same as #77.
Comment #85
Wim LeersWoot, thanks @ianmthomasuk :)
Comment #86
pierrot CreditAttribution: pierrot commentedThe d6 module didn't work for me. I implemented the #69 solution that worked like a charm. Only user-agent did the trick.
My settings : pressflow 6 + varnish + origin pull + imagecache + cloudfront
Comment #87
ianthomas_ukPlease never change status to "Closed (fixed)" - if you resolve a bug then it needs to be marked "Fixed" and the system will mark it closed after two weeks. Only do that if you've committed a fix to the correct repository. You've been able to work around on your own copy which is using a different version of Drupal - that is not going to do anything to help new people installing the D7 version, which is what this bug was about (although the fix may get back ported).
Comment #88
pierrot CreditAttribution: pierrot commentedMy bad! Thanks for the explanation.
Comment #89
Wim LeersWith the 2.6 release finally out the door, I'm going to commit this soon and ship it as part of the 2.7 release. I first want to make sure that the 2.6 release is indeed a sufficiently stable (bugfix) release.
Somewhat related: http://www.cdnplanet.com/blog/how-protect-your-cdn-origin-server/
Comment #90
Anonymous (not verified) CreditAttribution: Anonymous commentedThe patch applies cleanly to the 7.x-2.x branch, but if you include it in a drush make file, it fails.
This is the same patch as in comment #84 but updated to apply cleanly on 7.x-2.x branch.
Comment #91
Anonymous (not verified) CreditAttribution: Anonymous commentedI was getting PHP warnings about $_SERVER['HTTP_USER_AGENT'] not being set. Apparently this is a legitimate situation, so I've added a condition to check if this key exists in the array.
The patch attached is the merge of #90 and the patch below.
Comment #92
asb CreditAttribution: asb commentedAfter spending over an hour to read through this 3.5 year old issue, I'm under the impression that the initial problem - possible duplicate content penalties and a generally unprofessional site design - are still unsolved. Or have I missed something?
I'm running Pressflow with cdn-6.x-2.x-dev, where the patches from #90 and #91 appear to be not backported and the whole site is accessible on all sharded subdomains (static.*, js.* css.*).
If the 'cdn' module can not provide a solution to the problems it creates: What have other users done to prevent messed up search engine results and rogue bots crawling four sites instead of one?
Comment #93
standingtall CreditAttribution: standingtall commented@asb
CDN is an SEO risky module. The best solution I found was to claim cdn.mydomain.com in Google webmaster tools and then remove all the URLs.
Comment #94
asb CreditAttribution: asb commentedYes, search engine penalties are one problem. Another problem are bots and crawlers which currently are responsible for about 60% of the whole web traffic, as a news magazine reported a couple of weeks ago. Judging from our logfiles, this numbers appear plausible - we have sh*tloads of bots on our servers, and they cause significant load (and mostly don't honor robots.txt).
With some Varnish logic it's possible, at least to blacklist the most offending bots; however, we can not block Googlebot, Msnbot, Slurp and the other legitimate crawlers just to prevent indexing the sharded domains (and yes, they find these domains). The additional load is a major problem since the sites we operate have an average of 100k pages, thusly the crawlers create a lot of traffic and cause a significant load on the web heads.
Comment #95
mr.j CreditAttribution: mr.j commentedI posted a few times earlier in this thread. At the time we had a lot of pages on our cdn subdomain indexed in google as duplicates. These days they have all disappeared so I must have been successful in cleaning it all up. I don't believe that we are using any extra modules or patches proposed in this thread at all, rather it is all done in .htaccess or server config to avoid hitting Drupal bootstrap as much as possible.
Bearing in mind we use an Edgecast origin pull CDN and we're using Drupal 6.x, what I did was:
1. Create a subdomain specifically for the origin pull CDN to use: static.example.com. Never link to that subdomain anywhere or put it in your Drupal settings. The only place you use it is in your CDN server settings. i.e. tell it to pull content from your site using that subdomain. If you are really paranoid you could use a password generator to make up a 30 character random string for the subdomain and use that instead.
2. Assign a subdomain that will actually be published for static files on your Drupal site in the html - i.e. cdn.example.com - and set your server up with a CNAME record to pass requests on to that subdomain to your CDN server.
3. Set up the Drupal CDN module to use that second subdomain for static content when it is published. i.e. cdn.example.com
4. Redirect any requests for non-static files on your subdomain you created in step 1 to www.example.com. This is what our .htaccess looks like:
5. Set up canonical URLs using the nodewords module just in case, so that all our published pages are indexed on the www subdomain.
The end result is this:
1. Your pages are served on www.example.com with static files served from cdn.example.com via the CDN module for Drupal.
2. Requests to cdn.example.com are routed to our CDN, which forwards the requests through to static.example.com.
3. If the request is for an allowed static file, it gets returned and cached by the CDN.
4. If it isn't, the request is 301's back to the www subdomain.
Looking back in hindsight we possibly could have gotten away with not using the intermediary subdomain (static.example.com) if we could set up a rule in .htaccess to determine if a request is being made by our CDN server (maybe looking at the request headers) but its working now so there's no point in me messing with it.
Comment #96
asb CreditAttribution: asb commented@mr.j: Thanks a lot for your highly instructive reply! I believe your approach might be a useful workaround, until the 'cdn' module includes bangpound's patches.
Theis will be probably futile. If a subdomain is registered in the DNS, it's public. Anything can query for CNAME or A records. Additionally, I suspect that nowadays bots do not even bother to look up DNS records anymore, they just guess. At least we had lots of requests for domains like vvv.example.com or wwww.example.com in our logfiles. Solution: a) Don't use wildcards, neither in DNS nor in the webserver config; b) Redirect all unwanted requests to something useful, e.g.:
I adapted this, and it appears to work after modifying our homegrown hotlink protection:
…at least mostly (our sharded subdomains still have cookies for whatever reason).
Good pointer, I missed that.
Thanks again!
Comment #97
asb CreditAttribution: asb commentedThis issue might be more complex than I assumed before. The .htaccess redirects suggested in the previous post actually slow the page loading down. When inspecting the network traffic with Dragonfly (Opera's incarnation of Firefox's Firebug) or the YSlow plugin, the diagram shows a high number of 301 redirects within the page (those pointing to ressources that are supposed to be managed by the 'cdn' module).
The duplicate content issue is discussed in #981148: Image Cache + Origin Pull mode + CNAME subdomains as well; in #8, mikeytown2 suggests another set of rewrite rules. It might be interesting, to compare them with the suggestions from #95 in this issue. For the time being, I have disabled the rewrite rules.
Comment #98
mr.j CreditAttribution: mr.j commentedWe actually have wildcard subdomains on so no-one knows what the CDN's specific static subdomain is. But you are right that if you don't have wildcards then you're making the subdomain visible through CNAME records. The first rule in our .htaccess above makes sure that only the subdomains that we want to use are allowed and everything else is redirected to www.
I don't see the same 301 issue you have described. Once the CDN caches the static content it is just served straight from the CDN's public subdomain. i.e. cdn.example.com without a 301.
Comment #99
asb CreditAttribution: asb commentedI'm currently testing those redirects in
.hataccess
:This ugly redirect monster is supposed to a) avoid accesses without subdomain (to enable cookieless subdomains), b) handle language-specific sub-domains, c) handle requests to sharded sub-domains for static content to the main sub-domain, if non-static content is requested.
Someone fluent in RegEx could do this definitely more efficient, but for now it appears to work,
Comment #100
mikeytown2 CreditAttribution: mikeytown2 commentedCommenting what each rule does is a good idea.
There might be a way to get the cdn js css rules in one but keeping it simple is usually a good idea. That way when you look at it again in 3 months you know what's going on.
Comment #101
asb CreditAttribution: asb commented@mikeytown2: Actually I removed my (longish) comments from the code block in the attempt to make it look smaller (and thus easier to read). Wrong assumption ;)
I asked the experts over at http://stackoverflow.com/questions/24705187/different-redirects-based-on... for advice if my approach des even make sense (and borrowed your concise comments for that, thanks!)
Comment #102
kgil CreditAttribution: kgil commentedAmazon CloudFront, Akamai...
Apache, Nginx, Varnish...
http://www.cliip.net/20140705/0101/drupal-cdn-module-and-seo-avoid-dupli...
Comment #103
asb CreditAttribution: asb commented@kgil: That approach won't work with sharded subdomains for those having to migrate from 'parallels' module to 'CDN'.
… and my redirect attempts from #99 are crap, they do not work properly. The 'CDN' module plus sharded subdomains and 'AdvAgg' simply re-introduce the same issues we had before when there was no 'AdvAgg'. And it causes new SEO issues as a bonus for free.
Comment #104
Wim LeersQuoting myself in #89:
Sadly, that then never happened — I got too busy with Drupal 8! Sorry, all :( This was probably the most important missing thing in the CDN module.
After #91, the discussion has been solely about stop-gap measures AFAICT, so there's nothing new to review, really.
First, rerolling the patch to chase HEAD.
Comment #105
Wim LeersOops, #104 is overwriting an existing test. We don't want that of course.
Comment #107
Wim LeersUgh, patchfail. Wimfail. Sorry for all the noise.
Comment #108
Wim LeersFinally :)
Comment #110
Wim LeersThis is now live on http://wimleers.com and is working splendidly :)
(Including: CDN URLs that contain duplicate content that are already in Google's search index redirect to the actual site when you click them.)
Comment #112
Wim LeersFor a follow-up of this, see #2678374: Document that SEO (duplicate content prevention) causes redirect loop in combination with reverse proxy between CDN and web server — it looks like there are edge cases where this causes problems.
Comment #113
Wim LeersTurns out there's a significant problem with this when your setup is
See #2678374-14: Document that SEO (duplicate content prevention) causes redirect loop in combination with reverse proxy between CDN and web server.