Problem/Motivation
Google is starting to index JavaScript-rendered pages.
On 27th of October Google published new guidelines to ask for site owners to modify their robots.txt:
For optimal rendering and indexing, our new guideline specifies that you should allow Googlebot access to the JavaScript, CSS, and image files that your pages use. This provides you optimal rendering and indexing for your site. Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings.
One of the most critical files (in Drupal 7) is the /misc/jquery.once.js file, that is not accessible to google with the default robots.txt. This stops most of javaScript code inside Drupal.behaviors not to be executed.
I tested myself and verify manually with Webmaster Tools that /misc and /modules css and JS files are blocked to Google.
This is mitigated by the fact that only affects to sites with CSS and JS Aggregation OFF. Once that is aggregated is enabled, the aggregated JSS and CSS lives in /files folder, that is accessible to Google.
Proposed resolution
Change robots.txt file
Remaining tasks
We need to rethink the best default robots.txt.
User interface changes
none
API changes
none
Comment | File | Size | Author |
---|---|---|---|
#89 | fix_robots_txt_to_allow-D7-2364343-89.patch | 952 bytes | k_zoltan |
#85 | fix_robots_txt_to_allow-D7-2364343-83.patch | 578 bytes | k_zoltan |
#80 | fix_robots_txt_to_allow-D8-2364343-80.patch | 752 bytes | Neograph734 |
#34 | fix_robots_txt_to_allow-2364343-34.patch | 540 bytes | joegraduate |
#32 | fix_robots_txt_to_allow-2364343-32.patch | 540 bytes | joegraduate |
Comments
Comment #1
ksenzeeAccording to Wikipedia, major crawlers (including Google) will respect an Allow directive. Different crawlers have different ways of determining whether the Allow or Disallow directive wins out; some go by which directive comes first in the file, and some go by which directive is longer. So we can add Allow directives for *.js and *.css inside those folders and have them respected, as long as they 1) come before the Disallow directives for the folders themselves, and 2) are longer than the directives for the folders.
Comment #3
ksenzeeComment #5
ksenzeeSorry - I keep uploading the D7 patch.
Comment #6
corbacho CreditAttribution: corbacho commentedSweet. Thanks for the patch, I see that also you have a D7 patch ready, cool!
I saw also this related document: the table at the bottom says "undefined" when using wildcards. It's strange that there is no more "official" documentation about it.
So, I tested manually with the Robots tester of Google Webmaster tools, and with the new robots.txt, it passes fine paths like
core/test.css
I'm marking RBTC
Comment #7
alexpottThis issue is a normal bug fix, and doesn't include any disruptive changes, so it is allowed per #2350615: [policy, no patch] What changes can be accepted during the Drupal 8 beta phase?. Committed d3516dd and pushed to 8.0.x. Thanks!
Comment #9
corbacho CreditAttribution: corbacho commentedThanks Alex.
I see already the patch was back-ported to D7 and uploaded in #1 and #2 (it's the same patch)
I think the patch is correct: allowing Google to access to js/css files in the
/modules, /profiles, /misc and /themes
foldersI will let someone else do final review and mark RTBC
Comment #10
xjmComment #12
klokie CreditAttribution: klokie commentedPatch in #5 looks good to me, and changed my GWT review from failed to "Awesome! This page is mobile-friendly." :)
Thanks!
Comment #13
sjager CreditAttribution: sjager commentedAs of today, Google is now rolling out its mobile-friendly update.
Can this be given higher priority due to the rollout?
Comment #14
TravisJohnston CreditAttribution: TravisJohnston commentedJust to add, the patch for D7 included in #5 does help, but I noticed that if my D7 site is compressing it's CSS/JS, then it fails the Google Mobile Friendly Testing Tool. If I uncompress it, then it passes....
If I add the following, it works:
Allow: /sites/all/themes/*.css$
Allow: /sites/all/themes/*.js$
Comment #15
ksenzeeThat's odd for a couple of reasons. First, compressed CSS/JS files normally go in your files directory, so if anything I'd expect it to work with compression on even if it failed with compression off -- not the other way around. Second, nothing in robots.txt disallows sites/all/themes, so those two rules shouldn't have any effect. Have you customized your robots.txt file, or are you using any metatag directives that might affect it? And does this apply no matter what theme you're using?
Comment #16
TravisJohnston CreditAttribution: TravisJohnston commentedNothing special in my theme. I thought that was strange too because I accidentally targeted my theme css/js folders instead of the default/files folder and tested it. I noticed I screwed up but before i made the change, the test completed and passed for mobile friendly... Maybe a fluke, so it should include the default/files/css and js folders for the cached versions.
Comment #17
TravisJohnston CreditAttribution: TravisJohnston commentedYeah it was a fluke, I had to add the following in order for it to work. Using the $ at the end doesn't work either.
Comment #18
pounardRaising to major for this reason #2476547: CSS and JS cannot be crawled by robots.txt; decreases SEO (duplicate, didn't manage to find this one) - Google is going to lower all Drupal site scores as soon as they won't use core aggregation. For example with use a complete framework using Gulp over our themes to completely bypass core aggregation mechanism, but generated files are stored into the theme itself (so in the profiles/ dir). Enough people are doing that to raise this issue.
Comment #19
pounardI had to play a game of retry/error with Google mobile test until it worked, and on the site I tested, the trailing $ won't work, removing it makes Google happy. Patch attached. I guess this might be due to the additional query parameter added on CSS and JS assets by Drupal.
Comment #20
damien_vancouver CreditAttribution: damien_vancouver commentedIt looks like as of July 28th 2015 Google has started e-mailing all Webmaster Tools account contacts for sites affected by this problem (ie. sites with JS and/or CSS aggregation disabled on their performance page). Because it goes to all the contacts will be the subject of many curious e-mails from end customers in the next little while.
The email has subject Googlebot cannot access CSS and JS files on http://example.com and starts with:
Fetch and Render showed it was misc/*.js it was unhappy about. On the site I was alerted about, JS aggregation cannot be enabled because it breaks custom javascript the customer wrote in-house.
Patch from #19 fixed the problem for me, Fetch and Render in Webmaster Tools shows as "Complete" with no missing JS.
Comment #21
jp.stacey CreditAttribution: jp.stacey at Magnetic Phield commentedThe patch in #19 "works" for everything on an example site homepage except:
By "works" I mean regarding Google Mobile Friendly testing: if you have a view on your homepage, for example, it will try to access
feed.png
.So there needs to be another line:
Allow: /misc/*.png
. Re-rolled and attached for comment.Comment #22
jenlamptonadding tag for D6 as well
Comment #23
damien_vancouver CreditAttribution: damien_vancouver commentedI tested the patch from #21 using Mobile: Smartphone and Desktop and received no errors.
Setting back to RTBC. Also attaching D6 backport.
Comment #24
ksenzeeAre we at all concerned that this regex will match files with .js or .css anywhere in the filename?
Comment #25
damien_vancouver CreditAttribution: damien_vancouver commentedThat's a good point - there are no files in the core paths other than ones ending in .js or .css, but it would be tidier and consistent with Drupal 8's robots.txt.
I've re-attached patches for D7 and D6 with this change.
Comment #26
ciss CreditAttribution: ciss at yousign GmbH commented@damien_vancouver: According to the specs those rules would fail for cache breaker URLs (that is, with additional parameters). We'd need "*.js?" as well.
Comment #27
joegraduateOn sites using a distributions like Panopoly which install modules and themes within the profiles directory, I'm finding that the Google tools report problems accessing image files inside of the profiles directory as well (similar to the problem identified by @jp.stacey in #21). The attached patch adds additional lines allowing images within the profiles. I'm also attaching an interdiff between this patch and #21.
Comment #28
joegraduateComment #29
damien_vancouver CreditAttribution: damien_vancouver commentedYes I can confirm #25 does not work any more (see attached screenshot) with the $ at the end. Allowing the *.js? line does fix it.
We can either double the number of lines for each entry, like so (it'll be even more lines after #27's images and anything else that is missing):
versus go back to #21 style which works in all cases (but could inadvertently match future filenames with *.js or *.css or *.png in them):
I lean towards the earlier style, and that makes the latest greatest patch #27. These folders have files only coming from core and so we don't need to expect any weird filenames containing .js, .css, or .png in the filename portion.
Does this mean there is the same cache-breaker problem with Drupal 8's robots.txt? It contains:
Comment #30
joegraduate@damien_vancouver did you mean to delete the files from #27?
Comment #31
damien_vancouver CreditAttribution: damien_vancouver commentedNo I didn't, I refreshed my preview as I was entering my comment and the file didn't show up in the list, which I hoped wasn't a problem.. but I guess drupal.org got confused and saved the file list from the tab I was typing on (from before you uploaded your comment/file). I've re-attached it to this comment. :|
Comment #32
joegraduateShamelessly re-uploading my own patch to restore my patch/contribution credit after first patch was deleted :P
This is identical to #27 & #31.
Comment #33
David_Rothstein CreditAttribution: David_Rothstein as a volunteer commentedLooks like we need to fix this in Drupal 8 still then also. As far as I know Drupal 8 appends a query string at the end of CSS and JS URLs, so the
/*.js$
and/*.css$
committed earlier aren't going to work there either.There also seems to be a discrepancy between Drupal 8 and the latest Drupal 7 patch here regarding whether images, and anything in the
profiles
directory, is allowed?Comment #34
joegraduateGood point(s), @David_Rothstein.
Miscellaneous images do exist in the D8 /core/ directory so they should probably be allowed. Also, although I don't have any public sites running custom distributions/profiles on D8 to test this with, it does seem like CSS/JS/image files would exist in the D8 /profiles/ directory if a custom distribution were being used that Google's tools may complain about not being able to access (as in #21 & #27).
Attached is an updated patch for 8.0.x that removes the "$" from the existing "Allow" CSS & JS patterns, adds CSS & JS patterns for the /profiles/ directory, and adds image patterns for /core/ and /profiles/ directories.
Comment #35
pounardThat's in my opinion not good idea to fix this in D8 first, D8 has almost no production sites while D7 has thousands.
Comment #36
ciss CreditAttribution: ciss at yousign GmbH commentedFor new patches we should also add the Drupal version to the patch names, as it is already starting to get confusing (#32 and earlier is D7, #34 is D8).
Comment #37
Togas CreditAttribution: Togas commentedRather than using *.css$ or *.css?, can´t we use *.css* ?
Like:
Allow: *.css*
Allow: *.js*
or:
Allow: /misc/*.css*
Allow: /misc/*.js*
Comment #38
pounardIf I'm not mistaken, *.css* would allow non-CSS stuff like my_image.css.php for example, I don't think it is a good idea.
Comment #39
DonofGor CreditAttribution: DonofGor commentedMy tuppence on this matter. We tried to use the following to no success:
We were referencing files in sub-folder in the modules and misc folders. The problem in our case was that the generic case had Disallows in place for /modules/ and /misc/
Since these are longer in length it trumps the allows. Go figure?
You could get around the issue two ways
1. Set up a specific user-agent for Googlebot with the following only, but this will disregard the * user agent since a crawler can only follow one user-agent e.g.
or
2. Keep all the Disallows that are set up by default and increase the Allow to be a longer length than the Disallow. Since the allow is longer than the Disallow it trumps it e.g.
Comment #40
joegraduateComment #41
droplet CreditAttribution: droplet commentedI think we should reconsider the main reason that why does Drupal blocking these dir in robots.txt ?
- Security ?
- SEO ?
- Performance ?
- ???
Comment #42
pounard- Consistency ? These are not supposed to contain any displayable content, so it's rather pointless to let any bot trying to crawl it.
In my opinion, the simplest patch that makes it work with Google should be commited ASAP, because a lot of production sites are suffering from this, then a broader discussion could be opened after the critical issue is fixed for production sites.
Comment #43
droplet CreditAttribution: droplet commentedYes!!
Comment #46
joegraduateRestoring useful issue summary.
Comment #47
crizThe last patch #43 works for the core folder, but not for the profiles folder (tested in Google Webmaster Tools).
Why? Because "/profiles/" is longer than the 7 characters of "/*/*.js". "/core/" is not.
Why do we disallow /profiles and not /modules? I guess this is a leftover from Drupal 7, where the profiles folder was containing core profiles, right? Allowing /profiles is fixing our issue and we can go with this generic rules. Patch attached.
If we want to continue disallowing /profiles we need to go back and use the patch from #34.
Comment #48
pounard@#47, last time I checked, Google was unhappy when adding $ at the end of lines, are you sure this patch works as expected, or did Google change?
My use case was a site with Gulp generated CSS and JS files stored under the profiles folder, we do append a query string on every generated CSS and JS file (Drupal does it by itself actually) in order to be able to force browsers to refresh those files when we regenerate it. Under that specific circumstances, adding $ wouldn't work for me.
Comment #49
droplet CreditAttribution: droplet commentedThis is interesting.
@48,
the patch removed the rules end with `$`.
Comment #50
pounardOh right, I misread the patch, sorry for the noise.
Comment #51
ciss CreditAttribution: ciss at yousign GmbH commentedUsing .css and .js without an ending delimiter is lazy and error prone (e.g. this might match .json as well, as mentioned above). Instead we should match "*.js?" as well as "*.js$", as has been done in an earlier patch.
I don't think we should add any additional extensions. If we started adding those, we'd have to account for practically any extension that may appear in a subdirectory.
A robots.txt primer
Keep in mind that robots.txt is only a de-facto standard, with wildly varying implementations. There are basically three flavors:
(I've determined the guidelines for 3. via various tests, so I can't point to any reference for them. Test these via Google's robots.txt tester and you should come to the same conclusions.)
Comment #52
droplet CreditAttribution: droplet commented`$` isn't supported in Robots.txt standard, even not in Google. .js won't match .json.
Comment #53
criz@droplet At least the google bot supports "$". And "*.js" matches "*.json" in this case. https://developers.google.com/webmasters/control-crawl-index/docs/robots...
Comment #54
droplet CreditAttribution: droplet commentedIt's fun enough, google robots.txt Tester giving different result.
whatever, even Google indexing .json, I don't think it's a problem. But if Google blocked .js?parameters that would be a cause.
Comment #55
crizHere is a patch that blocks .json files in folders.
But as also composer.json in root is not blocked so far and it is practicable to index .json files in some cases (http://searchengineland.com/ajax-killing-crawl-budget-226487) I guess the idea was to get a patch in as soon as possible that fixes at least the google webmaster tool warnings.
@ciss
all three flavours would go with current patch, right? I don't think we have to take care about robots not interpreting Allow to index css and js files in the core folder.
I can confirm your experience about the "google flavour", but have not tested your last mention (Wildcards can be chained to increase specificity). Do you think we can rely on this? At least your first two guidelines are documented here: https://developers.google.com/webmasters/control-crawl-index/docs/robots...
About your 2nd concern:
We have to add this extensions when we want to fix the issue. Any alternatives?
Is there consensus that we can allow the "/profiles" folder?
Comment #56
ciss CreditAttribution: ciss at yousign GmbH commentedBing also supports "*" and "$" wildcards. I've read somewhere that Yahoo mainly uses Bing's index nowadays.
But to get back on track: May I suggest that we add the rules for *.js and *.css explicitely for the Google Bot? That way we prevent confusion and possible incompatibilities to other crawlers.
Comment #57
ciss CreditAttribution: ciss at yousign GmbH commentedBy the way, here's what we're now using in our robots.txt. Google's robots.txt tester likes it. We'll have to wait if Googlebot itself likes it as well:
Edit: bad example. See the following comment.
Comment #58
criz@ciss
If I read this right we would have to duplicate all rules for the googlebot:
https://developers.google.com/webmasters/control-crawl-index/docs/robots...
A quick test in google webmaster tools confirms this.
Comment #59
ciss CreditAttribution: ciss at yousign GmbH commented@criz: You're absolutely right, my bad. Should have tested/read more thoroughly.
Comment #60
erykolryko CreditAttribution: erykolryko commentedusing #55
Comment #62
rkent_87 CreditAttribution: rkent_87 commentedIs the patch in #55 applicable to D7?
Comment #63
aken.niels@gmail.com@rkent_87, no, it is not. That patch is specifically created for D8, hence the /core folder, which is missing in D7.
Comment #64
droplet CreditAttribution: droplet commentedI think and hope we can address this issue before D8 release.
http://googlewebmastercentral.blogspot.hk/2015/10/deprecating-our-ajax-c...
Comment #65
Fabianx CreditAttribution: Fabianx at Tag1 Consulting commentedComment #67
Neograph734Updating the title to also include the fact that image files (png & gif) are blocked, as indicated earlier. And I'd like to emphasize this is becoming a big problem for existing D7 sites as well.
Comment #68
Neograph734Re-assigning Drupal version
Comment #69
cilefen CreditAttribution: cilefen commentedIs this issue finished for D8 and back-portable to D7 now?
Comment #70
droplet CreditAttribution: droplet commentedNo! No commits in last 7 months. why 8.1 ?
Comment #71
cilefen CreditAttribution: cilefen commented@droplet I was just wondering if this issue is converging on a solution.
Comment #72
Neograph734The last commit from #66 makes not much sense as Drupal appends a query string to all css or js files (at least all that are in the /core/ folder). Currently all css and js files from /core/ are still getting blocked.
@criz in #55, after taking a closer look at your patch. Wouldn't it make more sense to keep disallowing the /profiles/ folder but to explicitly allow access to any css, js or image file in there? @joegraduate did the same in #34.
As for @pounard in #38, I suppose it would be better to explicitly allow only 'real' css files and no files with '.css' in the file name.
Also there was this line in the latest patch, which I supposed was an error:
The combination of all comments and opinions has been bundled into attached patch.
Comment #73
criz@Neograph734 Sounds good, but I am not sure why /profiles should be disallowed and not /modules, /themes, /vendor or /sites. We should keep it simple. I would vote for just disallowing /core (or even allow /core). See for example what Yoast is recommending for WordPress: https://yoast.com/wordpress-robots-txt-example/
The article also shows once again why it is important to fix this issue soon:
And just a note: composer.json and composer.lock don't need to be blocked in robots.txt: #2392153: Disallow composer.json and composer.lock from being indexed
The attached patch is based on the patch from #72. The only change is that /profiles is not disallowed.
Comment #74
cilefen CreditAttribution: cilefen commented@criz Why did you change this to 8.1.x-dev? I am honestly just wondering. It is a major bug and therefore fixable in 8.0.x.
Comment #75
criz@cliefen You're right. Sorry. Setting back to 8.0.x.
Comment #76
Neograph734@criz, I've tried basically all combinations I could come up with and it seems good.
Comment #77
catchPlease split the /profiles change out to a separate issue. It's very likely that /modules and /themes got missed when we added those as base-level folders, rather than forgetting to take /profiles out.
Also could someone run the patched robots.txt through a validator to make sure it passes?
Comment #78
droplet CreditAttribution: droplet commented@catch,
any hints why profiles should be separated issue? therefore, we can focus on the discussion about that points.
Comment #79
catch@droplet because it's not CSS, Javascript or Image files and I think it needs a separate decision whether we add modules/themes or drop profiles.
Comment #80
Neograph734Re-attaching the patch from #72, which still included /profiles.
Validations
http://www.lxrmarketplace.com/robots-txt-validator-tool.html :
Since this issue is manly due to Google, I'd assume we are good. The order is in the most compatible form for other crawlers.
Google robots.txt tester:
No errors or warnings given.
Page lookups
Yandex Webmaster (https://webmaster.yandex.com/robots.xml):
Tried some of them in Google webmaster tools succesfully as well, but this allowed me to run multiple urls at once.
Comment #81
crizOkay, let's add another issue to cleanup robots.txt after fixing this issue. :)
Tested patch from #80 in Google Search Console and all looks good.
Comment #82
catchCommitted/pushed to 8.1.x and cherry-picked to 8.0.x. Thanks! Moving to 7.x for backport.
Comment #85
k_zoltan CreditAttribution: k_zoltan at Cylex commentedPatch for Drupal 7. Similar solution as for D8.
Comment #86
Neograph734I suppose the themes/ and profiles/ folders be included as well. Those also can contain front-end resources Google might want to crawl.
Comment #87
catch@Neograph734 as discussed let's do that in a separate issue (which could also be backported to 7.x).
Comment #88
Neograph734@catch, I realize that. The comment was intended for k_zoltan in #85. I suppose we'd want to allow resources profiles/ and themes/ at first and then later we can decide to remove them entirely in the other issue.
Current patch (#85) only allows resources in modules/ and misc/ which is a half solution for D7 sites.
Follow up issue for discussion on the removal of the profiles folder: #2677708: Remove profiles folder from robots.txt.
Comment #89
k_zoltan CreditAttribution: k_zoltan at Cylex commentedHere is the new patch based on the recommendations of @Neograph734
Comment #90
Neograph734Thanks! I guess this should work. I can test later today.
Comment #91
Neograph734Entered some paths into the Google robots.txt tester and everything looks fine to me.
Comment #92
Neograph734Marking this as RTBC, all my previously blocked resources are now available to Google and pass the robots.txt validation. (They still appear in the Blocked Resources page, but the last occurrence date is a few days ago and I suppose they will disappear automatically).
Comment #93
xjmComment #94
Shin-en CreditAttribution: Shin-en commentedGoogle Search Console positively reacts to patch #89. It still has a problem with icons and fonts. For that reason I would recommend additional extensions to be whitelisted:
.svg
.ttf
.woff
Comment #95
Neograph734@Shin-en, I guess this is better reported in #2677708: Remove profiles folder from robots.txt. That is supposed to be about white listing the
profiles/
folders from D8 robots.txt and perhaps backport those to D7 along with themodules/
andthemes/
folders.After that, this patch is probably no longer required.
Comment #96
Fabianx CreditAttribution: Fabianx at Tag1 Consulting commentedWe need to at a minimum:
- Remove /profiles and discuss in #2677708: Remove profiles folder from robots.txt
- Ensure that we only list the resources actually present in misc/, themes/ and modules/.
I don't think misc has image files, e.g.
Comment #97
Neograph734Fabianx, please have a look at the misc folder, you'll see there are various png and gif images there. Most of them only used for logged in user interactions, but the menu bullets and forum icons are visible for anonymous users, and as such impact search engines.
As for the modules and themes folders, we do not know what to expect there as people can insert their own files. In Drupal 8 these folders have been removed entirely from robots.txt and perhaps the same should be done for 7. (But that could be a separate discussion.)
Maybe, for Drupal 8 this decision was postponed (as requested by Catch in #77 and #87) and the items in profiles where put on the allowed list, but as mentioned in #94, this does not include everything. I suppose we could do the same for Drupal 7, and always backport #2677708: Remove profiles folder from robots.txt to Drupal 7 later.
The situation is that Google is attempting to screenshot the page for previews and notices it is missing resources. There is warning generated that important page elements should be allowed and less important elements do not matter, but it hard to tell if these elements result in an SEO penalty. It would be quite bad it every Drupal site out there would get negative SEO points because of a blocked folder in robots.txt.
I'm setting this back to needs review, as I personally believe this patch can (and if it does impact SEO, should) be used as is.
Comment #98
Fabianx CreditAttribution: Fabianx at Tag1 Consulting commentedYou are right:
I checked the list and indeed most file types are present.
If someone is curious:
Lets remove profiles at least. We want to have the same feature set as Drupal 8.
Comment #99
Neograph734I am not sure if I understand you well...
In order to mimic the same behavior as Drupal 8, we should be removing the modules and themes folders from robots.txt. Profiles is still in Drupal 8 and that discussion has not yet started.
Comment #100
Fabianx CreditAttribution: Fabianx at Tag1 Consulting commentedI did misunderstand the discussion taking place for 8.x and read the wrong patch.
This is fine as is and marked for commit.
Thanks for answering my questions!
Comment #101
Neograph734Awesome, no problem! :)
Comment #102
David_Rothstein CreditAttribution: David_Rothstein as a volunteer commentedCommitted to 7.x - thanks!