Scenario... in our environment we don't have a base_url set, since our sites need to resolve to several differnet URLs. We also require https for authenticated users, but redirect to http for non-authenticated users.
Linkchecker currently doens't work well since all links are discovered using https:// form of the link, since that is the base_url at the time of the checking. When these links are tested in cron, they all return a 302 to the http:// form of the url, instead of either a 200 or 404.
Has anyone approached this before?
Comments
Comment #1
hass commentedThis url comes from url() function when the node is saved.
I'm not sure what excactly you are doing there in the background, but it sounds very complex. How does all these rules work? I'm highly interrested to learn what api functions you are using to implement these rules. Maybe we can find a way to implement a generic solution to get the public url used by the content at node_save(). What has been implemented with url() was just the best idea I had... I'm open minded to your ideas.
This issue has some crossings with #1489132: "Permission restrictions deny" message on many broken links when https securepages.
Comment #2
joelcollinsdc commentedThanks for the prompt reply!
Here is our scenario... we run lots of websites in our organization, all with the same settings. We have public facing servers that have lots of things blocked for security reasons, and internal servers that allow users to login and add content. These 2 different server environments have different urls
http://publicwebsite.example.org
http://internalwebsite.example.org
Therefore, we cannot have a static base_url specified in settings.php.
Furthermore, since users authenticate with domain passwords, we need to encrypt this across the network, so internally we use securepages to redirect all authenticated user requests as well as /user to https. Other requests are all redirected back to http.
I'm not sure I follow the logic you are describing with the url() option. I'm not sure this is a good idea, but what if linkchecker had logic to follow 302 if the host matches some kind of regex?
Comment #3
hass commentedI do not really understand the reasons for what you are doing there, but I guess you have some. But now you learn that these strange ideas have side effects and I cannot make wonders happen. If a module cannot use url() you must be in troubles. You must already have a tons of workaround implemented to solve path issues everywhere. I cannot believe that this really works as node_save() and filters and CACHES must bring you into troubles and from time to time the internal url may become public... Possibly unseen.
I have no idea what you are talking about regarding the regex. If you'd like to check a path like "/node/1" - I cannot issue a http request without a hostname and protocol. Otherwise you need to disable checking internal links. You will run into the same issues if you run a backend in a subfolder. No idea why the heck someone is doing this. If you need more speed - buy a loadbalancer, but run all under the same url.
However if you have an idea I'm open minded to add a solution that can be used in a generic way. I have no clue how you solve all the issues in your site. This must be very very tricky and I cannot look behind the seenes from here.
Comment #4
joelcollinsdc commentedYou are correct, there are certain things that we cannot use such as partial page caching (block cache, etc), but full page caching works, as does the vast majority of modules. I think that it is (or should be) a not-uncommon drupal configuration to use a dynamic base url (thats why this feature is there...), and just because most drupal sites aren't big enough to warrant multiple URLs to access them that doesn't mean drupal should not work for large sites. Having different subsets of your users accesss your site using a variety of URLs should be quite common for websites with security practices in mind.
Anyhow, one idea would be to not add in URLs with full paths until check time. So, it would work like this...
1) Lets say there is a node with /node/1 in the url. (Side note, i think I could make an argument that linkchecker could use a different check mechanism for all links that are obviously drupal links...)
2) the linkchecker_check funciton would add the url '/node/1' to the link database, not http://whateverurlIhappenToBeOn.example.org/node/1
3) during cron or through drush or whatever, use $base_url to figure out what the absolute url is that needs to be tested.
Comment #5
hass commentedI'm running some very large sites and we don't run more than one url except for parallel downloading and such issues. But not for security reasons. Not sure what you are referring, too. Just interrested about the arguments here for better understanding. Maybe I'm just missing a detail, but i personally see no chance to have a higher security by running more than one url, but the same modules. It's OT here, but i heard this so often and got never heard good arguments except obscurity without a real benefit. :-)
Going forward... From one past case where someone used
http://localhost:8081/to access cron.php we also had issues. Cannot find the case, but if I rely on this url you will always get 301/302 codes. Logic wise we must have the url at save time. I do not like to re-invent the wheel and if you look into url() function you may know that there are tons of conditions that can change the url. I'm only thinking about url inbound/outbound rewrite, conditional ssl/non-ssl paths, language based domains, paths with language prefix and other things. This is already very complex and nobody touches these core code voluntary. Tooo many kittens could be killed...If you have such requirements you should look into the code and check out if you find a way to alter the base url with a custom module.
I do not think your last idea works more reliable. However i'm still looking for a solution for this design thing that hurts already with SSL.
Comment #6
hass commentedJust an idea... Look into https://api.drupal.org/api/drupal/modules%21system%21system.api.php/func... or try to set base url on post only and switch back after the save.
Comment #7
joelcollinsdc commentedSo if linkchecker worked by storing a relative url in linkchecker_link and making it an absolute URL when it was time to check, it woudl respect the dynamic base url properites of drupal. I think this is the correct approach.
BTW, In the current way that linkchecker is invoked, running drush -l SomeFakeUrlThatDoesntResolve.local linkchecker-check would actually work, because linkchecker is not using a dynamic base_url currently. But it shouldnt.
Comment #8
hass commentedI have no idea how this could be possible as linkchecker has no idea about your dynamic base urls.
Comment #9
joelcollinsdc commentedSo, a dynamic base url basically just means "use whatever the base url is from the current request".
These steps do exactly that...
1) Lets say there is a node with /node/1 in the url. (Side note, i think I could make an argument that linkchecker could use a different check mechanism for all links that are obviously drupal links...)
2) the linkchecker_check funciton would add the url '/node/1' to the link database, not http://whateverurlIhappenToBeOn.example.org/node/1
3) during cron or through drush or whatever, use $base_url to figure out what the absolute url is that needs to be tested.
This just assumes people are competent enough to run drush / cron with a proper url.
Comment #10
hass commentedRelative links also mean ../../foo/bar, ./foo and so on. Others are running filters like markdown, pathologic, etc. And I have no idea if you force the node to SSL or not. I'm not loading the node for every link check and this means I have no idea about the path of the node and where a relative path points if I do not read the node alias and node language and consider language switching logics, too. If you run your backend in a subfolder this will break again and i don't know why, but some run cron with localhost and custom ports. All this stuff need to be taken under consideration.
Comment #11
hass commentedComment #12
hass commented