I just installed and setup the weblink module for a site. I was surprised to discover that there is no handling of dead links.

If a link returns a 404 or can't be found becuase there is no longer a dns entry, or the server is no longer there, I need to have those links removed from the active list viewed by the site. (maybe even track it so that it only gets removed it it returns 2 consecutive errors to avoid removing links due to minor bandwidth/routing issues)

I also need to have those links go into a special administrative interface so a dead link can either be removed permanantly from the site; held in the admin interface for later decision; or added back into the sites live link list.

Has such work been started by someone else? If not, I will be working on it this week.

any advice for where to start?

Comments

km’s picture

It would be very useful to have such a feature implemented!

Automatic checking of dead links is not a trivial task. Many sites do not return proper HTTP codes or do not support the HEAD method at all. Some years ago I used a perl script to check links. Maybe I find it in the backups...

ericG’s picture

Doing the validation/link checking outside of drupal/php is a great idea.

I managed to hack together what I need in the short term last night, but working on real link checking is going to take some time. Doing is with a separate perl script should be easier then hacking it into the module (I am relatively new to php and much more comfortable with perl).

I will upload a patch of what I have so far as soon as I clean it up a bit. It uses the current error detection but writes to a new field in the weblink table; the category view now ignores items flagged to not display because of error; the module generates a block that contains a list of all weblinks currently flagged to not display and links to their admin pages. (the blocks content only shows for those in groups with "admin dead links" permissions.

With a perl script handling the testing of the links and deciding when to mark the node to not display, it would be so much better.

Anonymous’s picture

"monitor" does this. once you set a weblink to be monitored, it is also checked for 404s this is achieved quite simple through a fopen() call. So it is there. I am thus closing this issue. If this is not what you meant, please re-open.

The administration of those failed links could use some more attention though: A scren where all failed links are listed, with the amount of time they are unavaible etc. And quih links to delete/edit a link from there.

Ber

ericG’s picture

Unless I installed the module wrong, I don't see how it does what I need.

What line of code in the module as it is currently released forces dead links to not show up in the category listing? If it is there I missed it and wasted my time adding it in.

I know that there is a check of link status in the module right now, the issue is that it is not useful and does not actually have an effect on the display of weblink nodes.

Am I missing something? I can not find any code that removes dead links or even give a consistent listing of them. The code that writes to watchdog gives very inconsistent results in terms of providing access to admin of nodes that fail in the limited checking that does happen in the module now.

As well, there is no way to set a preference to say that a link would be removed only after it fails x consecutive checks.

I am re-opening this issue since I really think that you mis-understood what I need/requested/am working on

none of the features I am currently hacking into the module currently exist in the module.

Anonymous’s picture

2 points.
1: It would be nice to see the features you are working on as a part of the weblink module. Good luck.
2: to ber/anonymous... closing out this bug as "wontfix" when it is needed is silly; closing this bug as "wontfix" as annoymous is just rude.

Bèr Kessels’s picture

2: to ber/anonymous... closing out this bug as "wontfix" when it is needed is silly; closing this bug as "wontfix" as annoymous is just rude.

True, I should have logged on, but forgot to do so! However, I am currently the official maintainer of this module, And therefore know it quite well. So i also know that a big part of the proposed code is already implemented! The reason why I set this "won'tfix" is because i am sure that adding something similar to what is already there, is not good. This will only clutter the UI, and make things worse. But, as i said in the comment, feel free to extend the current monitoring, because that can improve, instead of devolve the current module.
Adding parts of third party code, that show many parralels to what is already there is not wanted.

But sorry: I never wanted to offend anyone, just an unlucky coincidence of me forbetting to log on, and me not taking the time to get deeper into the details of why i marked it as wontfix.

When you look at the part that contains the weblink_moitor* functions you will see that there is code there already to handle four-o-fours, etc. the weblink_moitor_list() function can only be accessed by a 'hidden' function.
http://drupal.kollm.org/tmp/_contrib/drupal-contrib-phpdoc/modules_2webl...

I would really like to know from you:
What do you plan to add on top of this?
Or do want to remove this code and *replace* it by something better?
For what version are you planning this? (mark that in about one-and-a-half months there will be a next stable drupal release, so you might as well want to code it for that.)
Did you consider adding another module for weblinks, called (for example) weblinks_monitor.module, that depends on weblinks.module? That way we could skim the current weblinks down to its basics, and provide depending modules for extra weblinks functionality.

Thank you for your interest and time. Andagain: Sorry, i did not meat it Rude at all!

Ber (if you wnat to mail me about this: berkessels at gmx dot net

km’s picture

A separate weblinks_monitor.module depending on weblinks.module seems to be a nice idea.

ericG’s picture

Hi Ber,

Thanks for the informative reply.

I have no desire to clutter up or confuse anything on the module you maintain.
My request/actions are motivated by a client that demanded certain "features".

The idea of a separate module (weblink_monitor.module) is a good idea and I will take my efforts in that direction.

Mainly what I want to see in the module is this:
(1)
A page accessible via the admin menu
"Weblinks in need of attention" or something like that

this would be controlled by a setting in the standard permissions page "admin dead links"

this page would provide a very simple interface to dead links admin.

the hack I did over the weekend just makes a block with all dead links, where the content of that block is restricted by the "admin dead links" permissions. the block has the title of all bad links and the titles link to that node's admin page. (I have to add something to the admin page so that when a weblink node is updated, its linkstatus gets changed back to 'ok'.

(2)
I added a new field in the weblink table 'linkstatus'
when the monitoring of links happens now in the module, it will simply add in either 'ok' or 'error'

This is used to effect the category listings and link count by simply adding in
'where linkstatus='ok'

I will work towards setting this status via the soon to exist weblink_monitor.module

As far as you being rude or not... whatever... I have a rather thick skin and don't have time to get upset about minor bullshit.

Bèr Kessels’s picture

Bèr Kessels’s picture

Project: Weblink » Links Package
Version: » master
Component: Miscellaneous » User interface
migas’s picture

I'm about to build a greater site for a community. New to Drupal I thought that this great system would also alert dead links. Surprisingly I found no module for that on drupal.org. No module, which only checks all the links on my site. Fortunately I found the needed module on http://drupal.org/node/72840 - I've tested it with some nodes, in which I put wrong links. The module found them all and made a very table.

There is only one problem: the lines (node - link - issue) are connected with the number of the node, I'm using Pathauto and the links doesn't work. So I hope, that there will be a solution for that.
Why not integrate It into drupal.org?
greetings from Austria

Jarvis L’s picture

So how did it go? Did you manage to fix it? Would be nice to see the result if you succeeded. This would be really useful.

Jarvis, IT Professional currently working on the sleeping tablets project.

bomarmonk’s picture

Also see http://drupal.org/project/linkchecker

Should links package simply recommend use of the linkchecker module? It looks good and a new maintainer is about to start doing more work with it!

hass’s picture

Title: interface for dead links » Interface for dead links

I'm working on a new linkchecker version and we have new 2.x branches that needs some testing... your feedback is appreciated. Linkchecker scans all node type links. So no need to reinvent the wheel here...

syscrusher’s picture

Assigned: Unassigned » syscrusher

Hello! I'm sorry to be slow weighing in on this. I took the time to review the code from Linkchecker carefully before replying.

In a nutshell, I am very impressed with what I see in Linkchecker and definitely do not want to reinvent that wheel. On the other hand, it seems that Linkchecker has reinvented some wheels that are already in the Links Package API as well.

I would like to propose a discussion (possibly by phone) between haas and myself to talk about how we can work together to make both modules better. For example, I am willing to make additions to the Links table schema if that would allow Linkchecker to utilize the Links API for some of its back-end storage. Also, I think that haas has come very close to implementing a link-capturing feature that I've been wanting to add to links_related, and I would like to talk about how to leverage this excellent work.

I also have one specific patch suggestion for improving the cron behavior of Linkchecker and am willing to contribute a patch if you are interested. :-)

haas, would you be interested in arranging a telephone conversation so we can discuss technical details easily and efficiently?

Kind regards,

Scott (Syscrusher)

syscrusher’s picture

Version: master » 6.x-2.x-dev

Some clarification.... I am the owner of the Links Package, which includes the Links API (links.inc and links.module), links_admin.module, links_related.module, and links_weblink.module. I am not the maintainer of weblink.module, which is (as Ber Kessels indicates) a separate package. Ber was kind enough to contribute much of the code for the original version of links_weblink.module, so my work in that area is based on Ber's original code. (Open Source is a wonderful thing!)

Also, I'm retagging the version to the D6 developer branch, since that's where I'm working on new features, rather than HEAD.

Kind regards,

Syscrusher

hass’s picture

@Syscrusher: Not sure what you'd like to change... I would be happy to hear what you'd like you change, but cron logic or the functions inside will change within the next days/week... I've committed parts of the changes to D6, but not yet backported them to D5. I'm very busy with adding cURL support that makes me feel stressed, but makes linkchecker performing *ultra* speedy (100+ simultaneous checks at once are heavy to implement) and catching all the status codes and so on is a bit tricky with non blocking multi threaded cURL... feedback is nevertheless welcome. If I have implemented something wrong about link package I would be happy to correct.

hass’s picture

What wheel have I re-invented? The link package integration in linkchecker was only a ~3-6 lines addition and the rest of the code is required by the rest of the linkchecker module :-)

bomarmonk’s picture

This sounds like a great integration for the links package. In the end, great minds think alike, so I am looking forward to the functionality of both of your modules. Thanks for the work on this!

syscrusher’s picture

@hass: I didn't realize you were going to be adding non-blocking link checking to support parallel link checking. That's terrific! My suggestion was that in your while{} loop you should snapshot the epoch timestamp prior to the loop, and check the *total* elapsed time (not the time per loop) in seconds after each loop iteration, exiting after an administrator-configured maximum time is exceeded even if you haven't done as many links as they wanted in a cron run.

Even if you are doing parallel nonblocking HTTP requests, this may still be a valuable feature. Remember that the operating system may limit the number of outbound IP sockets for a single PHP process space (threaded or not). So you may want to have an administrator-defined setting to say, for instance, "Allow Linkchecker to do up to __X__ simultaneous HTTP requests while processing __Y__ maximum links per cron job, with the entire process lasting no more than __Z__ seconds regardless of whether we've done all of the __Y__ links."

Of course, it may not be desirable (or possible?) to interrupt a long-running HTTP request until cURL itself times out, so you can't *guarantee* the time limit. But you *can* guarantee that you won't send out any new requests after that time limit is expired.

I hope I've explained this well enough. If you don't think this is useful, given what you are already doing with parallelization, I understand, but I wanted to offer the idea for your consideration.

Kind regards,

Scott

syscrusher’s picture

@hass:

What wheel have I re-invented? The link package integration in linkchecker was only a ~3-6 lines addition and the rest of the code is required by the rest of the linkchecker module :-)

I'm sorry if I gave the impression that I thought *you* had reinvented wheels. It's not a question of you or me, but rather that there are some wheels which each of us have invented independently, and in very similar (read: compatible) ways. I did not intend criticism of your work in any way; in fact, I consider your module design to be extremely good.

But we do have some duplicated functionality. I have designed a {links_monitor} table and a plan to build a link checking module, but I have not yet built the code. You clearly have already built something that is extremely good, and is very, very close to what I envisioned. I don't need to build a link checking API -- I should (and will) just use yours. :-) On the other hand, you have, as part of your link checker, built a link *storage* mechanism that replicates much of the functionality of the Links API from my package. And you have built a mechanism for attaching links to nodes, which again replicates functionality of Links API.

Link storage and node attachment are not "core functionality" to a Linkchecker module, and dead/moved link checking is not 'core functionality" to Links API. If you look at our database schemas, we have many fields in common, and I think we can capitalize on that to allow each of us to maintain less code, and to focus on the code we personally find most interesting to work on.

My thoughts are that we should join forces, and let Links API (my package) be the back-end storage engine for the URLs that you capture, and let your excellent Linkchecker module provide the link monitoring and automatic update features.

Kind regards,

Scott (Syscrusher)

hass’s picture

Yeah, I also thought about time limiting the link check process... this is much better then a fixed count. It's something for near future... and as we have a 15 seconds http time-out we can safely calculate the max time. I tried to get it working first and thought about saving link check times for statistical reasons in future. I know you have a links_monitor table, but it wasn't yet used and I saw many comments in this thread that people asked if links package shouldn't point to linkchecker + you have written on project home that you don't have time to work on it.

Additional I'd liked to build an independent module as people may not use links package, but like to have a general solution. From links package side it wouldn't make sense to search for links in comments, other content types, blocks and so on... so linkchecker is more a general module that integrates with others. I'm very disappointed about code quality in linkchecker 1.x branch - but got the idea as I also have broken links on my site and merged Janode and link_checker functionality and links package monitor functionality - all in one module. Not sure if there are more modules that could be integrated, but if there are - I can do... :-)

syscrusher’s picture

[...] I know you have a links_monitor table, but it wasn't yet used and I saw many comments in this thread that people asked if links package shouldn't point to linkchecker + you have written on project home that you don't have time to work on it.

That's certainly true right now. I'm working on getting Views fully supported (that's about 60% done right now in the dev snapshot) and building fields-in-core support for Links API in Drupal 7. The good news is, I no longer have to do the link checking module from scratch, because of your work.

Additional I'd liked to build an independent module as people may not use links package, but like to have a general solution. From links package side it wouldn't make sense to search for links in comments, other content types, blocks and so on... so linkchecker is more a general module that integrates with others.

Actually, I think adding that functionality to Links would make good sense, if you are amenable to working with me to integrate more closely. I already have a pending feature request for someone who wants Links API to be able to harvest links from nodes and add them to the {links} and {links_node} tables, replacing the original link with a Links API placeholder akin to the way {links_related} works.

I'm very disappointed about code quality in linkchecker 1.x branch - but got the idea as I also have broken links on my site and merged Janode and link_checker functionality and links package monitor functionality - all in one module. Not sure if there are more modules that could be integrated, but if there are - I can do... :-)

I can't comment on the 1.x branch of your code; I downloaded the dev branch for Drupal 6 to examine, and am very pleased with what I see.

I still would like to arrange an interactive conversation with you to talk about options. :-)

Kind regards,

Scott (Syscrusher)

hass’s picture

Actually, I think adding that functionality to Links would make good sense, if you are amenable to working with me to integrate more closely. I already have a pending feature request for someone who wants Links API to be able to harvest links from nodes and add them to the {links} and {links_node} tables, replacing the original link with a Links API placeholder akin to the way {links_related} works.

Sounds like a good idea... I also missed this type of import/replacement in the links package :-). This can for sure work independent and together with linkchecker. You only need to implement this extraction and link placeholder replacement stuff. If there is no URL in content it is going to be removed from linkchecker via cleanup process, but linkchecker adds the new link node reference back... there is really nothing extra I need to do - this works :-).

I will send you an email about the phone call.