See #205202: Fix search index link handling for non-existent nodes for some background information.

If trying to build (for example) a Wikipedia-type site with Drupal, one might expect to have lots of nodes with links to URL aliases that don't exist yet on the site - these would be added with the intention that the content would be created someday later. For example:

Chickens are a type of <a href="/bird">bird</a>.

Someone might come along later and create the entry for "bird", at which point the link would start working.

Although the implementation of this might be complicated, it would be great if the search index could properly handle this case so that all these links are properly indexed and recorded - and therefore used to calculate search relevance - once the 'bird' node has been created. (Currently, this kind of link is only picked up if someone goes back and manually triggers a reindex of the chicken article - e.g., by resaving the node.)

Comments

jhodgdon’s picture

Just as a note, probably the "bird" link in this case would not be explicitly in the text, but what normally happens in a wiki, whenever it encountered text that was the same as a page title, the link would be generated via an input filter at runtime.

And even if it was explicit in the text, I'm unsure how this could possibly be implemented, and I'm inclined to mark it as "won't fix" because I can't think of any way it could be done in practice.

So I think that for a wiki site, you'd really just have to rely on the fact that the search/node cron process eventually comes back and reindexes nodes that haven't been indexed recently. Or a contributed module that would maybe have its own cron process that would check for new links in nodes and mark the nodes for reindexing?

David_Rothstein’s picture

Just as a note, probably the "bird" link in this case would not be explicitly in the text, but what normally happens in a wiki, whenever it encountered text that was the same as a page title, the link would be generated via an input filter at runtime.

I don't know this for sure, but I thought (hoped?) that when a node is indexed, it's the rendered version of the node that actually gets indexed (i.e., the version that users will see), rather than the original content. If so, this shouldn't be a problem, since check_markup() will have already run on it before the search index gets it. If not, then there are much bigger problems, because it would mean that wiki-style and other alternative markup formats never work correctly with the search indexer.

And even if it was explicit in the text, I'm unsure how this could possibly be implemented, and I'm inclined to mark it as "won't fix" because I can't think of any way it could be done in practice.

Well, the search module already handles URL aliases and already has a table to store node links, so I was thinking maybe there'd just be an extra column added to that table? Then when an node is indexed, any on-site link that is unknown gets a row in that table (with the 'nid' column empty), and when a new node is indexed, any URL aliases that is has get checked against that table to see if they match - if so, the appropriate nodes get marked for a reindex. I think that might work.

So I think that for a wiki site, you'd really just have to rely on the fact that the search/node cron process eventually comes back and reindexes nodes that haven't been indexed recently. Or a contributed module that would maybe have its own cron process that would check for new links in nodes and mark the nodes for reindexing?

It might be nice if a contrib module has an easy way to hook into this part of the search index. I'm not sure if it does or not. I agree this won't be the most-used feature ever, so it might be better to focus on making sure there are available hooks for contrib modules to easily intercept these kinds of broken links, rather than adding the feature directly. (Ideally they would not have to rescan the node on their own, but rather be able to play with the results.)

jhodgdon’s picture

NOTE: I edited this as I realized it wasn't clear...

I wasn't meaning to imply that the link wouldn't be there during indexing, if auto generated.

What I meant to imply was that during the original node indexing, the link might not be created by the input filter, because a wiki might not recognize it as a node title (linked-to node doesn't exist). Then if the node was reindexed, and the bird node existed at that point, it might recognize it as a node title and create the link.

So, if you considered an implementation that made a table of broken links, and marked nodes with broken links for faster reindexing, this might not help in the case of auto-generated wiki links, because it wouldn't generate broken links, only working links.

As far as a contrib module goes, it is quite easy for a contrib module to mark a node as needing reindexing. There's a table that keeps track of when a node was last indexed (look in the code to find it), and if you set that field to 0 it indicates the node needs reindexing. node.module does the same thing when the node is updated, I believe.

jhodgdon’s picture

Status: Active » Closed (won't fix)

The functionality that was in the Drupal Core search indexing for "tracking" link was doing the following in a fairly buggy way:
- Trying to figure out if a link in node A was to a node B
- If so, while indexing node A, add the link text in A to node B's search text

This was kind of silly really. The vast majority of links from A to B would use as link text something that was already in the text of B anyway, right? Plus, it was buggy (there were at least 5 open bugs related to this, including this one). So, in
#2003482: Convert hook_search_info to plugin system
we actually ended up just getting rid of this functionality entirely. In which case, this issue here will not be fixed, since the code doesn't exist any more.