Problem/Motivation

Currently, NodeSearch decides which content to index in a given cron run by running a query to find nodes that have been added or updated, and decides how many by a setting.

Since the time to index a node could depend on the nodes themselves, this setting is kind of problematic, and rather than using a query, a queue system seems like it might be better: nodes could be added to the queue as they are created or updated, and then the queue system would index them in order.

However, the current Drupal queue API does not have any way to tell whether a given node is already in the queue or not, and we would need to know that so that each node is not added to the queue multiple times, which would make indexing terrible inefficient.

Proposed resolution

a) Add functionality to the queue API (like a new type of queue?) so that it has some way of determining "this item is already in the queue". Possibly a tagging system? So that perhaps instead of just calling
QueueInterface::createItem($data);
we would call
TaggedQueueInterface::createTaggedItem($data, $tags)
with
$tags = array('nid' => 3)

and then there could be
TaggedQueueInterface::itemExists($tags)
which would return True/False after checking whether an item with matching tags already exists in the queue.

b) Use this queue system to index nodes in NodeSearch, rather than the existing query/setting system.

Remaining tasks

Make the new queue system and use it.

User interface changes

We'd get rid of the "number of nodes to index per cron run" setting.

API changes

This would be an API addition to the queue system, rather than a change to the existing API.

Original report by @Xano

We've had the queue API since D7, but it's not used for indexing content yet. Next to the well-known benefits of using the queue API for tasks like this, we can also remove the dreaded "Number of items to index per cron run" setting at the Search.module configuration page.

Comments

aries’s picture

Assigned: Unassigned » aries
cpliakas’s picture

I don't think the queue system works well for search. What happens if you update a node three times before the queue is processed? Currently the queue system doesn't handle uniqueness as illustrated by the issue #1548286: API for handling unique queue items, so it would trigger three index operations on the same node in the same cron run. In addition, what happens when you want to re-index all content on your site? If you have 100k items, does that mean that 100k messages have to be sent to the queue?

I'm sure workarounds to these issues could be coded, but my sense is that the queue system plus workarounds would probably add too much complexity for too little gain over the current system. What are the big wins that would warrant moving to the queue API?

Thanks,
Chris

timmillwood’s picture

Interested in this, but have a feeling that search indexing could create a lot of data for the queue.

aries’s picture

I don't feel re-indexing is a problem. At the moment, we also maintain a special db table for this.

But, on a highly interactive website, using the queue is an overkill, because the same item would be in the queue several times. Do you see any gain to use the Queue API for this?

What would solve is eliminating the necessary cron run for those who don't have access to Drush via Batch+Queue APIs. I would simply add an "Index now" button right next to the Re-index button which starts a batch until all the items are indexed. Give a +1 if you agree.

cpliakas’s picture

aries,

I am not clear on your position. Are you in favor of the original post suggesting that we should leverage the Queue API, or are you saying that all is needed is an "Index now" button similar to Apache Solr Search Integration and Search API? Or is it somewhere in between.

Thanks in advance for helping me understand,
Chris

aries’s picture

Chris,

Sorry, I wasn't clear enough. Somewhere between. I meant using Queue with the current implementation is an overkill. But we could do it better.

If you check the _batch_populate_queue(), you see a trick on the name field. We can also do the same in search:[entity_type]:[entity_id] form. With this, we can avoid the duplicates, in the queue. 100K of entities in the queue is not a real problem, since the current implementation also does the same in the search_dataset table.

So what I suggest:

  1. Organizing the current search backend into a separate module (eg. search_dbindex), because it maintains expensive tables which are might not necessary with other engines. I would also separate the UI into a dedicated module.
  2. Queue API per item change as i described above.
  3. Cosmetic changes in the Search UI providing a button to consume the queue via Batch API.
cpliakas’s picture

Gotcha. Thanks for the clarification!

#2 and #3 sound viable as you laid it out. I also like the premise of #1 although I would like to have some more discussions about the implementation of that. I think the concept of a backend can be separated out into logical, reusable parts. For example, a "backend" consists of a few things. One is the actual connection to the backend service, such as SQL or Solr. The second piece is the listener, which implements the various hooks and updates the queue. The third piece is the indexer, which retrieves the data and prepares it for indexing before passing it to the connection piece. Ideally the "listener" and "indexer" could and should be backend agnostic so that modules such as Search API and Apache Solr Search Integration can build on top of them instead of having to re-implement very similar subsystems. I understand this is above and beyond what the scope of this issue is, but I would hope we have an eye towards the future when separating out the backend.

Regarding the 100k index problem, I do think there are a couple of differences between the queue API and the current core implementation. It is a difference of UPDATES vs. INSERTS + DELETES, which concerns me a little. I do understand that alternate queue systems can be used which would eliminate this as an issue, but I just want to make sure any performance impacts of moving to the queue system are fully vetted.

Thanks again,
Chris

aries’s picture

Chris,
+1 on your idea on the break down. Regarding the 100k index problem, we can introduce a mass-indexer batch operation based on the given threshold per cron run. So if the threshold was 100 / cron run according to the defaults, it's only 1000 INSERTs for a massive website.

jhodgdon’s picture

Regarding queue set-up, I think this needs to do something like the API module does for queueing its "reparse this file" jobs. Basically, each file has a database field that keeps track of when it was last parsed, and a Boolean field that keeps track of "is this already queued". When the module detects during a cron run that the file is newer than when it was last parsed, it is added to the parse queue, but only if it is not already marked as "queued".

We could do something similar here for node search.

jhodgdon’s picture

Assigned: aries » Unassigned
Issue summary: View changes

I do no think that aries is actually working on this, at this point... so unassigning.

I still think this is probably a good idea, although since we'd definitely need to make sure not to requeue nodes already in the queue, we'd need additions to the queue API. Updating issue summary.

jhodgdon’s picture

Version: 8.0.x-dev » 8.1.x-dev

Since 8.0.x-beta1 has been released, our policy at this point is No feature requests until 8.1.x. See #2350615: [policy, no patch] What changes can be accepted during the Drupal 8 beta phase?. Sorry, it's just too late for 8.0.x at this point, so even if we had a viable patch, the core committers would not commit it. So unless we decide this is a Task or a Bug (and I don't think it is), we'll have to delay it.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 10.1.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch, which currently accepts only minor-version allowed changes. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

sukr_s’s picture

#504012: Use a queue for node create/update indexing uses queue to index content upon creation or updation. This works only when automated_cron is used. Otherwise the indexing is still done in cron.

After changes in #504012: Use a queue for node create/update indexing is accepted, the cron function in node can be removed and the check for automated_cron module in Node::postSave can be removed. This will completely make the node indexing via queues.

pwolanin’s picture

IMHO - a queue is actually the wrong solution unless core adds something like https://www.drupal.org/project/queue_unique

The problem is that a single content item may be updated multiple times while it's in the queue, but you only want to re-index it once when it's turn comes around

catch’s picture

Title: Use queue for indexing content » [PP-1] Use queue for indexing content
Related issues: +#1548286: API for handling unique queue items

Yeah I think we can postpone this on #1548286: API for handling unique queue items.

Version: 11.x-dev » main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Read more in the announcement.

catch’s picture

Status: Active » Postponed