[PP-1] Use queue for indexing content [#1560820]

Problem/Motivation

Currently, NodeSearch decides which content to index in a given cron run by running a query to find nodes that have been added or updated, and decides how many by a setting.

Since the time to index a node could depend on the nodes themselves, this setting is kind of problematic, and rather than using a query, a queue system seems like it might be better: nodes could be added to the queue as they are created or updated, and then the queue system would index them in order.

However, the current Drupal queue API does not have any way to tell whether a given node is already in the queue or not, and we would need to know that so that each node is not added to the queue multiple times, which would make indexing terrible inefficient.

Proposed resolution

a) Add functionality to the queue API (like a new type of queue?) so that it has some way of determining "this item is already in the queue". Possibly a tagging system? So that perhaps instead of just calling
QueueInterface::createItem($data);
we would call
TaggedQueueInterface::createTaggedItem($data, $tags)
with
$tags = array('nid' => 3)

and then there could be
TaggedQueueInterface::itemExists($tags)
which would return True/False after checking whether an item with matching tags already exists in the queue.

b) Use this queue system to index nodes in NodeSearch, rather than the existing query/setting system.

Remaining tasks

Make the new queue system and use it.

User interface changes

We'd get rid of the "number of nodes to index per cron run" setting.

API changes

This would be an API addition to the queue system, rather than a change to the existing API.

Original report by @Xano

We've had the queue API since D7, but it's not used for indexing content yet. Next to the well-known benefits of using the queue API for tasks like this, we can also remove the dreaded "Number of items to index per cron run" setting at the Search.module configuration page.

Comments

Comment #1

aries commented 8 May 2012 at 14:31

Assigned:

Unassigned

» aries

Comment #2

cpliakas commented 8 May 2012 at 19:41

I don't think the queue system works well for search. What happens if you update a node three times before the queue is processed? Currently the queue system doesn't handle uniqueness as illustrated by the issue #1548286: API for handling unique queue items, so it would trigger three index operations on the same node in the same cron run. In addition, what happens when you want to re-index all content on your site? If you have 100k items, does that mean that 100k messages have to be sent to the queue?

I'm sure workarounds to these issues could be coded, but my sense is that the queue system plus workarounds would probably add too much complexity for too little gain over the current system. What are the big wins that would warrant moving to the queue API?

Thanks,
Chris

Comment #3

timmillwood

English

🏴󠁧󠁢󠁷󠁬󠁳󠁿 Wales, UK

commented 8 May 2012 at 21:42

Interested in this, but have a feeling that search indexing could create a lot of data for the queue.

Comment #4

aries commented 10 May 2012 at 10:19

I don't feel re-indexing is a problem. At the moment, we also maintain a special db table for this.

But, on a highly interactive website, using the queue is an overkill, because the same item would be in the queue several times. Do you see any gain to use the Queue API for this?

What would solve is eliminating the necessary cron run for those who don't have access to Drush via Batch+Queue APIs. I would simply add an "Index now" button right next to the Re-index button which starts a batch until all the items are indexed. Give a +1 if you agree.

Comment #5

cpliakas commented 10 May 2012 at 13:48

aries,

I am not clear on your position. Are you in favor of the original post suggesting that we should leverage the Queue API, or are you saying that all is needed is an "Index now" button similar to Apache Solr Search Integration and Search API? Or is it somewhere in between.

Thanks in advance for helping me understand,
Chris

Comment #6

aries commented 10 May 2012 at 18:07

Chris,

Sorry, I wasn't clear enough. Somewhere between. I meant using Queue with the current implementation is an overkill. But we could do it better.

If you check the _batch_populate_queue(), you see a trick on the name field. We can also do the same in search:[entity_type]:[entity_id] form. With this, we can avoid the duplicates, in the queue. 100K of entities in the queue is not a real problem, since the current implementation also does the same in the search_dataset table.

So what I suggest:

Organizing the current search backend into a separate module (eg. search_dbindex), because it maintains expensive tables which are might not necessary with other engines. I would also separate the UI into a dedicated module.
Queue API per item change as i described above.
Cosmetic changes in the Search UI providing a button to consume the queue via Batch API.

Comment #7

cpliakas commented 10 May 2012 at 19:06

Gotcha. Thanks for the clarification!

#2 and #3 sound viable as you laid it out. I also like the premise of #1 although I would like to have some more discussions about the implementation of that. I think the concept of a backend can be separated out into logical, reusable parts. For example, a "backend" consists of a few things. One is the actual connection to the backend service, such as SQL or Solr. The second piece is the listener, which implements the various hooks and updates the queue. The third piece is the indexer, which retrieves the data and prepares it for indexing before passing it to the connection piece. Ideally the "listener" and "indexer" could and should be backend agnostic so that modules such as Search API and Apache Solr Search Integration can build on top of them instead of having to re-implement very similar subsystems. I understand this is above and beyond what the scope of this issue is, but I would hope we have an eye towards the future when separating out the backend.

Regarding the 100k index problem, I do think there are a couple of differences between the queue API and the current core implementation. It is a difference of UPDATES vs. INSERTS + DELETES, which concerns me a little. I do understand that alternate queue systems can be used which would eliminate this as an issue, but I just want to make sure any performance impacts of moving to the queue system are fully vetted.

Thanks again,
Chris

Comment #8

aries commented 10 May 2012 at 23:11

Chris,
+1 on your idea on the break down. Regarding the 100k index problem, we can introduce a mass-indexer batch operation based on the given threshold per cron run. So if the threshold was 100 / cron run according to the defaults, it's only 1000 INSERTs for a massive website.

Comment #9

jhodgdon

she/her

English

commented 27 October 2013 at 22:25

Regarding queue set-up, I think this needs to do something like the API module does for queueing its "reparse this file" jobs. Basically, each file has a database field that keeps track of when it was last parsed, and a Boolean field that keeps track of "is this already queued". When the module detects during a cron run that the file is newer than when it was last parsed, it is added to the parse queue, but only if it is not already marked as "queued".

We could do something similar here for node search.

Comment #10

jhodgdon

she/her

English

commented 23 September 2014 at 15:29

Assigned:	aries	» Unassigned
Issue summary:	View changes

I do no think that aries is actually working on this, at this point... so unassigning.

I still think this is probably a good idea, although since we'd definitely need to make sure not to requeue nodes already in the queue, we'd need additions to the queue API. Updating issue summary.

Comment #11

jhodgdon

she/her

English

Comment #27

sukr_s commented 14 August 2024 at 09:33

#504012: Use a queue for node create/update indexing uses queue to index content upon creation or updation. This works only when automated_cron is used. Otherwise the indexing is still done in cron.

After changes in #504012: Use a queue for node create/update indexing is accepted, the cron function in node can be removed and the check for automated_cron module in Node::postSave can be removed. This will completely make the node indexing via queues.

Comment #28

pwolanin commented 8 April 2025 at 14:18

IMHO - a queue is actually the wrong solution unless core adds something like https://www.drupal.org/project/queue_unique

The problem is that a single content item may be updated multiple times while it's in the queue, but you only want to re-index it once when it's turn comes around

Comment #29

catch

he/him

English

commented 9 April 2025 at 03:43

Title:	Use queue for indexing content	» [PP-1] Use queue for indexing content
Related issues:		+#1548286: API for handling unique queue items

Yeah I think we can postpone this on #1548286: API for handling unique queue items.

Comment #30

9 April 2025 at 03:43

Version:

11.x-dev

» main

Drupal core is now using the main branch as the primary development branch. New developments and disruptive changes should now be targeted to the main branch.

Comment #31

catch

he/him

English

commented 4 February 2026 at 17:34

Status:

Active

» Postponed

[PP-1] Use queue for indexing content

Problem/Motivation

Proposed resolution

Remaining tasks

User interface changes

API changes

Original report by @Xano

Comments

Related issues

Referenced by