Problem/Motivation
Currently, NodeSearch decides which content to index in a given cron run by running a query to find nodes that have been added or updated, and decides how many by a setting.
Since the time to index a node could depend on the nodes themselves, this setting is kind of problematic, and rather than using a query, a queue system seems like it might be better: nodes could be added to the queue as they are created or updated, and then the queue system would index them in order.
However, the current Drupal queue API does not have any way to tell whether a given node is already in the queue or not, and we would need to know that so that each node is not added to the queue multiple times, which would make indexing terrible inefficient.
Proposed resolution
a) Add functionality to the queue API (like a new type of queue?) so that it has some way of determining "this item is already in the queue". Possibly a tagging system? So that perhaps instead of just calling
QueueInterface::createItem($data);
we would call
TaggedQueueInterface::createTaggedItem($data, $tags)
with
$tags = array('nid' => 3)
and then there could be
TaggedQueueInterface::itemExists($tags)
which would return True/False after checking whether an item with matching tags already exists in the queue.
b) Use this queue system to index nodes in NodeSearch, rather than the existing query/setting system.
Remaining tasks
Make the new queue system and use it.
User interface changes
We'd get rid of the "number of nodes to index per cron run" setting.
API changes
This would be an API addition to the queue system, rather than a change to the existing API.
Original report by @Xano
We've had the queue API since D7, but it's not used for indexing content yet. Next to the well-known benefits of using the queue API for tasks like this, we can also remove the dreaded "Number of items to index per cron run" setting at the Search.module configuration page.
Comments
Comment #1
aries commentedComment #2
cpliakas commentedI don't think the queue system works well for search. What happens if you update a node three times before the queue is processed? Currently the queue system doesn't handle uniqueness as illustrated by the issue #1548286: API for handling unique queue items, so it would trigger three index operations on the same node in the same cron run. In addition, what happens when you want to re-index all content on your site? If you have 100k items, does that mean that 100k messages have to be sent to the queue?
I'm sure workarounds to these issues could be coded, but my sense is that the queue system plus workarounds would probably add too much complexity for too little gain over the current system. What are the big wins that would warrant moving to the queue API?
Thanks,
Chris
Comment #3
timmillwoodInterested in this, but have a feeling that search indexing could create a lot of data for the queue.
Comment #4
aries commentedI don't feel re-indexing is a problem. At the moment, we also maintain a special db table for this.
But, on a highly interactive website, using the queue is an overkill, because the same item would be in the queue several times. Do you see any gain to use the Queue API for this?
What would solve is eliminating the necessary cron run for those who don't have access to Drush via Batch+Queue APIs. I would simply add an "Index now" button right next to the Re-index button which starts a batch until all the items are indexed. Give a +1 if you agree.
Comment #5
cpliakas commentedaries,
I am not clear on your position. Are you in favor of the original post suggesting that we should leverage the Queue API, or are you saying that all is needed is an "Index now" button similar to Apache Solr Search Integration and Search API? Or is it somewhere in between.
Thanks in advance for helping me understand,
Chris
Comment #6
aries commentedChris,
Sorry, I wasn't clear enough. Somewhere between. I meant using Queue with the current implementation is an overkill. But we could do it better.
If you check the _batch_populate_queue(), you see a trick on the name field. We can also do the same in search:[entity_type]:[entity_id] form. With this, we can avoid the duplicates, in the queue. 100K of entities in the queue is not a real problem, since the current implementation also does the same in the search_dataset table.
So what I suggest:
Comment #7
cpliakas commentedGotcha. Thanks for the clarification!
#2 and #3 sound viable as you laid it out. I also like the premise of #1 although I would like to have some more discussions about the implementation of that. I think the concept of a backend can be separated out into logical, reusable parts. For example, a "backend" consists of a few things. One is the actual connection to the backend service, such as SQL or Solr. The second piece is the listener, which implements the various hooks and updates the queue. The third piece is the indexer, which retrieves the data and prepares it for indexing before passing it to the connection piece. Ideally the "listener" and "indexer" could and should be backend agnostic so that modules such as Search API and Apache Solr Search Integration can build on top of them instead of having to re-implement very similar subsystems. I understand this is above and beyond what the scope of this issue is, but I would hope we have an eye towards the future when separating out the backend.
Regarding the 100k index problem, I do think there are a couple of differences between the queue API and the current core implementation. It is a difference of UPDATES vs. INSERTS + DELETES, which concerns me a little. I do understand that alternate queue systems can be used which would eliminate this as an issue, but I just want to make sure any performance impacts of moving to the queue system are fully vetted.
Thanks again,
Chris
Comment #8
aries commentedChris,
+1 on your idea on the break down. Regarding the 100k index problem, we can introduce a mass-indexer batch operation based on the given threshold per cron run. So if the threshold was 100 / cron run according to the defaults, it's only 1000 INSERTs for a massive website.
Comment #9
jhodgdonRegarding queue set-up, I think this needs to do something like the API module does for queueing its "reparse this file" jobs. Basically, each file has a database field that keeps track of when it was last parsed, and a Boolean field that keeps track of "is this already queued". When the module detects during a cron run that the file is newer than when it was last parsed, it is added to the parse queue, but only if it is not already marked as "queued".
We could do something similar here for node search.
Comment #10
jhodgdonI do no think that aries is actually working on this, at this point... so unassigning.
I still think this is probably a good idea, although since we'd definitely need to make sure not to requeue nodes already in the queue, we'd need additions to the queue API. Updating issue summary.
Comment #11
jhodgdonSince 8.0.x-beta1 has been released, our policy at this point is No feature requests until 8.1.x. See #2350615: [policy, no patch] What changes can be accepted during the Drupal 8 beta phase?. Sorry, it's just too late for 8.0.x at this point, so even if we had a viable patch, the core committers would not commit it. So unless we decide this is a Task or a Bug (and I don't think it is), we'll have to delay it.
Comment #27
sukr_s commented#504012: Use a queue for node create/update indexing uses queue to index content upon creation or updation. This works only when automated_cron is used. Otherwise the indexing is still done in cron.
After changes in #504012: Use a queue for node create/update indexing is accepted, the cron function in node can be removed and the check for automated_cron module in Node::postSave can be removed. This will completely make the node indexing via queues.
Comment #28
pwolanin commentedIMHO - a queue is actually the wrong solution unless core adds something like https://www.drupal.org/project/queue_unique
The problem is that a single content item may be updated multiple times while it's in the queue, but you only want to re-index it once when it's turn comes around
Comment #29
catchYeah I think we can postpone this on #1548286: API for handling unique queue items.
Comment #31
catch