Hello!
We have Drupal 9 website which has around 1.4 million documents on it. While indexing the website it takes more time. When we checked the code, we found out that while reindexing, module calls deletebyquery function which takes more time on solr servers (roughly between 60 to 90 seconds).
Problem with deletebyquery is that, it will scans all the documents and fields within that document and it will retrieve the documents which needs to be deleted. If the number of document is high, this deletebyquery request will take more time and resources to process the request. In our case it's 1.4 million so it has a larger impact on indexing process.
Comparatively deletebyid is much faster. In search_api_solr module code it is using deletebyquery and deletebyid function in SearchApiSolrBackend.php
Ref : https://git.drupalcode.org/project/search_api_solr/-/blob/4.x/src/Plugin...
Here we are proposing, instead of calling both functions we need to check first if that document has child documents or not, then grab the ids for that and call deletebyid function to delete it.
Also sometimes _root_ doesn’t have any document id's so before passing the _root_ we need to check first if it have any documents or not.
Thank you!!
Comments
Comment #2
mpotdar commentedComment #3
mpotdar commentedComment #4
mkalkbrennerI agree that there could be optimizations like that. So if you have a setup where you can easily do performance tests, you're welcome to provide a patch.
Comment #5
mkalkbrennerJust to explain that this is no bug. If a document is about to be deleted, you need to delete it's children first.
Doing a query first to search for children might not be safe depending on replication or commit strategy, but I'm not sure.
The question is, why do you trigger mass deletions. Or does Search API trigger them when you re-index?
In general you can just overwrite the existing document.
Comment #6
mkalkbrennerAre you sure? the filed _root_ is indexed. So that query should be fast. Otherwise block joins for parents would be slow, too.
Comment #7
mkalkbrennerComment #9
mkalkbrennerComment #11
pdcarto commentedI'm not sure that this actually fixed the problem, or possibly I'm seeing a different problem. I see a `deleteItems` task in `search_api_task` with 1680 ids. Solr fails with a "too many boolean clauses" message.
In my case, a parent object is being deleted (a pdf file), spawning the deletion of the indexed hocr text for each of its 1680 children (pages).
I tried editing `maxBooleanClauses` - setting it to a very big number (default is 1024) . Initially I edited and re-installed `solrconfig_query.xml` and restarted solr, which had no impact. Then I found search_api_solr's `search_api_solr.solr_cache.cache_queryresult_default_7_0_0` configuration and changed it there, which again had no impact.
It seems to me that there is one problem here with two possible solutions: