Problem/Motivation
In the current version of the module the file is ignored if the extract service is unavailable.
In the Drupal watchdog logs you find the following error message if the service is unavailable:
Error extracting text from file XXXX for XXXX
In this case the file is not added to any queue and will only be indexed if the node where the file is linked is re-indexed.
In my case I used many large files and the Solr service was out of memory if I indexed all the nodes with the coupled files.
Proposed resolution
Please your feedback for this proposal.
See Patch.
src/Plugin/search_api_attachments/SolrExtractor.php
Check if the Solr service is available.
if (!$backend->isAvailable()) {
throw new \Exception('Solr Exctractor is not available.');
}
Remaining task 1 Add this to other extracting services.
src/Plugin/search_api/processor/FilesExtractor.php
Add file to queue
If there's a PHP error (try /catch), add the file to the Drupal queue.
if (static::ENABLE_QUEUE) {
$queue = \Drupal::queue('search_api_attachments');
$item = new \stdClass();
$item->fid = $file->id();
$queue->createItem($item);
$this->logger->log(LogLevel::WARNING, 'File is added to the queue for text extraction @file_id for @entity_type @entity_id.', $message_params);
}
Check if file is added in the past to the queue
Ignore this file.
It will be indexed in the queue (cron).
$extract = TRUE;
if (static::ENABLE_QUEUE) {
$database = \Drupal::database();
$result = $database->select('queue', 'q')
->fields('q', ['item_id', 'data'])
->condition('data', "%" . $database->escapeLike('"' . $file->id() . '"') . "%", 'LIKE')
->condition('name', 'search_api_attachments')
->execute()
->fetchAll();
$extract = ($result) ? FALSE : TRUE;
}
if ($extract) {
$extracted_data = $extractor_plugin->extract($file);
$this->keyValue->get($collection)->set($key, $extracted_data);
}
Remaining task 2 Other method to search in BLOB of a queue item?
src/Plugin/QueueWorker/ExtractorQueue.php
See patch.
1) Get chosen extractor and configuration.
2) Load file from by queue item data.
3) Concert file to text.
4) Save extracted text in key_value table.
5) Get a list of nodes/paragraph items/media items where the file is used
Remaining task 3 I used entity_usage to look where a media item of paragraph item is used.
6) Index the nodes (where the files are coupled) again.
The extracted text is in the key_value table so that the external service no longer needs to be called.
Remaining tasks
See Proposed resolution
User interface changes
None
API changes
None
Data model changes
None
Comment | File | Size | Author |
---|---|---|---|
#10 | 2991136-10-queue_service.patch | 10.35 KB | neclimdul |
#10 | interdiff-2991136-10.txt | 12 KB | neclimdul |
Comments
Comment #2
borisson_Very awesome, Great work @tomhollevoet!
@file document headers are not needed for files that contain classes/interfaces.
This is probably a copy/paste from another file? :)
We can revert this entire if statement, to make it less indented. This is probably just a pet-peeve but I prefer having as few levels of indentation as possible.
Should we check here if that already exists? So that multiple workers don't get the same file?
The entity type manager is used a couple of times. Extracting that to a variable (and probably using DI to inject it) seems like a good way to simplify this code.
Early returns can help here as well.
I'm not sure I understand why this is in a const, is this supposed be configurable?
I don't think it should be configurable, this is an awesome feature and probably it should be enabled all the time. In that case, creating a const as an alias for
TRUE
is probably overkill.We should probably add the entity type / entity id to this item as well, to help with debugging / information from the queue?
LogLevel::WARNING doesn't seem to be right, ::INFO is probably better?
Comment #3
tomhollevoet CreditAttribution: tomhollevoet at Calibrate commentedThanks for the comments @borisson_ !
I fixed these suggestions in a new patch.
Comment #4
tomhollevoet CreditAttribution: tomhollevoet at Calibrate commentedAdd maximum attempts in queue.
Comment #5
borisson_I think this looks really awesome! I would love to get feedback from @izus, because this adds a new feature - but it get's a +1 from me.
Comment #6
izus CreditAttribution: izus commentedHi,
thanks for the effort on this.
+1 for the idea of the feature added here but i would prefer that we avoid a hard dependency on an other module (entity_usage in .info.yml file ) : we shouldn't be obliged to install entity_usage to use search_api_attachments.
can we avoid the hard dependency please ?
Comment #7
borisson_I'm not sure, I don't see a way around using that module to get the same functionality.
Comment #8
drunken monkeySorry! Joris asked me for my opinion on this, but I fear I'm probably not familiar enough with this module (anymore) to really understand the patch. As far as I understand it, though, the only advantage compared to just throwing a
SearchApiException
(and having Search API deal with re-indexing the item at a later time) is that the rest of the item will still get indexed this way – which might actually be undesired and, as such, a disadvantage for some sites.(OK, and potentially, that other items can still get indexed, if they don't have files attached. I'm not certain, but I think there is no way in a processor to fail just for a single item and have that not be indexed.)
If that's the case, and izus would prefer not to add that dependency (which I totally understand and, in one of my modules, would probably see the same way), then just throwing an exception seems way easier?
Or, if there's a queue, why bother with all the files and not just remember the Search API item IDs that need to be re-indexed once the extractor becomes available again? It seems that would also eliminate the need for entity_usage.
Also, if you need introspection into the queue contents, I think you should just create your own table instead of using the queue system. Querying the
queue
table directly like this seems way too hack-y and dangerous. (E.g., if someone switches out the queue service to use MongoDB instead, this would just throw exceptions every time the statement is reached.)But, to re-iterate: I might just not have understood the patch, or the current workings of the module, and might have been spouting fantastic nonsense throughout this comment.
But what I can surely do is provide a re-roll, see attached.
Comment #9
izus CreditAttribution: izus commentedHi,
Can we please use state api (per example) to avoid the entity_usage module hard dependency.
state api is just a suggestion, please suggest if you think there are better core solutions
Thanks
Comment #10
neclimdulNot sure if this is what you're looking for. Removing
entity_usage
changes a lot of assumptions from the original patch so there are quite a few changes.Comment #11
neclimdulComment #13
izus CreditAttribution: izus commentedThis is now merged
thanks all