Problem/Motivation

In the current version of the module the file is ignored if the extract service is unavailable.

In the Drupal watchdog logs you find the following error message if the service is unavailable:
Error extracting text from file XXXX for XXXX

In this case the file is not added to any queue and will only be indexed if the node where the file is linked is re-indexed.

In my case I used many large files and the Solr service was out of memory if I indexed all the nodes with the coupled files.

Proposed resolution

Please your feedback for this proposal.
See Patch.

src/Plugin/search_api_attachments/SolrExtractor.php

Check if the Solr service is available.

if (!$backend->isAvailable()) {
  throw new \Exception('Solr Exctractor is not available.');
}

Remaining task 1 Add this to other extracting services.

src/Plugin/search_api/processor/FilesExtractor.php

Add file to queue

If there's a PHP error (try /catch), add the file to the Drupal queue.

if (static::ENABLE_QUEUE) {
   $queue = \Drupal::queue('search_api_attachments');
   $item = new \stdClass();
   $item->fid = $file->id();
  $queue->createItem($item);

  $this->logger->log(LogLevel::WARNING, 'File is added to the queue for text extraction @file_id for @entity_type @entity_id.', $message_params);
}

Check if file is added in the past to the queue

Ignore this file.
It will be indexed in the queue (cron).

$extract = TRUE;
if (static::ENABLE_QUEUE) {
  $database = \Drupal::database();
    $result = $database->select('queue', 'q')
      ->fields('q', ['item_id', 'data'])
      ->condition('data', "%" . $database->escapeLike('"' . $file->id() . '"') . "%", 'LIKE')
      ->condition('name', 'search_api_attachments')
      ->execute()
      ->fetchAll();

  $extract = ($result) ? FALSE : TRUE;
}

if ($extract) {
  $extracted_data = $extractor_plugin->extract($file);
  $this->keyValue->get($collection)->set($key, $extracted_data);
}

Remaining task 2 Other method to search in BLOB of a queue item?

src/Plugin/QueueWorker/ExtractorQueue.php

See patch.

1) Get chosen extractor and configuration.
2) Load file from by queue item data.
3) Concert file to text.
4) Save extracted text in key_value table.
5) Get a list of nodes/paragraph items/media items where the file is used

Remaining task 3 I used entity_usage to look where a media item of paragraph item is used.

6) Index the nodes (where the files are coupled) again.
The extracted text is in the key_value table so that the external service no longer needs to be called.

Remaining tasks

See Proposed resolution

User interface changes

None

API changes

None

Data model changes

None

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

tomhollevoet created an issue. See original summary.

borisson_’s picture

Status: Needs review » Needs work

Very awesome, Great work @tomhollevoet!

  1. +++ b/src/Plugin/QueueWorker/ExtractorQueue.php
    @@ -0,0 +1,99 @@
    +/**
    + * @file
    + * Contains \Drupal\search_api_attachments\Plugin\QueueWorker\ExtractorQueue.
    + */
    

    @file document headers are not needed for files that contain classes/interfaces.

  2. +++ b/src/Plugin/QueueWorker/ExtractorQueue.php
    @@ -0,0 +1,99 @@
    + * Processes Tasks for Learning.
    

    This is probably a copy/paste from another file? :)

  3. +++ b/src/Plugin/QueueWorker/ExtractorQueue.php
    @@ -0,0 +1,99 @@
    +    if ($file) {
    

    We can revert this entire if statement, to make it less indented. This is probably just a pet-peeve but I prefer having as few levels of indentation as possible.

    if ($file === NULL) {
      return FALSE;
    }
    
    try {
    // ... rest of code here.
    }
    
    
  4. +++ b/src/Plugin/QueueWorker/ExtractorQueue.php
    @@ -0,0 +1,99 @@
    +        $key = $collection . ':' . $file->id();
    +        $keyvalue_service->get($collection)->set($key, $extracted_data);
    

    Should we check here if that already exists? So that multiple workers don't get the same file?

  5. +++ b/src/Plugin/QueueWorker/ExtractorQueue.php
    @@ -0,0 +1,99 @@
    +              $entity = \Drupal::entityTypeManager()->getStorage($entity_type)->load($entity_id);
    

    The entity type manager is used a couple of times. Extracting that to a variable (and probably using DI to inject it) seems like a good way to simplify this code.

  6. +++ b/src/Plugin/QueueWorker/ExtractorQueue.php
    @@ -0,0 +1,99 @@
    +        // Update Search API record.
    +        if (!empty($nids)) {
    

    Early returns can help here as well.

    if (empty($nids)) {
      return;
    }
    
    foreach ($nids as $nid) {
    ...
    }
    
  7. +++ b/src/Plugin/search_api/processor/FilesExtractor.php
    @@ -59,6 +59,11 @@ class FilesExtractor extends ProcessorPluginBase implements PluginFormInterface
    +  const ENABLE_QUEUE = TRUE;
    

    I'm not sure I understand why this is in a const, is this supposed be configurable?

    I don't think it should be configurable, this is an awesome feature and probably it should be enabled all the time. In that case, creating a const as an alias for TRUE is probably overkill.

  8. +++ b/src/Plugin/search_api/processor/FilesExtractor.php
    @@ -266,8 +287,16 @@ class FilesExtractor extends ProcessorPluginBase implements PluginFormInterface
    +          $item = new \stdClass();
    +          $item->fid = $file->id();
    +          $queue->createItem($item);
    

    We should probably add the entity type / entity id to this item as well, to help with debugging / information from the queue?

  9. +++ b/src/Plugin/search_api/processor/FilesExtractor.php
    @@ -266,8 +287,16 @@ class FilesExtractor extends ProcessorPluginBase implements PluginFormInterface
    +          $this->logger->log(LogLevel::WARNING, 'File is added to the queue for text extraction @file_id for @entity_type @entity_id.', $message_params);
    

    LogLevel::WARNING doesn't seem to be right, ::INFO is probably better?

tomhollevoet’s picture

Status: Needs work » Needs review
FileSize
8.46 KB

Thanks for the comments @borisson_ !
I fixed these suggestions in a new patch.

tomhollevoet’s picture

Add maximum attempts in queue.

borisson_’s picture

I think this looks really awesome! I would love to get feedback from @izus, because this adds a new feature - but it get's a +1 from me.

izus’s picture

Hi,
thanks for the effort on this.
+1 for the idea of the feature added here but i would prefer that we avoid a hard dependency on an other module (entity_usage in .info.yml file ) : we shouldn't be obliged to install entity_usage to use search_api_attachments.

can we avoid the hard dependency please ?

borisson_’s picture

+1 for the idea of the feature added here but i would prefer that we avoid a hard dependency on an other module (entity_usage in .info.yml file ) : we shouldn't be obliged to install entity_usage to use search_api_attachments.

I'm not sure, I don't see a way around using that module to get the same functionality.

drunken monkey’s picture

FileSize
9.3 KB

Sorry! Joris asked me for my opinion on this, but I fear I'm probably not familiar enough with this module (anymore) to really understand the patch. As far as I understand it, though, the only advantage compared to just throwing a SearchApiException (and having Search API deal with re-indexing the item at a later time) is that the rest of the item will still get indexed this way – which might actually be undesired and, as such, a disadvantage for some sites.
(OK, and potentially, that other items can still get indexed, if they don't have files attached. I'm not certain, but I think there is no way in a processor to fail just for a single item and have that not be indexed.)

If that's the case, and izus would prefer not to add that dependency (which I totally understand and, in one of my modules, would probably see the same way), then just throwing an exception seems way easier?

Or, if there's a queue, why bother with all the files and not just remember the Search API item IDs that need to be re-indexed once the extractor becomes available again? It seems that would also eliminate the need for entity_usage.

Also, if you need introspection into the queue contents, I think you should just create your own table instead of using the queue system. Querying the queue table directly like this seems way too hack-y and dangerous. (E.g., if someone switches out the queue service to use MongoDB instead, this would just throw exceptions every time the statement is reached.)

But, to re-iterate: I might just not have understood the patch, or the current workings of the module, and might have been spouting fantastic nonsense throughout this comment.

But what I can surely do is provide a re-roll, see attached.

izus’s picture

Status: Needs review » Needs work

Hi,
Can we please use state api (per example) to avoid the entity_usage module hard dependency.
state api is just a suggestion, please suggest if you think there are better core solutions
Thanks

neclimdul’s picture

FileSize
12 KB
10.35 KB

Not sure if this is what you're looking for. Removing entity_usage changes a lot of assumptions from the original patch so there are quite a few changes.

neclimdul’s picture

Status: Needs work » Needs review

  • neclimdul authored ac750c9 on 8.x-1.x
    Issue #2991136 by tomhollevoet, neclimdul, drunken monkey, borisson_,...
izus’s picture

Status: Needs review » Fixed

This is now merged
thanks all

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.