Problem/Motivation

When attempting to index media entities with file attachments, we are running into a few different exceptions thrown by Solarium during the extraction. An example of one exception that we're seeing can be seen here - https://www.drupal.org/node/2859565. Regardless of the exception, it causes indexing to halt outright. If we try to start the indexing process again it will fail on the same item. If we unpublish or remove that item from the queue of items to be indexed indexing proceeds until it hits another item that throws an exception, and so on.

Core and module versions:
Drupal - 8.3.2
Search API - 1.0.0
Search API Solr - 1.0.0-beta3
Search API Attachments - 1.0-beta2

Proposed resolution

When an exception is thrown during extraction, the module should catch the exception, log the item that failed, and proceed with indexing the remainder of items in the queue.

Remaining tasks

Provide a patch that carries out the proposed resolution. This would involve wrapping the invocation of the extractOrGetFromCache method with a try/catch block in FilesExtrator.php. The catch should log helpful debugging information for the failed entity, as well as continue to the next file.

Relevant code:

          if (!empty($files)) {
            $extraction = '';

            foreach ($files as $file) {
              if ($this->isFileIndexable($file, $item, $field_name)) {
                $extraction .= $this->extractOrGetFromCache($file, $extractor_plugin);
              }
            }
            $field->addValue($extraction);
          }
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

malik.kotob created an issue. See original summary.

janusman’s picture

This may also be related: https://www.drupal.org/node/2343705 for apachesolr_attachments.module. The main idea is to track files that have failed in the past, so we stop trying to extract them each-and-every-time.

malik.kotob’s picture

Attaching a patch that allows indexing to continue when exceptions are thrown during extraction. The added code catches the exception thrown during extraction and logs the failure to a new table (search_api_attachments_log). The table stores the index_id, entity_id, fid, and exception message.

acbramley’s picture

Status: Active » Needs work
+++ b/src/Plugin/search_api/processor/FilesExtrator.php
@@ -518,4 +533,46 @@ class FilesExtrator extends ProcessorPluginBase implements PluginFormInterface {
+      $this->connection
+        ->insert('search_api_attachments_log')
+        ->fields([
+          'index_id' => $item->getIndex()->id(),
+          'datasource' => $item->getDatasourceId(),
+          'entity_id' => $entity->id(),
+          'fid' => $file->id(),
+          'message' => $exceptionMessage,
+          'timestamp' => REQUEST_TIME,
+        ])
+        ->execute();

Why not just use Drupal's logger service to log this instead of creating a db table for it? That way it's up to the site builder to determine where that log goes (be it syslog, stdout etc)

malik.kotob’s picture

@acbramley I actually took that route initially, but with @janusam's comment in mind from above, I think it would be nice to have a table down the road that we could potentially leverage for tracking files that have failed in the past. Additionally, this would give us the ability to more easily expose the content to views to allow for non-technical folk to sift through problematic files. Thoughts?

malik.kotob’s picture

FileSize
7.79 KB

The previously attached patch was incorrectly rolled. Attaching a new patch against 8.x-1.x.

malik.kotob’s picture

Status: Needs work » Needs review
malik.kotob’s picture

Version: 8.x-1.0-beta2 » 8.x-1.x-dev
malik.kotob’s picture

Attaching a patch that applies to the 8.x-1.0-beta2 version cleanly.

malik.kotob’s picture

FileSize
7.6 KB

Apologies, fixing incorrect spacing in patch from #9.

malik.kotob’s picture

Moving the exception handling location. Attaching new patches for 8.x-1.0-beta2 and 8.x-1.x.

acbramley’s picture

@malik.kotob re #5 understood, if further work is going to be done to ignore the problematic files then that makes sense.

Further review:

+++ b/src/Plugin/search_api/processor/FilesExtrator.php
@@ -518,4 +540,46 @@ class FilesExtrator extends ProcessorPluginBase implements PluginFormInterface {
+    try {
+      $this->connection
+        ->insert('search_api_attachments_log')
+        ->fields([
+          'index_id' => $item->getIndex()->id(),
+          'datasource' => $item->getDatasourceId(),
+          'entity_id' => $entity->id(),
+          'fid' => $file->id(),
+          'message' => $exceptionMessage,
+          'timestamp' => REQUEST_TIME,
+        ])
+        ->execute();
+    }
+    catch (Exception $e) {
+      throw $e;
+    }

Not sure what the point of this is?
EDIT: To clarify, the try/catch is what I'm questioning. It's just catching and throwing the exception anyway so there's no point having the try/catch in the first place?

acbramley’s picture

Fixed #12 and also fixed the update hook which was failing.

acbramley’s picture

Status: Needs review » Needs work
+++ b/src/Plugin/search_api/processor/FilesExtrator.php
@@ -183,7 +195,7 @@ class FilesExtrator extends ProcessorPluginBase implements PluginFormInterface {
-                $extraction .= $this->extractOrGetFromCache($file, $extractor_plugin);
+                $extraction .= $this->extractOrGetFromCache($entity, $file, $item, $extractor_plugin);

This change causes Fatals in the ExtractedText formatter since the $entity and $item parameters have not been added.

That formatter doesn't have access to the search api item object so we may have to remove that from the logging since that is the only reason it is being added here.

acbramley’s picture

Status: Needs work » Needs review
FileSize
9.09 KB

Here's a new patch that uses a simple logger approach.

malik.kotob’s picture

@acbramley, ah I didn't realize that method was called elsewhere. Yeah, I think we could live without the index information. Especially since these errors are possible outside of the context of indexing (didn't realize that until you pointed this issue out). The patch from #13 is missing the install file portion. What was the bug in the update hook? I'll post another patch shortly with your findings from there as well as from #14. Thanks for your help on this!

malik.kotob’s picture

New patches that address @acbramley's feedback in #12, #13, and #14 (pointless try/catch, failing update hook, fatals in ExtractedText formatter). I've already experienced a benefit of having a table with failing fids by being able to query and join on the file_managed table. Having all failing files in a single location has improved debugging and spotting of patterns in files that fail for various reasons.

acbramley’s picture

@malik.kotob no worries! #17 is looking much better, I do however think we need some form of tests for this.

acbramley’s picture

FileSize
9.02 KB
644 bytes

Removing the test log from #15

malik.kotob’s picture

Sorry @acbramley. I've been sidetracked and haven't had a chance to get back to this. For the tests, do you have anything in mind? Do you feel like we need tests that test an extraction against a corrupt file against an actual SOLR instance, or do you feel like unit tests might be sufficient? We're not really doing anything too crazy in #17, so maybe some unit tests that verify we write to the table when an exception is thrown are sufficient?

acbramley’s picture

FileSize
8.36 KB

Reroll of #19 against HEAD with a couple of minor fixes:

1) $extracted_data was undefined when an exception was thrown leading to warnings when setting the keyVale store.
2) extractFileContents wasn't returning NULL when isFileIndexable returned FALSE.

It'd be really great to get this committed as I imagine most sites using this module would want their indexing to continue to work when a file can't be extracted.

fenstrat’s picture

Status: Needs review » Reviewed & tested by the community

This does exactly what it says on the box, so marking as RTBC.

Again it'd be great to expand this module tests, but it's probably not the task for this issue.

  • acbramley authored 6cfceeb on 8.x-1.x
    Issue #2884453 by malik.kotob, acbramley, janusman, fenstrat, izus: Add...
izus’s picture

Status: Reviewed & tested by the community » Fixed

Thanks all
this is now merged

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.