Add exception handling to prevent indexing from halting [#2884453]

Problem/Motivation

When attempting to index media entities with file attachments, we are running into a few different exceptions thrown by Solarium during the extraction. An example of one exception that we're seeing can be seen here - https://www.drupal.org/node/2859565. Regardless of the exception, it causes indexing to halt outright. If we try to start the indexing process again it will fail on the same item. If we unpublish or remove that item from the queue of items to be indexed indexing proceeds until it hits another item that throws an exception, and so on.

Core and module versions:
Drupal - 8.3.2
Search API - 1.0.0
Search API Solr - 1.0.0-beta3
Search API Attachments - 1.0-beta2

Proposed resolution

When an exception is thrown during extraction, the module should catch the exception, log the item that failed, and proceed with indexing the remainder of items in the queue.

Remaining tasks

Provide a patch that carries out the proposed resolution. This would involve wrapping the invocation of the extractOrGetFromCache method with a try/catch block in FilesExtrator.php. The catch should log helpful debugging information for the failed entity, as well as continue to the next file.

Relevant code:

          if (!empty($files)) {
            $extraction = '';

            foreach ($files as $file) {
              if ($this->isFileIndexable($file, $item, $field_name)) {
                $extraction .= $this->extractOrGetFromCache($file, $extractor_plugin);
              }
            }
            $field->addValue($extraction);
          }

Comment	File	Size	Author
#21	add_exception_handling-2884453-21.patch	8.36 KB	acbramley
#19	interdiff-15-19.txt	644 bytes	acbramley
#19	add_exception_handling-2884453-19.patch	9.02 KB	acbramley
#17	add_exception_handling-beta2-2884453-17.patch	10 KB	malik.kotob
#17	add_exception_handling-2884453-17.patch	10.19 KB	malik.kotob
#15	add_exception_handling-2884453-15.patch	9.09 KB	acbramley
#13	add_exception_handling-2884453-13.patch	6.58 KB	acbramley
#13	add_exception_handling-beta2-2884453-13.patch	10.24 KB	acbramley
#11	add_exception_handling-beta2-2884453-11.patch	8.9 KB	malik.kotob
#11	add_exception_handling-2884453-11.patch	9.09 KB	malik.kotob
#10	add_exception_handling-beta2-2884453-10.patch	7.6 KB	malik.kotob
#3	add_exception_handling-2884453-3.patch	8.66 KB	malik.kotob
#9	add_exception_handling-beta2-2884453-9.patch	7.69 KB	malik.kotob
#6	add_exception_handling-2884453-6.patch	7.79 KB	malik.kotob

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

7 June 2017 at 17:29

malik.kotob created an issue. See original summary.

Comment #2

janusman CreditAttribution: janusman at Acquia commented 7 June 2017 at 18:50

This may also be related: https://www.drupal.org/node/2343705 for apachesolr_attachments.module. The main idea is to track files that have failed in the past, so we stop trying to extract them each-and-every-time.

Comment #3

malik.kotob CreditAttribution: malik.kotob commented 12 June 2017 at 20:30

File	Size
add_exception_handling-2884453-3.patch	8.66 KB

Attaching a patch that allows indexing to continue when exceptions are thrown during extraction. The added code catches the exception thrown during extraction and logs the failure to a new table (search_api_attachments_log). The table stores the index_id, entity_id, fid, and exception message.

Comment #4

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 13 June 2017 at 05:41

Status:

Active

» Needs work

+++ b/src/Plugin/search_api/processor/FilesExtrator.php
@@ -518,4 +533,46 @@ class FilesExtrator extends ProcessorPluginBase implements PluginFormInterface {
+      $this->connection
+        ->insert('search_api_attachments_log')
+        ->fields([
+          'index_id' => $item->getIndex()->id(),
+          'datasource' => $item->getDatasourceId(),
+          'entity_id' => $entity->id(),
+          'fid' => $file->id(),
+          'message' => $exceptionMessage,
+          'timestamp' => REQUEST_TIME,
+        ])
+        ->execute();

Why not just use Drupal's logger service to log this instead of creating a db table for it? That way it's up to the site builder to determine where that log goes (be it syslog, stdout etc)

Comment #5

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 12:48

@acbramley I actually took that route initially, but with @janusam's comment in mind from above, I think it would be nice to have a table down the road that we could potentially leverage for tracking files that have failed in the past. Additionally, this would give us the ability to more easily expose the content to views to allow for non-technical folk to sift through problematic files. Thoughts?

Comment #6

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 15:18

File	Size
add_exception_handling-2884453-6.patch	7.79 KB

1 file was hidden/shown/deleted

File	Size
add_exception_handling-2884453-3.patch	8.66 KB

The previously attached patch was incorrectly rolled. Attaching a new patch against 8.x-1.x.

Comment #7

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 15:19

Status:

Needs work

» Needs review

Comment #8

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 15:19

Version:

8.x-1.0-beta2

» 8.x-1.x-dev

Comment #9

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 15:41

File	Size
add_exception_handling-beta2-2884453-9.patch	7.69 KB

Attaching a patch that applies to the 8.x-1.0-beta2 version cleanly.

Comment #10

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 16:15

File	Size
add_exception_handling-beta2-2884453-10.patch	7.6 KB

1 file was hidden/shown/deleted

File	Size
add_exception_handling-beta2-2884453-9.patch	7.69 KB

Apologies, fixing incorrect spacing in patch from #9.

Comment #11

malik.kotob CreditAttribution: malik.kotob commented 13 June 2017 at 21:26

File	Size
add_exception_handling-2884453-11.patch	9.09 KB
add_exception_handling-beta2-2884453-11.patch	8.9 KB

2 files were hidden/shown/deleted

File	Size
add_exception_handling-2884453-6.patch	7.79 KB
add_exception_handling-beta2-2884453-10.patch	7.6 KB

Moving the exception handling location. Attaching new patches for 8.x-1.0-beta2 and 8.x-1.x.

Comment #12

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 14 June 2017 at 01:29

@malik.kotob re #5 understood, if further work is going to be done to ignore the problematic files then that makes sense.

Further review:

+++ b/src/Plugin/search_api/processor/FilesExtrator.php
@@ -518,4 +540,46 @@ class FilesExtrator extends ProcessorPluginBase implements PluginFormInterface {
+    try {
+      $this->connection
+        ->insert('search_api_attachments_log')
+        ->fields([
+          'index_id' => $item->getIndex()->id(),
+          'datasource' => $item->getDatasourceId(),
+          'entity_id' => $entity->id(),
+          'fid' => $file->id(),
+          'message' => $exceptionMessage,
+          'timestamp' => REQUEST_TIME,
+        ])
+        ->execute();
+    }
+    catch (Exception $e) {
+      throw $e;
+    }

Not sure what the point of this is?
EDIT: To clarify, the try/catch is what I'm questioning. It's just catching and throwing the exception anyway so there's no point having the try/catch in the first place?

Comment #13

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 14 June 2017 at 01:49

File	Size
add_exception_handling-beta2-2884453-13.patch	10.24 KB
add_exception_handling-2884453-13.patch	6.58 KB

2 files were hidden/shown/deleted

File	Size
add_exception_handling-2884453-11.patch	9.09 KB
add_exception_handling-beta2-2884453-11.patch	8.9 KB

Fixed #12 and also fixed the update hook which was failing.

Comment #14

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 14 June 2017 at 02:09

Status:

Needs review

» Needs work

+++ b/src/Plugin/search_api/processor/FilesExtrator.php
@@ -183,7 +195,7 @@ class FilesExtrator extends ProcessorPluginBase implements PluginFormInterface {
-                $extraction .= $this->extractOrGetFromCache($file, $extractor_plugin);
+                $extraction .= $this->extractOrGetFromCache($entity, $file, $item, $extractor_plugin);

This change causes Fatals in the ExtractedText formatter since the $entity and $item parameters have not been added.

That formatter doesn't have access to the search api item object so we may have to remove that from the logging since that is the only reason it is being added here.

Comment #15

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 14 June 2017 at 02:53

Status:

Needs work

» Needs review

File	Size
add_exception_handling-2884453-15.patch	9.09 KB

Here's a new patch that uses a simple logger approach.

Comment #16

malik.kotob CreditAttribution: malik.kotob commented 14 June 2017 at 11:18

@acbramley, ah I didn't realize that method was called elsewhere. Yeah, I think we could live without the index information. Especially since these errors are possible outside of the context of indexing (didn't realize that until you pointed this issue out). The patch from #13 is missing the install file portion. What was the bug in the update hook? I'll post another patch shortly with your findings from there as well as from #14. Thanks for your help on this!

Comment #17

malik.kotob CreditAttribution: malik.kotob commented 14 June 2017 at 15:34

File	Size
add_exception_handling-2884453-17.patch	10.19 KB
add_exception_handling-beta2-2884453-17.patch	10 KB

2 files were hidden/shown/deleted

File	Size
add_exception_handling-beta2-2884453-13.patch	10.24 KB
add_exception_handling-2884453-13.patch	6.58 KB

New patches that address @acbramley's feedback in #12, #13, and #14 (pointless try/catch, failing update hook, fatals in ExtractedText formatter). I've already experienced a benefit of having a table with failing fids by being able to query and join on the file_managed table. Having all failing files in a single location has improved debugging and spotting of patterns in files that fail for various reasons.

Comment #18

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 14 June 2017 at 22:37

@malik.kotob no worries! #17 is looking much better, I do however think we need some form of tests for this.

Comment #19

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 21 June 2017 at 03:02

File	Size
add_exception_handling-2884453-19.patch	9.02 KB
interdiff-15-19.txt	644 bytes

1 file was hidden/shown/deleted

File	Size
add_exception_handling-2884453-15.patch	9.09 KB

Removing the test log from #15

Comment #20

malik.kotob CreditAttribution: malik.kotob commented 21 June 2017 at 15:17

Sorry @acbramley. I've been sidetracked and haven't had a chance to get back to this. For the tests, do you have anything in mind? Do you feel like we need tests that test an extraction against a corrupt file against an actual SOLR instance, or do you feel like unit tests might be sufficient? We're not really doing anything too crazy in #17, so maybe some unit tests that verify we write to the table when an exception is thrown are sufficient?

Comment #21

acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented 27 April 2018 at 00:13

File	Size
add_exception_handling-2884453-21.patch	8.36 KB

2 files were hidden/shown/deleted

File	Size
add_exception_handling-2884453-19.patch	9.02 KB
interdiff-15-19.txt	644 bytes

Reroll of #19 against HEAD with a couple of minor fixes:

1) $extracted_data was undefined when an exception was thrown leading to warnings when setting the keyVale store.
2) extractFileContents wasn't returning NULL when isFileIndexable returned FALSE.

It'd be really great to get this committed as I imagine most sites using this module would want their indexing to continue to work when a file can't be extracted.

Comment #22

fenstrat

he/him

English

Australia

CreditAttribution: fenstrat at PreviousNext for Transport for NSW commented 29 April 2018 at 23:17

Status:

Needs review

» Reviewed & tested by the community

This does exactly what it says on the box, so marking as RTBC.

Again it'd be great to expand this module tests, but it's probably not the task for this issue.

Comment #23

30 April 2018 at 09:16

acbramley authored 6cfceeb on 8.x-1.x

Issue #2884453 by malik.kotob, acbramley, janusman, fenstrat, izus: Add...

Comment #24

izus CreditAttribution: izus commented 30 April 2018 at 09:18

Status:

Reviewed & tested by the community

» Fixed

Thanks all
this is now merged

Comment #25

14 May 2018 at 09:19

Status:

Fixed

» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

Add exception handling to prevent indexing from halting

Problem/Motivation

Proposed resolution

Remaining tasks

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Thank you to these Drupal contributors

News items

Our community

Documentation

Drupal code base

Governance of community