Problem/Motivation
When attempting to index media entities with file attachments, we are running into a few different exceptions thrown by Solarium during the extraction. An example of one exception that we're seeing can be seen here - https://www.drupal.org/node/2859565. Regardless of the exception, it causes indexing to halt outright. If we try to start the indexing process again it will fail on the same item. If we unpublish or remove that item from the queue of items to be indexed indexing proceeds until it hits another item that throws an exception, and so on.
Core and module versions:
Drupal - 8.3.2
Search API - 1.0.0
Search API Solr - 1.0.0-beta3
Search API Attachments - 1.0-beta2
Proposed resolution
When an exception is thrown during extraction, the module should catch the exception, log the item that failed, and proceed with indexing the remainder of items in the queue.
Remaining tasks
Provide a patch that carries out the proposed resolution. This would involve wrapping the invocation of the extractOrGetFromCache method with a try/catch block in FilesExtrator.php. The catch should log helpful debugging information for the failed entity, as well as continue to the next file.
Relevant code:
if (!empty($files)) {
$extraction = '';
foreach ($files as $file) {
if ($this->isFileIndexable($file, $item, $field_name)) {
$extraction .= $this->extractOrGetFromCache($file, $extractor_plugin);
}
}
$field->addValue($extraction);
}
Comment | File | Size | Author |
---|---|---|---|
#21 | add_exception_handling-2884453-21.patch | 8.36 KB | acbramley |
#17 | add_exception_handling-beta2-2884453-17.patch | 10 KB | malik.kotob |
#17 | add_exception_handling-2884453-17.patch | 10.19 KB | malik.kotob |
Comments
Comment #2
janusman CreditAttribution: janusman at Acquia commentedThis may also be related: https://www.drupal.org/node/2343705 for apachesolr_attachments.module. The main idea is to track files that have failed in the past, so we stop trying to extract them each-and-every-time.
Comment #3
malik.kotob CreditAttribution: malik.kotob commentedAttaching a patch that allows indexing to continue when exceptions are thrown during extraction. The added code catches the exception thrown during extraction and logs the failure to a new table (search_api_attachments_log). The table stores the index_id, entity_id, fid, and exception message.
Comment #4
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commentedWhy not just use Drupal's logger service to log this instead of creating a db table for it? That way it's up to the site builder to determine where that log goes (be it syslog, stdout etc)
Comment #5
malik.kotob CreditAttribution: malik.kotob commented@acbramley I actually took that route initially, but with @janusam's comment in mind from above, I think it would be nice to have a table down the road that we could potentially leverage for tracking files that have failed in the past. Additionally, this would give us the ability to more easily expose the content to views to allow for non-technical folk to sift through problematic files. Thoughts?
Comment #6
malik.kotob CreditAttribution: malik.kotob commentedThe previously attached patch was incorrectly rolled. Attaching a new patch against 8.x-1.x.
Comment #7
malik.kotob CreditAttribution: malik.kotob commentedComment #8
malik.kotob CreditAttribution: malik.kotob commentedComment #9
malik.kotob CreditAttribution: malik.kotob commentedAttaching a patch that applies to the 8.x-1.0-beta2 version cleanly.
Comment #10
malik.kotob CreditAttribution: malik.kotob commentedApologies, fixing incorrect spacing in patch from #9.
Comment #11
malik.kotob CreditAttribution: malik.kotob commentedMoving the exception handling location. Attaching new patches for 8.x-1.0-beta2 and 8.x-1.x.
Comment #12
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented@malik.kotob re #5 understood, if further work is going to be done to ignore the problematic files then that makes sense.
Further review:
Not sure what the point of this is?
EDIT: To clarify, the try/catch is what I'm questioning. It's just catching and throwing the exception anyway so there's no point having the try/catch in the first place?
Comment #13
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commentedFixed #12 and also fixed the update hook which was failing.
Comment #14
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commentedThis change causes Fatals in the ExtractedText formatter since the $entity and $item parameters have not been added.
That formatter doesn't have access to the search api item object so we may have to remove that from the logging since that is the only reason it is being added here.
Comment #15
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commentedHere's a new patch that uses a simple logger approach.
Comment #16
malik.kotob CreditAttribution: malik.kotob commented@acbramley, ah I didn't realize that method was called elsewhere. Yeah, I think we could live without the index information. Especially since these errors are possible outside of the context of indexing (didn't realize that until you pointed this issue out). The patch from #13 is missing the install file portion. What was the bug in the update hook? I'll post another patch shortly with your findings from there as well as from #14. Thanks for your help on this!
Comment #17
malik.kotob CreditAttribution: malik.kotob commentedNew patches that address @acbramley's feedback in #12, #13, and #14 (pointless try/catch, failing update hook, fatals in ExtractedText formatter). I've already experienced a benefit of having a table with failing fids by being able to query and join on the file_managed table. Having all failing files in a single location has improved debugging and spotting of patterns in files that fail for various reasons.
Comment #18
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commented@malik.kotob no worries! #17 is looking much better, I do however think we need some form of tests for this.
Comment #19
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commentedRemoving the test log from #15
Comment #20
malik.kotob CreditAttribution: malik.kotob commentedSorry @acbramley. I've been sidetracked and haven't had a chance to get back to this. For the tests, do you have anything in mind? Do you feel like we need tests that test an extraction against a corrupt file against an actual SOLR instance, or do you feel like unit tests might be sufficient? We're not really doing anything too crazy in #17, so maybe some unit tests that verify we write to the table when an exception is thrown are sufficient?
Comment #21
acbramley CreditAttribution: acbramley at PreviousNext for Transport for NSW commentedReroll of #19 against HEAD with a couple of minor fixes:
1) $extracted_data was undefined when an exception was thrown leading to warnings when setting the keyVale store.
2) extractFileContents wasn't returning NULL when isFileIndexable returned FALSE.
It'd be really great to get this committed as I imagine most sites using this module would want their indexing to continue to work when a file can't be extracted.
Comment #22
fenstratThis does exactly what it says on the box, so marking as RTBC.
Again it'd be great to expand this module tests, but it's probably not the task for this issue.
Comment #24
izus CreditAttribution: izus commentedThanks all
this is now merged