Skip files that fail indexing [#815104]

I'm running into a problem with indexing PDF files with the version of Tika that comes with Solr 1.4. As it turns out, Tika throws exceptions if the PDF it is trying to index has been password protected. Apparently, Tika tries to pass in an empty password string.

The problem isn't that exceptions are being thrown; the problem is that it prevents any other files from being indexed, which isn't good. There ought to be some way to mark a document in some way as to recognize that it failed to be indexed, and to skip it the next time cron runs. An admin report of the failed files would help administrators identify and fix files with issues.

I am also trying to upgrade Tika to 0.7 (which parses the PDFs just fine from the command line), but this may not always be possible for others.

Comment	File	Size	Author
#10	815104-bad-file-handling-10.patch	1.72 KB	pwolanin
#3	815104-reindex-failed-files.patch	13.72 KB	ebeyrent

Comments

Comment #1

ebeyrent commented 2 June 2010 at 16:32

Here's what I am proposing:

- Add a column to the apachesolr_attachments_files table named "failed" of type tinyint default value 0
- possibly also add columns for attempts (count of attempts to index the file) and last_attempted (timestamp)
- Modify the apachesolr_attachments_add_documents() such that failed files are excluded in the query
- Add an admin screen that lists the failed files, with options to re-index each one or many

Any thoughts or suggestions?

Comment #2

ebeyrent commented 2 June 2010 at 16:32

Version:

6.x-2.0-alpha1

» 6.x-2.x-dev

Comment #3

ebeyrent commented 3 June 2010 at 16:52

Status:

Active

» Needs review

Status	File	Size
new	815104-reindex-failed-files.patch	13.72 KB

Here's a patch that adds the functionality described in #1.

The admin screen that lists the failed files can be accessed at admin/reports/apachesolr/attachments.

Comment #4

pwolanin commented 4 June 2010 at 15:45

I think a document with empty text in the cache table is skipped?

Comment #5

ebeyrent commented 4 June 2010 at 17:55

Not based on what I was seeing. My understanding is that the workflow is like this:

- Get all the noes to index
$rows = apachesolr_get_nodes_to_index('apachesolr_attachments', $cron_try);

- Index the nodes
$success = apachesolr_index_nodes($rows, 'apachesolr_attachments');

This calls the defined callbacks, one of which is apachesolr_attachments_add_documents(). This function gets all the files for each node, and then attempts to extract the text for each file, but it never checks if a text is empty. For each file, it calls:

apachesolr_attachments_get_attachment_text($file)

This function does the following:


$cached = db_fetch_array(db_query("SELECT * FROM {apachesolr_attachments_files} WHERE fid = %d", $file->fid));

  if (!is_null($cached['body']) && ($cached['sha1'] == $sha1)) {
    // No need to re-extract.
    return $cached['body'];
  }

Because there is no body, an attempt is made to extract the text from the file. If the file has things that Tika doesn't like, exceptions are thrown, and the row doesn't get updated. As a result, the same file will get parsed on each and every cron run, and will fail every time.

Comment #6

ebeyrent commented 4 June 2010 at 18:33

I would also say that just because a document didn't produce any text, that doesn't necessarily mean that an error didn't occur. IMO, there should be a distinction between not having any text and having exceptions thrown. Do you agree? One benefit to the patch I posted is that admins can identify files that had problems, fix the files, and re-index the nodes. This would certainly be helpful if you needed to replace several files, and didn't want to delete the old attachment and re-upload a new one.

Comment #7

pwolanin commented 5 June 2010 at 00:39

Ideally we should fix this for 6.x-1.x first, unless you wan to help maintain the 2.x branch.

-    apachesolr_index_updated(time());
+    apachesolr_index_set_last_updated(time());

Overal the patch looks reasonable, but I think the logic and schema change is a bit more complex than it needs to be.

Comment #8

ebeyrent commented 5 June 2010 at 00:59

What do you mean when you say more complex? I was thinking that it might be nice to add another column to display details about why the file failed; as an administrator, you only know that a file failed, but not why.

What would you like to see axed from the patch?

Comment #9

pwolanin commented 20 June 2010 at 22:17

re: Tika 0.7 see https://issues.apache.org/jira/browse/SOLR-1819 and lobby for an update to 0.7 in Solr 1.4.1

Comment #10

pwolanin commented 21 June 2010 at 02:29

Version:

6.x-2.x-dev

» 6.x-1.x-dev

Status	File	Size
new	815104-bad-file-handling-10.patch	1.72 KB

Here's a truly minimal patch to start - just catches the exception and logs it so we are not stuck in an infinite loop.

Comment #11

pwolanin commented 22 June 2010 at 13:01

@ebeyrent - can you detail your steps for putting Tike 0.7 in place for Solr? Let's add that to the README

Comment #12

pwolanin commented 22 June 2010 at 15:43

according to this issue: https://issues.apache.org/jira/browse/SOLR-1902 Tika 0.7 (or 0.8 snapshot) is not working for content extraction with Solr at the moment.

Comment #13

pwolanin commented 22 June 2010 at 15:59

I'm going to commit #10 since it prevents the most horrific problem, but let's refine your initial patch since I agree it would be useful to have more insight into which files are having problems.

Regarding simplifying the schema change - the total # of attempts for a file - in general that will be 1, right? I think we can simply recrod the number of successive fails - if that number is > 0, the file is in a failed state. If we succeed, we set to 0.

Comment #14

ebeyrent commented 23 June 2010 at 00:52

I'm fine with that as long as the column matches the intent - the schema change I proposed already has a "failed" column, which will be either 0 or 1. The intent behind the number of attempts column was so that administrators could potentially differentiate between a one-time server failure vs. something wrong with the file itself. You're right though - it's probably not necessary, and I would be fine dropping that field and supporting code.

Comment #15

pwolanin commented 29 June 2010 at 02:29

Status:

Needs review

» Needs work

setting to CNW just to track this more accurately.

Comment #16

jpmckinney commented 8 February 2011 at 16:46

Title:	Exception handling	» Skip files that fail indexing
Category:	bug	» feature

This is a feature request at this point.

Skip files that fail indexing