I'm running into a problem with indexing PDF files with the version of Tika that comes with Solr 1.4. As it turns out, Tika throws exceptions if the PDF it is trying to index has been password protected. Apparently, Tika tries to pass in an empty password string.
The problem isn't that exceptions are being thrown; the problem is that it prevents any other files from being indexed, which isn't good. There ought to be some way to mark a document in some way as to recognize that it failed to be indexed, and to skip it the next time cron runs. An admin report of the failed files would help administrators identify and fix files with issues.
I am also trying to upgrade Tika to 0.7 (which parses the PDFs just fine from the command line), but this may not always be possible for others.
| Comment | File | Size | Author |
|---|---|---|---|
| #10 | 815104-bad-file-handling-10.patch | 1.72 KB | pwolanin |
| #3 | 815104-reindex-failed-files.patch | 13.72 KB | ebeyrent |
Comments
Comment #1
ebeyrent commentedHere's what I am proposing:
- Add a column to the apachesolr_attachments_files table named "failed" of type tinyint default value 0
- possibly also add columns for attempts (count of attempts to index the file) and last_attempted (timestamp)
- Modify the apachesolr_attachments_add_documents() such that failed files are excluded in the query
- Add an admin screen that lists the failed files, with options to re-index each one or many
Any thoughts or suggestions?
Comment #2
ebeyrent commentedComment #3
ebeyrent commentedHere's a patch that adds the functionality described in #1.
The admin screen that lists the failed files can be accessed at admin/reports/apachesolr/attachments.
Comment #4
pwolanin commentedI think a document with empty text in the cache table is skipped?
Comment #5
ebeyrent commentedNot based on what I was seeing. My understanding is that the workflow is like this:
- Get all the noes to index
$rows = apachesolr_get_nodes_to_index('apachesolr_attachments', $cron_try);
- Index the nodes
$success = apachesolr_index_nodes($rows, 'apachesolr_attachments');
This calls the defined callbacks, one of which is apachesolr_attachments_add_documents(). This function gets all the files for each node, and then attempts to extract the text for each file, but it never checks if a text is empty. For each file, it calls:
apachesolr_attachments_get_attachment_text($file)
This function does the following:
Because there is no body, an attempt is made to extract the text from the file. If the file has things that Tika doesn't like, exceptions are thrown, and the row doesn't get updated. As a result, the same file will get parsed on each and every cron run, and will fail every time.
Comment #6
ebeyrent commentedI would also say that just because a document didn't produce any text, that doesn't necessarily mean that an error didn't occur. IMO, there should be a distinction between not having any text and having exceptions thrown. Do you agree? One benefit to the patch I posted is that admins can identify files that had problems, fix the files, and re-index the nodes. This would certainly be helpful if you needed to replace several files, and didn't want to delete the old attachment and re-upload a new one.
Comment #7
pwolanin commentedIdeally we should fix this for 6.x-1.x first, unless you wan to help maintain the 2.x branch.
Overal the patch looks reasonable, but I think the logic and schema change is a bit more complex than it needs to be.
Comment #8
ebeyrent commentedWhat do you mean when you say more complex? I was thinking that it might be nice to add another column to display details about why the file failed; as an administrator, you only know that a file failed, but not why.
What would you like to see axed from the patch?
Comment #9
pwolanin commentedre: Tika 0.7 see https://issues.apache.org/jira/browse/SOLR-1819 and lobby for an update to 0.7 in Solr 1.4.1
Comment #10
pwolanin commentedHere's a truly minimal patch to start - just catches the exception and logs it so we are not stuck in an infinite loop.
Comment #11
pwolanin commented@ebeyrent - can you detail your steps for putting Tike 0.7 in place for Solr? Let's add that to the README
Comment #12
pwolanin commentedaccording to this issue: https://issues.apache.org/jira/browse/SOLR-1902 Tika 0.7 (or 0.8 snapshot) is not working for content extraction with Solr at the moment.
Comment #13
pwolanin commentedI'm going to commit #10 since it prevents the most horrific problem, but let's refine your initial patch since I agree it would be useful to have more insight into which files are having problems.
Regarding simplifying the schema change - the total # of attempts for a file - in general that will be 1, right? I think we can simply recrod the number of successive fails - if that number is > 0, the file is in a failed state. If we succeed, we set to 0.
Comment #14
ebeyrent commentedI'm fine with that as long as the column matches the intent - the schema change I proposed already has a "failed" column, which will be either 0 or 1. The intent behind the number of attempts column was so that administrators could potentially differentiate between a one-time server failure vs. something wrong with the file itself. You're right though - it's probably not necessary, and I would be fine dropping that field and supporting code.
Comment #15
pwolanin commentedsetting to CNW just to track this more accurately.
Comment #16
jpmckinney commentedThis is a feature request at this point.