Description of the current apachesolr.index.inc
:
Document processing relies on the document callback
call during apachesolr_index_entity_to_documents()
.
This callback expect a string to be generated from the document.
Then some complex array_merge()
happens using dynamic callbacks dependent of the current $entity
.
Then some hooks are invoked, a teaser is generated and tags added.
I wasn't able to follow the process to the end but I suspect that indexing really happens using the SolrBaseQuery()
.
This process isn't flexible enough and is suboptimal in the case of the Tika because it relies on the fact that the whole (even huge) document must be loaded into memory, posted to the extracting handler (using extractOnly) then the whole values be processed several times, passed to several function (like array_merge()
) until it get POST
ed to Solr for indexation.
The API would rather allow a callback to freely use the ExtractingRequest
handler directly bound to Solr indexation (no "extractOnly
" flag).
Teaser, tags & co may (or may not) be generated after indexation happened but more importantly you can post documents using exec(curl)
which is the more efficient solution.
Of course it requires more Solr configuration but that cost less than increasing server memory.
Somehow related: #1312858: Is there a roadmap? and probably many others.
Comment | File | Size | Author |
---|---|---|---|
#6 | 0001-issue-1773454-support-sending-file-entities-to-Apach.patch | 2.85 KB | drzraf |
#1 | 1773454-tika-apachesolr_file.patch | 611 bytes | drzraf |
Comments
Comment #1
drzraf CreditAttribution: drzraf commentedstub but worse a look.
About the missing teaser, they are plenty of way to grab it in an elegant manner.
Comment #3
pwolanin CreditAttribution: pwolanin commentedThis should not be included in this module, and certainly not using the CLI curl command - there is a separate apachesolr_attachments module that does the loop for file attachments.
The logic is that it should work the same whether you are using Tika locally or Solr remotely. If that's not working for you, please look at writing an alternative attachements module - the problem is that the indexing needs have a bunch of new logic in order to correctly POST all the needed Drupal fields like the document ID, etc.
Please re-open if you have suggestions for non-breaking API changes that would add flexibility to the indexing logic, but I suspect you should be able to accomplish this already.
Comment #4
drzraf CreditAttribution: drzraf commentedright but the point of the patch is to show how tika must be requested in order to progress to a more flexible apachesolr API
oh ? then how would you index a 30MB file without eating >30MB of php memory ?
The only other way I can think about (the 1st one I tried actually) suffers from an annoying limitation:
As of writing the php-curl approach is limited by a php bug which prevents sending multivalued fields using multipart/form-data (https://bugs.php.net/bug.php?id=51634)
that's why the patch prefix field names with literal. but it may (or may not) be out of the scope of the apachesolr module itself.
The case of the teaser is anecdotal.
Adding functions can keep an API backward-compatible: there's no terrible breakage ahead, if done right.
I do agree that
apachesolr_index_send_to_tika
is certainly out of scope here... but as a maintainer of the apachesolr module you're the more knowledgeable person to give advises about the changes (and hooks) which may be added toapachesolr_index_send_to_solr
.At the very least, on the apachesolr side, it needs work.
At the apachesolr_file, apachesolr_media, apachesolr_attachment level it may be considered as a feature request.
Comment #5
drzraf CreditAttribution: drzraf commentedping ?
Comment #6
drzraf CreditAttribution: drzraf commentedThe apachesolr module part:
revamped / enhanced (replace 1773454-tika-apachesolr.patch)
0001-issue-1773454-support-sending-file-entities-to-Apach.patch
Comment #7
drzraf CreditAttribution: drzraf commentedone may change his mind in a 2 years timespan