API does not allow proper use of the ExtractingRequestHandler [#1773454]

Description of the current apachesolr.index.inc:

Document processing relies on the document callback call during apachesolr_index_entity_to_documents().
This callback expect a string to be generated from the document.
Then some complex array_merge() happens using dynamic callbacks dependent of the current $entity.
Then some hooks are invoked, a teaser is generated and tags added.
I wasn't able to follow the process to the end but I suspect that indexing really happens using the SolrBaseQuery().

This process isn't flexible enough and is suboptimal in the case of the Tika because it relies on the fact that the whole (even huge) document must be loaded into memory, posted to the extracting handler (using extractOnly) then the whole values be processed several times, passed to several function (like array_merge()) until it get POSTed to Solr for indexation.

The API would rather allow a callback to freely use the ExtractingRequest handler directly bound to Solr indexation (no "extractOnly" flag).
Teaser, tags & co may (or may not) be generated after indexation happened but more importantly you can post documents using exec(curl) which is the more efficient solution.
Of course it requires more Solr configuration but that cost less than increasing server memory.

Somehow related: #1312858: Is there a roadmap? and probably many others.

Comment	File	Size	Author
#6	0001-issue-1773454-support-sending-file-entities-to-Apach.patch	2.85 KB	drzraf
#6
#1	1773454-tika-apachesolr.patch	2.58 KB	drzraf
#1
#1	1773454-tika-apachesolr_file.patch	611 bytes	drzraf
#1

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

drzraf CreditAttribution: drzraf commented 4 September 2012 at 12:23

Status:

Active

» Needs review

File	Size
1773454-tika-apachesolr_file.patch	611 bytes

1773454-tika-apachesolr.patch	2.58 KB

stub but worse a look.
About the missing teaser, they are plenty of way to grab it in an elegant manner.

Comment #2

4 September 2012 at 12:26

Status:

Needs review

» Needs work

The last submitted patch, 1773454-tika-apachesolr_file.patch, failed testing.

Comment #3

pwolanin CreditAttribution: pwolanin commented 10 September 2012 at 14:35

Status:

Needs work

» Closed (works as designed)

This should not be included in this module, and certainly not using the CLI curl command - there is a separate apachesolr_attachments module that does the loop for file attachments.

The logic is that it should work the same whether you are using Tika locally or Solr remotely. If that's not working for you, please look at writing an alternative attachements module - the problem is that the indexing needs have a bunch of new logic in order to correctly POST all the needed Drupal fields like the document ID, etc.

Please re-open if you have suggestions for non-breaking API changes that would add flexibility to the indexing logic, but I suspect you should be able to accomplish this already.

Comment #4

drzraf CreditAttribution: drzraf commented 10 September 2012 at 15:40

This should not be included in this module

right but the point of the patch is to show how tika must be requested in order to progress to a more flexible apachesolr API

and certainly not using the CLI curl command

oh ? then how would you index a 30MB file without eating >30MB of php memory ?
The only other way I can think about (the 1st one I tried actually) suffers from an annoying limitation:
As of writing the php-curl approach is limited by a php bug which prevents sending multivalued fields using multipart/form-data (https://bugs.php.net/bug.php?id=51634)

needs have a bunch of new logic in order to correctly POST all the needed Drupal fields like the document ID, etc.

that's why the patch prefix field names with literal. but it may (or may not) be out of the scope of the apachesolr module itself.
The case of the teaser is anecdotal.

re-open if you have suggestions for non-breaking API changes

Adding functions can keep an API backward-compatible: there's no terrible breakage ahead, if done right.

I do agree that apachesolr_index_send_to_tika is certainly out of scope here... but as a maintainer of the apachesolr module you're the more knowledgeable person to give advises about the changes (and hooks) which may be added to apachesolr_index_send_to_solr.
At the very least, on the apachesolr side, it needs work.

At the apachesolr_file, apachesolr_media, apachesolr_attachment level it may be considered as a feature request.