Description of the current apachesolr.index.inc:

Document processing relies on the document callback call during apachesolr_index_entity_to_documents().
This callback expect a string to be generated from the document.
Then some complex array_merge() happens using dynamic callbacks dependent of the current $entity.
Then some hooks are invoked, a teaser is generated and tags added.
I wasn't able to follow the process to the end but I suspect that indexing really happens using the SolrBaseQuery().

This process isn't flexible enough and is suboptimal in the case of the Tika because it relies on the fact that the whole (even huge) document must be loaded into memory, posted to the extracting handler (using extractOnly) then the whole values be processed several times, passed to several function (like array_merge()) until it get POSTed to Solr for indexation.

The API would rather allow a callback to freely use the ExtractingRequest handler directly bound to Solr indexation (no "extractOnly" flag).
Teaser, tags & co may (or may not) be generated after indexation happened but more importantly you can post documents using exec(curl) which is the more efficient solution.
Of course it requires more Solr configuration but that cost less than increasing server memory.

Somehow related: #1312858: Is there a roadmap? and probably many others.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drzraf’s picture

Status: Active » Needs review
FileSize
611 bytes
2.58 KB

stub but worse a look.
About the missing teaser, they are plenty of way to grab it in an elegant manner.

Status: Needs review » Needs work

The last submitted patch, 1773454-tika-apachesolr_file.patch, failed testing.

pwolanin’s picture

Status: Needs work » Closed (works as designed)

This should not be included in this module, and certainly not using the CLI curl command - there is a separate apachesolr_attachments module that does the loop for file attachments.

The logic is that it should work the same whether you are using Tika locally or Solr remotely. If that's not working for you, please look at writing an alternative attachements module - the problem is that the indexing needs have a bunch of new logic in order to correctly POST all the needed Drupal fields like the document ID, etc.

Please re-open if you have suggestions for non-breaking API changes that would add flexibility to the indexing logic, but I suspect you should be able to accomplish this already.

drzraf’s picture

This should not be included in this module

right but the point of the patch is to show how tika must be requested in order to progress to a more flexible apachesolr API

and certainly not using the CLI curl command

oh ? then how would you index a 30MB file without eating >30MB of php memory ?
The only other way I can think about (the 1st one I tried actually) suffers from an annoying limitation:
As of writing the php-curl approach is limited by a php bug which prevents sending multivalued fields using multipart/form-data (https://bugs.php.net/bug.php?id=51634)

needs have a bunch of new logic in order to correctly POST all the needed Drupal fields like the document ID, etc.

that's why the patch prefix field names with literal. but it may (or may not) be out of the scope of the apachesolr module itself.
The case of the teaser is anecdotal.

re-open if you have suggestions for non-breaking API changes

Adding functions can keep an API backward-compatible: there's no terrible breakage ahead, if done right.

I do agree that apachesolr_index_send_to_tika is certainly out of scope here... but as a maintainer of the apachesolr module you're the more knowledgeable person to give advises about the changes (and hooks) which may be added to apachesolr_index_send_to_solr.
At the very least, on the apachesolr side, it needs work.

At the apachesolr_file, apachesolr_media, apachesolr_attachment level it may be considered as a feature request.

drzraf’s picture

ping ?

drzraf’s picture

FileSize
2.85 KB

The apachesolr module part:
revamped / enhanced (replace 1773454-tika-apachesolr.patch)

0001-issue-1773454-support-sending-file-entities-to-Apach.patch

drzraf’s picture

Status: Closed (works as designed) » Needs review

one may change his mind in a 2 years timespan