hi - i'm testing out the amazons3 module and it's working well - all my attachments are saved to an s3 bucket... however none of that gets indexed by solr... any ideas on why no (i'm guessing there are plenty of reasons why not) and any suggestions on workarounds?

EDIT: See also #1430710: Integrate with Remote Stream Wrapper module

Comments

alibama created an issue. See original summary.

escuriola’s picture

Hello

We have to modify some functions because drupal_realpath, file_exists, etc., does not work correctly with non-local /mounted files, and this commands are in several places of the module files.
- First of all we check if the stream is non-local
- Then, we modify "apachesolr_attachments_is_file" because of "drupal_realpath" and call a custom function to download the s3 file and check the file in a temporary folder.
- And in the "apachesolr_attachments_get_attachment_text" we send the downloaded local path instead of the uri.

I would prepare an improved patch with this, now we are working with a similar solution sending a rewrited url instead of the uri and works nice.

Regards.

alibama’s picture

way cool - look forward to testing it.... I'm also going to test elasticsearch and see if I can work through there... glad to test patches

digitalfrontiersmedia’s picture

escuriola & alibama,

Did you ever develop patches for this that you could share? I'm looking at a similar setup. I may have a solution but and am looking for alternate options so any solutions or guidance would be greatly appreciated.

Thanks,
Stephen

escuriola’s picture

Hello.

I reviewed the changes and this is the only change I things affects amazonS3:

diff --git a/apachesolr_attachments.index.inc b/apachesolr_attachments.index.inc
index 9e9517c..0496a98 100644
--- a/apachesolr_attachments.index.inc
+++ b/apachesolr_attachments.index.inc
@@ -61,8 +61,10 @@ function apachesolr_attachments_get_attachment_text($file) {
array('@filesize' => $file->filesize, '@filename' => $file->filename, '@sizelimit' => $filesize_limit));
return FALSE;
}
-
$filepath = drupal_realpath($file->uri);
+ if (is_amazons3($file->uri)) {
+ $filepath = i_get_s3_internal_path($file->uri);
+ }

With this the $filepath directly comes from uri and the S3 wrappers can understand and locate.

Regards!

digitalfrontiersmedia’s picture

Wow! Sweet little patch. I'll check it out! Many thanks!

krrishnajee’s picture

Is there a good solution for this issue? I tried the above patch and it doesn't work.

krrishnajee’s picture

My friend Ian came up with an trick and we tried it, Finally it works. We are not sure is this the correct way of approach, and also I know changing the source of core module is not a nice thing, But unless we have no other option, we end up here. What we did was, before indexing the files download the files from S3 bucket to our local temp folder. And send the local file path to the tika to read the file and do the indexing. Check this out:

apachesolr_attachments.module

function apachesolr_attachments_is_file($entity) {
  if (!empty($entity->uri)) {
    
    //get the path of default/files folder
    $files_folder_name = variable_get('file_public_path', conf_path() . '/files');
    
    //create the full url of pdf file (from s3 bucket)
    $file_url = file_create_url($entity->uri);

    //get the actual uri by rid-off 'public://'
    $file_url_name = str_replace('public://','',$entity->uri);

    //complete local file url (this is the temp folder we store our pdfs)
    $files_full_name = $files_folder_name."/".$file_url_name;

    //download and store pdf from s3 to local temp folder
    file_put_contents($files_full_name, fopen($file_url, 'r'));
    
    //set the file full path for indexing
    $filepath = $_SERVER['DOCUMENT_ROOT']."/".$files_full_name;
    
    // Check that we have a valid filepath.
    if (!$filepath) {
      return FALSE;
    } 

apachesolr_attachments.index.inc

function apachesolr_attachments_get_attachment_text($file) {
  $indexer_table = apachesolr_get_indexer_table('file');
  if (!apachesolr_attachments_is_file($file)) {
    return FALSE;
  }
  
  $file_url_name = str_replace('public://','',$file->uri);
  $filename = variable_get('file_public_path', conf_path() . '/files');
  $filepath = $_SERVER['DOCUMENT_ROOT']."/".$filename."/".$file_url_name;
    
  //$filepath = drupal_realpath($file->uri);
  // No need to use java for plain text files.
  if ($file->filemime == 'text/plain' || $file->filemime == 'text/x-diff') {
digitalfrontiersmedia’s picture

We ended up doing something similar. Dave ended up curling the file from S3 but instead of storing it locally, simply piped the response from curl directly into tika.

amontero’s picture

For files in storage backends such Amazon S3 or Google Object Storage, check also linked issue #2789569: Allow configuration of file hashing function which would avoid redownloading files for updated nodes already in the file extraction cache.

amontero’s picture

Also, @DigitalFrontiersMedia, do you have any patch which you can post to allow others to use and work on it? TIA.

amontero’s picture

amontero’s picture

digitalfrontiersmedia’s picture

@amontero, I will ask Dave to post what we ended up with if posting a patch makes sense (i.e. isn't too application specific).

digitalfrontiersmedia’s picture

FYI -- I've asked Dave to submit the work but he's been busy. I may post it on his behalf if he just doesn't have the time to box it up.