working with s3 file system [#2652920]

Comment #1

19 January 2016 at 16:23

alibama created an issue. See original summary.

Log in or register to post comments

Comment #2

escuriola commented 21 January 2016 at 15:38

Hello

We have to modify some functions because drupal_realpath, file_exists, etc., does not work correctly with non-local /mounted files, and this commands are in several places of the module files.
- First of all we check if the stream is non-local
- Then, we modify "apachesolr_attachments_is_file" because of "drupal_realpath" and call a custom function to download the s3 file and check the file in a temporary folder.
- And in the "apachesolr_attachments_get_attachment_text" we send the downloaded local path instead of the uri.

I would prepare an improved patch with this, now we are working with a similar solution sending a rewrited url instead of the uri and works nice.

Regards.

Log in or register to post comments

Comment #3

alibama commented 21 January 2016 at 15:46

way cool - look forward to testing it.... I'm also going to test elasticsearch and see if I can work through there... glad to test patches

Log in or register to post comments

Comment #4

digitalfrontiersmedia

Sarasota, FL

commented 15 July 2016 at 16:36

escuriola & alibama,

Did you ever develop patches for this that you could share? I'm looking at a similar setup. I may have a solution but and am looking for alternate options so any solutions or guidance would be greatly appreciated.

Thanks,
Stephen

Log in or register to post comments

Comment #5

escuriola commented 20 July 2016 at 09:28

Hello.

I reviewed the changes and this is the only change I things affects amazonS3:

diff --git a/apachesolr_attachments.index.inc b/apachesolr_attachments.index.inc
index 9e9517c..0496a98 100644
--- a/apachesolr_attachments.index.inc
+++ b/apachesolr_attachments.index.inc
@@ -61,8 +61,10 @@ function apachesolr_attachments_get_attachment_text($file) {
array('@filesize' => $file->filesize, '@filename' => $file->filename, '@sizelimit' => $filesize_limit));
return FALSE;
}
-
$filepath = drupal_realpath($file->uri);
+ if (is_amazons3($file->uri)) {
+ $filepath = i_get_s3_internal_path($file->uri);
+ }

With this the $filepath directly comes from uri and the S3 wrappers can understand and locate.

Regards!

Log in or register to post comments

Comment #6

digitalfrontiersmedia

Sarasota, FL

commented 20 July 2016 at 12:27

Wow! Sweet little patch. I'll check it out! Many thanks!

Log in or register to post comments

Comment #7

krrishnajee commented 17 November 2016 at 09:14

Is there a good solution for this issue? I tried the above patch and it doesn't work.

Log in or register to post comments

Comment #8

krrishnajee commented 5 December 2016 at 11:10

My friend Ian came up with an trick and we tried it, Finally it works. We are not sure is this the correct way of approach, and also I know changing the source of core module is not a nice thing, But unless we have no other option, we end up here. What we did was, before indexing the files download the files from S3 bucket to our local temp folder. And send the local file path to the tika to read the file and do the indexing. Check this out:

apachesolr_attachments.module

function apachesolr_attachments_is_file($entity) {
  if (!empty($entity->uri)) {
    
    //get the path of default/files folder
    $files_folder_name = variable_get('file_public_path', conf_path() . '/files');
    
    //create the full url of pdf file (from s3 bucket)
    $file_url = file_create_url($entity->uri);

    //get the actual uri by rid-off 'public://'
    $file_url_name = str_replace('public://','',$entity->uri);

    //complete local file url (this is the temp folder we store our pdfs)
    $files_full_name = $files_folder_name."/".$file_url_name;

    //download and store pdf from s3 to local temp folder
    file_put_contents($files_full_name, fopen($file_url, 'r'));
    
    //set the file full path for indexing
    $filepath = $_SERVER['DOCUMENT_ROOT']."/".$files_full_name;
    
    // Check that we have a valid filepath.
    if (!$filepath) {
      return FALSE;
    }

apachesolr_attachments.index.inc

function apachesolr_attachments_get_attachment_text($file) {
  $indexer_table = apachesolr_get_indexer_table('file');
  if (!apachesolr_attachments_is_file($file)) {
    return FALSE;
  }
  
  $file_url_name = str_replace('public://','',$file->uri);
  $filename = variable_get('file_public_path', conf_path() . '/files');
  $filepath = $_SERVER['DOCUMENT_ROOT']."/".$filename."/".$file_url_name;
    
  //$filepath = drupal_realpath($file->uri);
  // No need to use java for plain text files.
  if ($file->filemime == 'text/plain' || $file->filemime == 'text/x-diff') {

Log in or register to post comments

Comment #9

digitalfrontiersmedia

Sarasota, FL

commented 5 December 2016 at 13:03

We ended up doing something similar. Dave ended up curling the file from S3 but instead of storing it locally, simply piped the response from curl directly into tika.

Log in or register to post comments

Comment #10

amontero

Barcelona

commented 6 December 2016 at 17:15

Parent issue:

» #2789569: Allow configuration of file hashing function

For files in storage backends such Amazon S3 or Google Object Storage, check also linked issue #2789569: Allow configuration of file hashing function which would avoid redownloading files for updated nodes already in the file extraction cache.