Active
Project:
Apache Solr Attachments
Version:
7.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
19 Jan 2016 at 16:23 UTC
Updated:
8 Mar 2017 at 19:56 UTC
Jump to comment: Most recent
Comments
Comment #2
escuriola commentedHello
We have to modify some functions because drupal_realpath, file_exists, etc., does not work correctly with non-local /mounted files, and this commands are in several places of the module files.
- First of all we check if the stream is non-local
- Then, we modify "apachesolr_attachments_is_file" because of "drupal_realpath" and call a custom function to download the s3 file and check the file in a temporary folder.
- And in the "apachesolr_attachments_get_attachment_text" we send the downloaded local path instead of the uri.
I would prepare an improved patch with this, now we are working with a similar solution sending a rewrited url instead of the uri and works nice.
Regards.
Comment #3
alibama commentedway cool - look forward to testing it.... I'm also going to test elasticsearch and see if I can work through there... glad to test patches
Comment #4
digitalfrontiersmediaescuriola & alibama,
Did you ever develop patches for this that you could share? I'm looking at a similar setup. I may have a solution but and am looking for alternate options so any solutions or guidance would be greatly appreciated.
Thanks,
Stephen
Comment #5
escuriola commentedHello.
I reviewed the changes and this is the only change I things affects amazonS3:
diff --git a/apachesolr_attachments.index.inc b/apachesolr_attachments.index.inc
index 9e9517c..0496a98 100644
--- a/apachesolr_attachments.index.inc
+++ b/apachesolr_attachments.index.inc
@@ -61,8 +61,10 @@ function apachesolr_attachments_get_attachment_text($file) {
array('@filesize' => $file->filesize, '@filename' => $file->filename, '@sizelimit' => $filesize_limit));
return FALSE;
}
-
$filepath = drupal_realpath($file->uri);
+ if (is_amazons3($file->uri)) {
+ $filepath = i_get_s3_internal_path($file->uri);
+ }
With this the $filepath directly comes from uri and the S3 wrappers can understand and locate.
Regards!
Comment #6
digitalfrontiersmediaWow! Sweet little patch. I'll check it out! Many thanks!
Comment #7
krrishnajee commentedIs there a good solution for this issue? I tried the above patch and it doesn't work.
Comment #8
krrishnajee commentedMy friend Ian came up with an trick and we tried it, Finally it works. We are not sure is this the correct way of approach, and also I know changing the source of core module is not a nice thing, But unless we have no other option, we end up here. What we did was, before indexing the files download the files from S3 bucket to our local temp folder. And send the local file path to the tika to read the file and do the indexing. Check this out:
apachesolr_attachments.module
apachesolr_attachments.index.inc
Comment #9
digitalfrontiersmediaWe ended up doing something similar. Dave ended up curling the file from S3 but instead of storing it locally, simply piped the response from curl directly into tika.
Comment #10
amonteroFor files in storage backends such Amazon S3 or Google Object Storage, check also linked issue #2789569: Allow configuration of file hashing function which would avoid redownloading files for updated nodes already in the file extraction cache.
Comment #11
amonteroAlso, @DigitalFrontiersMedia, do you have any patch which you can post to allow others to use and work on it? TIA.
Comment #12
amonteroComment #13
amonteroSee related issue: #1430710: Integrate with Remote Stream Wrapper module
Comment #14
digitalfrontiersmedia@amontero, I will ask Dave to post what we ended up with if posting a patch makes sense (i.e. isn't too application specific).
Comment #15
digitalfrontiersmediaFYI -- I've asked Dave to submit the work but he's been busy. I may post it on his behalf if he just doesn't have the time to box it up.