Hello,

I'm working in a use case where exist file entities that live by them self. I mean, you can upload files (ie. videos, docs, images, etc.) directly using media module without attaching them to any node. The problem is that this files are not indexed. Apparently its a viable but never requested feature (ref).

Someone know if there is a hack to index this files?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

David_Rothstein’s picture

Title: Index files that are not attached to nodes » Index files that are not attached to other entities

The Apache Solr File module does this, but it isn't compatible with the Apache Solr Attachments module.

I think I have a case where I'll need both features on the same site (i.e., some files indexed as attachments and linked in the search results to the entity they're attached to, and others - which aren't necessarily attached to anything - indexed as their own entities), so I'm interested in being able to do this with Apache Solr Attachments alone.

Given that the word "attachments" appears in the name of the module, I'm not sure the feature really fits here. But I'd still like it (and might have time to work on it), and it's encouraging to read in that blog post that it might be accepted as a patch :) Anyone know if it would have a chance of making it in?

David_Rothstein’s picture

Status: Active » Needs review
FileSize
20.48 KB

OK, I needed this, so I wound up writing a patch.

I tried to keep it as minimal as possible so it has a shot at the stable 7.x-1.x branch (in reality, I think a larger rearchitecting might make sense for this feature). So I use the same storage methods that the module currently uses for file attachments, and basically the same user interface also. It's still a large patch, but a lot of it is moving code around.

Features:

  • On the existing bundle administration page, there are now extra dropdowns for unattached files of each file type, with choices to either index these types of unattached files for search or not. (The default is the latter, for backwards compatibility reasons - although an update function might be a better way to guarantee this since I think some sites will already have the old default saved in the database even though it did nothing before.)
  • If these files are indexed for search, they will be presented in the search results as their own entities (with no link to any other entity, since by definition there's none they're attached to).
  • When unattached files are indexed, fields attached to the file entity (via the File Entity module) will be indexed along with it. (Whether the file contents itself are extracted and indexed depends on the file mimetype, like always.) And there are also facets available for these fields. Technically this bullet point could be a followup feature request, but it was a tiny part of the patch, and it made sense to me because if unattached files are showing up in search they are basically "first class" entities at that point, and so should be able to have their fields indexed in Solr just like any other entity. For example, this feature allows you to index videos for search, which normally wouldn't make sense (since you can't extract text from a video), but if the video file type has fields attached to it, such as a description field, transcript field, etc., it does make sense to and is now possible.

In order to work correctly, this patch requires that #2014067: File types provided by the File Entity module are not always recognized by Apache Solr be applied to the Apache Solr module.

David_Rothstein’s picture

Note that if you're using the Apache Solr Access module, these unattached files might not show up in search results for non-administrative users (similar bug as already exists in this module for attached files: #1782936: Index file entities be with access grants from parent node if the apachesolr_access module is enabled).

Since these unattached files have no parent entities, the correct fix here probably involves something similar to the patches in #1665350: Only users with the "Bypass content access control" permission are able to search for users when Apache Solr Access is enabled for the Apachesolr User module.

David_Rothstein’s picture

Issue summary: View changes

minor modifications

Pedja Grujić’s picture

Not able to apply patch, fails:

patching file apachesolr_attachments.admin.inc
patching file apachesolr_attachments.index.inc
Hunk #2 succeeded at 211 (offset 5 lines).
Hunk #3 succeeded at 246 (offset 5 lines).
Hunk #4 FAILED at 249.
1 out of 4 hunks FAILED -- saving rejects to file apachesolr_attachments.index.inc.rej
patching file apachesolr_attachments.module
Hunk #4 succeeded at 252 (offset 4 lines).
Hunk #5 succeeded at 354 (offset 4 lines).
Hunk #6 succeeded at 371 (offset 4 lines).
Hunk #7 succeeded at 407 (offset 4 lines).
Hunk #8 succeeded at 447 (offset 4 lines).
Hunk #9 succeeded at 456 (offset 4 lines).
Hunk #10 FAILED at 744.
Hunk #11 succeeded at 783 (offset 24 lines).
Hunk #12 FAILED at 847.
2 out of 12 hunks FAILED -- saving rejects to file apachesolr_attachments.module.rej

undertext’s picture

I rerolled the patch.
It applies for dev branch and for current stable version 7.3.
Thanks to @David_Rothstein for saving my work hours)

David_Rothstein’s picture

Status: Needs review » Needs work
FileSize
20.65 KB
1.42 KB

The reroll looks good, but was missing one or two things (most notably the code that exposes facets for the fields attached to file entities). So I fixed that in the attached patch.

However, in the interim there have been many commits to the 7.x-1.x-dev branch, so neither this patch nor the one above apply to that branch anymore (both still work against the 7.x-1.3 release). So, marking "needs work" for that.

ryantollefson’s picture

Thanks for this; I ran into a small bug... Cron was crashing (white screen) on me.

Checked Server logs and found:
PHP Fatal error: Call to a member function getExternalUrl() on a non-object in [path]\apachesolr_attachments.module on line 242, referer: [URL]

I don't really know PHP, so not sure how to fix, but here is line 242 from my file:

$path = file_stream_wrapper_get_instance_by_uri($file->uri)->getExternalUrl();
undertext’s picture

nmillin’s picture

Patch #8 didn't apply cleanly to the latest dev. Tweaked patch to work with current dev.

vitalie’s picture

Thanks all. Patch #9 works ok if I add:

        if (empty($parent_entity_info->extraFields)) {
          continue;
        }

after line 357 after applying the patch (the line is oreach ($parent_entities as $parent_entity_info) {

vitalie’s picture

If anyone needs it, here is the patch from #9 which includes the code from #10 and is applied agains the 7.x-1.4 version of the module.

jenlampton’s picture

Status: Needs work » Needs review
FileSize
22.14 KB

Rerolled from the correct location.

jenlampton’s picture

The previous patch removed the check for file size, which was causing me some headaches. Rerolled here (it needed a reroll anyway) and added back the filesize check. Note that the file size check has moved to apachesolr_attachments_get_attachment_text() so that the rest of the file entity will be still be properly indexed.

pwolanin’s picture

Issue tags: +needs iss

Hmm, it's not clear to me why function apachesolr_attachments_apachesolr_file_excluded() is removed and some of the other changes

You are also changing the cleanup function in a way that might leave some orphan files?

Also - looks like I have a test fixture but no tests.

mausolos’s picture

Any particular reason you're setting the filesize limit before AND after the logic check? Was this unintended?

@@ -55,6 +54,13 @@ function apachesolr_attachments_get_attachment_text($file) {
 
   // Exclude files above the configured limit.
   $filesize_limit = variable_get('apachesolr_attachments_filesize_limit', '41943040');
+  if (isset($file->filesize) && $filesize_limit > 0 && $file->filesize > $filesize_limit) {
+    // File too large.
+    return FALSE;
+  }
+
+  // Exclude files above the configured limit.
+  $filesize_limit = variable_get('apachesolr_attachments_filesize_limit', '41943040');
nixar’s picture

I cannot talk about #14 or #15 but I've been using the patch in #13 on a site with several hundreds of files and it's been working well.