We are finding that some SOLR services freeze up during a cron run if it is trying to index a file that is too large. This ends up causing cron to fail for the drupal site - perhaps it would be useful to create a setting where we could choose to either bypass files that are too large or maybe only partially index them.

Comments

omaster’s picture

This is definiatly needed. I have hit the same problem. Has anyone looked at it?

omaster’s picture

I guess not.

jhedstrom’s picture

jhedstrom’s picture

Version: 6.x-2.x-dev » 7.x-1.x-dev
Status: Active » Needs review
StatusFileSize
new1.56 KB

Here is a patch that adds a filesize limit. There is currently no UI for setting this, but can easily be done via drush, settings.php, or strongarm. The default is no limit, so existing sites will behave as they do now. This is a very important feature for sites that have no upload size limit, as even with enormous amounts of memory, large enough files will still fatal error cron.

nick_vh’s picture

Status: Needs review » Needs work

I think it is better to use the exclude hook. We might get to a point with multiple environments and different limitations of filesizes (who knows right?) and we do not want to mark this file as invalid and/or incomplete. We just want to ignore/exclude it.

/**
 * This is invoked for each entity that is being inspected to be added to the
 * index. if any module returns TRUE, the entity is skipped for indexing.
 *
 * @param integer $entity_id
 * @param string $entity_type
 * @param integer $row
 *   A complete set of data from the indexing table.
 * @param string $env_id
 * @return boolean
 */
function hook_apachesolr_exclude($entity_id, $entity_type, $row, $env_id) {
  // Never index media entities to core_1
  if ($entity_type == 'media' && $env_id == 'core_1') {
    return TRUE;
  }
  return FALSE;
}
sterndata’s picture

I've hacked my the module to fix it on my system, but I can do a more general fix if someone can help me with a bit of code.

Essentially,

   $mem_left = amount_of_memory_still_available_to_php();
   $mem_needed = size_of_file_about_to_be_read($file);
   if ($mem_needed > ($mem_left + $some_fudge_factor)) {
         // do want needs to be done
       }
     else {
        log_an_error("File to large: ".$file);
     }

What I need is that first function, amount_of_memory_still_available_to_php().

posulliv’s picture

Status: Needs work » Needs review
StatusFileSize
new1.93 KB

This patch is based on the patch from #4 and uses the hook_apachesolr_exclude hook. A new variable is added to the configuration screen to control the filesize limit.

posulliv’s picture

Update on patch from #7 with tiny bit of validation for variable to limit filesize.

jessehs’s picture

Re-roll of #8 without removing the newline at end of file (which caused the patch to break against HEAD).

nick_vh’s picture

Issue summary: View changes
StatusFileSize
new2.14 KB

Made the default 40MB as we see most of the problems coming from files above that size. It's still configurable. Also moved the exclude hook to the hook that was already implemented. (apachesolr_attachments_apachesolr_file_exclude)

nick_vh’s picture

Committed it to dev branch. Let me know if it works for you

nick_vh’s picture

Status: Needs review » Fixed

  • Nick_vh committed 8f00ca3 on 7.x-1.x
    Issue #1251308 by posulliv, Nick_vh, jessehs, jhedstrom | aLearningGuy:...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

janusman’s picture

Status: Closed (fixed) » Needs work
+++ b/apachesolr_attachments.module
@@ -388,6 +388,19 @@ function apachesolr_attachments_apachesolr_file_exclude($entity_id, $row, $env_i
+  // Exclude files above the configured limit.
+  $filesize_limit = variable_get('apachesolr_attachments_filesize_limit', '40');
+  // load the entity.
+  $entities = entity_load('file', array($entity_id), NULL, TRUE);
+  // Take the first item.
+  $entity = reset($entities);
+
+  // Check if the filesize is higher than the allowed filesize.
+  if (isset($entity->filesize) && $filesize_limit > 0 && $entity->filesize > $filesize_limit) {
+    return TRUE;
+  }
+

It looks like the variable is an integer that means 'megabytes', but $entity->filesize is kept in bytes. So this works BUT the default of '40' will then skip all files except those smaller than 40 bytes :(

janusman’s picture

Status: Needs work » Needs review
StatusFileSize
new2.03 KB

Patch for 7.x-1.x.

pwolanin’s picture

@janusman - I think you posted an interdiff only?

janusman’s picture

Status: Needs review » Fixed

I didn't.. my diff was against a fresh checkout of master :)

My changes are pretty tame and seem to work, so committing to 7.x-1.x-dev

(As soon as I get permissions to do so!)

  • janusman authored ef7477b on 7.x-1.x
    Issue #1251308 by posulliv, Nick_vh, janusman, jhedstrom, jessehs: File...
janusman’s picture

Committed.

n.dhuygelaere’s picture

I still have a "PHP Fatal error: Allowed memory size", when I index a file with tika with the 7.x-1.3+9-dev version (2015-02-13).

In, fact the error occurs because tika try to extract the content of the file before the hook_apachesolr_file_exclude.

For fix this bug, i have alter the apachesolr_attachments_get_attachment_text() function, as bellow :

function apachesolr_attachments_get_attachment_text($file) {
  $indexer_table = apachesolr_get_indexer_table('file');
  if (!apachesolr_attachments_is_file($file)) {
    return FALSE;
  }
  // Exclude files above the configured limit.
  $filesize_limit = variable_get('apachesolr_attachments_filesize_limit', '41943040');
  // Check if the filesize is higher than the allowed filesize.
  if (isset($file->filesize) && $filesize_limit > 0 && $file->filesize > $filesize_limit) {
    return false;
  }
....
pwolanin’s picture

Status: Fixed » Active
janusman’s picture

Status: Active » Needs review
StatusFileSize
new2.37 KB

New patch, basically the same as n.dhuygelaere's from #21, but I added logging.

  • janusman authored 5e96575 on 7.x-1.x
    Followup to: Issue #1251308 by janusman, posulliv, Nick_vh, jhedstrom,...
janusman’s picture

Status: Needs review » Fixed

Committed to 7.x-1.x

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

nicholass’s picture

This is somehow STILL NOT RELEASED... I was bit by this problem this week and installing the dev version of the module did the trick. 👍 Module maintainers please release!