My site is running the apachesolr module 7.x-1.8 with the latest stable apachesolr_attachments module.

It is set up to extract using Solr.

Debugging cron shows that apachesolr_cron is causing cron (whether it runs normally or with drush) to fail with the following message:

Attempting to re-run cron while it is already running.
Cron run failed

I have pinpointed it to be when indexing files or a file in particular (but I'm not able to tell which one it is if this is the case) because I initially indexed all my nodes minus the files just fine.

I see that we use Solr to extract, the tika that is called is v1.5. Would it help to update it to a more recent version? If yes, do I only need to update the 3 tika-* files located in the lib folder?

Any suggestions would be greatly appreciated!

ps: Please note that I have another site using the exact same Solr configuration, only with a different core of course, and it's not showing the error.

TYIA,

resolution here: https://www.drupal.org/node/2774943

TL;DR: updated PDFBox to 1.8.5 and the issue has not reappeared

n

Edit/resolution:

The issue was indeed coming from a file and the bundled tika was having troubles properly indexing the content of that file.

I managed to identified the file in the logs by using a modified log4j.properties file:

#  Logging level
solr.log=${solr.solr.home}/logs
log4j.rootLogger=INFO, file, CONSOLE

log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender

log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x \u2013 %m%n

#- size rotation with log cleanup.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.MaxFileSize=96MB
log4j.appender.file.MaxBackupIndex=9

#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%-5p - %d{yyyy-MM-dd HH:mm:ss.SSS}; %C; %m\n

log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.hadoop=WARN
log4j.logger.org.apache.solr.core.SolrCore=WARN
log4j.logger.org.apache.solr.search.SolrIndexSearcher=WARN

# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=INFO

Only then in the logs I was able to see that tika was choking on this file, sending a request repeatedly until I exclude the node using the Solr Exclude module

Comments

nixar created an issue. See original summary.

nixar’s picture

Issue summary: View changes
Status: Active » Closed (fixed)
nixar’s picture

Status: Closed (fixed) » Active
nixar’s picture

updating PDFBox didn't help as the problem is still occurring

nixar’s picture

Issue summary: View changes
nixar’s picture

Issue summary: View changes
nixar’s picture

Issue summary: View changes
Status: Active » Closed (works as designed)