My site is running the apachesolr module 7.x-1.8 with the latest stable apachesolr_attachments module.
It is set up to extract using Solr.
Debugging cron shows that apachesolr_cron is causing cron (whether it runs normally or with drush) to fail with the following message:
Attempting to re-run cron while it is already running.
Cron run failed
I have pinpointed it to be when indexing files or a file in particular (but I'm not able to tell which one it is if this is the case) because I initially indexed all my nodes minus the files just fine.
I see that we use Solr to extract, the tika that is called is v1.5. Would it help to update it to a more recent version? If yes, do I only need to update the 3 tika-* files located in the lib folder?
Any suggestions would be greatly appreciated!
ps: Please note that I have another site using the exact same Solr configuration, only with a different core of course, and it's not showing the error.
TYIA,
resolution here: https://www.drupal.org/node/2774943
TL;DR: updated PDFBox to 1.8.5 and the issue has not reappeared
n
Edit/resolution:
The issue was indeed coming from a file and the bundled tika was having troubles properly indexing the content of that file.
I managed to identified the file in the logs by using a modified log4j.properties file:
# Logging level
solr.log=${solr.solr.home}/logs
log4j.rootLogger=INFO, file, CONSOLE
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.PatternLayout
log4j.appender.CONSOLE.layout.ConversionPattern=%-4r [%t] %-5p %c %x \u2013 %m%n
#- size rotation with log cleanup.
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.file.MaxFileSize=96MB
log4j.appender.file.MaxBackupIndex=9
#- File to log to and log format
log4j.appender.file.File=${solr.log}/solr.log
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%-5p - %d{yyyy-MM-dd HH:mm:ss.SSS}; %C; %m\n
log4j.logger.org.apache.zookeeper=WARN
log4j.logger.org.apache.hadoop=WARN
log4j.logger.org.apache.solr.core.SolrCore=WARN
log4j.logger.org.apache.solr.search.SolrIndexSearcher=WARN
# set to INFO to enable infostream log messages
log4j.logger.org.apache.solr.update.LoggingInfoStream=INFO
Only then in the logs I was able to see that tika was choking on this file, sending a request repeatedly until I exclude the node using the Solr Exclude module
Comments
Comment #2
nixar CreditAttribution: nixar commentedComment #3
nixar CreditAttribution: nixar commentedComment #4
nixar CreditAttribution: nixar commentedupdating PDFBox didn't help as the problem is still occurring
Comment #5
nixar CreditAttribution: nixar commentedComment #6
nixar CreditAttribution: nixar commentedComment #7
nixar CreditAttribution: nixar commented