before I forget, I just rebuilt the Solr Nutch set-up and stumbled on the README vs the Nutch Tutorial

in the Nutch Tutorial there is a step on integrating Nutch and Solr by copying an xml from Nutch into Solr

correct me if I am wrong, but I assume that's not a needed step

the Apache Solr Integration Drupal module has config files that are copied into Solr and those files provide the needed Nutch compliant Schema

will clarify this post later when I am back in front of my dev machine

Comments

sgurlt’s picture

I am actualy trying to rebuild the readme and got stuck at the point to crawl my first url:

Indexing 1 documents
SolrIndexer: finished at 2013-03-04 10:28:31, elapsed: 00:00:21
SolrDeleteDuplicates: starting at 2013-03-04 10:28:31
SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1265)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373)
at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:153)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

Could it be, that this has something to do with the config files?
I am using the files from the git archiv for nutch and the files from the solr modul for solr, shoud be correct I think?!

cilefen’s picture

Sorry for the long delay -- you need to look at the Solr server log file also.

niccolox’s picture

There is a dedup bug in nutch 1.6

cilefen’s picture

Status: Active » Closed (won't fix)

niccolo - thank you for the info, I am closing this issue.