ok, this is a stub, another one, for Nutch love

anyone out there working on Nutch 1.4 and Drupal 6 or 7 ?

I am going to take another crack at some kind of integration with BOA Aegir LEMP stack over the Christmas break

some relevant news below that may give some focus

it seems to me, Nutch could do with a HUGE Drupal integration boost

26 November 2011 - Apache Nutch 1.4 Released
The 1.4 release of Nutch is now available. This release includes several improvements including allowing Parsers to declare support for multiple MIME types, configurable Fetcher Queue depth, Fetcher speed improvements, tigther Tika integration, and support for HTTP auth in Solr indexing. Please see the list of changes made in this version. The release is available here.

23 September 2011 - Apache Nutch focuses on 1.x series for main development
After some discussion and a vote about the issue, the Nutch development community decided to focus their efforts on maintaining and releasing the 1.x series of Nutch, and to branch the now former Nutch trunk based on Gora, allowing others to try and improve it, while the mainline development goes on.

Comments

niccolox’s picture

just found this package from Australia's CSIRO, kind of like the National Science Foundation in Oz http://www.atnf.csiro.au/computing/software/arch/

What is CSIRO Arch?

CSIRO Arch is an open source free extension of Apache Nutch, a popular general purpose search engine that is capable of indexing billions of web pages using clusters of computers. Arch uses Nutch software and adds additional features to provide a powerful and efficient search engine that is optimized for use in corporate web environments. Such environments typically have one or more web sites, with web content provided for external readers and internal use, and one or more "intranet" sites that provide content for internal use only. Arch can be used to search both the external access and restricted access sites and produces extremely high quality search results.

Corporate web environments are a challenging area for modern search engines. Whilst they may include multiple web sites and millions of pages, compared to the global Web they are much smaller and this makes them easier to index. However, the smaller scale of corporate environments and the more restricted access to information also make it harder to estimate the relative importance of documents found on corporate web sites. The search methods used to search the global Web generally do not work well on a smaller scale and this leads to frustration for companies who often find that searches on their intranets are of limited use.

Arch has been specifically designed to provide very high quality searches for intranet web environments. Arch makes use of web server logs and other information that is available within an organization, but not available to external search engines, to provide excellent search results. It is robust and easy for a webmaster to install and maintain, and is extremely efficient at providing relevant and up-to-date information.

niccolox’s picture

looks like the CSIRO is moving towards using Drupal for its own sites and so the Arch (Nutch/Solr) package will *have* to integrate with Drupal

http://lucene.472066.n3.nabble.com/Drupal-Integration-with-Nutch-via-CSI...
http://lucene.472066.n3.nabble.com/Drupal-Integration-with-Nutch-via-CSI...

dstuart’s picture

New Years resolution: Make Nutch module kick ass and get D7 version up to speed!

dstuart’s picture

Yea, I saw thoses posts looks very interesting. With the latest patches in 1.4 I think we can get full integration without having to change the schema.xml in solr. Also with the new work by @ygerasimov in Apache Solr Views 7 we have a good option for proper D7 integration also

niccolox’s picture

gday dstuart, great new years resolutions

I think that the Nutch project could get a lot of fresh interest and energy with an easy-to-use and documented package for Drupal integration

I guess, if a network provider started implementing Nutch AND Solr we would also see it take off

niccolox’s picture

Ok. Its been a long time between drinks. Any movement on Nutch Drupal?

naeluh’s picture

oops thats is from almost a year ago haha my bad ? also I am interested in helping test anything thanks

niccolox’s picture

Check the lucene nutch solr group for a major nutch module sandbox release

niccolox’s picture

Solr Nutch Search Sandbox Project Updated to Integrate with Common Schema

http://groups.drupal.org/node/273813

niccolox’s picture

and the good news is the Solr Nutch sandbox works using Solr 3.6 and Nutch 1.6
http://groups.drupal.org/node/273813#comment-869623

can we fold this work into the Nutch module ?

avpaderno’s picture

Issue summary: View changes
Status: Active » Closed (outdated)

I am closing this issue, since Drupal 6 isn't supported anymore.