It would seem that Tika 0.4 was released in July.

Does ASA spport Tika 0.4, or should I be looking at checking out a 0.3 version from svn?

CommentFileSizeAuthor
#10 docs-540974-10.patch1.77 KBpwolanin

Comments

pwolanin’s picture

It should work - please try it and we can update the README.

timatlee’s picture

Can't seem to get it going. When trying to run it from the command line, I get:

Failed to load Main-Class manifest attribute from D:/websites/drupal6/sites/all/modules/apachesolr_attachments/tika/tika-core-0.4.jar

Still using tika-0.3 from svn, revision 756979

timatlee’s picture

Title: Tika 0.4 » Tika 0.4 / 0.5

Apparently I'm a newbie, and should have RTFM'd.

Checking out Tika from SVN at http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/lib will get you 0.4, except without tika-app-0.4.jar, which I guess is the java app that needs to be called.

I couldn't find a download that had tika-app-0.4.jar, so I wound up having to build it from source.

If you don't mind a few downloads, it's very easy to do. I have no idea what the license restrictions are specific to distributing a compiled jar, but I think if you're in this deep, you're probably OK having to go elseware to download some additional tools.

Oh, to note: I am doing this on Windows XP and 2003.

  1. Downloading Java SDK
  2. Downloading Maven 2 from http://maven.apache.org/download.html
  3. Downloading Tika 0.4 source from http://lucene.apache.org/tika/download.html (actually used the SVN version by the end of it, which reads as version 0.5)
  4. Installed Java SDK, Maven.
  5. Set JAVA_HOME=JAVA_HOME=C:\Program Files\java\jdk1.6.0_13\ (Note: No quotes around the path, even though it's got a space. Maven exits very early if quotes exist)
  6. Set MAVEN_OPTS=-Xmx256m (Tests fail with 64M memory; this increases it to 256. Based off the thread at http://osdir.com/ml/tika-user.lucene.apache.org/2009-06/msg00000.html)
  7. In the Tika source directory, did D:\Path-To-Maven\Bin\mvn.bat clean install. Brewed a coffee while Maven downloaded and did its thing. About 5 minutes.
  8. Copied the built tika-app-SNAPSHOT.jar from D:\Path-to-Tika-Source\tika-app\target to the Drupal module path apachesolr_attachments\tika.

I'm sure this would be a fraction of the difficulty if I were using a real OS, but I'm not so fortunate for that where I work.

Hope this helps someone else down the road.

pwolanin’s picture

Hmm, perhaps tika 0.4 added a new master application jar. Silly that Solr is not shipping it - but likely it's not needed for content extraction.

timatlee’s picture

Hum, does that mean that, in the long run, Tika will not be required?

pwolanin’s picture

Well, the long-run plan was to use tika via Solr (or at least have that as an option), but it would be nice to continue to have a local tika extraction option.

timatlee’s picture

Hmm, using Tika via Solr would be exceptional. I had a considerable amount of grief getting Tika to run properly when using IIS.... A lot of problems with quoting paths and such.

Any information out there on using tika via solr that I could experiment with?

Thanks,

pwolanin’s picture

pwolanin’s picture

I had the same experience with tika 0.4 - needed to build from source to get the app jar. PITA.

pwolanin’s picture

Status: Active » Needs review
StatusFileSize
new1.77 KB

Here's an update to the README

pwolanin’s picture

Status: Needs review » Fixed

committed that path + one more line

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

very_random_man’s picture

I've followed these instructions but unfortunately I'm still having a bit of related Tika trouble.

I've built Tika with Maven and it works fine from the command line (Mac OSX). However, when running the cron, the attachments module runs the shell_exec command to extract the text and gets no response. This error turns up in the apache error log
:

Exception in thread "main" java.lang.InternalError: Can't connect to window server - not enough permissions.
        at java.lang.ClassLoader$NativeLibrary.load(Native Method)
        at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1822)
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1723)
        at java.lang.Runtime.loadLibrary0(Runtime.java:822)
        at java.lang.System.loadLibrary(System.java:993)
        at sun.security.action.LoadLibraryAction.run(LoadLibraryAction.java:50)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.awt.Toolkit.loadLibraries(Toolkit.java:1509)
        at java.awt.Toolkit.<clinit>(Toolkit.java:1530)
        at java.awt.Color.<clinit>(Color.java:250)
        at org.apache.pdfbox.pdmodel.graphics.color.PDColorState.<clinit>(PDColorState.java:48)
        at org.apache.pdfbox.pdmodel.graphics.PDGraphicsState.<init>(PDGraphicsState.java:40)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:182)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:367)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:291)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:247)
        at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:180)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)

The problem appears to be that there aren't sufficient permissions to spawn the GUI. I notice that even when running via the command line, the GUI is still appearing very briefly.

Is there a way to suppress this GUI thing? Also, is this likely to only be a problem on my Mac environment? Would a linux server deal with this differently? Presumably anyone who has got this working must have worked around this.

Any hints will be most appreciated! :-)

very_random_man’s picture

Status: Closed (fixed) » Active
pwolanin’s picture

Perhaps you are using the wrong jar?

I've used tike 0.4 with no problem on Mac OS 10.5

very_random_man’s picture

I was using tika-app-0.5.jar. I've just downloaded and installed 0.4 and it works fine.

Also, when I run the 0.4 command line i don't see any programs pop up on my dock so I my hunch is that 0.5 is running some kind of GUI thing by mistake which shell_exec doesn't have permission to do.

Thanks for the module btw. Works like a charm now!

pwolanin’s picture

Status: Active » Fixed

hmm, 0.5 works ok for me to index some files with this module - built from a checkout of http://svn.apache.org/repos/asf/lucene/tika/tags/0.5

so, not sure what the issue is for you, but marking fixed in the absence of more info.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

lukus’s picture

@timatlee;

You walkthrough worked for me, thanks.

But, I need to download the source using

svn checkout http://svn.apache.org/repos/asf/tika/tags/0.5
Yaron Tal’s picture

On ubuntu:

svn checkout http://svn.apache.org/repos/asf/tika/tags/0.5 tika
sudo apt-get install maven2
cd tika
sudo mvn install

The last command failed at first (and second run), but the third time it finished and it seems to work.

In drupal I added the tika root dir (the checkout dir) as the Tika directory path and tika-app/target/tika-app-0.5.jar as the tika jar file.

Searching in drupal gives me content from within pdf files now.

johnennew’s picture

Thanks Yaron,

Just a note that I managed to get tika version 0.9 working using these instructions in Drupal 7. I had some PDFs which earlier versions of Tika were not extracting contents from.

Download the 0.9 src code from the Apache tika website: http://tika.apache.org/download.html

On ubuntu you can then:

unzip apache-tika-0.9-src.zip
cd apache-tika-0.9-src
sudo apt-get install maven2
sudo mvn install
sudo mkdir /usr/local/share/tika
sudo cp tika-app/target/tika-app-0.9.jar /usr/local/share/tika

In the attachments setting screen I set the directory path as /usr/local/share/tika and the Tika jar file as tika-app-0.9.jar

After manually running cron and waiting 5 minutes, the pdf contents appeared in the search results

akshita’s picture

Version: 6.x-2.x-dev » 7.x-1.0
Priority: Normal » Critical
Status: Closed (fixed) » Active

Hi John

I followed exactly what you did but no luck .Please do the needful

cd /var
unzip apache-tika-0.9-src.zip
cd apache-tika-0.9
sudo apt-get install maven2
sudo mvn install
sudo mkdir /usr/local/share/tika
sudo cp tika-app/target/tika-app-0.9.jar /usr/local/share/tika

root@xxxxx:/var# cp tika-app/target/tika-app-0.9.jar /usr/local/share/tika
cp: cannot stat `tika-app/target/tika-app-0.9.jar': No such file or directory

Thanks
Revathi

nick_vh’s picture

Status: Active » Fixed

Closing because it's an old issue with no response. Also, most of the people have gotten it to work so it should work

pwolanin’s picture

Status: Fixed » Closed (fixed)
dahousecat’s picture

I had to add these 2 lines to settings.php to get it to work:

$conf['apachesolr_attachments_tika_path'] = '/var/www/sites/all/modules/thirdparty/apachesolr_attachments/tika/';
$conf['apachesolr_attachments_tika_jar'] = 'tika-app-1.4.jar';

I don't see how these variables were meant to be set and nothing in the instructions about it?!?

glenshewchuck’s picture

Issue summary: View changes

I don't know about early versions of this module but as of version 7.x-1.4 the $conf settings can be found at /admin/config/search/apachesolr/attachments