Thanks for this great module. It works great except for one thing. When a user searches on a word in an attached PDF file, the node that it belongs to is found (as expected). However, the snippet (excerpt) returns nothing. So the user has no idea where to look for the given search term in the result.

Can you add the excerpt from the attached file and add it to the parent entity?

Thanks a lot!

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

BarisW’s picture

Status: Active » Needs review
FileSize
1.06 KB

I believe it can be as simple as this. What's still missing is the information about in which file this snippet is found. aybe this would work (haven't tested it yet).

Instead of:

<?php
if (isset($response->extracted)) {
  $extraction = $response->extracted;
}
?>

use this

<?php
if (isset($response->extracted)) {
  $extraction = $file['filename'] . ': ' . $response->extracted;
}
?>
izus’s picture

Status: Needs review » Fixed

Hi,
i do have the extract with the last code base using tika. maybe the issue was fixed meanwhile !
i commited the hl parameter as it may be interesting to have.
thanks

tostinni’s picture

Status: Fixed » Needs work

Hi izus,
I'm still unable to get the excerpt using the lastest dev and Tika 1.4.
I reindexed all the PDF by clearing up the cache_search_api_attachment table and making sure that Tika is running using "top" command but I still can't see it in views.
Do you have some additional configuration to share ?

Edit :
I noticed that the lastest dev doesn't update the cache_search_api_attachments table, so I think something isn't right with this version.

izus’s picture

hmm, i wonder how i considered this working in my last test, maybe i didn't drink enough cofee... i just tested it again and can't get the excerpt.
This definitely needs a patch
Sorry for confusion

rovo’s picture

Hello, just checking in to see if there has been any new development with this? I'm running into the same issue where I can't get the snippet to show in the search results for terms searched that are in the PDF attachment.

izus’s picture

Hello rovo,
there is no commit doing this i'm aware of yet.
patches are welcome for this if anyone can contribute.
++

rovo’s picture

Hi Izus,

If you don't mind me double checking that I understand the issue correctly and it's not that I just misconfigured something; Search API Attachments will not create a snippet of text based on what it finds in the PDF attachment, to be displayed on the search results page? Or maybe Search API Attachments is able to do this, and I've just misconfigured my setup?

Greatly appreciate your insight.

izus’s picture

it doesn't do it yet and i'd love to review a patch for this and have it in :)

rovo’s picture

I'm looking into it, but I started thinking maybe I'm not making the best use of this module(or have it misconfigured). Since this module isn't returning a snippet for the search results, I'm wondering how others are making use of it? What do you have showing for a search result that matches? In my case, I've found that it will return the the Summary of the node.

maximpodorov’s picture

In my case, Solr server returns excerpts for file attachments (and for other fields also). This requires the following query rewriting:

function MYMODULE_search_api_solr_query_alter(array &$call_args, SearchApiQueryInterface $query) {
  if (isset($call_args['params']['hl']) && ($call_args['params']['hl'] === 'true')) {
    $call_args['params']['hl.fl'] = '*'; // The very essence of the trick.
    $call_args['params']['hl.requireFieldMatch'] = 'true';
    $call_args['params']['hl.fragsize'] = 400;
    $call_args['params']['hl.maxAnalyzedChars'] = 300000;
  }
}
rovo’s picture

Max, this looks great. I did find that SOLR would already return a preview snippet of the attached PDF. I was trying to bypass using SOLR, instead only relying on Search API, Search API attachments, Database search, and Search pages modules. I'm trying to avoid SOLR, because it has a default limit on the amount of tokens indexed from the PDF, and I can't change them in the solrconfig.xml file to increase it on the host provider I'm using. I've found that Search API, Search API attachments, and Database search, actually do index the entire PDFon their own without SOLR, they just don't provide a contextual highlighted preview snippet for the search results page like SOLR does.

Anonymous’s picture

I get also no excerpts for file attachments (pdf) with solr.

izus’s picture

hi fku,

did you test what #10 suggests ?
if this is really what we need we can may be make values configurable and add a patch for this.
or i don't know if someone tried to have this done out of SOLR so that it is more general and can fit #11 too.
++

Anonymous’s picture

hi izus,

i'm not a drupal-programmer. I have copy the code if (isset ... } to search_api_solr\includes\service.inc at line 1692 (version 7.x-1.x-dev 2014-05-12), into the protected function preQuery. No excerpts.

izus’s picture

hi,
try adding it to the .module file and replacing MYMODULE with your module name.
++

Anonymous’s picture

I don't understand you. I have not an own module. Should I do that in the search_api_solr.module or in search_api_attachments.module?

izus’s picture

you can just test it like this and if it does the job, you can delete your tests and we can suggest a patch :)

Anonymous’s picture

I have integrate the function #10 at the end of search_api_attachments.module:

function search_api_attachments_search_api_solr_query_alter(array &$call_args, SearchApiQueryInterface $query) {
if (isset($call_args['params']['hl']) && ($call_args['params']['hl'] === 'true')) {
$call_args['params']['hl.fl'] = '*'; // The very essence of the trick.
$call_args['params']['hl.requireFieldMatch'] = 'true';
$call_args['params']['hl.fragsize'] = 400;
$call_args['params']['hl.maxAnalyzedChars'] = 300000;
}
}

- clear all indexed data
- flush all caches
- index
- search with my view

No excerpts.

My dev Environment:
Win 8.1 64-bit, xampp 1.8.0, JRE 1.7, Tomcat 7.0.28, Apache Solr 4.8.0, search_api 7.x-1.x-dev 2014-05-12, search_api_solr 7.x-1.x-dev 2014-05-12, search_api_attachments 7.x-1.x-dev 2014-03-02, Tika 1.5.
The drupal installation is a copy of a online website with 3138 pdf-files 30kb - 200kb and 64 pdf-files 4mb - 12mb + 299 nodes without files attached. The online installation has old modules for search_api*. I would like to update and test this in the dev environment.

Robert_W’s picture

The code in #10 doesn't work with the latest Search API, Search API Attachments and using database search. My documents get indexed as I can find text in the document, but it does not display the match as a snippet. Doh, the code in #10 isn't suppose to work with database search.

  • izus committed dc866a4 on 8.x-1.x authored by BarisW
    Issue #2134163 by BarisW: Add the snippet to the search result.
    
izus’s picture

made #2068805: Multiple file field or multivalued file field : how to find which file contains which text ? a duplicate of this, so in the excerpt we should somehow have the filename that contains the keywords we searched for.

c3rberus’s picture

First off, this module is well needed so thank you for maintaining it.

We're looking for this feature as well, it would be great to be able to return a excerpt of text from the indexed document in the search result as this alone can answer the user's search query without them having to dig deeper into the document.

We're using Tika as it is pretty straight forward to setup you only need a single file and trying to avoid Solr due to its complexity.

Is this possible with Tika or only with Solr? I have this module using Tika but I don't get any search excerpt returned back in my search result.

Any possibility to get this added? Would make a world of difference and to be able to stick to search_api, search_db, views, tika and search_api_attachments.

natew’s picture

I ran in to the same issue with search api solr and attachments. The solution in #10 improved the excerpts with highlighting, however highlighting of the search terms is still somewhat spotty with some pdfs.

c3rberus’s picture

can #10 be used with search_api_attachments and tika backend or is that not supported?

maximpodorov’s picture

I use #10 for search_api_attachments and tika to show excerpts.

Grimreaper’s picture

Hello,

What needs to be done to close this issue?

I think there is a duplication with #2503743: Enhance highlight support by increasing maxAnalyzedChars parameter for excerpts.

izus’s picture