I'm not sure if it's a bug or I am doing something wrong. But when I search for a word that appears only once in a node I end up with that line duplicated. IE:

Avdeling for teologi
Velkommen til en prat om studier i teologi! Avdeling for teologi
Velkommen til en prat om studier i teologi! Avdeling for teologi

I tracked this down to the fact solr is being told to create snippets for both content and i18n_content_nb. These snippets are identical and consequently when they get merged together form the duplication. Any help on this would be great because at the moment I am having to intercept the query and remove content from the hl.fl parameter.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mkalkbrenner’s picture

Category: Bug report » Support request

Are you talking about the search result?

I guess that you didn't configure the field bias correctly at /admin/config/search/apachesolr/settings/YOUR_INDEX/bias

The language unspecific "The full, rendered content" needs to be set to "Omit".
As stated in the descriptions of the fields you should do that for all language unspecific fields:
"Unspecified language: recommendation is set this bias to Omit"

If you never saved such a configuration, you should see a error message at /admin/reports/status
Can you please verify that before you modify something?

Daemon_Byte’s picture

I didn't have an error message for that but then I do already have that set to omit.

mkalkbrenner’s picture

Please be more specific.
Are you talking about search results?
Are all language unspecific fields set to omit?
Can you please post the query string that is sent to solr?

Daemon_Byte’s picture

I didn't have any error message on the status page. I also have omit set on all the non language content related things including all that h1, em etc stuff. The query string is rather long because of facets but from what I can see my problem is this: hl.fl=content%2Ci18n_content_nb%2Ci18n_ts_nb_comments which should be hl.fl=i18n_content_nb

select?start=0&rows=10&fq=%28hash%3Afv5phl%20OR%20access__all%3A0%29&fq=%28ss_language%3Anb%20OR%20ss_language%3Aund%29&spellcheck=true&q=avdeling&fl=id%2Centity_id%2Centity_type%2Cbundle%2Cbundle_name%2Clabel%2Css_language%2Cis_comment_count%2Cds_created%2Cds_changed%2Cscore%2Cpath%2Curl%2Cis_uid%2Ctos_name%2Cteaser&mm=1&pf=content%5E2.0&ps=15&hl=true&hl.fl=content%2Ci18n_content_nb%2Ci18n_ts_nb_comments&hl.snippets=3&hl.mergeContigious=true&f.content.hl.alternateField=teaser&f.content.hl.maxAlternateFieldLength=256&spellcheck.q=avdeling&hl.simple.pre=%3Cmark%3E&hl.simple.post=%3C%2Fmark%3E&facet=true&facet.sort=count&facet.mincount=1&facet.field=%7B%21ex%3Dim_field_search_terms%7Dim_field_search_terms&facet.field=%7B%21ex%3Dbundle%7Dbundle&f.im_field_search_terms.facet.limit=50&f.im_field_search_terms.facet.mincount=1&facet.date=dm_field_date2&facet.date=dm_field_date&f.dm_field_date2.facet.date.start=2011-01-01T00%3A00%3A00Z%2FYEAR&f.dm_field_date2.facet.date.end=2014-01-01T00%3A00%3A00Z%2B1YEAR%2FYEAR&f.dm_field_date2.facet.date.gap=%2B1YEAR&f.dm_field_date2.facet.limit=-1&f.dm_field_date.facet.date.start=2012-01-01T00%3A00%3A00Z%2FYEAR&f.dm_field_date.facet.date.end=2014-01-01T00%3A00%3A00Z%2B1YEAR%2FYEAR&f.dm_field_date.facet.date.gap=%2B1YEAR&f.dm_field_date.facet.limit=-1&facet.query=ds_created%3A%5BNOW%2FDAY-7DAYS%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FMONTH-1MONTH%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FMONTH-1YEAR%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FHOUR-1HOUR%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FHOUR-24HOURS%20TO%20NOW%2FDAY%2B1DAY%5D&f.bundle.facet.limit=50&f.bundle.facet.mincount=1&qf=i18n_content_nb%5E40.0&qf=i18n_label_nb%5E5.0&qf=i18n_tags_nb_h1%5E5.0&qf=i18n_tags_nb_h2_h3%5E3.0&qf=i18n_tags_nb_h4_h5_h6%5E2.0&qf=i18n_tags_nb_inline%5E1.0&qf=i18n_taxonomy_names_nb%5E2.0&qf=i18n_tos_nb_name%5E3.0&qf=tm_vid_4_names%5E840&qf=tos_name_formatted%5E0.1&f.i18n_content_nb.hl.alternateField=i18n_teaser_nb&wt=json&json.nl=map

mkalkbrenner’s picture

The query string looks good and it should work without modifications of hl.fl. These are just candidates and there's some code to pick the right one for the search result.
Earlier versions of that code contained an error and your issue sounds exactly like the one I described at https://drupal.org/node/1946132

Which version of the apachesolr module are you running?

My patch from
https://drupal.org/comment/7505331#comment-7505331
to avoid these duplicates has been accepted and was released with Apache Solr integration 7.x-1.3

Daemon_Byte’s picture

FileSize
108.19 KB

Sorry for the delay but I am not able to access the site from outside the building. I'm using 1.6.

mkalkbrenner’s picture

OK, it seems like there is some debugging required. Can you post some var_dump() output?
Have a look at function apachesolr_search_preprocess_apachesolr_search_snippets().
I'm interested in the values of $vars['snippets'] and $vars['flattened_snippets'] at the end of that function.

Daemon_Byte’s picture

FileSize
37.22 KB

vdump1 has a dpm from the bottom of that function.

mkalkbrenner’s picture

The problem seems to be the "." at the beginning of i18n_content_nb. Therefor the strings are not identical and both search result snippets remain.
It seems that you've installed the devel module. Please open the devel tab of that node and post the output of the "Apache Solr" tab. I'm interested in the values of 18n_content_nb and content.

Daemon_Byte’s picture

FileSize
59.64 KB

had to remove emails and company names but here it is. They don't seem to have a dot at that point.

Daemon_Byte’s picture

Fakta: Studieresepsjonen I studieresepsjonen i 3. etasje, som er åpen fra 10:00-14:00, kan du få svar på spørsmål i forhold til studiet du tar eller videre studier. Studieresepsjonen har ansvar for registrering, eksamensoppmelding og gir ut eksamensutskrift. Telefon: E: Studieveileder for de teologiske studieprogrammene er . Dersom du vil komme i kontakt med ham, kan du enten sende en e-post til adressen under, eller møte i studieresesepsjonen i 3. etasje for å avtale et tidspunkt for å treffe ham. har kontor i administrasjonsavdelingen i 3. etasje. Velkommen til en prat om studier i teologi! Avdeling for teologi

Fakta: Studieresepsjonen I studieresepsjonen i 3. etasje, som er åpen fra 10:00-14:00, kan du få svar på spørsmål i forhold til studiet du tar eller videre studier. Studieresepsjonen har ansvar for registrering, eksamensoppmelding og gir ut eksamensutskrift. Telefon: E: Studieveileder for de teologiske studieprogrammene er . Dersom du vil komme i kontakt med ham, kan du enten sende en e-post til adressen under, eller møte i studieresesepsjonen i 3. etasje for å avtale et tidspunkt for å treffe ham. Erlend har kontor i administrasjonsavdelingen i 3. etasje. Velkommen til en prat om studier i teologi! Avdeling for teologi

mkalkbrenner’s picture

Assigned: Unassigned » mkalkbrenner
Category: Support request » Bug report
Status: Active » Needs review
FileSize
1.24 KB

I don't know exactly what's going on in your setup and it's difficult to do some more debugging this way.
Nevertheless I think that we should not rely on the de-duplication of apachesolr_search.module anymore which is designed for identical strings in fields like content and teaser which are of the same language.

I attached a patch that replaces the highlighting by language specific highlighting only. In case of CLIR we still have an issue, but this should have been the case in the previous version as well.

Can you test the patch?

Daemon_Byte’s picture

The patch seemed to fix the issue and didn't cause any bugs that I saw

mkalkbrenner’s picture

Title: duplicate snippets » avoid duplicate snippets in search result
Status: Needs review » Fixed

committed to git

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.