I'm not sure if it's a bug or I am doing something wrong. But when I search for a word that appears only once in a node I end up with that line duplicated. IE:
Avdeling for teologi
Velkommen til en prat om studier i teologi! Avdeling for teologi
Velkommen til en prat om studier i teologi! Avdeling for teologi
I tracked this down to the fact solr is being told to create snippets for both content and i18n_content_nb. These snippets are identical and consequently when they get merged together form the duplication. Any help on this would be great because at the moment I am having to intercept the query and remove content from the hl.fl parameter.
Comment | File | Size | Author |
---|---|---|---|
#12 | 2217753.patch | 1.24 KB | mkalkbrenner |
#10 | vdump2.png | 59.64 KB | Daemon_Byte |
#8 | vdump1.png | 37.22 KB | Daemon_Byte |
#6 | modules.png | 108.19 KB | Daemon_Byte |
Comments
Comment #1
mkalkbrennerAre you talking about the search result?
I guess that you didn't configure the field bias correctly at /admin/config/search/apachesolr/settings/YOUR_INDEX/bias
The language unspecific "The full, rendered content" needs to be set to "Omit".
As stated in the descriptions of the fields you should do that for all language unspecific fields:
"Unspecified language: recommendation is set this bias to Omit"
If you never saved such a configuration, you should see a error message at /admin/reports/status
Can you please verify that before you modify something?
Comment #2
Daemon_Byte CreditAttribution: Daemon_Byte commentedI didn't have an error message for that but then I do already have that set to omit.
Comment #3
mkalkbrennerPlease be more specific.
Are you talking about search results?
Are all language unspecific fields set to omit?
Can you please post the query string that is sent to solr?
Comment #4
Daemon_Byte CreditAttribution: Daemon_Byte commentedI didn't have any error message on the status page. I also have omit set on all the non language content related things including all that h1, em etc stuff. The query string is rather long because of facets but from what I can see my problem is this: hl.fl=content%2Ci18n_content_nb%2Ci18n_ts_nb_comments which should be hl.fl=i18n_content_nb
select?start=0&rows=10&fq=%28hash%3Afv5phl%20OR%20access__all%3A0%29&fq=%28ss_language%3Anb%20OR%20ss_language%3Aund%29&spellcheck=true&q=avdeling&fl=id%2Centity_id%2Centity_type%2Cbundle%2Cbundle_name%2Clabel%2Css_language%2Cis_comment_count%2Cds_created%2Cds_changed%2Cscore%2Cpath%2Curl%2Cis_uid%2Ctos_name%2Cteaser&mm=1&pf=content%5E2.0&ps=15&hl=true&hl.fl=content%2Ci18n_content_nb%2Ci18n_ts_nb_comments&hl.snippets=3&hl.mergeContigious=true&f.content.hl.alternateField=teaser&f.content.hl.maxAlternateFieldLength=256&spellcheck.q=avdeling&hl.simple.pre=%3Cmark%3E&hl.simple.post=%3C%2Fmark%3E&facet=true&facet.sort=count&facet.mincount=1&facet.field=%7B%21ex%3Dim_field_search_terms%7Dim_field_search_terms&facet.field=%7B%21ex%3Dbundle%7Dbundle&f.im_field_search_terms.facet.limit=50&f.im_field_search_terms.facet.mincount=1&facet.date=dm_field_date2&facet.date=dm_field_date&f.dm_field_date2.facet.date.start=2011-01-01T00%3A00%3A00Z%2FYEAR&f.dm_field_date2.facet.date.end=2014-01-01T00%3A00%3A00Z%2B1YEAR%2FYEAR&f.dm_field_date2.facet.date.gap=%2B1YEAR&f.dm_field_date2.facet.limit=-1&f.dm_field_date.facet.date.start=2012-01-01T00%3A00%3A00Z%2FYEAR&f.dm_field_date.facet.date.end=2014-01-01T00%3A00%3A00Z%2B1YEAR%2FYEAR&f.dm_field_date.facet.date.gap=%2B1YEAR&f.dm_field_date.facet.limit=-1&facet.query=ds_created%3A%5BNOW%2FDAY-7DAYS%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FMONTH-1MONTH%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FMONTH-1YEAR%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FHOUR-1HOUR%20TO%20NOW%2FDAY%2B1DAY%5D&facet.query=ds_created%3A%5BNOW%2FHOUR-24HOURS%20TO%20NOW%2FDAY%2B1DAY%5D&f.bundle.facet.limit=50&f.bundle.facet.mincount=1&qf=i18n_content_nb%5E40.0&qf=i18n_label_nb%5E5.0&qf=i18n_tags_nb_h1%5E5.0&qf=i18n_tags_nb_h2_h3%5E3.0&qf=i18n_tags_nb_h4_h5_h6%5E2.0&qf=i18n_tags_nb_inline%5E1.0&qf=i18n_taxonomy_names_nb%5E2.0&qf=i18n_tos_nb_name%5E3.0&qf=tm_vid_4_names%5E840&qf=tos_name_formatted%5E0.1&f.i18n_content_nb.hl.alternateField=i18n_teaser_nb&wt=json&json.nl=map
Comment #5
mkalkbrennerThe query string looks good and it should work without modifications of hl.fl. These are just candidates and there's some code to pick the right one for the search result.
Earlier versions of that code contained an error and your issue sounds exactly like the one I described at https://drupal.org/node/1946132
Which version of the apachesolr module are you running?
My patch from
https://drupal.org/comment/7505331#comment-7505331
to avoid these duplicates has been accepted and was released with Apache Solr integration 7.x-1.3
Comment #6
Daemon_Byte CreditAttribution: Daemon_Byte commentedSorry for the delay but I am not able to access the site from outside the building. I'm using 1.6.
Comment #7
mkalkbrennerOK, it seems like there is some debugging required. Can you post some var_dump() output?
Have a look at function apachesolr_search_preprocess_apachesolr_search_snippets().
I'm interested in the values of $vars['snippets'] and $vars['flattened_snippets'] at the end of that function.
Comment #8
Daemon_Byte CreditAttribution: Daemon_Byte commentedvdump1 has a dpm from the bottom of that function.
Comment #9
mkalkbrennerThe problem seems to be the "." at the beginning of i18n_content_nb. Therefor the strings are not identical and both search result snippets remain.
It seems that you've installed the devel module. Please open the devel tab of that node and post the output of the "Apache Solr" tab. I'm interested in the values of 18n_content_nb and content.
Comment #10
Daemon_Byte CreditAttribution: Daemon_Byte commentedhad to remove emails and company names but here it is. They don't seem to have a dot at that point.
Comment #11
Daemon_Byte CreditAttribution: Daemon_Byte commentedFakta: Studieresepsjonen I studieresepsjonen i 3. etasje, som er åpen fra 10:00-14:00, kan du få svar på spørsmål i forhold til studiet du tar eller videre studier. Studieresepsjonen har ansvar for registrering, eksamensoppmelding og gir ut eksamensutskrift. Telefon: E: Studieveileder for de teologiske studieprogrammene er . Dersom du vil komme i kontakt med ham, kan du enten sende en e-post til adressen under, eller møte i studieresesepsjonen i 3. etasje for å avtale et tidspunkt for å treffe ham. har kontor i administrasjonsavdelingen i 3. etasje. Velkommen til en prat om studier i teologi! Avdeling for teologi
Fakta: Studieresepsjonen I studieresepsjonen i 3. etasje, som er åpen fra 10:00-14:00, kan du få svar på spørsmål i forhold til studiet du tar eller videre studier. Studieresepsjonen har ansvar for registrering, eksamensoppmelding og gir ut eksamensutskrift. Telefon: E: Studieveileder for de teologiske studieprogrammene er . Dersom du vil komme i kontakt med ham, kan du enten sende en e-post til adressen under, eller møte i studieresesepsjonen i 3. etasje for å avtale et tidspunkt for å treffe ham. Erlend har kontor i administrasjonsavdelingen i 3. etasje. Velkommen til en prat om studier i teologi! Avdeling for teologi
Comment #12
mkalkbrennerI don't know exactly what's going on in your setup and it's difficult to do some more debugging this way.
Nevertheless I think that we should not rely on the de-duplication of apachesolr_search.module anymore which is designed for identical strings in fields like content and teaser which are of the same language.
I attached a patch that replaces the highlighting by language specific highlighting only. In case of CLIR we still have an issue, but this should have been the case in the previous version as well.
Can you test the patch?
Comment #13
Daemon_Byte CreditAttribution: Daemon_Byte commentedThe patch seemed to fix the issue and didn't cause any bugs that I saw
Comment #14
mkalkbrennercommitted to git