I have been using Apache Solr Search together with Apache Solr Attachments for many days without problems until hitting this
now today on any search attempt:
warning: strtotime() expects parameter 1 to be string, array given in /Library/WebServer/Drupal/drupal-6.22/sites/all/modules/apachesolr/apachesolr_search.module on line 414.
warning: strtotime() expects parameter 1 to be string, array given in /Library/WebServer/Drupal/drupal-6.22/sites/all/modules/apachesolr/apachesolr_search.module on line 415.
warning: Illegal offset type in isset or empty in /Library/WebServer/Drupal/drupal-6.22/includes/path.inc on line 64.
warning: Illegal offset type in /Library/WebServer/Drupal/drupal-6.22/includes/path.inc on line 69.
warning: htmlspecialchars_decode() expects parameter 1 to be string, array given in /Library/WebServer/Drupal/drupal-6.22/sites/all/modules/apachesolr/apachesolr_search.module on line 432.
warning: mb_strlen() expects parameter 1 to be string, array given in /Library/WebServer/Drupal/drupal-6.22/includes/unicode.inc on line 409.
warning: htmlspecialchars() expects parameter 1 to be string, array given in /Library/WebServer/Drupal/drupal-6.22/includes/bootstrap.inc on line 856.
Where
apachesolr_search.module
414: $doc->created = strtotime($doc->created);
415: $doc->changed = strtotime($doc->changed);
432: 'title' => htmlspecialchars_decode($doc->title, ENT_QUOTES),
Today I had experimented with trying different schema.xml (see #1319516: How can i enforce whole word matching only (disable partial word and related word matching)), which may or may not be implicated.
After including the textgen field type (from the main Solr Search example) in the Drupal schema.xml, and substituting
all occurrences of type="text" with type="textgen", I reran the system, with clean Solr server start and cleared both
main Solr Index and Attachments index, and used 'Delete the index' as well as just 'Re-index'.
This seemed to work fine, although I was not happy with the hits it was giving me on words split across lines (it was giving hits on 'requirement' for search on 'require') so I decided to tune the scheme.xml to adjust a generateWordParts parameter:
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
I carefully reran and requested deletion and reindexing to the entire search system. But soon after running the above warning messages started occurring on every search (but note the search results are served and make sense, both content hits and attachment hits).
Then I tried reverting the schema to generateWordParts="1" (as it is in the original schema.xml) and the error still occurred. I don't think it has anything to do with it.
I have tried deleting the attachments cache also, as well as the main Drupal cache. Problem still occurs on every search.
I have nothing at all otherwise in the Drupal module system.
Until playing the schema.xml I had indexed the entire site of over 20000 nodes with 800 file attachments without any problems at all.
Now as soon as any indexing has taken place the error above occurs.
Very glad for help,
Webel
Comments
Comment #1
webel commentedUpdate: I reverted to the schema.xml as distrubuted with Solr Search Integration, restarted with deletion and reindexing, and the problem vanished. I would seem to have something to do with the replacement of the type 'text' with the borrowed type 'textgen'.
From original schema.xml:
From schema.xml from Apache Solr Integration:
Apart from the textgen type not having the stemmer, and not removing duplicates, the only difference seems to be that it has splitOnCaseChange="0" and nothing about preserveOriginal.
I realise this involves the schema.xml and knowledge of Solr, but I am still treating this as a Drupal Apache Solr Integration bug, since I can't see how these setting should be able to provoke the error I am observing.
I would be most grateful for specific help on this (and I am certain there are other Solr Integration users who would like a functioning example of how to switch off stemming, which example I could provide if we can overcome this problem).
Webel
Comment #2
webel commentedPS: And textgen is missing the solr.MappingCharFilterFactory.
I will try another approach, I will try working from the 'text' type commenting out the solr.SnowballPorterFilterFactory bits etc., preserving the other bits.
Comment #3
webel commentedEDIT: POSTSCRIPT: THE "SOLUTION" REFERENCED HERE COULD NOT BE REPRODUCED, SEE LATER COMMENTS
Ok I seem to have got it working, as reported as a solution to my own support request at http://drupal.org/node/1319516#comment-5158076 (#1319516: How can i enforce whole word matching only (disable partial word and related word matching))
I report this success cautiously, as I have not indexed all of my site, and it will take some days to do so.
So the culprit would seem to be either:
or:
That is to say, they seem to be needed to avoid the error I encountered.
May I ask the maintainer(s) to please consider this before closing, I would like to understand it, and I still consider it is probably still fairly called a bug or issue.
Webel
Comment #4
webel commentedI spoke too soon, I made some further changes to the schema.xml (to try to deal with hyphenated line broken word matches) and it's back:
Note the change I made to catenateWords in the analyzer type="query, switching it from 0 to 1, to match the analyzer type="index".
The reason I am doing this (trying it rather blindly) is that I am trying to prevent hits on words broken across a line. For example, 'procure' keeps yielding hits on 'procure-ment' when the latter is spread across 2 lines and hiphenated. Sounds rather pedantic I am sure, but it what my client wants (to prevent).
I've also switched off generateNumberParts in both analyser elements.
I will revert back to the adapted schema.xml I posted at: http://drupal.org/node/1319516#comment-5158076 and see whether the problem vanishes.
Comment #5
webel commentedWell this is fairly driving me nuts.
When running with the very schema.xml 'text' adaptation that I posted as definitely working (and I let it run indexing lots of nodes and file attachments) is now giving the original error. The results I am reporting here are clearly inconsistent, and I can't draw clear conclusions about which parameters may be causing the problem.
I am being extremely careful each time I run a new schema.xml:
- I stop the cron job.
- I stop the Solr server, and then rerun it.
- I go to the Solr admin Search index tab and I select both 'Re-index all content' and 'Delete all content' (although I probably only need to delete).
- I go to Solr admin File attachments tab and I select both 'Delete files from index' and 're-index all files' (probably redundant).
- I do not always clear the attachments extracted text cache, as I can't see how this could possibly be responsible.
- I then restart cron, or I just use manual cron invokes.
Sometimes I get the initially reported error as soon as at least a few search hits are available; sometime I do not.
And this process is seemingly erratic and it is very time consuming isolating the factors.
I managed earlier today to get the "whole word" schema.xml snippet for 'text' at http://drupal.org/node/1319516#comment-5158076 to run ok without the error, indexing lots of nodes and file attachments, no worries. But then I played with the schema.xml and since then I can't restore my success. Something is affecting the state, something somewhere gets, for want of a better expression, "stuck".
Comment #6
nick_vhInteresting topic, please keep us updated!
Comment #7
webel commentedI reverted once again to the original schema.xml distributed with Solr Search Integration and it definitely did not occur. I indexed 100s of nodes and file attachments, no worries.
Then I introduced again this minimal change. Every attribute is the same as in the Drupal schema.xml, except I have commented out the stemmer part:
Note that not even the generateNumberParts or catenateWords attributes (that i had previously been playing with) are changed. And once again, after completely clean restart (see above), it fails, depending on the word.
For example, it fails for search on 'atoms' but not for search on 'atom' (and both give hits), where Did you mean: offered 'atoms'.
If fails for search on 'procure', which offers lots of file attachment hits.
For every file attachment hit displayed it gives exactly once:
I would be most grateful if somebody could try out the "whole word" schema.xml snippet above and see whether they can reproduce this bug.
Comment #8
webel commented@Nick_vh and maintainers.
I have spent all day on this, there is nothing more I can do myself to overcome this problem.
Could you please see whether you can get it to run with the stemmer algorithm commented out,
and please try to figure out what could be causing the warning message chain.
It is clearly a genuine bug, and I am very much in need of your help.
Webel
Comment #9
webel commentedKNOWN FACT: the problem has not arisen once with the original schema.xml distributed with Apache Solr Search Integration, only with it slightly adapted, such as commenting out the stemmer in the 'text' field type.
Comment #10
nick_vhHi webel. Thanks for your very descriptive and thorough explanation. However I'd like you to go one step further and also provide a patch that can actually prevent the errors from popping up. This would make the code more robust and allow for further customizability.
Do you think you could make it?
Comment #11
webel commentedWhen the problem is present it gives hits on both regular content and attachments.
But it does not display the node link for the hits, nor the hit snippet text.
For File attachment type it does provide the mime type download link.
I therefore think it is not due to Attachments module.
Comment #12
nick_vhChanged the title
Comment #13
webel commented@Nick_vh Firstly thanks for replies and for your work on this module.
You wrote:
It's not really an explanation, it's a description of the circumstances under which the problem arises.
I have no insight at all into what actually causes the problem, it is extremely difficult to get diagnostics on the problem (unless I start debugging cron and down). I now need help from the authors of the code who might figure out what is causing the errors as reported.
No. I need the help of those already familiar with the code.
Webel
Comment #14
webel commentedChanged title from "Notices in apachesolr_search .." to "Errors in apachesolr_search .."
When the "warnings" appear the search results no longer display the node title links or the snippet portion with the highlighted hit(s)
Comment #15
nick_vh@webel, I am a bit unclear of what we can do to fix it. Could you textually summarize our options here? I'll try to discuss in pwolanin and we'll come back to you. Currently I'm marking this as postponed because it is not in our current roadmap to support this level of customizations.
Comment #16
nick_vhComment #17
webel commented@Nick_vh
I appreciate your continued feedback.
I do not understand what you mean by your "options", and I do not know how I could possibly make the situation clearer.
When the stemmer part of the schema.xml is commented out errors arise as given above.
You main option would be to find out why _simplifying_ the schema breaks the module.
This would involve comparing on your own site somewhere, if possible with Attachments, and try restarting Solr with the Porter stemmer part commented out, and reindex the site and see whether you can reproduce the error.
You could also see whether you can figure out how the errors could arise in code, i.e. why I get:
It's a very complex chain of events from search indexing through the search that could give rise to this, and it is extremely difficult for somebody not intimate with the code to diagnose it, no matter how capable or fluent in PHP and Drupal that outsider may be.
This is not a high level of customisation, it is a very basic level of customisation, namely just switching off the stemmer to facilitate whole word search (as required by my client), while maintaining other benefits of Solr search such as integration with Solr Attachments, content type filters etc.
Webel
Comment #18
webel commentedPlease unpostpone, please help by diagnosing how the errors could be caused by the code.
Comment #19
nick_vh@webel
Since you are using Drupal 6 there is not much I personally can do for you in short term. We are planning to refactor the Drupal 6 release so it is inline with the Drupal 7 version. Changing the schema is not a small modification so the help we can give you is minimal. I'd advice you to hire a professional that can help you sort out the problems and hopefully return with a patchfile
Comment #20
nick_vhSuggestion :
Index the content and modify the qf parameters to use the different field :
So in the update index hook you add it to the document
And in the query alter you add a qf param and remove the ones you do not want
$query->addParam('qf', 'tus_content^40');
This way you don't need to alter the schema for unstemmed results
Comment #21
cpliakas commentedOooooh. The approach in #1320574-20: Errors in apachesolr_search when changing schema.xml to only support whole words is really slick! I like it. Kind of like a dynamic protwords.txt file!