Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I have installed the Module Search API with the Database Search and the Search-Pages. The Database Service is configured to not search on parts of a word.
When rebuilding the index (based on a Item type "Multible types" (Index now / all) I get a lot of entries in the watchdog-protocol:
Notice: Undefined offset: 1 in SearchApiHtmlFilter->parseText() (Line 125 von Myserver\sites\all\modules\search_api\includes\processor_html_filter.inc).
Notice: Undefined offset: 2 in SearchApiHtmlFilter->parseText() (Zeile 134 von E:\Webserver\wamp\www\multisite-test-1\sites\all\modules\search_api\includes\processor_html_filter.inc).
I have no clue what this means ...
Comment | File | Size | Author |
---|---|---|---|
#23 | 2567775-23--html_filter_undefined_index.patch | 585 bytes | drunken monkey |
#11 | search_api_undefined_offset_in_SearchApiHtmlFilter-2567775-12.patch | 1.27 KB | joseph.olstad |
Comments
Comment #2
joseph.olstadsubscribing.
I observe this issue, I am using solr 5.0.1
with search api version: 7.x-1.11
Looks like I have a really old version of search api, I am going to try upgrading!
Comment #3
joseph.olstadUpgrading did not help however ...
Here's a patch that helps, feel free to improve upon it.
Comment #4
joseph.olstadsetting to needs review. I have only done very simple testing on it so I cannot guarantee it's effectiveness.
Comment #6
joseph.olstadok, test this patch instead
Comment #8
joseph.olstadLooks like the test bot is broken, I bet a non-change would provoke an error.
Comment #9
joseph.olstadOk, one last try.
Comment #11
joseph.olstadOk, testbot looks broken, but I have another patch here.
Comment #12
joseph.olstadI tested my patch extensively in a dev environment, the unpatched issue occurs when indexing "archived web content" that may contain invalid html. We're not going to fix our archived content, however the patch does correctly handle the unset array and allows indexing the content correctly. I did a search on the content after applying patch 12 and it's indexed correctly although the contents of the index appears to have some left over > < characters with or without the patch.
The patch correctly prevents the watchdog from filling up with notices and warnings.
Comment #13
joseph.olstadFor now the patch does the job doesn't seem to have any bad side-effects. I'm not entirely happy with the patch because I'd like to improve it even more. The logic in SearchApiHtmlFilter->parseText() could be improved and made in a more structured way. With that said, I'll let a fresh set of eyes take over. From my debugging it looks like the parseText() function removes html tags, but it only does so correctly if the html is well structured.
To improve this: test parseText() with invalid html. This is where you'll see the need for perhaps a more robust/structured logic.
Comment #14
joseph.olstadsetting this to needs review in the hopes that the head is fixed.
This should pass testing, if it doesn't then HEAD is broken (before the patch even is tested)
Comment #16
joseph.olstadnot sure why this isn't passing testing, it fails on something that doesn't seem to be related to the patch. I retested another issue and it passed so it appears that HEAD is not broken.
I'll have a look at this later maybe.
Comment #17
joseph.olstadrunning a head test on a no-logic-change patch (add a comment, that's it)
Comment #18
joseph.olstadfire up the test bot
Comment #20
joseph.olstadfilename correction.
Comment #22
joseph.olstadOnce the patch for "HEAD is broken, simple test failure on adding a comment" is committed, then we can get back to testing patch 12.
Comment #23
drunken monkeyThanks for reporting!
I can't reproduce this and it's weird that the problem would appear only now, after the code hasn't been changed for more than a year.
In any case, I'd guess the proper solution here would be to just check whether
preg_match()
actually matches, and continue the loop otherwise, under the assumption that the HTML is just broken. The attached patch should implement this.It would just be good to know which input leads to this problem for you. Could you please find that out, and maybe at which position in the string the method encounters the not-matching (assumed) tag name?
Comment #24
joseph.olstad@drunken monkey , yes in my case I've got content classified as "archived" (aka, we don't maintain it or keep archived content up to date with the ever evolving w3c standards) , it's our archived content that has "broken" html and it's going to stay broken.
According to my debuging the undefined offset occurs only when processing broken html.
I'll have a look at your newer patch.
as for the simpletests failing according to the research I did calling
before assigning permissions resets the permissions cache and can reduce problems when drupalCreateUser is invoked after setUp loads a simpletest environment.
Thanks for the patch, I'll test it asap (when I get back to the office).
Comment #26
joseph.olstadHi @drunken_monkey, your fix seems to work.
I re-indexed 4000 nodes , including archived content, looks great, the archived content is indexed correctly, I was able to search for the content.
By the way, my patch on patch 12 also works and passed testing as well.
Feel free to give me a bit of credit on the commit :)
Looks good, thanks.
Comment #27
joseph.olstadoh yes, and no more undefined offset notices in watchdog (the reason for the patch in the first place), so this is good.
Comment #28
drunken monkeyOK, great to hear, thanks for testing!
Committed. Thanks again!
And sure, I always try to credit contributors fairly. In this case, you clearly did the bulk of the work, I just refactored your patch, so you get the credit.