I have installed the Module Search API with the Database Search and the Search-Pages. The Database Service is configured to not search on parts of a word.

When rebuilding the index (based on a Item type "Multible types" (Index now / all) I get a lot of entries in the watchdog-protocol:

Notice: Undefined offset: 1 in SearchApiHtmlFilter->parseText() (Line 125 von Myserver\sites\all\modules\search_api\includes\processor_html_filter.inc).

Notice: Undefined offset: 2 in SearchApiHtmlFilter->parseText() (Zeile 134 von E:\Webserver\wamp\www\multisite-test-1\sites\all\modules\search_api\includes\processor_html_filter.inc).

I have no clue what this means ...

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

smitty created an issue. See original summary.

joseph.olstad’s picture

subscribing.

I observe this issue, I am using solr 5.0.1
with search api version: 7.x-1.11

Looks like I have a really old version of search api, I am going to try upgrading!

joseph.olstad’s picture

Upgrading did not help however ...

Here's a patch that helps, feel free to improve upon it.

joseph.olstad’s picture

Status: Active » Needs review

setting to needs review. I have only done very simple testing on it so I cannot guarantee it's effectiveness.

Status: Needs review » Needs work
joseph.olstad’s picture

Status: Needs work » Needs review
FileSize
728 bytes

ok, test this patch instead

Status: Needs review » Needs work
joseph.olstad’s picture

Looks like the test bot is broken, I bet a non-change would provoke an error.

joseph.olstad’s picture

Status: Needs work » Needs review
FileSize
1.24 KB

Ok, one last try.

Status: Needs review » Needs work
joseph.olstad’s picture

Ok, testbot looks broken, but I have another patch here.

joseph.olstad’s picture

I tested my patch extensively in a dev environment, the unpatched issue occurs when indexing "archived web content" that may contain invalid html. We're not going to fix our archived content, however the patch does correctly handle the unset array and allows indexing the content correctly. I did a search on the content after applying patch 12 and it's indexed correctly although the contents of the index appears to have some left over > < characters with or without the patch.

The patch correctly prevents the watchdog from filling up with notices and warnings.

joseph.olstad’s picture

For now the patch does the job doesn't seem to have any bad side-effects. I'm not entirely happy with the patch because I'd like to improve it even more. The logic in SearchApiHtmlFilter->parseText() could be improved and made in a more structured way. With that said, I'll let a fresh set of eyes take over. From my debugging it looks like the parseText() function removes html tags, but it only does so correctly if the html is well structured.

To improve this: test parseText() with invalid html. This is where you'll see the need for perhaps a more robust/structured logic.

joseph.olstad’s picture

Status: Needs work » Needs review

setting this to needs review in the hopes that the head is fixed.

This should pass testing, if it doesn't then HEAD is broken (before the patch even is tested)

Status: Needs review » Needs work
joseph.olstad’s picture

not sure why this isn't passing testing, it fails on something that doesn't seem to be related to the patch. I retested another issue and it passed so it appears that HEAD is not broken.
I'll have a look at this later maybe.

joseph.olstad’s picture

FileSize
538 bytes

running a head test on a no-logic-change patch (add a comment, that's it)

joseph.olstad’s picture

Status: Needs work » Needs review

fire up the test bot

Status: Needs review » Needs work

The last submitted patch, 17: head_test-256775-17.patch, failed testing.

joseph.olstad’s picture

Status: Needs work » Needs review
FileSize
538 bytes

filename correction.

Status: Needs review » Needs work

The last submitted patch, 20: head_test-2567775-19.patch, failed testing.

joseph.olstad’s picture

drunken monkey’s picture

Version: 7.x-1.16 » 7.x-1.x-dev
Component: Framework » Plugins
Status: Needs work » Needs review
FileSize
585 bytes

Thanks for reporting!
I can't reproduce this and it's weird that the problem would appear only now, after the code hasn't been changed for more than a year.

In any case, I'd guess the proper solution here would be to just check whether preg_match() actually matches, and continue the loop otherwise, under the assumption that the HTML is just broken. The attached patch should implement this.

It would just be good to know which input leads to this problem for you. Could you please find that out, and maybe at which position in the string the method encounters the not-matching (assumed) tag name?

joseph.olstad’s picture

@drunken monkey , yes in my case I've got content classified as "archived" (aka, we don't maintain it or keep archived content up to date with the ever evolving w3c standards) , it's our archived content that has "broken" html and it's going to stay broken.

According to my debuging the undefined offset occurs only when processing broken html.

I'll have a look at your newer patch.

as for the simpletests failing according to the research I did calling

 checkPermission(array(), TRUE); 

before assigning permissions resets the permissions cache and can reduce problems when drupalCreateUser is invoked after setUp loads a simpletest environment.

Thanks for the patch, I'll test it asap (when I get back to the office).

joseph.olstad’s picture

Status: Needs review » Reviewed & tested by the community

Hi @drunken_monkey, your fix seems to work.

I re-indexed 4000 nodes , including archived content, looks great, the archived content is indexed correctly, I was able to search for the content.

By the way, my patch on patch 12 also works and passed testing as well.

Feel free to give me a bit of credit on the commit :)

Looks good, thanks.

joseph.olstad’s picture

oh yes, and no more undefined offset notices in watchdog (the reason for the patch in the first place), so this is good.

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

OK, great to hear, thanks for testing!
Committed. Thanks again!

And sure, I always try to credit contributors fairly. In this case, you clearly did the bulk of the work, I just refactored your patch, so you get the credit.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.