There is an issue #2179755: HTML filter leaves whitespaces about whitespace and HTML filter but this unfortunately was not solved properly.

If for example you have this text in body field:

<h2>Introduction&nbsp;</h2>
<p>Introduction</p>

and the "Correct faulty and chopped off HTML" text processor is enabled in text filter, the &nbsp; will be replaced by some special char that trim(search_api/includes/processor_html_filter.inc:116) can't remove when "HTML filter" processor is enabled in Search api filters.

The output from above text with "Correct faulty and chopped off HTML" text processor(checking the $text variable in search_api/includes/processor_html_filter.inc:111):

 <h2> introduction  </h2> 
  introduction  

This untrimmed space leads us to the following error:

SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '85213-body:value-introduction' for key 'PRIMARY': INSERT INTO

I created a patch to fix this.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mibfire created an issue. See original summary.

mibfire’s picture

The patch that solves this issue.

donquixote’s picture

Issue summary: View changes
mibfire’s picture

Issue summary: View changes
drunken monkey’s picture

Component: Framework » Plugins
Priority: Major » Normal
Status: Active » Needs review
FileSize
2.76 KB
2.36 KB

Thanks for reporting this issue and providing a patch! You’re right, seems we hadn’t really taken non-breaking spaces and other “exotic” whitespace into account.
However, your solution seems a bit verbose. Also, if we are using the exact same code three times, we might want to split it off into its own helper method. (Can include the html_entities_decode() call there, too, it seems.)

Please test/review my attached revision and see if it still resolves the problem for you!

PS: In the future, please also remember to set status to “Needs review” when posting a working patch. (And please don’t misuse the “Priority” field!)

drunken monkey’s picture

Status: Needs review » Fixed
mibfire’s picture

Status: Fixed » Active

Also, if we are using the exact same code three times, we might want to split it off into its own helper method. (Can include the html_entities_decode() call there, too, it seems.)

I was thinking on the same but i was not sure where i should put it to. I didn't want to extend the current class because i thought it is a global function that we might also use somewhere else.

I checked your patch but you don't remove the spaces between words. Is there always only one word? So couldn't we have something like: "word1(double spaces)word2"?

I also checked that on my profile page that i have already 1 credit in "Search API" but i am not sure that it is this one or something else what i did earlier. I think i should have 2 with this one.

Thanks

drunken monkey’s picture

Status: Active » Fixed

I checked your patch but you don't remove the spaces between words. Is there always only one word? So couldn't we have something like: "word1(double spaces)word2"?

I don’t think that’s true, preg_replace('/\s+/u', ' ', $token) should replace all whitespace with a single classical space.

I also checked that on my profile page that i have already 1 credit in "Search API" but i am not sure that it is this one or something else what i did earlier. I think i should have 2 with this one.

You can actually view the exact issues by clicking on “View all issue credits”. As you see (and I also see in the module’s commit log) it doesn’t seem like you were credited for another Search API issue yet.

mibfire’s picture

on’t think that’s true, preg_replace('/\s+/u', ' ', $token) should replace all whitespace with a single classical space.

Indeed, you are right! ok, thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.