search.module inconsistently uses multibyte safe functions (e.g., drupal_substr; drupal_strlen) and non-safe functions (preg_replace with an offset instead of mb_ereg_replace; strpos instead of mb_strpos). For instance, from line 1159 of the current version of search.module in HEAD:

      if (preg_match('/' . $boundary . $key . $boundary . '/iu', $text, $match, PREG_OFFSET_CAPTURE, $included[$key])) {
        $p = $match[0][1];
        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);
          if (($s = strrpos($end, ' ')) !== FALSE) {
            // Account for the added spaces.
            $q = max($q - 1, 0);
            $s = min($s, drupal_strlen($end) - 1);
            ...

I'm not sure if more Drupal wrapper functions are needed or if we should just use the mb versions provided by PHP.

This issue was originally raised in #916086: search_excerpt() doesn't highlight words that are matched via search_simplify()

CommentFileSizeAuthor
#56 drupal-search_multibyte_error_short_test_only_7.x-987472-59.patch1.16 KBdamien_vancouver
#56 drupal-search_multibyte_error_7.x_backport-987472-59.patch1.72 KBdamien_vancouver
#53 2011-12-22 Drupal Error.PNG52.54 KBWhiplashInfo
#52 drupal-search_multibyte_error_patched_7.10_search.module-987472-52.txt49.82 KBdamien_vancouver
#50 2011-12-20 after patching search module.PNG44.21 KBWhiplashInfo
#48 drupal-search_multibyte_error_with_long_test-987472-48.patch4.02 KBdamien_vancouver
#48 drupal-search_multibyte_error_short_test_only-987472-48.patch1.18 KBdamien_vancouver
#48 drupal-search_multibyte_error_with_short_test-987472-48.patch1.76 KBdamien_vancouver
#44 test-only.patch3.44 KBcatch
#42 drupal-search_multibyte_error-987472-42.patch4.02 KBdamien_vancouver
#41 drupal-search_multibyte_error-987472-41.patch4.02 KBdamien_vancouver
#40 drupal-search_multibyte_error-987472-40.patch3.98 KBdamien_vancouver
#39 drupal-search_multibyte_error-987472-39.patch1.89 KBdamien_vancouver
#37 drupal-search_multibyte_error-987472-37.patch2.34 KBdamien_vancouver
#36 drupal-search_multibyte_error-987472-36.patch2.08 KBdamien_vancouver
#26 drupal-search_multibyte_before-987472-26.jpg28.02 KBdamien_vancouver
#26 drupal-search_multibyte_after-987472-26.jpg32.06 KBdamien_vancouver
#26 drupal-search_multibyte_test_data-987472-26.txt2.65 KBdamien_vancouver
#26 drupal-search_multibyte_error-987472-26.patch676 bytesdamien_vancouver
#17 drupal-search_multibyte_error-987472-17_search.module.txt49.85 KBdamien_vancouver
#20 drupal-search_multibyte_error_example2-987472-20.jpg30.37 KBdamien_vancouver
#20 drupal-search_multibyte_error-987472-20_search.module.txt49.82 KBdamien_vancouver
#20 multibyte_content_search_errors-987472-20.patch676 bytesdamien_vancouver
#20 multibyte_content_search_errors_7.x-987472-20.patch656 bytesdamien_vancouver
#19 drupal-search_multibyte_error_example-987472-19.jpg98.02 KBdamien_vancouver
#19 drupal-search_multibyte_error_example-987472-19.jpg98.02 KBdamien_vancouver
#11 2.png13.55 KBOnkelTem
#11 multibyte_excert_fix.patch1011 bytesOnkelTem
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jhodgdon’s picture

Thanks for filing this issue. It may be a wider Drupal issue, not just search.module.

Damien Tournoud’s picture

Using byte-based functions instead of character based functions doesn't necessarily imply a bug, as long as they are used consistently.

The only thing obviously wrong I see in this excerpt is:

$end = substr($text . ' ', $p, 80);

Which could truncate a multibyte character at the end of the string.

Damien Tournoud’s picture

We could easily fix that by doing a $end = drupal_strlen(substr($text . ' ', $p), 0, 80).

jhodgdon’s picture

That's not the only place in search.module. The whole thing needs some examination, I think.

My concern is that I think (a) some non-mb string functions are being used in a way that is not multi-byte safe and (b) mb and non-mb functions are being mixed. Both of those would be bugs. We need to take a close look at the module and clean this up for the whole module, not just this one spot (which could be fixed as you suggest).

Damien Tournoud’s picture

Actually, this whole function seems ok to me, except this part:

$s = min($s, drupal_strlen($end) - 1);

We are mixing character-based and byte-based here. This drupal_strlen() should probably be a simple strlen(), but I'm not completely sure of the intent of this line.

WilliamB’s picture

Subscring i'm facing this issue as well.
And the fix previously posted seems to work.

Floop’s picture

Subscribing.
There are many people who are facing this issue.
#1105168: Invalid multibyte sequence warning from check_plain when searching is now a duplicate of this issue.

In my opinion the entire function should use a drupal_ or mb_ prefixed functions. The programmer's comment above the lines sais: ... leaving about 60 characters extra .... However with PHP's integrated functions, we are talking about bytes, not characters.

There are two options to fix the issue:

1) multibyte safe
The strlen and strpos should be used with drupal_ or mb_ prefix, otherwise the number of characters in front of the searched term and behind it could be less with a string containing a lot of special unicode characters than a string containing mostly ASCII characters.
Substr should then also be used with mb_ or drupal_ prefix.

2) faster
The function can use PHP's strlen, strpos and substr as previous comments mentioned, but then the #5's edit should be done. Otherwise a unicode character can be cut in the middle: #1105168-22: Invalid multibyte sequence warning from check_plain when searching

I am not a PHP professional, so please correct me if I am wrong.

drupal_was_my_past’s picture

Subscribe

ChrisLaFrancis’s picture

Subscribe

OnkelTem’s picture

Subscribe

OnkelTem’s picture

FileSize
1011 bytes
13.55 KB

Also, search module makes incorrect formatting when outputing results, treating parts of words and even letters as words (see the screenshot - search result start with single "ha" (х).

Attaching the patch for D7.2 which makes warnings to go.

barami’s picture

I think drupal don't care multibyte language users. I have this problem and cjk filename problem and any others. And since php 5 was used, this problem still alive.
I used php 5 with drupal few years. But this problem is never fixed.

It's php problem. But drupal development team don't think about bypass them.
We alway's patch this or use only english. So, this problems are blocking to increase non english users.

If drupal development team don't care this feedbacks, drupal will be a only english tool.
The solution is two.
Wait for php 6 or change drupal code to bypass locale related functions.

damien_vancouver’s picture

The following error started popping up for me as soon as I launched a new Drupal 7.8 site, which has no international content:

Warning: htmlspecialchars(): Invalid multibyte sequence in argument in check_plain() (line 1521 of /var/www/bootstrap.inc).

#1105168: Invalid multibyte sequence warning from check_plain when searching , a duplicate of this issue, led me here. The fixes suggested there don't apply to me (My database and all tables are UTF8, and all the SQL dumping and loading was done via drush sql-dump and command line mysql client).

Anyway, through some trial and error, I determined that the problem was opening left quote characters, pasted in by the users when entering press releases. Searching for terms on these pages would cause them to sometimes show up in the results, and trigger the error.

I fixed it by manually applying the changes from patch #11 above to my Search module... and the error dissappeared.

I think this is a bug. Searching for content should not produce errors just because users put multibyte characters in there. I think we should move forward with the patch from #11, but it needs to be re-rolled properly.

I will upgrade the site where I'm seeing this to Drupal 7.9 and test again, then fix up the patch against 7.x-dev

Edit, 2011-11-06: I tested the problem site after upgrading to 7.9 and the error persists. I will get a proper patch ready.

jtse’s picture

#11's patch fixed the issue in my Drupal 7.9 install.

dimon00’s picture

Same here: #11's patch fixed the issue (version 7.9).

robertstaddon’s picture

#11's patch fixed the issue for me as well (Drupal 7.9). Can we please get this into the next Drupal release?

damien_vancouver’s picture

Priority: Major » Normal
Status: Needs review » Active
FileSize
49.85 KB

I just had this problem on another totally urelated site and fixed it with the #11 patch again. This is definitely a bug in the wild.

I'll try and get a patch here ASAP against Drupal 8 and backported to 7.x-dev that will fix the problem in time for next release.

For those needing a fix right away or who aren't good with patches: I've attached a working patched Drupal 7.9 serach.module to this reply. Save the attachment: drupal-search_multibyte_error-987472-17_search.module.txt. Then rename it to "search.module" and overwrite your existing Drupal 7.9 modules/search.module with the downloaded one and the problem will go away.

If we don't get the patch into the next version of Drupal, I will post a new patched one on this thread so no one is left hanging.

[ EDIT: Please use the updated version from #20, below ]

Damien Tournoud’s picture

Please read my comments on this thread first. Using byte-based functions is not necessarily a bug per say, as long as you are using them consistently. The only thing obviously wrong I found in the function was the line mentioned in #5.

Could someone come up with a test string that displays a broken behavior?

damien_vancouver’s picture

Yes, this seems to be exactly what is happening somehow - a multibyte sequence is truncated at 80 characters in one of the nodes returned as search results. Any of the search results on the page can trigger it, so it's not obvious from looking at the search results which node has the bad string.

As for a test string, I figured 79 spaces and any old multibyte value will do. It doesn't appear to be that easy. I just spent about an hour trying to reproduce and catch it - it doesn't happen on my local development machine (of course), only on the servers, soI have no Xdebug. I am totally unable to make a new node that has a string that sets it off. I added some PHP error logging to bootstrap.inc and logged one of the strings, and tried copying the body of the offending node into a new node then running cron and searching for it. I can't reproduce it on demand.

However, I think it should be enough that it's happening to me and others? I've attached two screenshots of the problem as it occurred to me. On the first, it's searching for the word "warranty". Yikes! In the second searching for the company's name causes it. Yikes again!

Now the good news!

I tried the fix you suggested in #5 - replacing the drupal_strlen in that line with regular strlen, and that fixed the error for me. As this meets the definition of a rare condition that causes a PHP error, I'm marking the issue as Major.

I will come up with a quick patch that just changes the line as suggested in #5 tonight if no one beats me to it.

damien_vancouver’s picture

I attached the same file twice above in my screenshots last post #19. Attached is the actual second example screenshot. The node it hates is that FAQ node, and it has blank content showing up as a result of the error.

Also Attached are patches with your suggested fix from #5, for Drupal 8 and Drupal 7. I've tested/applied the 7 patch on both production sites giving me problems and it fixes them. Sadly I can't test the 8 without the ever elusive test string.... But the wrong function was definitely being used as you said and it's a one line change to correct that so hopefully that's OK.

Finally, I attached a new patched search.module for Drupal 7.9. Anyone who can't apply patches can download drupal-search_multibyte_error-987472-20_search.module.txt and follow the instructions in #17 above to fix the error until the fix makes it into an upcoming Drupal version.

damien_vancouver’s picture

Status: Active » Needs review

Setting issue status to needs review. Sorry for the spam. :(

chx’s picture

Priority: Normal » Major
Status: Active » Needs review

Could someone come up with a test string that displays a broken behavior?

I think #12 is simply insulting. I have said: give me a usable bug report and I will fix the search Unicode problems. These are not empty words: I spent considerable time and effort fixing #604002: Poor search support of some Unicode scripts. We do care about proper multibyte support. The CJK filename problem is its own issue, there is one already ongoing, did you test it? Did you help?

chx’s picture

Status: Needs review » Postponed (maintainer needs more info)
chx’s picture

Finally, I forgot to thank @damien_vancouver for trying to fix the problem but it's fairly hard to test a bugfix without a reproducible bug report.

damien_vancouver’s picture

@chx, So if I understand correctly, nothing is going to happen until I come up with a test string to reproduce the problem? I guess I will keep working on that then.

Why is there no way we can move the patch forward based on the fact that search.module is clearly using a multibyte function on a non multibyte string? Why MUST we reproduce a difficult use case when we are correcting an obvious coding mistake - mixing drupal_strlen with PHP strpos etc.? Can we not just fix it based on inspection of the code?

Consider that:
- Multiple people have confirmed the bug here.
- It happens to end users and is caused not by misconfiguration or site builder error, but by perfectly valid content and bad luck.
- We can easily see by inspecting the code that search.module is using the wrong function.

I will keep looking for a test string if it's really required. That means adding debugging to my production server while it's happening. I really do not want to have to do that when we already know the problem and the solution.

Meanwhile... those who are experiencing this bug and know all too well that it is real can overwrite their Drupal 7.9 modules/search.module with the patched one from #20 above (drupal-search_multibyte_error-987472-20_search.module.txt). I will keep releasing these patched search.modules (and unfortunately spamming everyone on this issue in the process) with each Drupal release until this issue is fixed. Unless we can agree that we are fixing an obvious mistake and move forward based on that, rather than needing to reproduce the side effect, or I can figure out a way to reproduce the problem.

damien_vancouver’s picture

Version: 8.x-dev » 7.x-dev
Status: Needs review » Postponed (maintainer needs more info)
FileSize
676 bytes
2.65 KB
32.06 KB
28.02 KB

I have managed to reproduce the problem with test data on all my servers and a plain D8 install.

While I do not see the red PHP error message on every server, the contents of the node summary will be missing from the search results (due to htmlspecialchars() returning an empty string).

Screenshots, test data, and re-rolled 8.x patch are attached.

Steps to reproduce:

  1. Create a "good" node: Make a new node with title "Test Page good" and content: "This is good SampleContent. It will show up".
  2. Now create a "bad' node: Create a new node with title "Test Page Bad" and paste the attached drupal-search_multibyte_test_data-987472-26.txt in as the node body. Set the Text Format to "Full HTML" and save the node.
  3. Clear your search index (I'm not certain you must do this) and then manually run cron to get those nodes in the index
  4. Now search for "SampleContent"
  5. The results page MAY have a red PHP error like: Warning: htmlspecialchars(): Invalid multibyte sequence in argument in check_plain() (line 1548 of /var/www/includes/bootstrap.inc.
  6. (See drupal-search_multibyte_before-987472-26.jpg) The results page WILL NOT show the summary content for the "Test Page Bad", only an ellipsis. This is from htmlspecialchars() returning an empty string to check_plain() - see RETURN VALUE in http://www.php.net/htmlspecialchars.
  7. For Drupal 8, apply the attached drupal-search_multibyte_error-987472-26.patch. If you are testing with Drupal 7, apply multibyte_content_search_errors_7.x-987472-20.patch from #20.
  8. (see drupal-search_multibyte_after-987472-26.jpg) Retry your search for "test page" and now you will see the search summary. htmlspecialchars() is no longer failing becuase the string was not truncated in the middle of a multi-byte. If you were seeing the red PHP error, it will also be gone.

As an aside note, I noticed during debugging this that check_plain is missing an error condition from htmlspecialchars().
If check_plain were to notice that htmlspecialchars should not return an empty string when it is given a long string, then it could detect this error condition. The issue goes away if we fix the string so it's not bad. But perhaps check_plain() should be improved to notice the error happening and report it?

damien_vancouver’s picture

Status: Postponed (maintainer needs more info) » Needs review

Forgot to set issue status to needs review again.

Status: Needs review » Needs work

The last submitted patch, drupal-search_multibyte_error-987472-26.patch, failed testing.

damien_vancouver’s picture

Version: 7.x-dev » 8.x-dev
Status: Needs work » Needs review

setting version to 8.x-dev and setting back to needs review again, hopefully this time it'll pass testing.

Sorry, this is my first git patch!

damien_vancouver’s picture

chx’s picture

Something is strange: you have upload ordinary ASCII text. There's not a single > 127 byte in it.

damien_vancouver’s picture

Yes it's actually htmlspecialchars() that's failing... so I guess multibyte is technically not the issue, it's the fact that PHP's mb_strlen() is counting the string differently than php strlen() (despite the fact it's just HTML entity chars in there, not real multi-byte).

I can say this with certainty because while working on this yesterday I added a bunch error_log() debugging on my production server into the search.module. I added some output right around where the patch is, printing out calculated the values of $p, $q, and then the calculated values for $s based on the drupal_strlen vs. regular php strlen (to compare and see what is different without / with the patch applied). When the error was occurring, the two were indeed returning different values, causing one of those HTML entity codes to be truncated, and then causing htmlspecialchars() to fail (it outputs an empty string in this case)... which in turn returns the empty search summary from check_plain() (and may display a red PHP error to the user, depending on your PHP settings).

Here's the relevant snippet of code that's failing, followed by my debug output. Sadly I no longer have the debug code I used, but it's pretty obvious what I'm printing out: I commented below right where I'm doing the printing (lines HERE1 and HERE2). The output from the debugging when it's failing is immediately below:

// NOTE: This is from Drupal 8 search.module, function search_excerpt(), line 1159:
      // Now locate a space in front (position $q) and behind it (position $s),
      // leaving about 60 characters extra before and after for context.
      // Note that a space was added to the front and end of $text above.
      if ($p) {
        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {
          $end = substr($text . ' ', $p, 80);
          if (($s = strrpos($end, ' ')) !== FALSE) {
            // Account for the added spaces.
            $q = max($q - 1, 0);
// HERE1 - print values of $p, $q, $s before
// HERE2 - printed values of s as calculated below as s(incorrect), and then again calculated with PHP strlen() as s(correct)
            $s = min($s, drupal_strlen($end) - 1);

Here's what my debugging printed for one of the problem strings:

[error] here1: p=380, q=319, s before=79
[error] here2: drupal_strlen(end)(incorrect): 78, s(incorrect):77.   strlen(end)(correct): 80, s(correct): 79

I had a theory that part of the problem was whitespace at the end or start of the initial excerpt it took. But I was never able to construct test data that caused the issue, no matter what I tried. The only way I got the attached text data from #26 was to actually error_log the entire node $text, then base64_encode it in my error log. Otherwise error_log logged it like this:

\xc2\xa0 \n What is SampleContent? ....

But base64 encoding then decoding it yielded this, which when pasted into a node body causes the problem.

<h1>What is SampleContent?....

I think one could spend a lot of time analyzing the weird side-effect behaviour of using the wrong string function. But I'm not sure that it is really necessary - we are dealing with undefined behaviour from using an incorrect function, so it's not surprising that it's behaving strangely! The base cause of all this is that PHP's mb_strlen() (called by drupal_strlen()) is counting the HTML entity characters differently than PHP strlen(). Since we're using PHP strpos() and other non-mb functions for the rest of the substring calculation in that block of code, we MUST stick to using those for strlen too or we won't necessarily get consistent numbers. This bug SHOULD only happen when multibyte chars are in the text, but truncated HTML entity chars seem to cause it to.

It should be noted that in this function (search_exerpt()), all of the code in question is just finding spaces in the strings, and truncating empty spaces at the end of the string... so there is no need to be using mb safe functions because we're dealing with locating whitespace. It's OK using either normal or multibyte searching and string math, as long as we don't mix and accidentally get one or two-off values - which is exactly what is going on here.

I am still curious why mb_strlen() is counting HTML entities as multibyte... that could be a PHP bug, or it could be by design. The documentation on mb_strlen() doesn't really help much! http://php.net/manual/en/function.mb-strlen.php

But, searching the PHP documentation I did discover why this problem has not been noticed before! PHP 5.3! See the new flag ENT_IGNORE to htmlspecialchars():

ENT_IGNORE Silently discard invalid code unit sequences instead of returning an empty string. Added in PHP 5.3.0. This is provided for backwards compatibility; avoid using it as it may have security implications.

So... I bet in PHP <= 5.2 the problem of the empty search summary would not have been seen, htmlspecialchars() would have been returning the string with those bad characters discared. Since PHP 5.3, it would have been returning an empty string instead. We might want to consider adding ENT_IGNORE to check_plain's htmlspecialchars() call, if we can't figure out a good security reason not to. In fact I got convinced after writing that that ANY bad node with broken HTML entity characters should trigger this bug. So I confidently made one in my drupal 8 test site, but... it works fine. Even though it's full of things like &quot &nb &nbsp and so on, htmlspecialchars doesn't seem to mind them and I can see the search summary. Bizarre! I'm tempted to keep trying to make a test string that fails... but that's how I lost half of yesterday :)

Please let me know if there is anything else I can do testing/debugging wise to push this forward. I appreciate what a pain it is to work on when it's so hard to reproduce. Also please let me know if the above test data does not cause the problem of the empty summary for you.. if so I'll go back to where it's failing and try and dig deeper and get some that works for everybody.

Damien Tournoud’s picture

We still need to come up with a small, reproducible test case for this.

The problem is definitely around multi-byte characters, and is triggered with the HTML entities in your test case because we have a call to html_decode_entities() entering the search excerpt function.

Staring at the code once more, I'm convinced those two lines are wrong, but without a set of test cases, it's hard to tell what is the proper fix:

// Here, we are cutting a UTF-8 string at a fixed number of bytes.
$end = substr($text . ' ', $p, 80);
// Here, we are mixing byte-based ($s) and character-based (drupal_strlen) offsets. 
$s = min($s, drupal_strlen($end) - 1);
damien_vancouver’s picture

You are probably correct about the substr line too, and how it should not truncate based on an arbitrary number of bytes. But the code is supposed to account for that - the value of $end is trimmed down to the position of a space found with strrpos(). So it could be argued it's more efficient this way (if we fix the bug with patch from #26). It could also be argued that using the non-mb functions as a shortcut is a more failure prone design pattern, and this bug is an example of that. I agree with both and have no stronger feelings towards one than the other.

I stepped through this a lot in XDebug yesterday (once I had a reproducable test case), so here is some more commentary on how the code is intended / supposed to work and where it's failing exactly. As to whether we should do it the multibyte function way, I leave that to someone more qualified to decide!

Here's the code again with extra comments added left justified:

      // Now locate a space in front (position $q) and behind it (position $s),
      // leaving about 60 characters extra before and after for context.
      // Note that a space was added to the front and end of $text above.
      if ($p) {
        if (($q = strpos(' ' . $text, ' ', max(0, $p - 61))) !== FALSE) {

// As you point out, this is arbitrarily chopping a string that may have UTF8 or HTML entities in it
         $end = substr($text . ' ', $p, 80);

// now $end MIGHT finish with a truncated multibyte or HTML entity.
// But that should be OK because of the the next line -
// we search with a strrpos() for a space, and that will be used as the end of the string
          if (($s = strrpos($end, ' ')) !== FALSE) {
            // Account for the added spaces.
            $q = max($q - 1, 0);

// so now we've found a space to truncate the string at ($s), and any arbitrarily truncated HTML entities or
// multi-byte sequences beyond that would be safely trimmed by our substr().  
// EXCEPT... we're using the wrong strlen() function and thus we might get a 1-off or 2-off error here!
            $s = min($s, drupal_strlen($end) - 1);
// ...Leaving $s with a truncated multi-byte or HTML entity.  Our Fail-safe has Failed!  Hate That!
          }
          else {  

// Note that usually the only failure case would be no space found in $end, and that should cause
// this else block, which discards the whole thing (again avoiding a broken entity or multibyte).
            unset($workkeys[$k]);
          }

So either way, the strlen() and substr() functions have to be consistently multibyte, or not, or this error can be triggered.

As to a small reproducable test string: Although I had no success creating one (and had to fall back to logging and munging customer content), here are my thoughts what it has to be, and lessons learned. I wish anyone trying godspeed and good luck.

  • It must be contain multibytes or HTML entities. Most of my data was mostly these. Having lots didn't seem to help.
  • The string causing the error is not the HTML you paste into the node body, but already rendered content, ie. my <h1> tags were already converted to plain text for searching!
  • Where it's truncated depends on what the regexp matches on. I started searching on title so it would always be the same 256 bytes (the start of the node). [EDIT: I just realized that was preventing the bug from occurring. It only happens when $p has a value which means there is a space in the snippet string which means you MUST match on content, or the bug is not triggered ] Of course when you searched on content things get more complicated as the 256 byte snippet moves around!
  • The snippet is not always 256 characters if it had spaces at the start or end, and;
  • Only when there are spaces (multiple spaces) at the end does the value of $s change so that line 1168 has any effect:
    $s = min($s, drupal_strlen($end) - 1);
    Otherwise the min() function will ignore the value from the drupal_strlen(). It only causes $s to change iff there are spaces (multiple spaces??) at the end of the snippet string, ie. $end.

All that sounds simple enough. I made dozens of test nodes meeting those conditions or small variations thereof with no luck. It might be easier using a debugger... I may take another crack at it still!

damien_vancouver’s picture

Version: 7.x-dev » 8.x-dev
Status: Postponed (maintainer needs more info) » Needs review

I have made a concise working test string! It fails as either Filtered HTML or Full HTML.

123456789 HTMLTest +123456789+&lsquo; +&lsquo; +&lsquo; +&lsquo; +12345678 &nbsp;&nbsp; +&lsquo; +&lsquo; +&lsquo; &lsquo;

Each + marks a 10 byte boundary. The &nbsp's are right around the 80 char mark of the search snippet (string $end), if your search term is "HTMLTest".

To reproduce, create any node with that test string as the body. Then run cron manually for your site so the new node is in your search index. Then search for "HTMLTest". You will get an empty search summary for the node. The HTML entity that is failing is the final &lsquo; - except I believe that is already converted into a multibyte left opening quote by the time check_plain sees it, so what is really failing is that multibyte being truncated.

Revised Notes on making a test string: For the bug to be triggered...

  • Your search must match node content, not the node title, and it can't be the very beginning of the node.
  • You must have some multibyte characters (or HTML entities that decode to multibyte characters, like &lsquo; or &rsquo; ) within 80 bytes of the search term. Each one of these causes the drupal_strlen to be inacurrate. I got it working with four of them in my search snippet.
  • You must have some spaces at the end of your 80 byte snippet. These are spaces after it's rendered to HTML. I couldn't get the error to happen without using &nbsp;s in my trailing spaces string right around the 80 byte boundary (I'm not sure why)
  • After those spaces should come some multibyte or HTML entities you are hoping to break. A bunch in a row helps.
  • Just fiddle around adding or removing multibyte encoded stuff from the search snippet until you are lucky enough to trigger a truncated value that makes htmlspecialchars / check_plain fail. I had plenty of truncated values not failing, but when you get one that does it will be consistent, as with the test string above.

Useful debugging code:
I added the following helpful debugging to search.module at line 1168, printing out whenever there was a mismatch. It still took about 15 or 20 tries with mismatches showing until one caused the error.

  // Drupal 8 search.module line 1166.  Add the left justified debugging code.
            // Account for the added spaces.
            $q = max($q - 1, 0);
$strlenIncorrect = drupal_strlen($end);
$sIncorrect = min($s, drupal_strlen($end) - 1);
$strlenCorrect = min($s, strlen($end) - 1);
$sCorrect = min($s, strlen($end) - 1);
if ($sIncorrect != $sCorrect) {
  drupal_set_message("Error!  strlenIncorrect: $strlenIncorrect sIncorrect $sIncorrect / strlenCorrect $strlenCorrect sCorrect $sCorrect end: [$end], base64 end: [". base64_encode($end) ."]",'error');
 }
            $s = min($s, drupal_strlen($end) - 1);
            $ranges[$q] = $p + $s;
damien_vancouver’s picture

Here is a new patch of the #26 fix with a new simpletest. I added the new test to Search module's "Search Excerpt Extraction" test.

When it's used in a Drupal search, the search_excerpt() function is passed $node->rendered. This is different than the text you put in the node to cause the error. I logged it base64 encoded and use that to create the correct string for testing.

damien_vancouver’s picture

Here is an updated patch where the test is properly described. The old test description and comments did not adequately describe the problem.

chx’s picture

Status: Needs review » Needs work

Huge thanks for the patch! // See issue 987472: Should be @see http://drupal.org/node/987472

Why base64, again...? Edit: you could use maybe HTML entities or even the characters UTF-8 encoded, other search tests have those.

damien_vancouver’s picture

Good call, we didn't need the base64! I was able to convert the string over to a literal pretty easily. Although just one byte off and it stops causing the error... it's so fussy to reproduce!

I've reworked the test as well, it's much simpler/clearer now. re-rolled and attached! Please review it's OK as I want to make sure we do have a patch ready in case there's a Drupal security release / point release, so this can get in there.

But I'm leaving the issue as "needs work" now, because I'd like to come up with a way better test. I think the test should auto-create 100 (or so) nodes of dangerous content and then search for known strings in each node and verify there's an excerpt showing up in the search page results (properly, with assertText). Only if all 100 nodes return search excerpts containing the search term will the test pass. This would be a way more useful test than a functional test of search_excerpt with this one string. I'll have to come up with simpletest code that can make bad nodes first, and it will take quite a lot of nodes, as not all bad nodes trigger the error, it took me about 20 tries with bad ones to see it happen.

So - expect a better version of this patch soon (ie. sometime in the next 48h) with a way more exhaustive test. But if we are doing a point release let's not hold up the commit for a better test, let's fix the bug for Drupal 7.10+

damien_vancouver’s picture

Status: Needs work » Needs review
FileSize
3.98 KB

This version has a proper web test case that creates 25 dangerous nodes, updates the search index, and then searches for known terms in the node content and ensures that the search excerpt isn't empty.

Without the drupal_strlen() to strlen() fix, this test fails about 30-60% of the time.

I have set the test to run 25 times, and to also log failing node bodies with DrupalTestCase::verbose().

I think this might be ready to go now, if I haven't messed anything up with the test.

damien_vancouver’s picture

sorry, please review this one - it has a couple minor formatting fixes.

damien_vancouver’s picture

I fixed a function name that did not apply to coding standards. New patch attached.

Is there anyone out there that has time to test this? It'd be nice to get it committed before the next point release, if at all possible, as it's a major bug causing PHP errors. Does no one else find that as important a priority as me??

nevergone’s picture

Status: Needs review » Reviewed & tested by the community

I'm install pure Drupal 8, and ran two tests: http://drupal.org/node/987472#comment-5290702 and create node with random hungarian language text. The patch is works well.

catch’s picture

FileSize
3.44 KB

Uploading just the test so the bot can show the failures on this issue.

Status: Reviewed & tested by the community » Needs work

The last submitted patch, test-only.patch, failed testing.

catch’s picture

Status: Needs work » Reviewed & tested by the community

Back to RTBC, haven't reviewed the patch yet.

catch’s picture

Status: Reviewed & tested by the community » Needs work
+++ b/core/modules/search/search.testundefined
@@ -1623,6 +1623,82 @@ class SearchExcerptTestCase extends DrupalUnitTestCase {
 /**
+ * @see http://drupal.org/node/987472
+ * Tests the search_excerpt() function does not truncate or corrupt multi-byte sequences

I think we can skip the @see, git blame works fine for this and the test is self-documenting enough.

+++ b/core/modules/search/search.testundefined
@@ -1623,6 +1623,82 @@ class SearchExcerptTestCase extends DrupalUnitTestCase {
+
+    $numTests = 25;  // 10 tests is usually sufficient to cause the bug several times.
+
+    $node = array(

Comments should go before the line rather than at the end.

I'm not sure why we need to create 25 nodes and make random combinations - can't we make one node that always triggers the bug? Creating 25 nodes is not fast.

Also is there a function that creates the search excerpt that we can test directly (or factor out so it's possible to do so) - then we wouldn't need to create nodes at all.

+++ b/core/modules/search/search.testundefined
@@ -1623,6 +1623,82 @@ class SearchExcerptTestCase extends DrupalUnitTestCase {
+    // Each node body has 10-25 "words" consisting of a multi byte entity above and a space.

Exceeds 80 chars.

+++ b/core/modules/search/search.testundefined
@@ -1623,6 +1623,82 @@ class SearchExcerptTestCase extends DrupalUnitTestCase {
 /**
+ * @see http://drupal.org/node/987472
+ * Tests the search_excerpt() function does not truncate or corrupt multi-byte sequences
...
+    // Create $numTests nodes containing random multibyte encodings and a keyword.
+    // Each node body has 10-25 "words" consisting of a multi byte entity above and a space.

I don't think we need the @see here. git blame works fine for that.

+++ b/core/modules/search/search.testundefined
@@ -1623,6 +1623,82 @@ class SearchExcerptTestCase extends DrupalUnitTestCase {
+

Inline comments should go above the line, not afterwards.

+++ b/core/modules/search/search.testundefined
@@ -1623,6 +1623,82 @@ class SearchExcerptTestCase extends DrupalUnitTestCase {
+    // Create $numTests nodes containing random multibyte encodings and a keyword.
+    // Each node body has 10-25 "words" consisting of a multi byte entity above and a space.

Comment exceeds 80 chars.

19 days to next Drupal core point release.

damien_vancouver’s picture

Also is there a function that creates the search excerpt that we can test directly (or factor out so it's possible to do so) - then we wouldn't need to create nodes at all.

I did actually write a simple one line test like you describe, back in #39.

    $text = "<div class=\"field field-name-body field-type-text-with-summary field-label-hidden\"><div class=\"field-items\"><div class=\"field-item even\" property=\"content:encoded\"><p>123456789 HTMLTest +123456789+‘  +‘  +‘  +‘  +12345678      +‘  +‘  +‘   ‘</p>\n</div></div></div> ";
    $result = search_excerpt('HTMLTest', $text);
    $this->assertFalse(empty($result),  'Rendered Multi-byte HTML encodings are not corrupted in search excerpts');
I'm not sure why we need to create 25 nodes and make random combinations - can't we make one node that always triggers the bug? Creating 25 nodes is not fast.

I came up with the more general test and the 25 nodes approach in order that we could test the entire search system. The idea was, it would create nodes likely to trigger these kind of bugs, then test the entire search system for this class of error. A bug anywhere in any of the functions that truncated a multibyte sequence would have caused this test to fail given enough runs. I ran it hundreds of times here and it didn't fail again after this issue's fix... so I don't think any other errors exist. But that was the rationale behind that approach.

Anyhow: I've fixed the previous patch as per your notes in #47. Attached as: drupal-search_multibyte_error_with_long_test-987472-48.patch.

I also re-rolled the short one line test and attached it with just the test failing: drupal-search_multibyte_error_short_test_only-987472-48.patch and again with the fix as: drupal-search_multibyte_error_with_short_test-987472-48.patch.

We could go with that one if we prefer the short test.

andyboutte’s picture

subscribe

WhiplashInfo’s picture

Version: 8.x-dev » 7.10
Issue tags: +check_plain, +search.module, +htmlspecialchars(), +bootstrap.inc
FileSize
44.21 KB

Maybe I'm posting in a wrong thread, and if so - can you please tell me?
I have had an issue that I reported in the Swedish Drupal Group. I was told that the problem was in the search module and was provided with a link to this thread.

I run into this error
Warning: htmlspecialchars(): Invalid multibyte sequence in argument i check_plain() (rad 1552 av D:\Documents\My Web Sites\Drupal 7.2 Developer release1\includes\bootstrap.inc).
PDOException: i dblog_watchdog() (rad 157 av D:\Documents\My Web Sites\Drupal 7.2 Developer release1\modules\dblog\dblog.module).

The error occured in a Bitnami Drupal 7.10 installation, and in a WebMatrix Drupal installation.
I applied thedrupal-search_multibyte_error_with_long_test-987472-48 from #48, which worked like a charm, but the error warning still exist. So the patch didn't solved my problem.

So - two different installations of Drupal 7.10 ended up in errors
Thanks / Tomas

catch’s picture

Version: 7.10 » 8.x-dev
Status: Needs review » Reviewed & tested by the community
Issue tags: -check_plain, -search.module, -htmlspecialchars(), -bootstrap.inc +Needs backport to D7

@WhiplashInfo: Please don't change the version of issues without a good reason, nor add lots of new tags.

@damien_vancouver: that short test looks great to me, I'm going to mark the drupal-search_multibyte_error_with_short_test-987472-48.patch version back to RTBC, will commit in 3-4 days if not further objections from anyone.

damien_vancouver’s picture

@catch - Great news! As soon as it gets committed I'll write a 7.x backport and attach here.

@WhiplashInfo, and anyone else running Drupal 7.10...

I've attached a patched Drupal 7.10 search.module to this post. To fix the problem:

  • download the attached drupal-search_multibyte_error_patched_7.10_search.module-987472-52.txt.
  • Rename that downloaded file to search.module
  • Rename (backup) your existing modules/search/search.module to search.module.bak
  • Place the new search.module in the modules/search directory and test again - the error should be gone.

This issue will hopefully be fixed in the Drupal 7.11 release and beyond.

WhiplashInfo’s picture

FileSize
52.54 KB

@ damien_vancouver

Thanks for your patch! I tried it out , but sorry to say whithout success. Still the same error, but a little bit different. I'll attach a png-file.

I found a solution to the problem! It is described in this node and I have implemented the suggestion - and it worked like a charm!
http://drupal.org/node/1090290#comment-4485492

Thanks / Tomas

damien_vancouver’s picture

@WhiplashInfo,

That screenshot provides some good info!

You are seeing the same error as this thread! BUT it's triggered in a different spot. The second line of the error shows that the module causing is is the "dblog" module. The second line of error reads:

PDOException: dblog_watchdog() line 157 of modules/dblog/dblog.module

That's why this patch didn't fix it - it's a different bug and not in search.module.

So... Can you please create a new issue for your problem? Set Project to Drupal Core, Version to 8.x-dev, Component to dblog.module. Post the full text of your error on there, and describe what you are doing when the error is triggered.

Then reply here with a link to that issue (for anyone else with your same problem who finds this thread first).

Then I will help help you reproduce and fix it on the new thread.

You might have to add some debugging to your dblog.module to help me reproduce it.. but we should be able to get it fixed too!

Also, we might find right away that dblog is not the real problem, but some other contrib module that is logging a bad string. We will update the issue once we know. Once we properly log the string that is causing that error, we can figure out where it came from and which part of Drupal is failing.

catch’s picture

Version: 8.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)

Alright. Committed/pushed to 8.x.

damien_vancouver’s picture

@catch thanks for your help getting it to committed!

Attached are the 7.x backports. [Edit: I screwed up the filenames, and put -59, they should be -56 ]

  • drupal-search_multibyte_error_short_test_only_7.x-987472-59.patch: just the test (so we can see it fail against 7.x)
  • drupal-search_multibyte_error_7.x_backport-987472-59.patch The test and the fix, for committing to 7.x-dev
chx’s picture

Status: Needs review » Reviewed & tested by the community

And a huge round of cheers for damien_vancouver! Very nice work.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Oh, no! I am SO sorry, I thought I already committed this ages ago!

Well, fixed now. :) Committed and pushed to 7.x. Yaayyyy!!! GREAT job, Damien!!

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Version: 7.x-dev » 7.12

It seems that is not fixed in 7.12. I have the problem and the patch #11 does not work.
You can check my problem there: http://drupal.org/node/1457266

xjm’s picture

Version: 7.12 » 7.x-dev
chugot’s picture

Hello,

I have the same issue on drupal version 7.12, but only on my production server, not in local or dev server.
I applied the patch, but that doesn't solve the bug.

The prod server is an IIS, and others are Apache server. Perhaps it is the cause of my problem.

Anyone can help me ?

thanks

gunwald’s picture

I have the same problem with Drupal core 7.19. What am I supposed to do?