Currently, normal-length words in Asian languages are triggering this warning (for example, Thai which is three bytes per character):

                if (strlen($word) > 50) {
                  watchdog('search_api_db', 'An overlong word (more than 50 characters) was encountered while indexing: %word.<br />' .
                      'Database search servers currently cannot index such words correctly – the word was therefore trimmed to the allowed length.',
                      array('%word' => $word), WATCHDOG_WARNING);
                  $word = self::mbStrcut($word, 0, 50);
                }

Would this line still work correctly if we change the strlen() call to drupal_strlen()?

Comments

mfb created an issue. See original summary.

drunken monkey’s picture

Category: Bug report » Support request

Would this line still work correctly if we change the strlen() call to drupal_strlen()?

No, not at all. The problem is that our underlying database table uses a column of size 50, so there is really no way to correctly store such a long word there.
What we could do, though, I think, is make the maximum word length a setting (either hidden variable or in the server settings). I'd gladly accept such a patch (plus port to the D8 version, which has the same problem), but unfortunately I don't have the time to work on this myself.

You can however easily just "hack" the module and change "50" everywhere it occurs in the context of fulltext fields to a higher value.

mfb’s picture

Category: Support request » Bug report

Let me give an example of the bug.

This word has 25 characters and can easily fit in a MySQL column that holds 50 UTF-8 characters: คุณพูดภาษาอังกฤษหรือเปล่า but the module is running strlen() on the word, which returns 75 bytes (because this word has 3 bytes per character).

So it's not an issue of 50 being too small or database limitation, it's an unnecessary warning being triggered because the wrong function is being called. The warning should test character length not byte length.

mfb’s picture

Actually I just dug in a bit more and realized that the mbStrcut() method is wrong too. It's cutting by byte count rather than by character count. This method could simply call drupal_substr() which cuts a string based on character count.

mfb’s picture

Status: Active » Needs review
StatusFileSize
new922 bytes

I didn't change the mbStrcut() method as it says it's for cutting by byte count; just avoided calling it here instead.

drunken monkey’s picture

StatusFileSize
new854 bytes
new1.63 KB

Oh, seems I got that wrong. I'd always assumed the character limit set in the database would be a byte count, but you're right, it's actually a character count.
Under these conditions it's of course clear that you got the right solution here, we indeed should use the character count functions, not the byte count ones. However, it seems you missed another place in the same method. Please see and test my attached revised patch.

drunken monkey’s picture

A test for this would also be great.
And we should apply the fix for string values, too.

drunken monkey’s picture

I want to test the "tests_only" patch, too, please!

The last submitted patch, 7: 2616804-7--database_character_limit_unicode.patch, failed testing.

Status: Needs review » Needs work

The last submitted patch, 8: 2616804-7--database_character_limit_unicode--tests_only.patch, failed testing.

arefen’s picture

i have a same problem. pathc #7 didn't solved problem.

drunken monkey’s picture

Apparently our default parsing in the Database backend split that "word" we were testing with into six separate words – and since I cannot read that script, I don't even know if that's correct or another bug. (Firefox and PhpStorm even fail me completely in rendering the string – only my terminal is able to.)

So, I think I'll use characters I can actually read (and have verified are not considered whitespace/punctuation by the backend) for testing.
Revised patch attached, please test/review!

@ arefen: Are you sure the word isn't actually longer than 50 (multi-byte) characters? If it is, my explanation in #2 applies.

Status: Needs review » Needs work

The last submitted patch, 12: 2616804-12--database_character_limit_unicode.patch, failed testing.

mfb’s picture

You should clarify the character and byte count of the string you are using, where the code comment says The word has 25 Unicode characters but 75 bytes.

The Thai I used above is a sample sentence I found online, not a string I ran into in the real world error messages. I'm not sure how word splitting works in Thai.. maybe it's possible to split some words out of a sentence..?

drunken monkey’s picture

Status: Needs work » Needs review
StatusFileSize
new2.17 KB
new4.27 KB

You're right, forgot to update that comment.
And seems that while I explicitly checked that that string wouldn't be split further, it seems its lowercase variant still contained whitespace or punctuation. Changed the string to all-lowercase, now it should really work.

  • drunken monkey committed f749d72 on 7.x-1.x
    Issue #2616804 by drunken monkey, mfb: Fixed indexing of words with...
drunken monkey’s picture

Project: Search API Database Search » Search API
Version: 7.x-1.x-dev » 8.x-1.x-dev
Component: Code » Database backend
Status: Needs review » Patch (to be ported)

Since the test bot seems to be fine and no-one else complained, either: committed.
Thanks again for your help here!

Unless I'm mistaken, the D8 version has exactly the same problem.

drunken monkey’s picture

Status: Patch (to be ported) » Needs review
StatusFileSize
new4.76 KB
new8.94 KB

Ported patch attached, including an additional test for string indexing.

drunken monkey’s picture

Status: Needs review » Fixed

Since the test bot agrees and no-one complained: committed.
Thanks again, everyone!

  • drunken monkey committed 6051636 on 8.x-1.x
    Issue #2616804 by drunken monkey: Fixed indexing of words with multi-...
mfb’s picture

\o/

If drupal.org supported utf8mb4 I'd put some emoji here :)

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

drunken monkey’s picture

Status: Closed (fixed) » Needs review
StatusFileSize
new1.79 KB

Just noticed that, while we did write a regression test method for this, it seems we somehow forgot to actually call it in our tests.
Even worse, for me locally, the tests failed when enabling it, so either the fix was wrong the first time, or we hit a regression – neither option a great prospect.
We have to enable this test and make it pass. (First step: check whether it would work with the module version from six months ago.)

Status: Needs review » Needs work

The last submitted patch, 26: 2616804-26--enable_regression_test.patch, failed testing.

mfb’s picture

completion failure - is it a bug in the test?

mfb’s picture

Status: Needs work » Needs review
StatusFileSize
new955 bytes
drunken monkey’s picture

StatusFileSize
new2.62 KB
new2.8 KB

Ah, thanks a lot for this fix!
However, seems the whole adding and removing of the body field is nonsense anyways, from some mistake several months back. Let's just get rid of that as well.

borisson_’s picture

Status: Needs review » Reviewed & tested by the community

Looks great!

  • drunken monkey committed 4ded835 on 8.x-1.x
    Follow-up to #2616804 by drunken monkey, mfb: Added backend tests for...
drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

Good to hear, thanks for reviewing!
Committed.
Thanks again for your help, too, mfb!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

sukh.singh’s picture

Will there be any patch for Drupal 7?

tdurocher’s picture

This does not appear to be in the recent 7.x dev release. The bug does apply to D7, does it not?

salag’s picture

Should the request for a drupal 7 patch be posted elsewhere? Is one in the works?
Thanks for your work on this.

irodriguez’s picture

This problem seems to still exist today.

An overlong word (more than 50 characters) was encountered while indexing: eyjkzwj1zyi6zmfsc2usimrpc2fibgvkijpmywxzzswiznvsbhnjcmvlbii6dhj1zswia2v5ym9hcmqionrydwusinjhdglvijowlju2mjusimfkyxb0axzlumf0aw8iomzhbhnllcjydg1wijowlcjwcm94esi6imjlc3qilcjobhnrdwfsaxrpzxmionrydwusinnwbgfzaci6zmfsc2usimxpdmuiomzhbhnllcjsaxzlug9zaxrpb25pzmzzzxqiojeymcwic3dmijoily9yzwxlyxnlcy5mbg93cgxhewvylm9yzy83ljeumi9mbg93cgxhewvylnn3ziisinn3zkhscyi6ii8vcmvszwfzzxmuzmxvd3bsyxllci5vcmcvny4xljivzmxvd3bsyxllcmhscy5zd2yilcjzcgvlzhmiolswlji1ldaunswxldeunswyxswidg9vbhrpcci6dhj1zswibw91c2vvdxruaw1lb3v0ijo1mdawlcj2b2x1bwuiojesimvycm9ycyi6wyiilcjwawrlbybsb2fkaw5nigfib3j0zwqilcjozxr3b3jrigvycm9yiiwivmlkzw8gbm90ihbyb3blcmx5igvuy29kzwqilcjwawrlbybmawxlig5vdcbmb3vuzcisilvuc3vwcg9ydgvkihzpzgvviiwiu2tpbibub3qgzm91bmqilcjtv0ygzmlszsbub3qgzm91bmqilcjtdwj0axrszxmgbm90igzvdw5kiiwisw52ywxpzcbsve1qifvstcisilvuc3vwcg9ydgvkihzpzgvvigzvcm1hdc4gvhj5igluc3rhbgxpbmcgqwrvymugrmxhc2guil0simvycm9yvxjscyi6wyiilciilciilciilciilciilciilciilciilciilcjodhrwoi8vz2v0lmfkb2jllmnvbs9mbgfzahbsyxllci8ixswicgxhewxpc3qioltdlcjobhngaxgiomzhbhnllcjkaxnhymxlsw5saw5lijpmywxzzswiy2xpcci6eyjhdwrpbyi6dhj1zswiyxvkaw9pbmx5ijp0cnvllcjzb3vyy2vzijpbeyj0exblijoiyxvkaw8vbxaziiwic3jjijoiahr0cdovl3dly2tidwzmywxvlmnvbs9hc3nldhmvcg9ky2fzdgvylzeyntkvmjaxof8wov8yml8xmju5xzcyote2xzmzntqubxazin1dlcj0axrszsi6ijkvmjivmtgtienhc3rszxmgqwxvbmcgdghlifjoaw5lin0simfuywx5dgljcyi6ilvbltc2njq0nzutncisimhvc3ruyw1lijoid2vja2j1zmzhbg8uy29tiiwib3jpz2luijoiahr0cdovl3dly2tidwzmywxvlmnvbs9wb2rjyxn0cy9hywetdhjhdmvslxnob3c.
Since database search servers currently cannot index words of more than 50 characters, the word was truncated for indexing. If this should not be a single word, please make sure the "Tokenizer" processor is enabled and configured correctly for index Database Search Index.
sivaprasadc’s picture

I'm using Drupal 7(7.x-1.25) version. It seems like still, the issue exists.

This is the message in the watchdog log.

An overlong word (more than 50 characters) was encountered while indexing: ag9tcnz46rerustx2gbhtzyfaqh8yeqpjswaiqtaw3khruptwdxss.
Database search servers currently cannot index such words correctly – the word was therefore trimmed to the allowed length.

Any help, Thanks in advance.

mfb’s picture

This bug report was about words that weren't actually over 50 characters, that couldn't be indexed.

If a word really is over 50 characters then it's not an issue - you should expect this warning.