Hi,

I have an index configured as follow:

* Datasources: content, "news" content type in all languages (EN, FR, NL)
* Server: main DB
* Do not index immediately
* Fields: rendered HTML output, title, status, uid, tid, publication_date
* Processors: entity status, html filter, ignore case, tokenizer and transliteration

I have more than 4400 items indexed so far. During indexing I have a recurring error, please see log in attached file.

Do you have any clue on how I could fix that ?

Regards,

Wanjee

Comments

wanjee created an issue. See original summary.

wanjee’s picture

Issue summary: View changes
StatusFileSize
new160.74 KB
wanjee’s picture

It seems related to some whitespaces not being stripped.

In my tokens I got ' ' (1 space) but also ' ' (5 spaces) or strings like 'data' and 'data ' (trailing space).
Mysql strips trailing spaces when creating the key which results in duplicate entries in my log.

I don't understand why those are not stripped by tokenizer preprocessor. "Whitespace characters" configuration is left empty to rely on default which are assumed to "be suitable for most languages with a Latin alphabet".

So the new question is : why would tokenizer not strip whitespaces characters ?

wanjee’s picture

Digging even more...

Occurences of (&)nbsp in markup are not properly preprocessed. They are replaced by whitespace somewhere in preprocessing but are never stripped out from token. This leads to duplicated entries

wanjee’s picture

I created a custom processor replacing nbsp occurences with a standard space, quite horrible solution but at least my indexing is ok.

I don't close this issue yet so this can maybe be properly addressed.

drunken monkey’s picture

Do you maybe have the "HTML filter" processor enabled and placed after the "Tokenizer" processor? That would explain it, I guess.
Otherwise, I'd probably need your complete index configuration in order to attempt to reproduce this behavior. (Needless to say, I've never encountered this – and no-one else has reported it, either.)

However, apart from all that – if the database will trim whitespace characters when creating the key, we should probably do the same in the Database backend regardless of all other considerations. So, even if this is a configuration problem on your side (or we find a bug somehow that causes this and fix that), this is something that shouldn't result in an error in any case.
The attached patch would implement this – but, as said, we should first try to find anything else that might be wrong here.

Also, I couldn't get the test to fail without the fix. It seems that, at least on my DB server (MariaDB 10.1.29-1), trailing spaces are not stripped by the database. Could you maybe try running the test (without the fix) on your server and see if that fails?

Status: Needs review » Needs work
drunken monkey’s picture

Component: General code » Database backend
Status: Needs work » Needs review

Ah, excellent, it failed on the test bot after all! Even better, then …

borisson_’s picture

Status: Needs review » Reviewed & tested by the community

Great work @drunken monkey!

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

Good to hear, thanks a lot for reviewing!
Committed.

@ wanjee: If this doesn't fix the problem for you, please re-open the issue with more information!

  • drunken monkey committed e008593 on 8.x-1.x
    Issue #2926733 by drunken monkey, borisson_: Fixed indexing of leading/...
borisson_’s picture

Status: Reviewed & tested by the community » Fixed

Closing this issue.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.