Hi,
I have an index configured as follow:
* Datasources: content, "news" content type in all languages (EN, FR, NL)
* Server: main DB
* Do not index immediately
* Fields: rendered HTML output, title, status, uid, tid, publication_date
* Processors: entity status, html filter, ignore case, tokenizer and transliteration
I have more than 4400 items indexed so far. During indexing I have a recurring error, please see log in attached file.
Do you have any clue on how I could fix that ?
Regards,
Wanjee
| Comment | File | Size | Author |
|---|---|---|---|
| #6 | 2926733-6--db_backend_index_trailing_space--tests_only.patch | 2.36 KB | drunken monkey |
| #6 | 2926733-6--db_backend_index_trailing_space.patch | 3.22 KB | drunken monkey |
| #2 | issue_2926733_log.txt | 160.74 KB | wanjee |
Comments
Comment #2
wanjee commentedComment #3
wanjee commentedIt seems related to some whitespaces not being stripped.
In my tokens I got ' ' (1 space) but also ' ' (5 spaces) or strings like 'data' and 'data ' (trailing space).
Mysql strips trailing spaces when creating the key which results in duplicate entries in my log.
I don't understand why those are not stripped by tokenizer preprocessor. "Whitespace characters" configuration is left empty to rely on default which are assumed to "be suitable for most languages with a Latin alphabet".
So the new question is : why would tokenizer not strip whitespaces characters ?
Comment #4
wanjee commentedDigging even more...
Occurences of (&)nbsp in markup are not properly preprocessed. They are replaced by whitespace somewhere in preprocessing but are never stripped out from token. This leads to duplicated entries
Comment #5
wanjee commentedI created a custom processor replacing nbsp occurences with a standard space, quite horrible solution but at least my indexing is ok.
I don't close this issue yet so this can maybe be properly addressed.
Comment #6
drunken monkeyDo you maybe have the "HTML filter" processor enabled and placed after the "Tokenizer" processor? That would explain it, I guess.
Otherwise, I'd probably need your complete index configuration in order to attempt to reproduce this behavior. (Needless to say, I've never encountered this – and no-one else has reported it, either.)
However, apart from all that – if the database will trim whitespace characters when creating the key, we should probably do the same in the Database backend regardless of all other considerations. So, even if this is a configuration problem on your side (or we find a bug somehow that causes this and fix that), this is something that shouldn't result in an error in any case.
The attached patch would implement this – but, as said, we should first try to find anything else that might be wrong here.
Also, I couldn't get the test to fail without the fix. It seems that, at least on my DB server (MariaDB 10.1.29-1), trailing spaces are not stripped by the database. Could you maybe try running the test (without the fix) on your server and see if that fails?
Comment #8
drunken monkeyAh, excellent, it failed on the test bot after all! Even better, then …
Comment #9
borisson_Great work @drunken monkey!
Comment #10
drunken monkeyGood to hear, thanks a lot for reviewing!
Committed.
@ wanjee: If this doesn't fix the problem for you, please re-open the issue with more information!
Comment #12
borisson_Closing this issue.