With several workarounds in place, I managed to use an argument in a view for the fulltext search on body and title of some nodes, based on the database search backend.
This will randomly work and not work.
After some debugging I found the issue being, that the key(s) are normalized in some way. Specifically, they are lower-cased. However in the database, they are stored as appearing in Text.
This leads to upper-cased words never being found.

So if the some Text is "Hello world". "Hello" will be stored capitalized as is and "world" will be lower-case in the database. When searching for "Hello" now (or "hello" or "HeLlO"), the search key is lower-cased and compared to the upper-case "Hello" in the database which yields no results.

How to proceed?
Should items be indexed lower-cased?
Should search strings NOT be normalized?
Should the DB-comparision be made case-insensitive by some magic way?

Comments

ChristianAdamski created an issue. See original summary.

christianadamski’s picture

And the reason is the function splitKeys() in the Database.php backend.
Specifically $proc = Unicode::strtolower(trim($keys))

Is this supposed to happen?

christianadamski’s picture

The indexItem() function in the Database search_api_backend should simply be taken out the back and shot.

The variable $value alone is used so often throughout the function and overwritten with completely distinct contents, it's just not funny anymore. There is even a foreach($value as $token) which then uses $value as variable being overwritten within the foreach.

This is just bad :( I have no idea how to rewrite this.

christianadamski’s picture

Status: Active » Needs review
StatusFileSize
new715 bytes

I took the low road now and just added what I needed. This whole function is messy. It's probably mostly fine from a result point of view, but it is just bad bad code.

I simply do not understand enough of what is happening or why, to rewrite this myself.

Status: Needs review » Needs work

The last submitted patch, 4: 2557291-4-lowercase-words-before-indexing.patch, failed testing.

drunken monkey’s picture

And the reason is the function splitKeys() in the Database.php backend.
Specifically $proc = Unicode::strtolower(trim($keys))

Is this supposed to happen?

It's complicated. In the D7 version, words are lowercased both at indexing and search time, which obviously works better but doesn't allow at all for case-sensitive search (in the rare case someone actually wants that).

The problem that made this lowercasing necessary is the database's collation, which is case-insensitive by default in Drupal. So, if we'd try to insert entries for both "word" and "Word" into the database, we'd get an exception. Therefore, we just lowercased all words (they have to be lowercased while being "collected" into the array, by the way, your code will probably just try to insert "word" twice, which is of course even worse), fixing the problem.

Later, to solve other problems (namely searches for accented characters, umlauts, etc.), we added code that switches the fulltext tables to a binary (i.e., case-sensitive) collation for some DBMSs, which makes the workaround unnecessary – for those DBMSs. But since there is no uniform way to change the collation, we cannot do this for all possible DBMSs, and we still need the workaround.

The proper solution would probably be:

  1. Add the workaround to the indexing process as well, as we did in D7.
  2. Make it more explicit for which DBMSs we apply the collation fix.
  3. Only apply the lowercasing workaround (during indexing and search) for the other DBMSs.

Hm, but I just see that we already lowercase all indexed words, in line 1032. Why doesn't that work for you?

The indexItem() function in the Database search_api_backend should simply be taken out the back and shot.

While I certainly agree with you that the code in that method is pretty ugly now that I look at it (it's also five years old, for the most part, so I don't really feel very defensive about it), I'd still appreciate it if you could voice that a bit more politely. But I understand, having to understand someone else's ugly code can get very frustrating very quickly. I've had to write more Views integration code than anyone should have to in a lifetime, I know. ;)
The good news is that we have pretty great test coverage at this point, so a re-write shouldn't be too hard to pull of, if you feel inclined. (Just create a new issue for it – maybe someone else will feel motivated at some point, and since the code is already there and working it might even be "Novice" level.)

christianadamski’s picture

At least here locally, the terms are by default indexed case sensitive. If I additionally enable the "Ignore case" processor, they are indexed lowercase.

I have no chance on working on this right now. I hope to join the search sprint at Barcelona.

drunken monkey’s picture

Priority: Normal » Major
Issue tags: +Needs tests

OK, I could reproduce the error now, indexing really doesn't work correctly. The problem is that the lowercasing is only done on the array key (to avoid indexing two tokens which are equal when compared case-insensitive), not on the value actually indexed. So, all in all, not very useful.

So, we should definitely fix this, and also add tests (i.e., disable the "Ignore case" processor in the DB backend test and see whether the search is then really case-sensitive).

I also created #2564837: Fix Database::indexItem() – it's really got a lot of problems.

berdir’s picture

drunken monkey’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests
StatusFileSize
new22.57 KB

Wasn't particularly easy, but I think I've solved it now. See the attached patch.
As you can see, I went all-out and formalized the various DBMS-specific peculiarities into a set of classes and an interface for dealing with them. (We can later also use this for other restrictions, cf. #2636872: Move DBMS-specific restrictions to the DBMS compatibility classes.) Pretty complicated, but on the other hand also clean solution, I think. Let's see what the test bot says (especially regarding SQLite and Postgres).

drunken monkey’s picture

Huh. That looks almost as if Postgres is case-sensitive by default, too. Trying that out now … (I don't have Postgres installed locally, I have to confess.)

Also, I realized that while we handle different case correctly now, we also need to transliterate the values, so "à" and "a" are treated as the same if the collation does that, too.

Status: Needs review » Needs work

The last submitted patch, 11: 2557291-11--case_sensitive_fulltext_searching.patch, failed testing.

drunken monkey’s picture

Status: Needs work » Needs review
StatusFileSize
new2.48 KB
new6.02 KB
new25.03 KB

Seems like that was a bug left-over from #2361329: Fix behavior of "is not" conditions on multi-valued fields – quite a glaring one, actually – which, for some reason, wasn't caught by our test suite. It seems there is kind of a glitch in the databases (all DBMSs, actually, not just SQLite) which allows the incorrect query to still run, even though the result is then of course not the intended one. Learned something again. (If anyone is interested, I can provide the complete explanation.)

Anyways, the attached patch fixes tests for me on both SQLite and MySQL.
Oh, let's also do a tests-only patch to see what we've uncovered here.

berdir’s picture

Is there a reason that you're not using the binary => TRUE flag, to force a case sensitivity for all backends?

Not introduced here but wondering if that utf8 conversion isn't going to cause trouble in case you actually have such characters (typical example are emojis) in node bodies or so?

borisson_’s picture

Status: Needs review » Needs work

I don't understand most of the things that happen in this patch yet, I'll have to look at it a few more times before I fully get it. I did find one small typo. I also wonder if we should work with a backend specific service instead of the switch in ::getDbmsCompatibilityHandler.

  1. +++ b/search_api_db/src/DatabaseCompatibility/CaseSensitiveDatabase.php
    @@ -0,0 +1,22 @@
    +   * @inheritDoc
    

    Should be {@inheritdoc}

  2. +++ b/search_api_db/src/Plugin/search_api/backend/Database.php
    @@ -153,10 +171,37 @@ class Database extends BackendPluginBase {
       /**
    +   * Retrieves a DBMS compatibility handler for the given database type.
    ...
    +   */
    +  protected function getDbmsCompatibilityHandler($databaseType) {
    

    Is there a way we can replace this behavior with a backend specific service? That looks more elegant compared to switch case. #2306083: Backend specific services now have a standard way to override?

drunken monkey’s picture

Thanks for both your reviews, great to get some more eyes on this!

Is there a reason that you're not using the binary => TRUE flag, to force a case sensitivity for all backends?

Oh. Yes, there is: I didn't know it existed. Thanks a lot for pointing that out!
(I actually had a Core issue for (more or less) that, #1144644: Enable specifying the collation when creating a database table, but seems like this was implemented in a different one.)
But it's not possible to change the charset there, too, right?
So, since two of our three supported DBMSs apparently use binary comparison by default, and we need a custom override in any case for the third, this doesn't really get us anything at all – but still good to have, I guess, and especially good to know.

Not introduced here but wondering if that utf8 conversion isn't going to cause trouble in case you actually have such characters (typical example are emojis) in node bodies or so?

Ah, that's where that change came from, of course … Thanks for pointing that out!
However, while I initially (after reading your comment) also thought this would probably mean it'd fail, I added a test for that and it stilled passed for me (see attached patch (couldn't find any Core test for this, so had to come up with my own way of getting a Unicode character in there)).

Is there a way we can replace this behavior with a backend specific service? That looks more elegant compared to switch case. #2306083: Backend specific services now have a standard way to override?

Huh. Good question. That would indeed look a lot cleaner, I guess. And would also certainly make it easier to extend for other DBMS drivers in contrib. However, I don't really have any clue how to do it. I understand how to make a service backend-overridable (we've already done that for the server task manager), but I have no idea how to then actually override it. It seems a bit like that would only be possible on a per-site basis, not for a certain database driver? But I might be mistaken, I really don't understand that text.

Anyways, I'd say we just create a follow-up issue for figuring that out once this issue gets committed. It's only the DB backend, not the Search API itself, so cleanness and extensibility aren't that critical.

(Two interdiffs here since I noticed a mistake myself before posting and was too lazy to create the interdiff anew.)

berdir’s picture

No, there's no API for the collation stuff, since AFAIK that's pretty much mysql-specific, other backends wouldn't be able to implement that ( I think, maybe I'm wrong).

Discussed the database backend thingy a bit with chx, here's the relevant extract:

(18:25:40) chx: berdir: flood is your friend
(18:25:51) chx: flood:
(18:25:52) chx: class: Drupal\Core\Flood\DatabaseBackend
(18:25:53) chx: arguments: ['@database', '@request_stack']
(18:25:55) chx: tags:
(18:25:56) chx: - { name: backend_overridable }
(18:25:58) chx: berdir: that's what you need to do
(18:26:59) berdir: chx: and then as a module, you could provide mysql.flood, postgresql.flood etc. ?
(18:27:07) berdir: by default, I mean
(18:27:12) chx: berdir: yes!~
(18:27:17) chx: berdir: pgsql.flood

So you define search_api_database.something and then provide mysql.search_api_database.something, pgsql.search_api_database.something etc. And based on the default_backend configuration (which defaults to the db driver name), it will pick the corresponding service.

drunken monkey’s picture

Thanks a lot for the information, very good to know!
With that, it really seems quite easy, and the attached patch also worked fine for me in the tests locally. (And since they use MySQL, which is the one that fails horribly without customizations, that's a pretty good sign I'd say.)

However, one complication I had to (quite ugly) work around is that we allow people to select a database other than the default for their search server – which, in this case, means the global service would be for the wrong database.
I used a work-around for that now, which just uses the generic handler in that case, which should result in a gentle degradation in all cases except MySQL, where it would fail horribly. However, as I'm not quite sure whether anyone ever uses that feature anyways, I think this should be good enough for now, until someone complains. (Otherwise, we would have needed the method with the switch statement after all, largely defeating the purpose of having a service in the first place.)

borisson_’s picture

I found a couple of really small documentation things that can be updated. Otherwise this is great and if you disagree with these, I'll RTBC.

  1. +++ b/search_api_db/src/DatabaseCompatibility/DatabaseInterface.php
    @@ -0,0 +1,65 @@
    +interface DatabaseInterface {
    

    While the namespace of this is really good, I'm a little bit unsure about the interface name.
    DatabaseCompatibilityHandlerInterface would be a better name, I think.

  2. +++ b/search_api_db/src/DatabaseCompatibility/MySql.php
    @@ -0,0 +1,40 @@
    +/**
    + * Represents a MySQL or MariaDB database.
    + */
    

    This should also work for percona.

    Represents a MySQL, MariaDB or Percona Server database.

  3. +++ b/search_api_db/src/DatabaseCompatibility/MySql.php
    @@ -0,0 +1,40 @@
    +    // The Drupal MySQL integration defaults to using a 4-byte-per-character
    

    Should this mention MySQL or all the databases, like in the docblock?

borisson_’s picture

Status: Needs review » Needs work

back to NW per #21.

borisson_’s picture

Status: Needs work » Needs review
StatusFileSize
new4.56 KB
new26.41 KB

Fixed my own remarks from #21.

Status: Needs review » Needs work

The last submitted patch, 23: fulltext_search_in_db-2557291-23.patch, failed testing.

The last submitted patch, 23: fulltext_search_in_db-2557291-23.patch, failed testing.

borisson_’s picture

Status: Needs work » Needs review
StatusFileSize
new26.62 KB

I did an oopsie with creating the patch, the interdiff was correct though.

drunken monkey’s picture

StatusFileSize
new517 bytes
new26.6 KB

While the namespace of this is really good, I'm a little bit unsure about the interface name. DatabaseCompatibilityHandlerInterface would be a better name, I think.

Well, that's really up for debate, see #2027221: [policy] Revisit class naming standards. Our current standards would probably require even more in the class name, while the saner standards now discussed seem to more veer to not duplicate the namespace in the class name.
But I'm fine with the change, it does look a lot clearer. And as long as the standards are a complete mess anyways, we might as well go with what feels right.

Should we then also rename the classes, to at least include the "Handler" suffix there, too?

This should also work for percona.

I think that's making it too verbose. Let's just use "Represents a MySQL-based database." I just found it unfair to always just mention MySQL, while MariaDB is now the "real" MySQL. ;)

Status: Needs review » Needs work

The last submitted patch, 27: 2557291-27--case_sensitive_fulltext_searching.patch, failed testing.

The last submitted patch, 27: 2557291-27--case_sensitive_fulltext_searching.patch, failed testing.

drunken monkey’s picture

StatusFileSize
new26.6 KB

Re-rollin', rollin', rollin' …

drunken monkey’s picture

Status: Needs work » Needs review
borisson_’s picture

Status: Needs review » Reviewed & tested by the community

Wooo!

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed

Excellent, finally!
Committed.
Thanks a lot again for all your help, everyone!

  • drunken monkey committed 65d2c50 on 8.x-1.x
    Issue #2557291 by drunken monkey, borisson_, ChristianAdamski: Fixed...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.