I think part of the speed problems with Drupal.org's issue queues is querying across multiple tables and UNIONing the result for full text searches. The query can be simplified by merging the full text indexes into one table, with an added field column.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drumm’s picture

Status: Active » Needs review
Parent issue: » #2136119: Text search on issue queues is slow and sometimes WSODs
FileSize
21.58 KB

And here is the first draft of a patch to do that.

drunken monkey’s picture

Status: Needs review » Needs work
FileSize
21.42 KB

Hm, you're right, that does look way simpler and more performant than the previous code! Thanks a lot for the suggestion, and the huge patch!

I don't know what's up with the test bot, the branch test seemed to fail inexplicably and now nothing is getting tested. However, a local test failed with the following two exceptions:

Undefined variable: db	Notice	service.inc	467	SearchApiDbService->fieldsUpdated()	Exception
…
SearchApiException: Unknown index test_index. in SearchApiDbService->search() (line 984 of /home/thomas/Dokumente/Programmieren/PHP/Drupal/search_api_db/service.inc).	Uncaught exception	service.inc	984	SearchApiDbService->search()

So this definitely still needs some work. Did the tests pass for you locally?

Other remarks regarding the patch:

  1. +++ b/search_api_db.install
    @@ -108,3 +108,100 @@ function search_api_db_update_7103() {
    +  $servers_query->innerJoin('search_api_index', 'i', 'i.server = s.machine_name');
    +  // todo from this table?
    +  $servers_query->fields('s', array('options'));
    

    What do you mean with that?

  2. +++ b/search_api_db.install
    @@ -108,3 +108,100 @@ function search_api_db_update_7103() {
    +      // Add new table.
    +      $text_table = 'search_api_db_' . $server->machine_name . '_text';
    

    This table name might already be taken. Please inline the code from SearchApiDbService::findFreeTable() (or copy it to a helper function in .install).
    Or rather, if the table already exists, find the field using it and move its table (also correcting its settings in the server options, of course).

  3. +++ b/search_api_db.install
    @@ -108,3 +108,100 @@ function search_api_db_update_7103() {
    +      // Migrate data.
    +      foreach ($options['indexes'] as $fields) {
    +        foreach ($fields as $name => $field) {
    +          if (search_api_is_text_type($field['type'])) {
    +            $query = db_select($field['table'], 't')
    +              ->fields('t', array('item_id', 'word', 'score'));
    +            $query->addExpression(':field_name', 'field_name', array(':field_name' => $name));
    +            db_insert($text_table)->from($query)->execute();
    +            // todo
    +            // db_drop_table($field['table']);
    +          }
    +        }
    +      }
    

    This will not work at all if there is more than one index on the server. You have to do $fields = $options['indexes'][$server->machine_name] instead of the outer loop.

    Also, you should clean up the server's options, deleting the "table" keys for all the fulltext fields, I guess.

  4. +++ b/search_api_db.install
    @@ -108,3 +108,100 @@ function search_api_db_update_7103() {
    +          'field_name' => array(
    +            'description' => "Name of the item's field.",
    +            'not null' => TRUE,
    +            'type' => 'varchar',
    +            'length' => 32,
    +          ),
    

    This won't be nearly enough. Keep in mind that these aren't Field API field names, but Search API field identifiers, and they can have arbitrary length!
    Increasing this to 255 will probably be enough. However, the safest way to do this would be to also add a truncation to the column length, and to remember that truncation for the field (same as we do currently for the table).

  5. +++ b/search_api_db.install
    @@ -108,3 +108,100 @@ function search_api_db_update_7103() {
    +      db_query("ALTER TABLE {{$text_table}} CONVERT TO CHARACTER SET 'utf8' COLLATE 'utf8_bin'");
    

    This is MySQL-specific and will fail for all (or most) other DBMSs.

Attached is a patch fixing some of these issues. Unfortunately, I don't have any more time right now to look at this, I'll have to get back to it on Monday.

One question already, though: for use on d.o, would you require this to be committed, or even in a new release? I'll be on vacation for two months starting after Christmas, so I wouldn't want to make such a large change (and, especially, a stable release containing it) without then being there to fix any problems people might encounter.

In any case, thanks for your work! Really sounds like a great improvement!
(Note, however, that you could already achieve virtually the same by using the "Aggregated fields" data alteration to combine all the fulltext fields into a single one. The only drawback of that would be that boosting wouldn't work anymore, then. (Though using several aggregated fields for the different boosts you want to assign would work, and probably still be an improvement, depending on how many fields you have now.) In any case, though, having this improvement there for all users of this module would of course be much better.)

drumm’s picture

Status: Needs work » Needs review
FileSize
23.8 KB

I believe I fixed the indexing issues, which I neglected to test before.

1. Was a note to myself, which is followed up on, and was left behind.

2. The rest of the code assumes a static table name, which is convenient. Although it did cause a lot more passing around of $index. Maybe the suffix could be _search_api_text instead of _text? I suppose that would risk approaching the column name length limit.

3. Done.

4. Let me know if you want to stick to number suffixes, but I thought I'd try this. If the field name is too long, it uses the md5 of the field name. I've seen it before for Views' block deltas. This makes them human unreadable, but they don't have to be stored in any configuration management.

5. Thanks for fixing!

I did consider aggregation, but didn't get as far as thinking about an aggregation of aggregations to use boosts. No one has spent time really optimizing boosts on Drupal.org, but they are useful to have. We actually only have title, body, and comment body being indexed for full text. Regardless, I'm not eager to do a full reindex, which takes multiple days.

We can of course deploy patches to Drupal.org, but only when really needed. In this case, I would want to be sure any future cleanup is as minimal as possible. #2 and #4 could change where/how things are stored, so they would be good to nail down. Running -dev for this is perfectly fine, we're already on a pre-1.1 dev.

(For testbot, I don't see a way to trigger a retest of the branch in the Drupal.org UI. Pushing a commit should trigger it, and hopefully get normal results.)

jthorson’s picture

I've queued up the 7.x-1.x branch test, which again failed with 7 passes, 7 fails, and 5 exceptions.

I then force-queued the above patch, which will certainly fail with at least the same failures listed above ... but the test run should at least flag if there are any new failures introduced by the patch.

Status: Needs review » Needs work

The last submitted patch, 3: 2155767.diff, failed testing.

drunken monkey’s picture

Status: Needs work » Needs review
FileSize
27.24 KB

2. The rest of the code assumes a static table name, which is convenient. Although it did cause a lot more passing around of $index. Maybe the suffix could be _search_api_text instead of _text? I suppose that would risk approaching the column name length limit.

Yes, of course, that's why I added the last sentence: instead of making the fulltext table name variable, check for collisions this one time when updating and instead change the table name of the field that causes the collision. That should be done, though, since a field with the machine name "text" doesn't seem so far-fetched.
Also, now that you mention it: this could indeed already exceed the table name length limit of 62 characters, since the index machine name can have up to 50 characters. It's rather unlikely that anyone will actually have an index with such a long character name, but I guess we'll have to make the table name variable in any case. Maybe just store it in a special options[indexes][INDEX][#fulltext] options key.
Hm, I guess the same goes for the search_api_db_INDEX table introduced in #2083079: Store single-valued fields in a single, denormalized table, now that I think about it … Damn. Stupid DBMS creators … Maybe we should just document for now that indexes used with this module should have machine names with at most 43 characters, and only fix it if enough people complain? Hardly elegant, but still … The part where this change would be worse than the previous one, of course, is that this one also affects existing indexes, so ones with too long machine names would suddenly break when updating the module.

Complicated. What do you (or others) think?

3. Done.

Thanks!
However, when you save the server options right away after each index, and load it fresh from the old result set for the next index, then the changes will get overwritten and only be kept for the last index.

Should be fixed in the attached patch. It also fixes removeIndex() and preDelete() to also remove the new fulltext table for the indexes, and some bugs in fieldsUpdated().

4. Let me know if you want to stick to number suffixes, but I thought I'd try this. If the field name is too long, it uses the md5 of the field name. I've seen it before for Views' block deltas. This makes them human unreadable, but they don't have to be stored in any configuration management.

Ah, yes, great idea! Normally I'm rather against this, as it complicates debugging (by a lot, sometimes – Facet API is really aweful there, in my opinion), but for the extremely rare cases that field names exceed 255 characters, I think we can let it slip. Also, as you say, it won't be stored in any configuration.

I did consider aggregation, but didn't get as far as thinking about an aggregation of aggregations to use boosts. No one has spent time really optimizing boosts on Drupal.org, but they are useful to have. We actually only have title, body, and comment body being indexed for full text. Regardless, I'm not eager to do a full reindex, which takes multiple days.

I think, since you are indexing large amounts of text, you should probably give the title a rather high boost or it won't have much effect. You could, however, also emulate this by just increasing the stored boost for the words which are contained in the title – but that would probably be rather hard to do cleanly. And when you already have only three tables, decreasing them to two will likely not make a huge difference.
However, you could probably avoid the re-indexing by combining the two tables for body and comment body manually into a single new one, much like the update hook in this patch. Then just manually adapt the server and index options accordingly and you should be set. Rather hack-y, but comparatively safe and probably warranted.
In any case, as said, doing this with a proper patch to this module itself is of course the much better way, and since I think we are pretty far here already it would probably take longer now, too.

We can of course deploy patches to Drupal.org, but only when really needed. In this case, I would want to be sure any future cleanup is as minimal as possible. #2 and #4 could change where/how things are stored, so they would be good to nail down. Running -dev for this is perfectly fine, we're already on a pre-1.1 dev.

OK, as long as dev is fine I'm pretty sure we'll be able to commit this before I'm away.
And, in any case, I don't think the storage part would have to be changed, so even using a patch shouldn't make too many problems.

I've queued up the 7.x-1.x branch test, which again failed with 7 passes, 7 fails, and 5 exceptions.

I then force-queued the above patch, which will certainly fail with at least the same failures listed above ... but the test run should at least flag if there are any new failures introduced by the patch.

Thanks for your help here!
However, the branch (still) fails by not being able to enable the dependency modules. Looking at the log, it seems that the testing framework is suddenly unaware that the module has any dependencies at all. (Which is weird, since it's working perfectly fine for, e.g., the Search API module and its external dependencies.) It would be great if you could look into this!

Status: Needs review » Needs work

The last submitted patch, 6: 2155767-6--single_text_table.patch, failed testing.

The last submitted patch, 1: 2155767.diff, failed testing.

drunken monkey’s picture

Status: Needs work » Needs review
FileSize
29.97 KB

OK, threre are still a few things wrong with indexing/deleting (caused by the removal of the 'table' key for fulltext fields) and complex searches (t.* now has one column too many).
The attached patch fixes most of these issues and should bring the test bot down to two fails, no exceptions. One complex query is still failing – sorry, I ran out of time.

Therefore I also wasn't able to fix the potential clashes in the update function as well. Also, as said, it would be good to get some other opinions on this.
In any case, that should be the only (major) thing preventing this patch from being committed. I haven't done a complete code review yet, though, so there may be other, minor issues. But those should be quick to fix, and since all tests pass there can hardly be anything major.

Hm, thinking about it, maybe we should keep the 'table' key after all and just set it to the search_api_db_INDEX_text table – that way, we'd save a lot of special-case code, and it would probably also be easier to later go back and fix the table name length issue.

Status: Needs review » Needs work

The last submitted patch, 9: 2155767-9--single_text_table.patch, failed testing.

drumm’s picture

Issue tags: +Drupal.org 7.1
FileSize
23.64 KB

This patch checks for table name collisions, and stores that name in the same spot as the old table name.

I haven't looked into the 2 fails in the last patch, so I expect they may happen again.

drumm’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 11: 2155767.diff, failed testing.

drumm’s picture

Status: Needs work » Needs review
FileSize
24.17 KB

It looks like the test failure is due to not including a key in the $fields array passed to createKeysQuery(), in the call from createFilterCondition().

drunken monkey’s picture

Sorry it took me a few days to review the patch completely, but now I can say that it already looks very good! Thanks a lot again for your hard work!

There is one larger problem though, described in #2147361-7: Use Database::getConnection() instead of setDb() magic – the previous 1.1 release had a rather serious regression which will make the module unusable for people who want to use a different database. A rather stupid mistake, but I'm just completely overworked at the moment.
Anyways, sine I need to create a new release containing that fix before committing the patch for this issue, I committed the other issue right away and now re-rolled the patch here to fit on top of the other one. I'm very sorry for the inconvenience, but I hope you agree that it was necessary and still the best course of action.
To help you review my changes, I provide the two interdiffs to your patch that don't include those changes from the other issue. I hope that helps you somewhat.

As you will see, apart from minor code style edits (mostly a matter of taste, but I like to keep this consistent in my modules), there was only one real issue:

+++ b/search_api_db.install
@@ -108,3 +108,125 @@ function search_api_db_update_7103() {
+      while (db_table_exists($text_table)) {

This and other places in the update function just use the db_*() functions, without taking the connection selected for the server in question into account.

However, one thing I also worry about is the use of search_api_get_datasource_controller() within the update function. From my understanding, the usage of any API functions should be kept to an absolute minimum. While it seems to work fine, and I'm pretty sure will also be future-proof, it might still be better to take the ID field definition from one of the index's existing tables and use that for the new table. But I'll leave that up to you to decide.

drumm’s picture

However, one thing I also worry about is the use of search_api_get_datasource_controller() within the update function.

Since the API is stable, I think this is okay. If the function does change in a breaking way, while you still want to support running search_api_db_update_7104(), then we would want to either inline the old function, or update to work with the new function. Since point releases are relatively frequent (thanks!), you can always recommend updating incrementally, if it is an issue.

I did think about getting the item_id column definition from an existing table, but I think that code would be roughly as complex. I don't think the change is worth it for now.

All the changes look good, will follow up again when I've run the update on a fresh dev site.

drumm’s picture

Status: Needs review » Reviewed & tested by the community

The upgrade went well for a Drupal.org dev site, and some general testing. I think this is good.

drunken monkey’s picture

Great to hear! Committed.
Thanks again for your hard work, and sorry for the delay!

drunken monkey’s picture

Status: Reviewed & tested by the community » Fixed
drumm’s picture

The deployment of this on Drupal.org went okay. The snag we ran into is that adding the keys wanted more space than we have on our tmp partition. We were able to push through that.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.