Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I think part of the speed problems with Drupal.org's issue queues is querying across multiple tables and UNIONing the result for full text searches. The query can be simplified by merging the full text indexes into one table, with an added field column.
Comment | File | Size | Author |
---|---|---|---|
#15 | 2155767-15--single_fulltext_table.patch | 26.2 KB | drunken monkey |
Comments
Comment #1
drummAnd here is the first draft of a patch to do that.
Comment #2
drunken monkeyHm, you're right, that does look way simpler and more performant than the previous code! Thanks a lot for the suggestion, and the huge patch!
I don't know what's up with the test bot, the branch test seemed to fail inexplicably and now nothing is getting tested. However, a local test failed with the following two exceptions:
So this definitely still needs some work. Did the tests pass for you locally?
Other remarks regarding the patch:
What do you mean with that?
This table name might already be taken. Please inline the code from
SearchApiDbService::findFreeTable()
(or copy it to a helper function in.install
).Or rather, if the table already exists, find the field using it and move its table (also correcting its settings in the server options, of course).
This will not work at all if there is more than one index on the server. You have to do
$fields = $options['indexes'][$server->machine_name]
instead of the outer loop.Also, you should clean up the server's options, deleting the "table" keys for all the fulltext fields, I guess.
This won't be nearly enough. Keep in mind that these aren't Field API field names, but Search API field identifiers, and they can have arbitrary length!
Increasing this to 255 will probably be enough. However, the safest way to do this would be to also add a truncation to the column length, and to remember that truncation for the field (same as we do currently for the table).
This is MySQL-specific and will fail for all (or most) other DBMSs.
Attached is a patch fixing some of these issues. Unfortunately, I don't have any more time right now to look at this, I'll have to get back to it on Monday.
One question already, though: for use on d.o, would you require this to be committed, or even in a new release? I'll be on vacation for two months starting after Christmas, so I wouldn't want to make such a large change (and, especially, a stable release containing it) without then being there to fix any problems people might encounter.
In any case, thanks for your work! Really sounds like a great improvement!
(Note, however, that you could already achieve virtually the same by using the "Aggregated fields" data alteration to combine all the fulltext fields into a single one. The only drawback of that would be that boosting wouldn't work anymore, then. (Though using several aggregated fields for the different boosts you want to assign would work, and probably still be an improvement, depending on how many fields you have now.) In any case, though, having this improvement there for all users of this module would of course be much better.)
Comment #3
drummI believe I fixed the indexing issues, which I neglected to test before.
1. Was a note to myself, which is followed up on, and was left behind.
2. The rest of the code assumes a static table name, which is convenient. Although it did cause a lot more passing around of
$index
. Maybe the suffix could be_search_api_text
instead of_text
? I suppose that would risk approaching the column name length limit.3. Done.
4. Let me know if you want to stick to number suffixes, but I thought I'd try this. If the field name is too long, it uses the md5 of the field name. I've seen it before for Views' block deltas. This makes them human unreadable, but they don't have to be stored in any configuration management.
5. Thanks for fixing!
I did consider aggregation, but didn't get as far as thinking about an aggregation of aggregations to use boosts. No one has spent time really optimizing boosts on Drupal.org, but they are useful to have. We actually only have title, body, and comment body being indexed for full text. Regardless, I'm not eager to do a full reindex, which takes multiple days.
We can of course deploy patches to Drupal.org, but only when really needed. In this case, I would want to be sure any future cleanup is as minimal as possible. #2 and #4 could change where/how things are stored, so they would be good to nail down. Running -dev for this is perfectly fine, we're already on a pre-1.1 dev.
(For testbot, I don't see a way to trigger a retest of the branch in the Drupal.org UI. Pushing a commit should trigger it, and hopefully get normal results.)
Comment #4
jthorson CreditAttribution: jthorson commentedI've queued up the 7.x-1.x branch test, which again failed with 7 passes, 7 fails, and 5 exceptions.
I then force-queued the above patch, which will certainly fail with at least the same failures listed above ... but the test run should at least flag if there are any new failures introduced by the patch.
Comment #6
drunken monkeyYes, of course, that's why I added the last sentence: instead of making the fulltext table name variable, check for collisions this one time when updating and instead change the table name of the field that causes the collision. That should be done, though, since a field with the machine name "text" doesn't seem so far-fetched.
Also, now that you mention it: this could indeed already exceed the table name length limit of 62 characters, since the index machine name can have up to 50 characters. It's rather unlikely that anyone will actually have an index with such a long character name, but I guess we'll have to make the table name variable in any case. Maybe just store it in a special
options[indexes][INDEX][#fulltext]
options key.Hm, I guess the same goes for the
search_api_db_INDEX
table introduced in #2083079: Store single-valued fields in a single, denormalized table, now that I think about it … Damn. Stupid DBMS creators … Maybe we should just document for now that indexes used with this module should have machine names with at most 43 characters, and only fix it if enough people complain? Hardly elegant, but still … The part where this change would be worse than the previous one, of course, is that this one also affects existing indexes, so ones with too long machine names would suddenly break when updating the module.Complicated. What do you (or others) think?
Thanks!
However, when you save the server options right away after each index, and load it fresh from the old result set for the next index, then the changes will get overwritten and only be kept for the last index.
Should be fixed in the attached patch. It also fixes
removeIndex()
andpreDelete()
to also remove the new fulltext table for the indexes, and some bugs infieldsUpdated()
.Ah, yes, great idea! Normally I'm rather against this, as it complicates debugging (by a lot, sometimes – Facet API is really aweful there, in my opinion), but for the extremely rare cases that field names exceed 255 characters, I think we can let it slip. Also, as you say, it won't be stored in any configuration.
I think, since you are indexing large amounts of text, you should probably give the title a rather high boost or it won't have much effect. You could, however, also emulate this by just increasing the stored boost for the words which are contained in the title – but that would probably be rather hard to do cleanly. And when you already have only three tables, decreasing them to two will likely not make a huge difference.
However, you could probably avoid the re-indexing by combining the two tables for body and comment body manually into a single new one, much like the update hook in this patch. Then just manually adapt the server and index options accordingly and you should be set. Rather hack-y, but comparatively safe and probably warranted.
In any case, as said, doing this with a proper patch to this module itself is of course the much better way, and since I think we are pretty far here already it would probably take longer now, too.
OK, as long as dev is fine I'm pretty sure we'll be able to commit this before I'm away.
And, in any case, I don't think the storage part would have to be changed, so even using a patch shouldn't make too many problems.
Thanks for your help here!
However, the branch (still) fails by not being able to enable the dependency modules. Looking at the log, it seems that the testing framework is suddenly unaware that the module has any dependencies at all. (Which is weird, since it's working perfectly fine for, e.g., the Search API module and its external dependencies.) It would be great if you could look into this!
Comment #9
drunken monkeyOK, threre are still a few things wrong with indexing/deleting (caused by the removal of the
'table'
key for fulltext fields) and complex searches (t.*
now has one column too many).The attached patch fixes most of these issues and should bring the test bot down to two fails, no exceptions. One complex query is still failing – sorry, I ran out of time.
Therefore I also wasn't able to fix the potential clashes in the update function as well. Also, as said, it would be good to get some other opinions on this.
In any case, that should be the only (major) thing preventing this patch from being committed. I haven't done a complete code review yet, though, so there may be other, minor issues. But those should be quick to fix, and since all tests pass there can hardly be anything major.
Hm, thinking about it, maybe we should keep the
'table'
key after all and just set it to thesearch_api_db_INDEX_text
table – that way, we'd save a lot of special-case code, and it would probably also be easier to later go back and fix the table name length issue.Comment #11
drummThis patch checks for table name collisions, and stores that name in the same spot as the old table name.
I haven't looked into the 2 fails in the last patch, so I expect they may happen again.
Comment #12
drummComment #14
drummIt looks like the test failure is due to not including a key in the
$fields
array passed tocreateKeysQuery()
, in the call fromcreateFilterCondition()
.Comment #15
drunken monkeySorry it took me a few days to review the patch completely, but now I can say that it already looks very good! Thanks a lot again for your hard work!
There is one larger problem though, described in #2147361-7: Use Database::getConnection() instead of setDb() magic – the previous 1.1 release had a rather serious regression which will make the module unusable for people who want to use a different database. A rather stupid mistake, but I'm just completely overworked at the moment.
Anyways, sine I need to create a new release containing that fix before committing the patch for this issue, I committed the other issue right away and now re-rolled the patch here to fit on top of the other one. I'm very sorry for the inconvenience, but I hope you agree that it was necessary and still the best course of action.
To help you review my changes, I provide the two interdiffs to your patch that don't include those changes from the other issue. I hope that helps you somewhat.
As you will see, apart from minor code style edits (mostly a matter of taste, but I like to keep this consistent in my modules), there was only one real issue:
This and other places in the update function just use the
db_*()
functions, without taking the connection selected for the server in question into account.However, one thing I also worry about is the use of
search_api_get_datasource_controller()
within the update function. From my understanding, the usage of any API functions should be kept to an absolute minimum. While it seems to work fine, and I'm pretty sure will also be future-proof, it might still be better to take the ID field definition from one of the index's existing tables and use that for the new table. But I'll leave that up to you to decide.Comment #16
drummSince the API is stable, I think this is okay. If the function does change in a breaking way, while you still want to support running
search_api_db_update_7104()
, then we would want to either inline the old function, or update to work with the new function. Since point releases are relatively frequent (thanks!), you can always recommend updating incrementally, if it is an issue.I did think about getting the
item_id
column definition from an existing table, but I think that code would be roughly as complex. I don't think the change is worth it for now.All the changes look good, will follow up again when I've run the update on a fresh dev site.
Comment #17
drummThe upgrade went well for a Drupal.org dev site, and some general testing. I think this is good.
Comment #18
drunken monkeyGreat to hear! Committed.
Thanks again for your hard work, and sorry for the delay!
Comment #19
drunken monkeyComment #20
drummThe deployment of this on Drupal.org went okay. The snag we ran into is that adding the keys wanted more space than we have on our tmp partition. We were able to push through that.