Fix server tasks system [#1551302]

Hi.

I was wondering why my variables table had grown unusually large, in fact over 2Mb, when I'm used to see less than 0.5Mb on large sites.
The whole contents of the table are loaded on every page request, and because of that it is cached into a single row in cache-bin. However, when that row becomes this large, it might not be useable by all systems. For example memcache has hardcoded 1Mb limit for single cache entries. I remember seeing that new cache-sharding offers some aid..

But, this should be fixed at it's origin, which is this module. A single variable is taking over 0.5mb space: search_api_tasks

It consists of:

a:1:{s:20:"content_backend_solr";a:1:{s:21:"content_backend_index";a:19954:{i:0;s:13:"delete-366333";i:1;s:13:"delete-366335";i:2;s:13:"delete-366337";i:3;s:13:"delete-366339";i:4;s:13:"delete-366341";i:5;s:13:"delete-366343";i:6;s:13:"delete-366345";i:7;s:13:"delete-366347";i:8;s:13:"delete-366349";i:9;s:13:"delete-366351";i:10;s:13:"delete-366353";i:11;s:13:"delete-366355";i:12;s:13:"delete-366357";i:13;s:13:"delete-366359";i:14;s:13:"delete-366361";i:15;s:13:"delete-366363";i:16;s:13:"delete-366365";i:17;s:13:"delete-366233";i:18;s:13:"delete-366235";i:19;s:13:"delete-366237";i:20;s:13:"delete-366239";i:21;s:13:"delete-366241";i:22;s:13:"delete-366243";i:23;s:13:"delete-366245";i:24;s:13:"delete-366247";i:25;s:13:"delete-366249";i:26;s:13:"delete-366251";i:27;s:13:"delete-366253";i:28;s:13:"delete-366255";i:29;s:13:"delete-366257";i:30;s:13:"delete-366259";i:31;s:13:"delete-366261";i:32;s:13:"delete-366263";i:33;s:13:"delete-366265";i:34;s:13:"delete-366267";i:35;s:13:"delete-366269";i:36;s:13:"delete-366271";i:37;s:13:"delete-366273";i:38;s:13:"delete-366275";i:39;s:13:"delete-366277";i:40;s:13:"delete-366279";i:41;s:13:"delete-366281";i:42;s:13:"delete-366283";i:43;s:13:"delete-366285";i:44;s:13:"delete-366287";i:45;s:13:"delete-366289";i:46;s:13:"delete-366291";i:47;s:13:"delete-366293";i:48;s:13:"delete-366295";i:49;s:13:"delete-366297";i:50;s:13:"delete-366299";i:51;s:13:"delete-366301";i:52;s:13:"delete-366303";i:53;s:13:"delete-366305";i:54;s:13:"delete-366307";i:55;s:13:"delete-366309";i:56;s:13:"delete-366311";i:57;s:13:"delete-366313";i:58;s:13:"delete-366315";i:59;s:13:"delete-366317";i:60;s:13:"delete-366319";i:61;s:13:"delete-366321";i:62;s:13:"delete-366323";i:63;s:13:"delete-366325";i:64;s:13:"delete-366327";i:65;s:13:"delete-366329";i:66;s:13:"delete-366331";i:67;s:13:"delete-366033";i:68;s:13:"delete-366035";i:69;s:13:"delete-366037";i:70;s:13:"delete-366039";i:71;s:13:"delete-366041";i:72;s:13:"delete-366043";i:73;s:13:"delete-366045";i:74;s:13:"delete-366047";i:75;s:13:"delete-366049";i:76;s:13:"delete-366051";i:77;s:13:"delete-366053";i:78;s:13:"delete-366055";i:79;s:13:"delete-366057";i:80;s:13:"delete-366059";i:81;s:13:"delete-366061";i:82;s:13:"delete-366063";i:83;s:13:"delete-366065";i:84;s:13:"delete-366067";i:85;s:13:"delete-366069";i:86;s:13:"delete-366071";i:87;s:13:"delete-366073";i:88;s:13:"delete-366075";i:89;s:13:"delete-366077";i:90;s:13:"delete-366079";i:91;s:13:"delete-366081";i:92;s:13:"delete-366083";i:93;s:13:"delete-366085";i:94;s:13:"delete-366087";i:95;s:13:"delete-366089";i:96;s:13:"delete-366091";i:97;s:13:"delete-366093";i:98;s:13:"delete-366095";i:99;s:13:"delete-366097";i:100;s:13:"delete-366099";i:101;s: ... and many thousand lines more

I didn't investigate much further, but it seems this is where it's used:

        $tasks = variable_get('search_api_tasks', array());
          // When we add or remove an index, we can ignore all other tasks.
          $tasks[$old_server->machine_name][$index->machine_name] = array('remove');
          variable_set('search_api_tasks', $tasks);
        }

http://drupalcontrib.org/api/drupal/contributions!search_api!search_api....

So, what to do, what to do? A simple fix would be to push this (and possibly similar variables) into it's own table. Usually I'm against putting any configuration outside of variables (global $conf), but this doesnt look like configuration data to me.

Also, as a opposing notion: seems like the actual configuration that holds important data such as servers to connect.. does NOT use Drupal variables!
It uses "search_api_server"-table instead. IMO this suck big time because we cannot use $conf overrides in settings.php when switching between production/staging/dev environments which have different apache solr instances etc.

Comment	File	Size	Author
#21	1551302-21--overhaul_tasks_system.patch	54.24 KB	drunken monkey
#21
#18	1551303-18--overhaul_tasks_system.patch	50.14 KB	drunken monkey
#18
#15	1551303-15--overhaul_tasks_system.patch	38.23 KB	drunken monkey
#15
#12	1551303-12--overhaul_tasks_system.patch	38.21 KB	drunken monkey
#12

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 20 May 2012 at 16:03

Title:	"variable"-table filled with junk from variable set by search_api	» Use separate table instead of variable for remembering server tasks
Version:	7.x-1.0	» 7.x-1.x-dev
Component:	Miscellaneous	» Framework
Category:	bug	» task

This will only occur if you let an index and associated server lying around disabled for some time (or when deleting huge amounts of items). Enabling the server for a short time and then disabling it again should clear the variable. On production sites, just don't leave unused servers lying around and you'll be fine.

If more people agree with you, though, I'd also not be totally against moving the tasks into their own table.

Comment #2

bibo CreditAttribution: bibo commented 24 May 2012 at 23:46

This will only occur if you let an index and associated server lying around disabled for some time (or when deleting huge amounts of items). Enabling the server for a short time and then disabling it again should clear the variable. On production sites, just don't leave unused servers lying around and you'll be fine.

Thanks for the answer. Now I atleast know the reason for it's existence. We indeed have/had a disabled index, which we actually didnt use anymore. We also have large amounts of content that was deleted (we did a full content delete/reimport).

Surprisingly our delete process was very heavy (a MySQL master created binlogs that filled the hd very fast, altogether over 200Gb).. and apparently it managed to crash one Apache Solr server (which I noticed later).

Anyway, somewhere during this progress this single row in the variable table had grown to over 4mb in size. We tried enabling the index and waited for the variable to clear. Something did happen (the Solr was almost overloaded), but the variable still stayed oversized. It might even be caused by php crashing due to server limits (maybe it orders the deletion, but the script dies too soon for it to clear the variable).

As I said earlier the whole variable table is saved for caching in a single row called "variables" in "cache_bootstrap" table. When this variable grows over 1Mb, it wont fit in memcache. I've witnessed this same error before in other circumstances, for example when D6 Thickbox generated about 90 000 theme callbacks, which filled about 90% of theme_registry, making it not fit in memcache. The result is that the theme registry got re-generated on every page load. For more info, see #435694: >1M data writes incompatible with memcached (memcached -I 32M -m32) and #1257768: memcache set not working for theme registry cache.

In this case however, the search_api_tasks in fact fills about 97% of the variables blob (4200K vs 90k), which again means that memcache silently fails and the whole variables table (with 97% clutter) is refetched from MySQL and processed for every single Drupal bootstrap.

If more people agree with you, though, I'd also not be totally against moving the tasks into their own table.

After deleting the variable manually and clearing caches, we saw a very noticiable 20-40% drop in page execution times. That's a good reason, right?

It's also good to know that the problem usually won't generate any hints or errorlogs, and it wont even show directly in devel querylog. That means that a lot of people probably suffer from this, but don't have a clue about the problem or it's cause. The best hint they might get if performance improves when memcache is disabled.

Since I see search_api as one of the modules making Drupal stand out (especially performance wise), I'm also hoping to see this information living anywhere but the variables table :)

Comment #3

davidseth CreditAttribution: davidseth commented 25 May 2012 at 00:46

@bibo,

I 100% agree with you. Whilst I don't have the same level of detail that you provided, I can add that we had severe problems on an index with 350,000 items in it. Massive spikes in the cache layer and immense heartache.

As this site is heavily trafficked and highly available, we can't disable the index and enable. Firstly, just index the 350k nodes takes days because we have to do it in limited bursts (again due to the traffic of the site). So that is not an option.

So to make Search API as bullet proof as possible let's pursue this :)

Comment #4

cpliakas CreditAttribution: cpliakas commented 2 January 2013 at 19:23

Working for an organization that supports a lot of websites with high traffic and large numbers of documents, we are often seeing this variable becoming way too large and crashing entire sites if they aren't using memcache. Would love to see the architecture change to ensure the reliability of this module for larger sites.

Comment #5

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 4 January 2013 at 10:11

Marking this as major as this is really crashing sites that rely on large scale datasets

Comment #6

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 4 January 2013 at 15:55

Priority:

Normal

» Major

Comment #7

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 29 October 2013 at 07:48

I see several possible solutions for this:

The basic option would be the one proposed here, to move this information to a dedicated table, which would of course eliminate the huge performance impact this can have. The downside is, though, that the 4 MB of probably useless data would still be saved and use up disk space – much better than it being in the cache, but still unnecessary.
The other option would be to get rid of this whole subsystem, by handling the disabling of a server much like deleting it: all indexes will be removed from the server, and all data indexed on it (for these indexes, unless they're read-only) deleted. We could then trash the entire logic, also making the code in some parts of the module significantly less complex.
Since disabling a server is never really necessary (except for preventing a Solr server from getting pinged), there is no reason to make it such a light-weight operation, by making sure re-enabling it goes as smoothly as possible (and storing all this data for it). By all your accounts in this issue, the function is currently mainly misused anyways, to permanently get rid of a server. By making disabling more like deleting, we could take that into account.
A variant for #1, potentially reducing the danger of amassing large amounts of useless data, would be to add a warning mechanism, when disabling a server and/or when visiting the overview with such a disabled server there, that indexes lying on a disabled server will use unnecessary ressources and should be moved away from it.

What would you think the better course of action would be here?

Comment #8

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 29 October 2013 at 13:49

Title:	Use separate table instead of variable for remembering server tasks	» Fix server tasks system
Category:	task	» bug

Just had a short chat with Nick Veenhof, in the course of which we realized that the Search API currently doesn't handle the case when a search item gets deleted but the server, though enabled, isn't available (Solr not reachable, etc.). A watchdog message is logged, but the index will stay out of sync for eternity (or until the index is cleared – which might amount to the same for large sites) – which of course isn't good.

So, our proposed solution would be a combination of both options: make disabling the server a more permanent action*, removing the indexes and data from it so we don't have to worry about anything that happens there as long as the server is disabled. And migrate the tasks variable to its own table and use that to track attempted actions that failed (apart from indexing, which the tracking system already handles). In addition to when a server gets enabled, we would then also look at that table during cron runs and see what tasks still need to be done for all enabled servers.

* We also talked about completely dropping support for disabling servers, possibly replacing it with a read-only flag on the server, but this would be an unnecessarily large change – now that we have this functionality, we should keep it, but if most people are using this as an alternative to deleting the server completely, we should take that into account in how we handle it.

One remaining problem is what we should do in the update function with disabled servers that have indexes attached to them. Remove the indexes or not? Keep their tasks or not?
I'll have to see what makes most sense, but thinking about it a bit I guess removing the indexes and making that the sole tasks in the new tasks table would make the most sense.

Another idea here would be to include some function to check the search server for invalid old items and automatically purge them. However, that is another issue, and would probably have to live in the individual backend modules anyways.

Anyways, preparing a patch now.

Comment #9

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 30 October 2013 at 18:42

I totally approve of this change and I'm excited to see what we can do to fix this.

make disabling the server a more permanent action*, removing the indexes and data from it so we don't have to worry about anything that happens there as long as the server is disabled.
migrate the tasks variable to its own table and use that to track attempted actions that failed.
Create an update function to move the task data from the variable to the table.
look at that table during cron runs and see what tasks still need to be done for all enabled servers.
Change the tests to use the new task tracking table.

These are the five tasks we need to get done before we've seen this problem fixed. A significant change. If there's anyone else reading this - input is welcome! :)

Comment #10

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 1 November 2013 at 15:30

Change the tests to use the new task tracking table.

There are currently no tests to be updated, I think. But we could of course add some, would definitely make sense.

Comment #11

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 2 November 2013 at 20:06

Issue summary:	View changes
Issue tags:		+API change

This would definitely involve some kind of API change – while the search_api_tasks variable isn't exactly exposed and probably not used by anyone, an index's server will now always be enabled, which is different from before. No code should break, though, as checking for the enabled field as before will of course continue to work.

Also, of course, the fact that the service class' delete() method can now throw an exception, and should do so when deleting failed.

Comment #12

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 November 2013 at 09:26

Status:

Active

» Needs review

File	Size
1551303-12--overhaul_tasks_system.patch	38.21 KB

This is a first stab at the patch. I haven't tested it yet, nor added any automatic tests, but this is the direction I want to take. The service class can now throw exceptions for some more operations, and the server entity class is set up to automatically catch them and add tasks for them. During cron, the tasks for enabled servers are collected and executed. (The same is done when enabling a server, for that server alone, to clean up on older tasks. However, I'm not even sure that's needed – it would be done on the next cron run anyways. Still, better to be on the safe side. Might otherwise make problems when re-adding an index to a newly re-enabled server.)

The tasks themselves have been moved to a table now, as discussed, and also have come shiny new API functions to deal with them. (For D8 those should probably be methonds on a pluggable task manager service or something, but that's in the future.) The update function removes the indexes from all disabled servers and stores a removeIndex task for each. Those and the clear all tasks are migrated to the new system, all others are ignored since removing the index should take care of them anyways.

Comment #13

3 November 2013 at 09:29

Status:

Needs review

» Needs work

The last submitted patch, 12: 1551303-12--overhaul_tasks_system.patch, failed testing.

Comment #14

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 November 2013 at 10:28

Created issues for the Solr (#2126281: Update error handling according to the latest Search API change) and DB (#2126291: Update error handling according to the latest Search API change) service classes to make use of these changes.

Comment #15

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 November 2013 at 10:36

File	Size
1551303-15--overhaul_tasks_system.patch	38.23 KB

Corrected patch, should pass now (only one silly mistake).

Adding new tests now.

Comment #16

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 November 2013 at 10:36

Status:

Needs work

» Needs review

Comment #17

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 November 2013 at 10:55

Oh, one question though, to any native speaker (or sufficiently confident US-resident Belgian): I've refered in all comments to the tasks that need to be executed as "outstanding tasks". For your ears, does the adjective sound right (pending, unresolved) or weird (prominent, exceptional)?

I first used it without thinking about it and only later realized it might sound weird.
If you think we should use some other adjective, which one would you suggest?

Comment #18

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 3 November 2013 at 12:21

File	Size
1551303-18--overhaul_tasks_system.patch	50.14 KB

And now even with additional tests.

Comment #19

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 5 November 2013 at 14:17

Lying is a bit weird to use in a server context. Also please add the functionality that now only happens in the cron run to the index now / drush scripts so that tasks are executed when you are indexing the remaining items.

Comment #20

Nick_vh

he/him

Ghent

CreditAttribution: Nick_vh commented 5 November 2013 at 14:18

Comment #21

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 6 November 2013 at 14:55

File	Size
1551302-21--overhaul_tasks_system.patch	54.24 KB

OK, here is the patch, with "outstanding" replaced by "pending" and tasks also being executed when indexing items.
I opted here for aborting if the tasks couldn't be executed successfully instead of trying to index items. First off, this saves some time in the likely case that when the tasks fail indexing will fail also. Secondly, if indexing would succeed although some tasks could not be executed, the result might not be desirable. I.e., if we tried to clear the index before, it failed and the deletion of those items is pending, then we want to make absolutely sure we do it before re-indexing any items.
So, your objection regarding also executing the tasks when indexing saved us from adding a new, annoying bug (though it could only have occured under very peculiar circumstances).

I hope this is now ready to go (and that the test bot will say so, too).