Once in a high traffic site the following happened:

The user edited a node, then save it. Close to the same time (ms), the same node was loaded, and the site crashed. As we figured out, the slave wasn't in sync, the old data were loaded from the slave, and were cached.
I need your advice, how to check, wether the db_ignore_slave works as well? I'm not so familiar with this deep db layer, especially with master-slave architect.

Comments

gielfeldt’s picture

the db_ignore_slave() does not work with AutoSlave, as AutoSlave takes care of this on a per-table basis.

To control the "assumed maximum replication lag", use the autoslave options in settings.php. It sounds like you may also want to enable the "global replication lag" option, which will force the use of the master db for concurrent users for the affected tables.

Example:

<?php
$databases['default']['default'] = array(
  'driver' => 'autoslave',
  'master' => 'mymasterdb',
  'slave' => 'myslavedb',
  'replication lag' => 30 // defaults to 2 seconds if not set. Standard Drupal's db_ignore_slave() defaults to 300 seconds (5 mins.)
  'global replication lag' => TRUE // defaults to false.
);
?>
gielfeldt’s picture

Elaboration of "db_ignore_slave() does not work with AutoSlave": It has no effect :-)

gielfeldt’s picture

Status: Active » Postponed (maintainer needs more info)

Hi

Did this answer your question?

szantog’s picture

We are working on this, I keep you posted.

gielfeldt’s picture

Ok. Let me know if you have other questions.

szantog’s picture

Status: Postponed (maintainer needs more info) » Fixed

We added the registry, system, registry_file to the 'Always use "master" for tables' array, it seems, it can solve our issue.
Thanks for your help!

gielfeldt’s picture

Hi

No problem. Do you have any insight to why these tables need to be "always master" on your setup? Then I will document it on the project page, perhaps even add them to the default settings.

Thanks

gielfeldt’s picture

Status: Fixed » Postponed (maintainer needs more info)

Hi szantog

Not sure how notification works when status is fixed, so I'm changing the status just in case :-)

szantog’s picture

Hmm.. It's hard a little bit.. Everything is theoretically, because we have limited debug options on the live environment.

The point should be the memcache vs database inconsistently. If the normal database cachebackend is used, eg. when enabling a module, the cache and registry tables are flushed and rebuild. But this is the database, if some lag exists, the wrong system and wrong cache tables are loaded. Wrong, but consistent. If we use memcache, due the lag wrong memcache data is built, and run against the correct database.

1. Turn on a module, master database are ok, memcache is empty.
2. The next page request load the data from slave, which is not in sync yet - and these not proper data go into memcache.

But there is a failure with this theory.
What about the other cache bins? Obviously wrong data in cache_bootstrap bin cause site crash.
----------------
I can't belive, during I wrote this, we get another site crash.
When updated a node the parent entity was loaded immediatly after saving, the old data get cached, this caused entitymailformedexceptions, and within some minutes the site was killed.
We need further investigation..

gielfeldt’s picture

The next page request load the data from slave, which is not in sync yet - and these not proper data go into memcache.

The replication lag mitigation feature should prevent exactly this.

How does your $databases['default']['default'] look? Did you try enabling the 'global replication lag'?

I may have a vague suspicion towards an isolation level issue, assuming your using MySQL innoDB. I'll investigate this further.

gielfeldt’s picture

Oh, btw, are you also using the lock.inc or memcache-lock.inc bundled with autoslave?

gielfeldt’s picture

Hi szantog

I actually found something that could be the culprit also when using "global replication lag".

In the latest dev version, I've tried to address these issues, which resulted in these fixes:
* AutoSlave now using a non-transactional connection for "global replication lag" including possibility for a better isolation level.
* Enable "global replication lag" by default

Try installing the dev version and add the new "init_commands" to the autoslaver driver declaration:

<?php
$databases['default']['default'] = array(
  'driver' => 'autoslave',
  'master' => 'mymasterdb',
  'slave' => 'myslavedb',
  'replication lag' => 30, // defaults to 2 seconds if not set. Standard Drupal's db_ignore_slave() defaults to 300 seconds (5 mins.)
  'init_commands' => array('autoslave' => "SET SESSION tx_isolation='READ-COMMITTED'")
);
?>

Let me know how/if it works.

szantog’s picture

Thanks for you hard work, at wednesday we will try those.

gielfeldt’s picture

Hi szantog

I think I finally figured out what's wrong. And I can't believe I didn't realize it before. Every query runs through autoslave, making autoslave able to detect which tables are being queried... except when using join() on a db_select(). Unfortunately, these a VERY common :-)

I'm currently thinking about how to hook into this. My initial thoughts are overriding the SelectQuery all together, or perhaps adding a tag or an extender to the query, thereby being able to alter it.

The latter seems easier to implement, though I'm not sure I can easily change the connection on an already instantiated SelectQuery object.

This problem seems to be what is interfering with the EntityCache ... and god knows what else.

szantog’s picture

EntityCache is irrelevant for now, my fail - entitycache is now turned off one of our sites, i missed it.

gielfeldt’s picture

There's a new dev version ready, where I've tried to address the issue with dirty tables not being recognized properly. Could you try it out?

Note, you'll have to copy the autoslave folder to /includes/database/ again, as it contains a new file.

Regarding EntityCache, I've discovered that this will only work properly when using the database as a cache backend. The reason is that, e.g. node save clears the cache inside a transaction.

gielfeldt’s picture

I've been looking some more into this. Regarding the error with the entity_extract_ids(), I think I've traced it to the inherint problem with transactions and non-db cache backends.

I've created a cache wrapper in autoslave, which should make cache queries transaction safe.

$conf['cache_backends'] = array(
  'sites/all/modules/memcache/memcache.inc',
  'sites/all/modules/autoslave/autoslave.cache.inc',
);

$conf['cache_default_class'] = 'AutoslaveCache';
$conf['autoslave_cache_default_class'] = 'MemCacheDrupal';

Another way to solve it, could be just to use the database as a cache backend.

This is partly theoretical, as I have been unable to reproduce your exact problem (not knowing precisely what modules you're using and how they are configured). However, it does suit the case, since modules like file_entity do perform cache operations during node_save(), which is wrapped in a transaction.

With the cache wrapper, the core-patch I mentioned earlier in our chat should be unnecessary.

gielfeldt’s picture

I've generalized the consistent cache wrapper: http://drupal.org/sandbox/gielfeldt/1946668

szantog’s picture

As we started to work together with all of our sites, can we close this issue, or just rename to 'varoius fixes and improvements based on high traffic live testing' :)

gielfeldt’s picture

Status: Postponed (maintainer needs more info) » Fixed

Let's just close it. I've set it to fixed, since the original issue with the site actually crashing has been solved.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.