It appears as though there are a few problems with deploying Drupal in a large replicated mysql environment because of the amount of database interactivity that is required.
What I would reccommend doing would be to have an optional read-only database connection and a read/write database connection.
By default the read-only database would be used if it was available and defined in the settings.php file, however if there was a request that would cause a table update (insert, update, delete), the $db_conns['default'] variable would be set to a connection to the read/write database and db_set_active() would be called.
This way all future database transactions would be performed on the read/write database.
This solution avoids a problem of a read following a write not showing up because of replication delay, for instance:
INSERT INTO table VALUES ('something');
SELECT * FROM table;
would return inconsistent data if the read was sent to the read-only and the write to the master as the insert would probably not have propogated by the time the select was executed.
Has anyone done something similar, and should it be built into the drupal core -- or should I keep a tiny fork of the db code to deal with this for my application?
Comments
Drupal being an application
Drupal being an application that uses a database, should only comply with one main aspect in this relation: make sure that has the most optimized code to access the database, making no extra access than those it needs, and that the SQL queries are optimized for the database structure.
As you pointed, there could be problems with database scalability. But these are database server problems and I think it's not Drupal responsability to correct them.
A possible solution for database server scalability problems is the creation of a database cluster. This http://www.davz.net/static/howto/mysqlcluster/ has some ideas to create a mySQL cluster.
Edit: just a doubt, can you explain more why you have a "replicated enviroment"? Drupal allows you to access different databases. For example if you have a particular module (made by you) that makes something heavy in a database, you can use Drupal capability to use diferent databases in your module. This way you maintain two databases separated, not replicated and at full capabilities.
I have a replicated
I have a replicated environment with one master (read/write) and seven slaves (read-only). Having a clustered database would slow this down because each server would need to open a TCP connection utilizing bandwidth on the link that all of them are sharing to the outisde world, lowering my aggregate throughput and incurring an extra delay. If the client computer can talk to the database that is running locally over the unix domain socket, it is much much faster (don't have to wait for the three-way handshake and tcp slow-start).
I need a replicated environment because one database server cannot serve the load that I need to serve and I can't have all of the database traffic going between machines in a cluster because of the latency that would cause and the maximum throughput that I'd be sacrificing.
Master/slave replication is really much more efficient if the application supports it.
There have been attempts
There have been attempts with using db replication for spreading the load of a large Drupal site:
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/marco/databas...
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/crackerjm/rep...
The problem is is that the majority of all Drupal sites are smallish and so this topic didn't get the attention it deserves. I think including this into core would make sense.
--
Drupal services
My Drupal services
--
Drupal services
My Drupal services
This is exactly what I'm
This is exactly what I'm talking about ;) (and almost exactly my idea, though I would be running a copy of the database on the localhost and always choose that database for reads unless a write is performed, that is accomodated by this scheme).
Thanks killes. I would suggest that this get another push toward integration into the core because it's something that's necessary for a large-scale deployment.
Pushing
From experience I can say that things won't get pushed untill the people (such as you) who need this kind of functionality start psuhing themselves. Most Drupal developers don't have such a high volume site that they will start working on this by themselves.
--
Drupal services
My Drupal services
--
Drupal services
My Drupal services
Per your request
Here is a very mysql specific patch to drupal 4.6. Note, I haven't tested it yet, but it looks like it will work w/o a problem. Its late and I need a nap so I'll test it out tommorow sometime to see how it goes.
README:
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/crackerjm/rep...
PATCH:
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/crackerjm/rep...
New revision
Ok, I currently have commited a new version to CVS. I appears to be working fine (verified with Mysql query-logging).
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/crackerjm/rep...
Apply the patch from your base drupal directory.
what I don't like about your
what I don't like about your solution is that it tries to find queries that do write or change content and issues them to the master; and issues the rest to the slave(s).
The other way round would be safer in my eyes for next MySql-Versions (with ne commands):
Find the queries that for sure only read (start with "select") and issue them to the slaves and issue the rest to the master.
i'm absolutely behind this
I've been looking at this as well, recently (as well as large-scale file handling)...
By all means, I think we should work on getting this into core, so that the various sites that could/should benefit from this can do so easily.
If you haven't already, get signed up for the drupal-devel mailing list, and post patches on http://drupal.org/node/add/project_issue/drupal .
Looking forward to some attention in this direction!
-James
--
James Walker :: http://walkah.net/
Some questions about your configuration.
I understand your need. Perhaps there is a need to document your thoughts now so someone can assemble a how-to page later on. I have a couple of questions already.
I set up one site with InnoDb tables for the financial stuff and Myisam tables for the product catalogue. When people browsed product, they used the replicated Myisam tables to read product information. When they clicked the Buy button, they went to a different database that handled their shopping cart.
MySQL replication required the use of certain tables and configurations. A future how-to page could use a link to MySQL's replication page.
Replication in MySQL
Test marketing suggested traffic would increase 60 times in the first month producing a lot of browsing but not a big increase in buying. That was a good reason to have two sets of servers and a read-only option.
There was a real question about storing session information for people who logged in to buy. When the buying traffic increased, we were looking at directing logged in users to a specific set of servers which could be clustered around a database containing their session and another database with their user profile.
When the site went live, most of the traffic was product images which normally are cached for a long time. We had daily changes that meant we had to limit cache on images. The images made the network connection a bottleneck and required frequent upgrades. We did everything possible to keep database traffic off the network.
The second heaviest workload was logging statistics. We had daily changes so logged usage direct to databases that we could view during the day. We used a separate connection for logging. I noticed Drupal has something for logging but I have not read the details. If the logging is to a database then that would be a good candidate for a third connection in a large environment.
When setting up a site for multiple servers, changing the design before construction is 100 times easier than changing a site after customers arrive. Your experience will help those who are about to expand their own site. I have the feeling that a newcomer will need more than “use 2 connections”. They will need to know something about your table choice and replication configuration so that they get the most out of the read-only connection.
How do you allocate work across the servers? Do you have every page request to a random server or do you direct a page request to a specific server based on the user’s session? We used a round robin Cisco setup. When a person logged on, they then stayed on the same server so their shopping cart activity was local to the server. The anonymous browsers stayed on servers with far less overhead.
petermoulding.com/web_architect
Latest Status?
What's the latest status with this? Also, should 'create' not be added to the array of statements to run on the master? So we have:
- insert
- delete
- update
- alter
- flush
- lock
- create
Has this idea stalled, or is
Has this idea stalled, or is someone still working on it?
www.it-hq.org
I am not aware of any recent
I am not aware of any recent progress.
--
Drupal services
My Drupal services
--
Drupal services
My Drupal services
Circular replication for failover
I would like to setup circular mysql replication for failover (http://www.onlamp.com/lpt/a/6549). Drupal can support automated failovers with a small change in database.inc to check whether the primary database is available, if not switch to the secondary. Can we include the ability for database failover in the drupal core ?