So as I've been talking up a lot lately, I'm trying to add a PDO database layer to Drupal. PDO, for those who haven't read the link, is a unification of PHP's disparate SQL database APIs int a single, robust, OO-based API. It is an optional PECL component in PHP 5.1 and part of the default package as of PHP 5.2. It is not a full SQL syntax abstraction nor an ORM. Think of it as PEAR::DB implemented in C where it belongs. :-)

Reasons we want PDO:

  • PDO natively supports prepared statements, and emulates it on databases that don't support them natively. Because they're implemented in C, I believe they should be substantially faster than our current slow preg_replace_callback() prepared statement emulation.
  • C-level prepared statements can be compiled and cached once and then just "fill in the blank". That can save a lot of time on repeated queries (like, say, 80 calls to look up a path alias on a single page.).
  • PDO's prepared statements are type-agnostic. The type casting, quoting, escaping and other injection protection all happens down in database-specific C code where it's faster and more secure.
  • PDO offers a single user-space API to talk to MySQL, PostgreSQL, SQLite, Oracle, MS SQL, ODBC, and various other databases that no one I know actually uses. That means adding support for SQLite, Oracle, and MS SQL via a PDO layer becomes dramatically easier, and takes less code.
  • Because PDO is PHP 5-only, it allows Drupal to use its market share to nudge the world toward PHP 5 without leaving PHP 4 legacy sites out in the cold. Using a lame web host that is still on PHP 4? They all run MySQL anyway so you're fine with the existing driver. Want your Drupal to be faster? Want to use SQLite? Want to use MS SQL? Install PHP 5 and bask in code developed this century. :-)

In the long term, once we can require PHP 5 (sometime in 2012...) we can eliminate the existing mysql, mysqli, and pgsql backends and just have thin extensions to a common PDO base to handle SQL differences (db_query_range() and such). That opens up some really nice possibilities:

  • We drop the pseudo-sprintf syntax for queries, since the data type of the column is handled by the PDO driver down in C. That makes query writing easier.
  • We allow support for named parameters in queries, which means we can pass a keyed array into the query call. Given how heavily Drupal relies on keyed arrays these days, I'm sure we can find some way to leverage that. :-)
  • With a real OO database abstraction layer underneath us, we can leverage the database layer in more OO ways. That's not just OO fanboyism there; PHP 5 offers a lot of nice extras that are build on its OO functionality, such as SPL, that magically become ours if we expose the database layer as a well-designed object. PDO already uses some of it. More on what we can do with this in, oh, a year or so. :-)

There are, of course, a few snags, some of which I've mentioned before:

  • MySQL, being the great software that it is, doesn't support the query cache for prepared statements. Why? I don't know. However, PDO's original author recommends just using the PDO prepared statement emulation so the query cache still works. I am going to work on the assumption that he knows what he's talking about unless someone proves me otherwise. :-) That should still get us a decent performance boost just by eliminating the regexing.
  • Not all database engines support an equivalent of mysql_num_rows($result_set). PDO supports an "affected rows" call for INSERT, UPDATE, and DELETe statements, but for SELECT statements it may or may not returning a meaningful value. For MySQL, it seems it doesn't. The solution here, as discussed on the dev list recently, is to eliminate db_num_rows() and refactor code accordingly. I am hopeful that will be the only significant API change at this stage, but we'll see.
  • Formal prepared statements, unlike our current pseudo-prepared statements, only accept parameter insertion for actual values. That means current queries that use a variable table name can't use the prepared statement syntax anymore. To wit, this:
    db_query("DELETE FROM {%s} WHERE expire != %d AND expire <= %d", $table, CACHE_PERMANENT, $cache_flush);
    

    has to become this:

    db_query("DELETE FROM {" . $table . "} WHERE expire != %d AND expire <= %d", CACHE_PERMANENT, $cache_flush);
    

    I do not see that as a major problem.

  • Because this patch requires rewriting db_query() specifically for PDO, db_query will need to be moved down into the database-specific database .inc files. For mysql, mysqli, and postgresql, I don't plan to change it. The PDO version, of course, will be quite different.

So that's the sales pitch. Assuming I rolled it correctly, the attached patch is my first milestone in the process: The front page loads without being 90% errors. Yay! More useful changes will be forthcoming, but I wanted to get this part out now so that people could review it and so that it's available somewhere in case my computer melts. It's still full of debugging code and is not nicely factored yet. I'm still working on that part. :-)

To use, install a copy of HEAD using the mysql driver. Then apply this patch, and in your settings.php file change "mysql://blah blah" to "pdo.mysql://blah blah". If it works, you should be able to then reload the front page and not notice anything different. If you see a page full of errors, then something broke. :-) I'm also interested to see exactly what versions of PHP this works under. I've found it to work under PHP 5.2 and 5.2.1, but PHP 5.1.6 apparently had horrid segfault issues that are going to make this system a no-go. I'm curoius if it works under PHP 5.1.4 with PDO from PECL. If you have 5.1.4, please let us know.

I need help! OK that kind too, but I meant help with this patch. If you support this idea, then I think the biggest thing you can do to help at the moment is cleaning up existing queries in core. That is, remove any non-variable substitutions (like table names, above) and refactor any references to db_num_rows() to not need db_num_rows(). According to chx, most cases where it's used in core don't actually need an exact number; they just need "there were records or not". There are various cleaner ways of doing that than db_num_rows(). I don't know if Dries will accept separate patches for that or if he'll require them all be in one big patch, but it would probably be easier (and beneficial to the other database implementations working their way through the queue) to do those separately. Dries?

Comments/questions/input/help/flames welcome.

Files: 
CommentFileSizeAuthor
#19 drupal_pdo.patch403.66 KBrecidive
#3 pdo-4.patch46.62 KBCrell
#2 pdo-2.patch46.21 KBCrell
pdo.patch78 KBCrell

Comments

chx’s picture

Security notices must be added everywhere where you move the variable into the query (ie. the cache query) and the variable must sanitized throughly. Do not presume that 'only sane values are passed to cache_get'. Do not presume anything.

Crell’s picture

FileSize
46.21 KB

Take 2... Now working is installing with PDO-MySQL from the get-go. Code is also better factored. You should now be able to install Drupal, create the first account, and start poking around the admin area just with the pdomysql driver. (PHP 5.2 required, of course.)

Still to do: make Postgres work again (just need to move some code around), propagate all of the API changes, and fix any queries that I just managed to break. Hooray!

Crell’s picture

FileSize
46.62 KB

And here's a patch that should, you know, actually apply properly.

On the downside, this patch is at the point where I could successfully benchmark it. Specifically, I used devel generate to create a bunch of nodes and set the front page to show 30 of them. That's be 30 node_load() calls, each of which is ~4 db queries I believe, for a query count in the hundreds. Running apache bench on the site with a count of 100, I got the following results (my desktop is not the fastest server):

MySQL native driver:
Requests per second: 3.44 [#/sec] (mean)
Time per request: 290.948 [ms] (mean)

PDO-MySQL without prepared statement emulation:
Requests per second: 3.46 [#/sec] (mean)
Time per request: 289.347 [ms] (mean)

PDO-MySQL with prepared statement emulation:
Requests per second: 3.50 [#/sec] (mean)
Time per request: 285.823 [ms] (mean)

So there is a difference, but it's minor. I was very surprised (and somewhat disappointed) to see that.

I still think PDO is a good direction for us to move, since performance is not the only benefit (see above), but if it's not a big speed boost then I'm not going to try and rush it through for Drupal 6. There's other speed boosts we can get that I'm going to try and get through first. If someone else wants to jump in here and keep up work on this patch, please do! I'll come back to it later.

Dries’s picture

Too bad this doesn't provide a bit more of a boost. Maybe you still overlooked something.

It might be useful to use prepared statements (instead of emulating them) to see if that buys us something. One would expect it should.

Crell’s picture

PDO-MySQL supports using MySQL's prepared statement functionality or emulating it in the PDO module itself. I'd expected either one to be faster than our interpreted preg_callback()-based emulation. I'll have to see what else I can dig up research-wise, because it doesn't feel right to me. At the moment I don't expect to be able to get this into D6, but I will keep working on it.

Dries’s picture

I agree. These numbers don't feel right. More research would be in order. Would love to see this make progress though, if not for D6, then for D7. Thanks Crell. :)

Dries’s picture

I agree. These numbers don't feel right. More research would be in order. Would love to see this make progress though, if not for D6, then for D7. Thanks Crell. :)

Crell’s picture

Status: Needs review » Needs work

Some follow-up. I spoke with Wez Furlong, PDO's principle author, at php|tek this week. His theory is that our queries are simply too simple. Compiling them takes almost no time, so not compiling them doesn't save us any appreciable time. MySQL is also pretty good about query speed all on its own, so we shouldn't expect as a big of a boost as we would on, say, Oracle. Drat.

That still doesn't explain why there's no speed difference on the prepared statement insertion itself between C and preg_replace_callback(). I need to run some benchmarks on that. More on those when I have them.

lopolencastredealmeida’s picture

If you could check this with Sybase, Oracle or MSSQL would be great :)

Lopo

Frando’s picture

Version: 6.x-dev » 7.x-dev

Not in Drupal 6, but hopefully in D7!

Crell’s picture

@humaneasy: If I had access to any of those databases, I would. I do not, however, so any benchmarks you could help with on that front would be great. :-)

And yes, this will be back in D7.

JacobSingh’s picture

Hi Crell,

I'm very interested in seeing this happen. I moved to drupal land a year ago from a PEAR based DB_DataObjects system, and to be honest, I didn't miss writing SQL one bit :)

let me know if I can do anything to push this along.

Thanks!
Jacob

hswong3i’s picture

subscribe :)

justMatt’s picture

subscribing

oriol_e9g’s picture

subscribing

dharamgollapudi’s picture

Subscribing...

profix898’s picture

subscribing

alippai’s picture

subscribing

recidive’s picture

Status: Needs work » Needs review
FileSize
403.66 KB

As we will drop support for PHP 4 in Drupal 7 it is time to start 'plan b': make PDO the default database layer.

Attached is patch that replaces all custom placeholders (%s, %d, %b, etc) with PDO's '?' placeholder. I've also moved some common functions to database.inc and removed both database.mysqli.inc and database.mysql-common.inc. Stripped down database.mysql.inc and database.pgsql.inc to the bare minimum. Also I did some changes to make install recognize only 'mysql' and 'pgsql' as database options. Still need to refactor install.mysql.inc and install.pgsql.inc to make them use PDO functions.

Query placeholders now can be on any of these forms:

db_query('SELECT * FROM {table} WHERE foo = ? AND bar = ?', $foo, $bar);
db_query('SELECT * FROM {table} WHERE foo = ? AND bar = ?', array($foo, $bar));
db_query('SELECT * FROM {table} WHERE foo = :foo AND bar = :bar', array(':foo' => $foo, ':bar' => $bar));
Crell’s picture

recidive: Thanks, but I'm already way ahead of you. :-) I have a more complete overhaul waiting in my sandbox that was presented at DrupalCon. It won't be submitted as a patch until HEAD opens up again, because until then we should be focusing on getting D6 out the door. (And any patch rolled now is going to break horribly before then, guaranteed.) It also goes a step farther and introduces several new features (insert/update/delete handling, replication, etc.)

Sorry if you didn't know about it; I've been trying to publicize it without distracting from D6 too much.

I'm pondering if I should just close this issue, to avoid confusion. :-(

mfer’s picture

@Crell - don't close the issue. I don't think it's distracting from D6.

I am curious about performance issues though. I've read here and here that performance in practice might not be nearly as good as I had originally through it was and not as good as mysqli.

Yesterday, I realized that mysql and postgresql PDO extensions are not enabled as part of the standard PHP distribution as of PHP 5.2.4. One of the hosting solutions I use has them enabled and another one doesn't. Is there any pole or listing of hosting providers and what they include?

hass’s picture

subscribing

Crell’s picture

@mfer: Yes, an abstraction layer is going to have more overhead than a native call. That's unavoidable. However, the first article you link also makes it clear that a compiled extension abstraction layer is going to be faster than a PHP user-space one. Right now we have a user-space one built on the weakest option for MySQL, the old ext/mysql. The goal is to replace that with one built on PDO, the only widely-used C-native abstraction layer. From my earlier benchmarks it should be at worst no slower than our current preg_replace_callback() system. Adding an extra abstraction in there to support PDO and mysqli would only slow both down. From what I've seen, PDO is more widely deployed than mysqli, too, so even if we were to drop support for non-MySQL databases mysqli wouldn't be a great solution for shared hosting.

I don't know what stats are for web hosts to that level of detail. There's a nice big list of PHP 5.2.x supporting hosts at http://gophp5.org, but it doesn't go into detail about exactly what extensions they have. That's for a later project. :-) If I had to speculate, I'd say that a host with PHP 5.2 and MySQL of some sort is far more likely to enable PDO-MySQL than mysqli. (And if they don't enable PDO-MySQL on 5.2 with MySQL, I'd find a new host.)

antgiant’s picture

subscribing

hswong3i’s picture

I give an indeed study with both mysql, mysqli, pgsql, oci8, pdo_mysql and pdo_pgsql, and have some interesting founding:

  1. We may need to keep %d|%f|%s|%b|%c syntax, but not replace all of these with ?. This is because besides MySQL, PDO for all other databases need explicitly specify the variable type for LOB as PDO::PARAM_LOB. This is required for pdo_pgsql implementation; on the other hand, it is just a similar idea as using variable binding for oci8.
  2. There is NO outstanding performance boost by using ? directly instead of _db_query_callback(), in case of MySQL. It is proved by benchmarking.
  3. Reuse of prepared query statement don't give any outstanding performance boost up, but it is still useful for working as database-friendly. This seems a bit conflict; BTW, In case of Oracle, database will able to cache the query if 1) using variable binding, 2) reuse prepared statement. After using this method, Oracle EM no longer complain about our lossy query style. The case should be similar for PostgreSQL, and other enterprise-level database which support prepare statement.
  4. As we are not able to drop pseudo-sprintf syntax (LOB handling need this, not only for Oracle, but also PostgreSQL...), _db_query_callback() need to be enhance with arguments preparation, but not remove. Anyway, this enhancement don't affect overall performance, which proved by benchmarking, too.
  5. The support for PDO development is not enough. I face a lot of difficulty when working with pdo_pgsql BYTEA handling; there is lack of useful official document ;-(

The research also show that: a lot of abstraction is not able to be handled by the change of using PDO. As I split database.*.inc into 3 parts: common.*.inc for database-specific query syntax abstraction, schema.*.inc for schema-specific abstraction, and database.*.inc for API-specific abstraction, the functions included within database.*.inc is very little (in case of pdo_mysql, it only contain 10 functions...). Most of the works for enhance cross database compatibility need to be preform among core, or common.*.inc.

PDO really simplify the syntax for driver implementation; BTW, the change of using PDO can't solve some basic syntax different. Maybe we should review the needs of keeping legacy drivers supporting until PHP6, coming with PDO-ONLY implementation?

P.S. for more information about the above research founding and benchmarking, please refer to my recent blog post.

hswong3i’s picture

Well, a simple progress report after a 2 month indeed research and study: I am now able to make both mysql, mysqli, pgsql, oci8, pdo_mysql, pdo_pgsql and pdo_oci get works under a single code base. Since the change affect EVERY code queries, result as a very huge patch, I setup a personal research project: Siren, in order to prepare all stuff before D7 is open for public development. Here I would like to recall some PDO related stuff:

  • ? placeholder for pdo_mysql didn't provide ANY spotlight performance boost, whatever compare with legacy drivers (mysql, mysqli) or locater-style variable binding. I keep trying ? placeholder for 3 releases (Siren beta4, RC1 and both RC2), and finally swap to locater-style in Siren RC3, but all benchmarking result show there is no differences...
  • Manually reuse prepared statement in oci8, pdo_pgsql and pdo_oci didn't provide spotlight performance boost, again. After I clean up those legacy manual prepare statement handling in Siren RC3, the overall performance is even boost up a bit, too ;-p
  • pdo_oci is totally not stable enough. Since most code are shared among Siren's oci8 and pdo_oci implementation (around ~60%), where the rest share identical programming logic with different PHP function call only, pdo_oci face an annoying ORA-01461 error, where oci8 works correctly without any problem. It should belongs to the unstable of PHP pdo_oci implementation, and so that is can't help right now ;-(

For people who are interesting in PHP PDO research, please feel free to comment on this progress report. For more details, please visit my Siren research project page, or its related issues in D.O. Thanks :-)

hass’s picture

Have you measured the SQL load, too with ? placeholders? I think this must be faster, because SQL Server does not do a bind column check on every reques and this will save load. Aside recompiling the SQL statement is no more required after the first call and this is for sure faster (i know well from perl and coldfusion). I'm not sure if you see this difference in a test environment. It could be possible this is not very visible, but url_alias request are very often called and this must be faster. If this is not faster i think there should be a bug in PDO and how prepared statements are cached and so on...

hswong3i’s picture

@hass: I have no idea about why there is no performance boost up with ?, too. Such simple AB test may not figure out indeed different of their performance; BTW, as I mentioned in #26, I try this for 3 release but there is really no different observed (or just different within standard variation).

On the other hand, that MAYBE because of my automatic locater generate handling. Since the base of locater is ALWAYS generated based on the input query statement with MD5 hash, the locater are always mapped with that query. So pdo_mysql may cache it up in user space, and reuse it as like as that we use ? placeholder.

Here are some code snippet for Siren's _db_query_callback():

<?php
/**
 * Helper function for db_query().
 */
function _db_query_callback($match, $init = FALSE) {
  static $args = NULL, $ident = NULL, $count = 0;
  if ($init) {
    if ($match) {
      $count = 0;
      $ident = substr(md5($match['query']), 1, 8);
      $args = array_values($match['args']);
      return;
    }
    return $args;
  }

  $current = $count++;
  $locator = ":l". $ident . $current;
  switch ($match[1]) {
    case '%d':
      return db_escape_decimal($args[$current], $locator);
    case '%f':
      return db_escape_float($args[$current], $locator);
    case '%s':
      return db_escape_string($args[$current], $locator);
    case '%b': // Binary Large OBject.
      return db_encode_blob($args[$current], $locator);
  }
}
?>
hswong3i’s picture

According to my latest research progress, supporting SQLite with PDO is very easy (which I am able to complete ~70% within 2 days). Details and patch are mentioned in here. Within this issue, I would like to note some SQLite related PDO stuff: I skip sqlite but dig into pdo_sqlite directly.

This is because the functionality of SQLite is directly related to its libraries version; on the other hand, pdo_sqlite is the official SQLite v3 PHP driver. At least SQLite v3.x come with RANDOM(), I would like to skip sqlite with legacy SQLite v2.x supporting.

Well... this is just a clone of existing pdo_pgsql research progress... And now, we have 4 databases driver implementation on hand, which both coming with PDO support ;-)

Crell’s picture

Status: Needs review » Closed (duplicate)

This has been superseded by the new, holistic approach: http://drupal.org/node/225450