Hi,
this may or not be a bug, but I wasn't able to get an IN in static query work right. What works is:

$result = db_query_range('SELECT c.nid, c.subject, c.cid, c.timestamp FROM {comments} c INNER JOIN {node} n ON n.nid = c.nid WHERE c.nid IN (' . implode(',', $nids) . ') AND n.status = :nstatus AND c.status = :cstatus ORDER BY c.cid DESC',
      array(
      //':statement' => implode(',', $nids),
      ':nstatus' => 1,
      ':cstatus' => COMMENT_PUBLISHED),
      0, $number);

What doesn't is:

$result = db_query_range('SELECT c.nid, c.subject, c.cid, c.timestamp FROM {comments} c INNER JOIN {node} n ON n.nid = c.nid WHERE c.nid IN :statement AND n.status = :nstatus AND c.status = :cstatus ORDER BY c.cid DESC',
      array(
      ':statement' => implode(',', $nids),
      ':nstatus' => 1,
      ':cstatus' => COMMENT_PUBLISHED),
      0, $number);

And this one is missing results

$result = db_query_range('SELECT c.nid, c.subject, c.cid, c.timestamp FROM {comments} c INNER JOIN {node} n ON n.nid = c.nid WHERE c.nid IN (:statement) AND n.status = :nstatus AND c.status = :cstatus ORDER BY c.cid DESC',
      array(
      ':statement' => implode(',', $nids),
      ':nstatus' => 1,
      ':cstatus' => COMMENT_PUBLISHED),
      0, $number);
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Damien Tournoud’s picture

This probably should be a dynamic query.

Crell’s picture

db_placeholders() is the correct way to handle D6 IN statements. That's not been upgraded to the new API yet, though, and frankly I'm not entirely certain how we want to do so. For now, a dynamic select I know works properly and is very tight and readable. :-)

Crell’s picture

Title: Problem with static SELECTs and WHERE IN-clause » Convert db_placeholders() to DBTNG

OK, so I looked at db_placeholders(). We can convert it to the new API and just have it return incrementing placeholders fairly easily. However, we cannot do that without also updating everywhere it's used, as the API would be changing. So it's best to convert db_placeholders() in one shot across all of core. Renaming the issue accordingly.

The new API should, I think, take an sequential array and return a 2 element array that can be read using list(), containing a string of placeholders and an array of values to substitute for them. So something like this:

$values = array(1, 2, 3, 4, 5);
list($placeholders, $args) = db_placeholders($values);
$result = db_query("SELECT * FROM {foo} WHERE fid IN (" . $placeholders . ")", $args);

Sound like a plan?

CorniI’s picture

I'd just agree with #1 that the new standard for IN-Queries is the dynamic query builder, much easier :D

Crell’s picture

Looking at the patch in http://drupal.org/node/302207#comment-1044710, I'm not sure I'd call that easier. :-)

catch’s picture

It's very non intuitive not being able to get IN to work in static queries - I just spent a couple of hours trying to work out why the multiple version of comment_nodeapi_load() was only working on the first node in my $nids array. I've used a dynamic query for now, but unless we're going to say 'use a static query for static queries, unless you want to use a dynamic query instead' as our guidelines, it's likely to get confusing.

Crell’s picture

@catch: I agree. Now what do you think of the solution proposed in #3? I don't want to move forward with it until I have some idea that it will be accepted.

catch’s picture

Crell: I'd really like to be able to do array(':nids', implode(', ', $nids) - but I have a feeling you'd have suggested that if there was an easy way to do it. Also what if I want to do IN(1,2) AND :something - seems like $args would preclude that.

Damien Tournoud’s picture

What I would *love* to do is:

db_query("SELECT * FROM {foo} WHERE fid IN (:fids) AND type = :type", array(':fids' => array(1, 2, 3), ':type' => 'my_type'));

That would require to go thru $args and check for the type of each argument, which we may or may not want to do, based on performance considerations.

Another approach, perhaps with a lesser performance impact:

db_query("SELECT * FROM {foo} WHERE fid IN (:fids) AND type = :type", array(':fids' => array(1, 2, 3), ':type' => 'my_type'), WITH_PLACEHOLDERS);

My reasoning is that the $query part of db_query() should remain a static string as much as possible.

catch’s picture

Damien - that's even better.

Crell’s picture

Well WITH_PLACEHOLDERS is impossible as the third parameter is already the $options array. If anything it would be an additional options key.

Unfortunately either of those syntaxes would require us to string parse the query to find :fids and replace it with :fids_1, :fids_2, etc. One of the main design goals of DBTNG is to avoid string parsing of queries whenever possible. Right now the only place we do so is for the table prefixing, and if I could get rid of that, too, I would. :-)

(That said, I'm happy to see people putting out ideas for this issue as it is an important one!)

markus_petrux’s picture

A mix between #3 and #9:

function db_placeholders($arguments, $type = 'int') {
  $placeholders = implode(',', array_fill(0, count($arguments), db_type_placeholder($type)));
  return array($arguments, $placeholders);
}

$values = array(1, 2, 3, 4, 5);
$result = db_query("SELECT * FROM {foo} WHERE fid IN (:fids)", array(':fids' => db_placeholders($values)));

db_query could react accordingly when a query argument is an array and take what db_placeholders() did.

Crell’s picture

db_type_placeholder() is going away, as the new query syntax is type-agnostic by design. #12 still requires string parsing the query to find :fids and convert it, just like #9 does. That would require us to have a preprocessor on all queries, which would have to be done before we create the prepared statement object. I'm really not wild about that approach.

catch’s picture

Let's not add string processing back in, nice as it looks on the surface it'd be ugly inside. I'm surprised PDO doesn't have a way to deal with this internally to be honest.

If we're stuck with IN(" $placeholders "), could we perhaps have array(':placeholders' => $args) to allow for IN queries with other conditions?

markus_petrux’s picture

To evaluate performace impact in #9 option 1:


function tmp_query($query, $args = array(), $options = array()) {
  if (!is_array($args)) {
    $args = func_get_args();
    array_shift($args);
  }

  // Expand placeholders for IN() operators.
  foreach ($args as $key => $data) {
    if (is_array($data) && $key[0] == ':') {
      $keys = array();
      foreach ($data as $i => $value) {
        $args[$key . '_' . $i] = $value;
        $keys[] = $key . '_' . $i;
      }
      unset($args[$key]);
      $query = str_replace($key, implode(', ', $keys), $query);
    }
  }
//  list($query, $args, $options) = _db_query_process_args($query, $args, $options);

//  return Database::getActiveConnection($options['target'])->query($query, $args, $options);
}
timer_start('tmp_query');
for ($i=0; $i < 10000; $i++) {
tmp_query("SELECT * FROM {foo} WHERE fid IN (:fids) AND type = :type", array(':fids' => array(1, 2, 3), ':type' => 'my_type'));
}
print timer_read('tmp_query');

The loop took around 222 ms on one system, and 382 on another.

HTH

Crell’s picture

222 ms as compared to...?

Performance is part of the issue. The performance impact on queries that DON'T need an array is also part of it. And there's just general code cleanliness. String parsing of a serialized data structure (eg, SQL) inherently means you serialized too early.

markus_petrux’s picture

222 ms as compared to...?

Ok, I added a check for isset($options['placeholders']) to bypass the foreach loop above. It takes 34 ms when FALSE, and 230 ms when TRUE. The for loops 10000 times in both cases, so it's 0.0034 ms -vs- 0.023 ms per one single call tmp_query() in the example above.

If there are more ideas on how to deal with this, then that could be compared with this example.

String parsing of a serialized data structure (eg, SQL) inherently means you serialized too early.

Sure, but you have to write code that is not so easy to understand / maintain, maybe.

webchick’s picture

Ok, I'm going to have to put my foot down and say #3 is unacceptable. We put db_placeholders() in core for a reason, and that reason is because it's way too easy to introduce SQL injection attacks from not properly escaping your arguments. Removing that would be a huge step backwards in terms of security.

That leaves us with two options:

1. Damien's suggestion about ' ... IN(:fids) ...', array(':fids' => array(1, 2, 3)), which I personally find easiest to read and most logical. Would require an is_array() check on all placeholders on all queries.

2. Some sort of "magical" placeholder. Perhaps @placeholder to signify an array of values, or :db_placeholder to try and pick something that's not going to be a database column. Would require string parsing all placeholders on all queries.

I'd like to see benchmarks of both approaches to know which way is best. I can't tell from #15 / #17 if that's what's being tested... sorry, been a long week. :(

catch’s picture

OK, so my suggestion in irc, which was was very late at night so might have a fundamental flaw was this:

db_query("SELECT * FROM node WHERE nid IN(:placeholders) AND published = :published", array(':placeholders' => array(1, 2, 3, 4, 5), ':published' => 1));

function db_query($query, $args)
  if (isset($args[':placeholders'])) {
    unset($args[':placeholders']);
  }
[.. continue as normal ..]

If that's viable, then it's a lot less expensive than parsing for the magic placeholder query - we're just doing an isset, then only do the str_replace etc. after that.

The issue here is you can only use one IN() per query. Not pretty, but we can do the same thing for $args[':placeholders2'] etc. if we really have to.

Assuming no major holes, I reckon this is the least worst solution - it allows us to keep the current syntax with no serious performance implications.

webchick’s picture

I would really love to NOT have to do that. :( But we might be forced to if the other way proves too expensive.

We'd have to decide a depth to go to, for example 3 (placeholders, placeholders2, placeholders3) that we deemed "No query is possibly going to be crazy enough to need THREE IN() clauses!" But, if someone /was/ to write some bizarro query that required it, their only recourse would be to hack core.

There's also a DX problem in that people are not going to expect to need a "fancy" placeholder to do something that's standard SQL. I predict "Unknown column 'Array' in 'where clause' " being one of our top 10 Troubleshooting FAQs for D7 when your example ends up getting translated to "SELECT * FROM node WHERE nid IN(Array) AND published = 1" (or whatever it does).

pwolanin’s picture

What happened to the suggestion from last night of using @ or some other special char to indicate an array?

something like:

function db_query($query, $args = array(), $options = array()) {
  if (!is_array($args)) {
    $args = func_get_args();
    array_shift($args);
  }
  list($query, $args, $options) = _db_query_process_args($query, $args, $options);

  foreach ($args as $key => $data) {
    // is_array() is slow, so do an initial char-based check.
    if ($key[0] == '@' && is_array($data)) {
      $new_keys = array();
      $base = ':' . substr($key, 1);
      foreach ($data as $i => $value) {
        $p = $base . '_' . $i;
        while (isset($args[$p])) {
          $p .= mt_rand();
        }
        $new_keys[$p] = $value;
      }
      $query = str_replace($key, implode(', ', $new_keys), $query);
      unset($args[$key]);
    }
  }

  return Database::getActiveConnection($options['target'])->query($query, $args, $options);
}
markus_petrux’s picture

IMO, that's great idea :)

- Suggestion 1:


$key = '@abc';

timer_start('timer-1');
for ($i=0; $i < 10000; $i++) {
  $base = ':' . substr($key, 1);
}
print timer_read('timer-1') ."<br />\n";

timer_start('timer-2');
for ($i=0; $i < 10000; $i++) {
  $base = $key;
  $base[0] = ':';
}
print timer_read('timer-2') ."<br />\n";

First method took on my system 7.64 ms. Second method took on my system 3.63 ms. So it looks like $base = $key; $base[0] = ':'; is significantly faster than $base = ':' . substr($key, 1);.

- Suggestion 2: if using @ prefix, maybe it could be assumed that $data is an array, so checking for is_array($data) could be removed. Alternative solution: document that @ can only be used when argument is an array.

- Suggestion 3: maybe it could also be removed the while loop/mt_rand. Just document that these placeholders need to use a unique prefix.

pwolanin’s picture

quick benchmark, suggests that having the check on each arg isn't too bad:

elapsed with stub code: 0.585354089737
elapsed w/ no @: 0.983224153519
elapsed w/ 1 @: 2.87513208389
elapsed w/ 2 @: 5.26019191742
function db_query($query, $args = array(), $options = array()) {
  if (!is_array($args)) {
    $args = func_get_args();
    array_shift($args);
  }

  foreach ($args as $key => $data) {
    // is_array() is slow, so do an initial char-based check.
    if ($key[0] == '@' && is_array($data)) {
      $new_keys = array();
      $base = ':' . substr($key, 1);
      foreach ($data as $i => $value) {
        $p = $base . '_' . $i;
        while (isset($args[$p])) {
          $p .= mt_rand();
        }
        $new_keys[$p] = $value;
      }
      $query = str_replace($key, implode(', ', $new_keys), $query);
      unset($args[$key]);
    }
  }
//  list($query, $args, $options) = _db_query_process_args($query, $args, $options);
//  return Database::getActiveConnection($options['target'])->query($query, $args, $options);
}

function dummy_query($query, $args = array(), $options = array()) {
  if (!is_array($args)) {
    $args = func_get_args();
    array_shift($args);
  }

//  list($query, $args, $options) = _db_query_process_args($query, $args, $options);
//  return Database::getActiveConnection($options['target'])->query($query, $args, $options);
}

$start = microtime(TRUE);
for ($i=0; $i < 100000; $i++) {
  dummy_query("SELECT * FROM {foo} WHERE fid = :fid AND type = :type AND status = :status", array(':fid' => 1, ':type' => 'my_type', ':status' => 1));
}
$end = microtime(TRUE);
echo "elapsed with stub code: ". ($end - $start) ."\n";

$start = microtime(TRUE);
for ($i=0; $i < 100000; $i++) {
  db_query("SELECT * FROM {foo} WHERE fid = :fid AND type = :type AND status = :status", array(':fid' => 1, ':type' => 'my_type', ':status' => 1));
}
$end = microtime(TRUE);
echo "elapsed w/ no @: ". ($end - $start) ."\n";

$start = microtime(TRUE);
for ($i=0; $i < 100000; $i++) {
  db_query("SELECT * FROM {foo} WHERE fid IN (@fids) AND type = :type AND status = :status", array('@fids' => array(1, 2, 3), ':type' => 'my_type', ':status' => 1));
}
$end = microtime(TRUE);
echo "elapsed w/ 1 @: ". ($end - $start) ."\n";

$start = microtime(TRUE);
for ($i=0; $i < 100000; $i++) {
  db_query("SELECT * FROM {foo} WHERE fid IN (@fids) AND nid in (@nids) AND type = :type AND status = :status", array('@fids' => array(1, 2, 3), '@nids' => array(1, 2, 3, 4, 5, 6), ':type' => 'my_type', ':status' => 1));
}
$end = microtime(TRUE);
echo "elapsed w/ 2 @: ". ($end - $start) ."\n";
pwolanin’s picture

If we don't do the check on the 1st char, but just do:

    if (is_array($data)) {

the results are like:

elapsed w/ no @: 1.32884788513
elapsed w/ 1 @: 3.0096218586
elapsed w/ 2 @: 5.33900809288
markus_petrux’s picture

I tried to mean just checking for if ($key[0] == '@') {. :-|

markus_petrux’s picture

hmm... just noticed that your code could only be used for numeric values. It would fail for string_col IN (:values).

The following includes my previous suggestions and fixes that (adding dynamically generated arguments to $args array).

function tmp_query($query, $args = array(), $options = array()) {
  if (!is_array($args)) {
    $args = func_get_args();
    array_shift($args);
  }

  foreach ($args as $key => $data) {
    if ($key[0] == '@') {
      $new_args = array();
      $base = $key;
      $base[0] = ':';
      foreach ($data as $i => $value) {
        $new_args[$base . '_' . $i] = $value;
      }
      $query = str_replace($key, implode(', ', array_keys($new_args)), $query);
      unset($args[$key]);
      $args += $new_args;
    }
  }
//  list($query, $args, $options) = _db_query_process_args($query, $args, $options);
//  return Database::getActiveConnection($options['target'])->query($query, $args, $options);
}
markus_petrux’s picture

Benchmarks...

Code in #23 took on my system:

elapsed with stub code: 0.278710126877
elapsed w/ no @: 0.61524105072
elapsed w/ 1 @: 2.38443303108
elapsed w/ 2 @: 4.19608592987

Code in #26 took on my system:

elapsed with stub code: 0.280394077301
elapsed w/ no @: 0.611225128174
elapsed w/ 1 @: 2.18950986862
elapsed w/ 2 @: 4.24191999435
Crell’s picture

OK, so if I'm following correctly, the @ solution doubles the time for a query but we're dealing with a tiny number to start with. It then has an incrementally larger impact on queries that actually DO have an array in them, by a linear amount.

Are any of the above micro-benchmarks for just the is_array() method? I agree with webchick that I'd rather magically detect arrays than magically detect @placeholder, if at all possible. Personally I'd veto :magic_name unless there's absolutely no alternative.

Whatever mechanism we go with I think it's logical to say that the overall cost for when there IS an array will be about equal, since it's essentially the same loop either way. So the main question is what the performance impact is on the 90% of queries that do not need the array handling.

Can someone try making these modifications to core as a patch and then benchmarking a normal page load? That will do a better job of telling us what the real world cost is than micro-benchmarks.

Damien Tournoud’s picture

/me bet that the full page benchmark impact will be below the margin of error.

pwolanin’s picture

whoops - yes - apparently I forgot the line to add the new args to the existing args.

Probably Damien is right - that the rest of the overhead involved with executing a query will case any of the proposed changes to have a nearly unmeasureable effect on overall page serving time.

markus_petrux’s picture

Crell, please note that the above benchmarks lack a small detail that may raise less difference between the proposal and the current method. It is the fact that "elapsed w/ X @" samples ought to be compared with db_placeholders(). The sample "elapsed with stub code" does not use IN () conditions.

Also, these tests loop 100,000 times. When 100,000 iterations take 2 seconds, that means one single step takes 0.00002 seconds.

Crell’s picture

@#31: As I said, let's see benchmarks on a full page load on HEAD to see what the real impact is. We shouldn't need to do any query conversion for that, just modify the conditional, since whatever we do the cost when we DO need to process an array should be a constant between the mechanisms being considered. We just want to test the cost of the extra checking.

markus_petrux’s picture

I don't have the time now to test with a real page on D7 now, but this is something that may give an idea on how much overhead would be added with the foreach loop for checking @ in db_query().

/**
 * This function contains the code that would be added to db_query().
 */
function dummy($args) {
  foreach ($args as $key => $data) {
    if ($key[0] == '@') {
      $new_args = array();
      $base = $key;
      $base[0] = ':';
      foreach ($data as $i => $value) {
        $new_args[$base . '_' . $i] = $value;
      }
      $query = str_replace($key, implode(', ', array_keys($new_args)), $query);
      unset($args[$key]);
      $args += $new_args;
    }
  }
}

/**
 * Loop 100,000 times with an array of arguments with no @.
 */
$start = microtime(TRUE);
for ($i=0; $i < 100000; $i++) {
  dummy(array(':fid' => 1, ':type' => 'my_type', ':status' => 1));
}
$end = microtime(TRUE);
echo "elapsed with no @: ". ($end - $start) ."\n";

elapsed with no @: 0.451392889023

That's 451 ms to loop 100,000 times.

HTH :)

Crell’s picture

Assigned: Unassigned » Crell
Status: Active » Needs review
FileSize
4.14 KB
4.12 KB
4.12 KB

OK, getting back to this...

Attached are three patches. All use @, although for one of them it wouldn't actually matter. All are based on the code from #21 above, and there is a unit test included to confirm that it works.

One patch checks only based on the use of @.

One patch checks only based on whether the data is an array. (If we went this route we would not use the @ sign, but for now it provides a more direct speed comparison since it's just the if statement we're really concerned about.)

One patch checks both for an @ and for whether the data is an array.

I'm uploading all three to see if the bot can handle all of them at once. :-) But we can now benchmark the total effect on a Drupal page load for normal queries. Once we know which approach we want to take we can micro-optimize the effectively constant cost for processing array-based placeholders and decide whether we want the @ sign or not, etc.

Dave Reid’s picture

Yay testing bot! Subscribing to help benchmark in the morning.

markus_petrux’s picture

If @ is required, then checking for ($key[0] == '@') is faster and checking for is_array() might not be needed if the the execute method fail if it finds a remainig array in any $args item?

Another potential advantage when @ is required is that it may help identify this kind of queries when scanning code for whatever purpose.

Crell’s picture

@ #36: That's all dependent on what the benchmarks tell us. If the difference between the two methods is huge, then that answers it for us. If it's small, then we can pick our mechanism based on the DX factors of the resulting syntax rather than on performance. Let's wait for the numbers and see.

Dave Reid’s picture

Here's my preliminary benchmarking results.

I used the following code (adjusted a little for each patch's syntax and detailed in each benchmarking result) and ran with ab -c 1 -n 1000:

define('DRUPAL_ROOT', dirname(realpath(__FILE__)));
require_once DRUPAL_ROOT . '/includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

$nids = db_query("SELECT nid FROM {node} ORDER BY created DESC LIMIT 10")->fetchAll(PDO::FETCH_COLUMN);
$nodes = db_query("SELECT * FROM {node} WHERE nid IN (@nids)", array('@nids' => $nids))->fetchAll();
$uids = db_query("SELECT uid FROM {users} ORDER BY name DESC LIMIT 25")->fetchAll(PDO::FETCH_COLUMN);
$users = db_query("SELECT * FROM {users} WHERE uid IN (@uids)", array('@uids' => $uids))->fetchAll();

exit();

From fastest to slowest:
122.481 ms: $key[0] == '@'
123.259 ms: $key[0] == '@' && is_array($data)
123.692 ms: is_array($data)
127.999 ms: db_placeholders()\

Crell’s picture

Thanks, Dave. What we really need, though, is a benchmark of a normal Drupal page load with each of these patches applied as compared to an unpatched HEAD using ab (and indicating what the settings were). That way we can gauge the "real world" impact of each method.

Interestingly it looks like any of these is an improvement over db_placeholders(), which is nice.

Dave Reid’s picture

I was hesitant to run the benchmarks on a normal Drupal page since it wouldn't show the true comparison benchmarks until the queries that use db_placeholder are replaced with '@'. BTW, I ran the previous tests with ab -c 1 -n 1000 http://mysql.drupalhead.local/test.php.

Crell’s picture

Well at the moment all we're looking for is the impact of the extra if() check. That's going to run for every query either way, so that will at least give us a comparison between the different mechanisms. You're right that a full benchmark against core would require a full conversion, which is a PITA, but from the benchmarks you already did it looks like we'll probably come out ahead either way.

catch’s picture

The three patches are mixed up with some query object caching stuff it seems.

Crell’s picture

That's deliberate. Trying to cache the statement object for a processed query would be very difficult, as we can't guarantee the same placeholders are generated each time. So instead I just disabled caching for those queries. The odds of them being run multiple times on the same page request is fairly low anyway.

Dave Reid’s picture

FileSize
2.14 KB

Here's my benchmarks for current HEAD with 50 users, 50 nodes (10 on front page), lots of taxonomy terms, nearly all core module enabled, and a few blocks on the front page (recent blog posts, who's new, who's online). Run with ab -c 1 -n 500 -C MY_USER1_SESSION_COOKIE http://mysql.drupalhead.local/, so logged in as user 1. I restarted apache before each test.

Crell’s picture

Hm. If I'm reading that right, the answer is "they're all virtually identical and have nearly no performance impact after all". I find that somewhat surprising, but I'd be very happy if it's true. :-)

If that's the case, then do we want to use @ flagging or the presence of an array? I think I would marginally prefer the array check, as it reduces the funky characters in the query string. OTOH, the @ makes it more obvious what you're expecting. Hm...

markus_petrux’s picture

Benchmarking against a normal page may encapsulate the fact than one method is faster than the other. It may not have significant impact on normal page, but it may for some kind of crons, importing/exporting nodes, when processing a lot of data, certain views, etc.

The first benchmarks just compared the cost of using @ and/or is_array, etc. so I would take that into account as well.

Crell’s picture

Well the cost should be a constant multiplied by the number of queries executed, so any difference should be there. Even if we take the benchmarks in #38, there's a 0.8% difference between @ and is_array(), and both are faster than db_placeholder(). So I think the verdict is "the performance difference isn't big enough for us to care".

catch’s picture

Seems like the string comparison vs. is_array won't have any practical impact, so all other things being equal, my vote is for

db_query("SELECT name FROM {test} WHERE age IN (:ages) ORDER BY age", array(':ages' => array(25, 26, 27)))->fetchAll();

Since it's one less thing to remember, and in general very, very nifty.

markus_petrux’s picture

Having no significant reason to do it otherwise, then it looks great like #48, TBH.

Status: Needs review » Needs work

The last submitted patch failed testing.

Dave Reid’s picture

Status: Needs work » Needs review

Testing slave #8 failure.

Dries’s picture

Status: Needs review » Fixed

Committed to CVS HEAD. Thanks all.

catch’s picture

Status: Fixed » Needs work

The tests included with the patch which was committed don't match the :placeholder syntax which (I think) was agreed on.

Dave Reid’s picture

Status: Needs work » Active

Let's make sure we get a followup to correct this doc change:

Index: includes/database/database.inc
@@ -349,10 +349,14 @@ abstract class DatabaseConnection extend
    * @param $query
    *   The query string as SQL, with curly-braces surrounding the
    *   table names.
+   * @param $query
+   *   Whether or not to cache the prepared statement for later reuse in this
+   *   same request.  Usually we want to, but queries that require preprocessing
+   *   cannot be safely cached.

:)

Crell’s picture

Wait, Dries, what did you commit? The patches posted earlier were for benchmarking, not for applying. They all needed more tweaking before they were commit ready.

Are you agreed on the is-array-only check then? I'm fine with that, but there was other code in that patch that is totally irrelevant if we go that route. (It was retained for simplicity in testing and benchmarking.)

Crell’s picture

Status: Active » Needs review

Attached patch is a follow-up cleanup.

1) Fix lots of documentation bugs, including #54.

2) Fix the unit test to use :, not @, which is what we're going to use.

3) Remove the rand-based duplicate key checking from the expandArguments() method. It's not really needed unless the caller is doing something extremely stupid with his placeholders, which therefore falls into the "don't babysit broken code" category. A big block o' comments was added to explain that fact. Net result, this is probably the most efficient we can reasonably make that algorithm.

Crell’s picture

FileSize
2.97 KB

Sigh.

drewish’s picture

Status: Needs review » Needs work

Fails under PostgreSQL:

Array to string conversion	Notice	database.inc	1513	DatabaseStatementBase->execute()	
SELECT name FROM {test} WHERE age IN (:ages) ORDER BY age - Array ( [:ages] => Array ( [0] => 25 [1] => 26 [2] => 27 ) ) SQLSTATE[22P02]: Invalid text representation: 7 ERROR: invalid input syntax for integer: "Array"	Uncaught exception	database.inc	70	DatabaseConnection_pgsql->query()	
David Strauss’s picture

Subscribing.

Crell’s picture

Status: Needs work » Needs review
FileSize
3.67 KB

Stupid PostgreSQL... Try this.

drewish’s picture

FileSize
4.19 KB

Crell suggested changing the line in expandArguments: $query = str_replace($key, implode(', ', $new_keys), $query); to $new_keys to array_keys($new_keys)

That got the test to pass in both MySQL and PostgreSQL.

markus_petrux’s picture

One potential problem using :placeholder here is that the str_replace may potentially affect other pleaceholders. Please, consider this:

db_query("SELECT name FROM {test} WHERE foo IN (:foo) AND bar = :foobar", array(
  ':foo' => array(25, 26, 27),
  ':foobar' => 28,
));

The following str_replace will affect both placeholders:

$query = str_replace($key, implode(', ', array_keys($new_keys)), $query);

If we're using @, then the str_replace would be safer in this case because @ would only be used for this kind of arguments. The above statement using @ would work:

db_query("SELECT name FROM {test} WHERE foo IN (@foo) AND bar = :foobar", array(
  '@foo' => array(25, 26, 27),
  ':foobar' => 28,
));

Though, it would still fail on queries like this:

db_query("SELECT name FROM {test} WHERE foo IN (@foo) AND bar = @foobar", array(
  '@foo' => array(25, 26, 27),
  '@foobar' => array(28, 29),
));

Another possibility would be using preg_replace instead of str_replace. But, then that would be slower.

Maybe doxygen could document this potential problems?

David Strauss’s picture

It needs to be documented or totally fixed, not just with an "@" prefix for the placeholders. Using "@" for the placeholders only mitigates and masks a very real problem.

Dave Reid’s picture

I'm pretty sure that we solved this problem in the SQLite driver with preg_replace.

Crell’s picture

I'm inclined to just document that issue. The latest patch already removes the rand-key-generator, on the grounds that we expect the caller to use sane placeholders. I don't see why we can't do the same here. Trying to bullet-proof that code is going to slow it down.

markus_petrux’s picture

Quick benchmark to compare str_replace with a possible way of using preg_replace:

$query = 'SELECT name FROM {test} WHERE age IN (:ages) ORDER BY age';
$key = ':ages';
$new_keys = array(
  ':ages_0' => 25,
  ':ages_1' => 26,
  ':ages_2' => 27,
);
$times = 10000;

timer_start('timer-1');
for ($i=0; $i < $times; $i++) {
  $dummy = str_replace($key, implode(', ', array_keys($new_keys)), $query);
}
$ms = timer_read('timer-1');
print 'Using str_replace: '. $ms .'ms for '. $times .' times, '. round($ms / $times, 6) ."ms average.<br />\n";

timer_start('timer-2');
for ($i=0; $i < $times; $i++) {
  $dummy = preg_replace('#'. $key .'(\W|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
}
$ms = timer_read('timer-2');
print 'Using preg_replace: '. $ms .'ms for '. $times .' times, '. round($ms / $times, 6) ."ms average.<br />\n";

Result:

Using str_replace: 34.86ms for 10000 times, 0.003486ms average.
Using preg_replace: 47.61ms for 10000 times, 0.004761ms average.

Edited to fix a bug, last timer_read was using 'timer-1', oops. Results fixed as well. Sorry.

drewish’s picture

Wouldn't strtr() do the trick?

markus_petrux’s picture

Wouldn't strtr() do the trick?

Nope, it produces a similar result than using str_replace.

$query = 'SELECT name FROM {test} WHERE foo IN (:foo) AND bar = :foobar';
$key = ':foo';
$new_keys = array(
  ':foo_0' => 25,
  ':foo_1' => 26,
  ':foo_2' => 27,
);
$result = str_replace($key, implode(', ', array_keys($new_keys)), $query);
print 'Using str_replace: '. $result ."\n";
$result = strtr($query, array($key => implode(', ', array_keys($new_keys))));
print 'Using strtr: '. $result ."\n";
$result = preg_replace('#'. $key .'(\W|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
print 'Using preg_replace: '. $result ."\n";

Results:

Using str_replace: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foo_0, :foo_1, :foo_2bar
Using strtr: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foo_0, :foo_1, :foo_2bar
Using preg_replace: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foobar
catch’s picture

I think it's reasonable to just document this, queries are going to blow up when this bug gets hit and the comments should make it clear why.

markus_petrux’s picture

Using preg_replace means adding just less than 0.005 milliseconds per query using argument with array(). Not fixing this, even if documented, may potentially cause annoying bugs. It's a matter of DX versus a little overhead penalty.

Dries’s picture

Sorry for pulling the trigger too fast on this. I committed Crell's and drewish' patch in #61. We can continue to refine as necessary, so I'm not marking this 'fixed' yet. Thanks for cleaning up behind me. ;-)

Crell’s picture

Thanks, Dries.

So the only remaining question, I think, is do we:

1) Document that using placeholders that are substrings of each other is a bad idea because they'll collide and expect module developers to not be dumb.

2) Switch from str_replace() to preg_replace() to have a less error-prone string replacement mechanism that won't create collisions (as much?) at the cost of a bit more processing time (but only when doing array expansion in queries).

3) All of the above.

Input?

David Strauss’s picture

I prefer correctness over documenting pitfalls. The performance impact is pretty negligible.

markus_petrux’s picture

won't create collisions (as much?)

As far as $key contains ":" followed by any number of letters, numbers or underscores, it won't. In the regular expression \W and $ should catch anything that is not a letter, number, underscore or end of subject.

Maybe one possible optimization would be using [^_a-zA-Z0-9] instead of \W. Here's a quick benchmark:

$query = 'SELECT name FROM {test} WHERE age IN (:ages) ORDER BY age';
$key = ':ages';
$new_keys = array(
  ':ages_0' => 25,
  ':ages_1' => 26,
  ':ages_2' => 27,
);
$times = 10000;

timer_start('timer-1');
for ($i=0; $i < $times; $i++) {
  $dummy = str_replace($key, implode(', ', array_keys($new_keys)), $query);
}
$ms = timer_read('timer-1');
print 'Using str_replace: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";

timer_start('timer-2');
for ($i=0; $i < $times; $i++) {
  $dummy = preg_replace('#'. $key .'(\W|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
}
$ms = timer_read('timer-2');
print 'Using preg_replace v1: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";

timer_start('timer-3');
for ($i=0; $i < $times; $i++) {
  $dummy = preg_replace('#'. $key .'([^_a-zA-Z0-9]|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
}
$ms = timer_read('timer-3');
print 'Using preg_replace v2: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";

Results:

Using str_replace: 34.81 ms for 10000 times, 0.003481 ms average.
Using preg_replace v1: 49.17 ms for 10000 times, 0.004917 ms average.
Using preg_replace v2: 48 ms for 10000 times, 0.0048 ms average.

It seems preg_replace v2 tends to cost a bit less than v1.

Testing results:

$query = 'SELECT name FROM {test} WHERE foo IN (:foo) AND bar = :foobar';
$key = ':foo';
$new_keys = array(
  ':foo_0' => 25,
  ':foo_1' => 26,
  ':foo_2' => 27,
);
$result = str_replace($key, implode(', ', array_keys($new_keys)), $query);
print 'Using str_replace: '. $result ."<br />\n";
$result = strtr($query, array($key => implode(', ', array_keys($new_keys))));
print 'Using strtr: '. $result ."<br />\n";
$result = preg_replace('#'. $key .'(\W|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
print 'Using preg_replace v1: '. $result ."<br />\n";
$result = preg_replace('#'. $key .'([^_a-zA-Z0-9]|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
print 'Using preg_replace v2: '. $result ."<br />\n";
Using str_replace: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foo_0, :foo_1, :foo_2bar
Using strtr: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foo_0, :foo_1, :foo_2bar
Using preg_replace v1: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foobar
Using preg_replace v2: SELECT name FROM {test} WHERE foo IN (:foo_0, :foo_1, :foo_2) AND bar = :foobar

preg_replace v1 and v2 are equivalent.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Looking at those benchmarks, and since we only need to do that tiny extra bit of work on queries using the array anyway, I agree we should just go ahead and fix this.

John Morahan’s picture

If we use preg_replace with \b (as we did with SQLite), it is a bit faster than the other preg_replace versions, although still not quite as fast as str_replace:

$query = 'SELECT name FROM {test} WHERE age IN (:ages) ORDER BY age';
$key = ':ages';
$new_keys = array(
  ':ages_0' => 25,
  ':ages_1' => 26,
  ':ages_2' => 27,
);
$times = 1000000;

timer_start('timer-1');
for ($i=0; $i < $times; $i++) {
  $dummy = str_replace($key, implode(', ', array_keys($new_keys)), $query);
}
$ms = timer_read('timer-1');
print 'Using str_replace: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";

timer_start('timer-2');
for ($i=0; $i < $times; $i++) {
  $dummy = preg_replace('#'. $key .'(\W|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
}
$ms = timer_read('timer-2');
print 'Using preg_replace v1: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";

timer_start('timer-3');
for ($i=0; $i < $times; $i++) {
    $dummy = preg_replace('#'. $key .'([^_a-zA-Z0-9]|$)#', implode(', ', array_keys($new_keys)) .'\1', $query);
}
$ms = timer_read('timer-3');
print 'Using preg_replace v2: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";

timer_start('timer-4');
for ($i=0; $i < $times; $i++) {
  $dummy = preg_replace('#' . $key . '\b#', implode(',', array_keys($new_keys)), $query);
}
$ms = timer_read('timer-4');
print 'Using preg_replace v3: '. $ms .' ms for '. $times .' times, '. round($ms / $times, 6) ." ms average.<br />\n";
Using str_replace: 4251.93 ms for 1000000 times, 0.004252 ms average.
Using preg_replace v1: 5701.79 ms for 1000000 times, 0.005702 ms average.
Using preg_replace v2: 5795.97 ms for 1000000 times, 0.005796 ms average.
Using preg_replace v3: 5215.83 ms for 1000000 times, 0.005216 ms average.
Crell’s picture

OK, it looks like there's a consensus for using preg, and it looks like the \b of SQLite is the fastest by a slim margin. Can someone roll the appropriate patch and update the inline docs as necessary?

markus_petrux’s picture

FileSize
998 bytes

John, that's a good one. \b is perfect here.

I tried to roll a patch that contains an explanation about the issue, but it probably could be made it easier. Anyway, HTH.

markus_petrux’s picture

Status: Needs work » Needs review

Oops, I forgot to change the status. Sorry.

Dave Reid’s picture

IMHO, we should probably be using the following since we don't have anything specifically document that says "You cannot use the '#' character in a placeholder."
$query = preg_replace('/' . preg_quote($key) . '\b/', implode(', ', array_keys($new_keys)), $query);

markus_petrux’s picture

hmm... maybe this should be documented that placeholders should only contain letters, numbers and underscores?

If placeholders could contain other characters, then \b could find another placeholder that should not be affected.

Dries’s picture

I think it is fine not to accept # as part of a placeholder.

David Strauss’s picture

Can we throw an exception if the placeholder contains illegal values? I know we're not supposed to "babysit broken code" and all that, but I've long said that's not my personal priority/agenda.

Crell’s picture

I think PDO will already throw an exception if you pass it garbage as a placeholder. No need for us to do so as well.

David Strauss’s picture

@Crell Can we get unit tests to check for such exceptions with garbage placeholders?

catch’s picture

Status: Needs review » Reviewed & tested by the community

Can we do that in a separate issue? We don't try to lock down t() placeholders to stop people using @!#@^^ instead of @foo and I don't see that this is much different.

David Strauss’s picture

Status: Reviewed & tested by the community » Needs work

@catch The task of adding tests should never be relegated to a separate issue, and our lack of enforcement and tests for invalid placeholders in t() is not argument that we should lack the same tests here.

I just don't find "I *think* PDO throws an exception" to be an acceptable answer. We need to at least know what should happen with invalid placeholders, even if that handling is embedding in PDO, and I'd prefer that we test for that expected outcome properly propagating through DB-TNG.

catch’s picture

FileSize
12.41 KB

Discussed this with David in irc, and I pointed out that afaik we still allow ? placeholders in code if not by convention, despite them breaking devel query logging, so I stand by moving that discussion to a separate issue.

On another note, db_placeholders() still lives. Attached patch removes it and all calls to it except taxonomy_select_nodes, which is a bit of a pig and needs a full conversion to db_select() before we can do it properly - but we'll need to do it before we can close this so leaving at CNW. It's getting late so I'm not taking that on tonight. Some of the other queries need converting to db_select for other reasons too, but this issue is long enough I think and they ought to be caught by the general dbtng conversions.

catch’s picture

I've moved the conversion of static queries with db_placeholders to #352054: Convert calls to db_placeholders() in static queries.

Dries’s picture

I committed #352054: Convert calls to db_placeholders() in static queries so we should be back on track now. Thanks catch! :)

markus_petrux’s picture

I did a small test with sqllite driver for PDO and it seems it does not like # as part of a placeholder.

try {
  $dbh = new PDO('sqlite:/tmp/foo.db');
  $dbh->query('CREATE TABLE foo (x int);');
  $dbh->prepare('SELECT * FROM foo WHERE x = :foo#');
  var_dump($dbh->errorInfo());
} catch (PDOException $e) {
    echo 'Connection failed: ' . $e->getMessage();
}

Result:

array(3) { [0]=>  string(5) "HY000" [1]=>  int(1) [2]=>  string(23) "unrecognized token: "#"" }

When trying with :foo#bar, it reported a syntax error.

However, it was possible to use placeholders like :fòó (note the accents). If someone uses non-word characters, the preg_replace statement will stop at the first non-word character, probably resulting in SQL syntax error, most likely detected during development, but who knows...

If the syntax for placeholders is checked, then it will penalize runtime performance. Maybe this kind of checks could be delegated to coder module?

Crell’s picture

I am totally cool with documenting "ASCII alphanumerics and underscores only" and adding that to the coder module as well. Otherwise we run into "babysitting" territory.

markus_petrux’s picture

Status: Needs work » Needs review
FileSize
998 bytes

Here's the patch in #79 updated with an explanation on how named placeholders need to be formatted.

markus_petrux’s picture

Sorry, I attached the wrong one.

Dries’s picture

Status: Needs review » Fixed

I've committed the patch in comment #95. If necessary, we can follow-up with additional patches. I've marked it 'fixed' but feel free to re-open with a patch.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.