So some bright soul did optimize locale module to always store empty translations of string for all languages. This was not enforced actually:

- when a new language was added, all source strings (in the db at the time) got an empty translation in the target language
- BUT when later new strings were identified in locale(), an empty translated was added only in the current language (later to the other languages with more overhead)
- also when importing a PO file, only the current language was affected
- the new JS translation also only affected the current language

This type of thinking let locale queries use INNER JOINs though, and this was buggy due to the above misbehavior.

While preparing/reviewing some patches, it repeatedly turned out that this is not a right practice (and as it seems it is even not a performance helper as the original intention indicated). Having these empty translations in the database makes the tables bigger, and makes it hard to manage the tables as we need to take care of all languages on all occasions to properly have the empty translations there.

So I went on and created this patch, which does the following:

- we did not store the empty translations anymore (update path even included to clean these out)

- INNER JOINs converted to LEFT JOINs for this very reason

- cleaned up PO file import, removed lots of code duplication by reusing the same logic for plural and non-plural stuff (this should make brmassa a bit more happier, as he suggested even more refactoring (http://drupal.org/node/147915), which could happen after this patch lands)
-> made this consistent to remove strings from the DB if in overwrite mode and the "new" translation is empty
-> used constants instead of magic names for the import modes

- cleaned up locale() to do *LESS* SQL queries if a non-existent string is found (it did use an INNER join and then had no idea whether the translation or source string was missing if there was no result obviously, so it did play around both tables to find it out)

- cleaned up the locale cache code, so it does a LEFT JOIN which is just enough to get empty strings for the nonexistent translations of source strings, these don't need to be actually in the database

I would welcome comments and reviews, and would really love to get this in soon, as clean and simple use of the textgroups is dependent of this cleanup (eg. http://drupal.org/node/147041 and http://drupal.org/node/146033 which deal with variable translation include small parts of this cleanup).

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Gábor Hojtsy’s picture

Oh, yeah and to add to the performance gains, I did remove a cache cleanup from locale() previously executed if an unknown string was found. This is debatable as the con of not including it there is that we don't need to cleanup and generate the cache 20 times a page if we have 20 new strings, but on the other hand we did not have the cache updated with the empty translations for short strings, so later page views also get direct hits on the DB for each new short string.

All in all I think the performance improvements/changes of this patch are dependent on whether LEFT JOIN is slower then INNER JOIN in these cases. But the code cleanup is mostly dependent on using LEFT JOIN, to get rid of some empty string mangling.

Jose Reyero’s picture

Status: Needs review » Needs work

Hi Gabor,

I think the feature would be great, and the patch looks good and appliesl. There's a bug though, after applying the patch in _locale_import_parse_header, looks like some messed up variable names and it breaks the .po import. I don't understand that part of the patch.

Array to string conversion in /home/workspace/drupal-i18n/includes/locale.inc on line 1294.

Gábor Hojtsy’s picture

Status: Needs work » Needs review
FileSize
24.45 KB

Jose, thanks for the heads up. Indeed, I did not test that part properly. Now I fixed those issues and tested both automatic and manual imports, all work just fine with plural forms and everything.

I also noticed that we actually need that cache remove on the locale() INSERT operation, as that only removes the cache field from the database, does not regenerate the cache (and the cache is in memory already in that request), so it is not a hit on the page but makes the new string possibly appear in the cache the next request. So it is fine there.

I also tried to find claims to back up this change. First went into MySQL and EXPLAIN-ed the LEFT JOIN and INNER JOIN version of the queries. I have done this on an updated Drupal test site, which had this patch applied (so no empty translations) and a not updated site, which had the old empty strings approach. I have imported all translations in the installer for all active modules and added a couple of "empty languages", so thousands of empty strings exist in the database. All I get for the EXPLAIN is the same for the before-the-patch version and after-the-patch versions and same for both joins. This shows that MySQL is clever enough to use indexes to axe the stuff which is not relevant. An example with a LEFT JOIN:

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s LEFT JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Home' AND t.language = 'hu' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | lnoinsert.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)

So what we see is that it basically selects by source (this string fit into the 32 char source string index) and then matches the lid found with the target table's lid index. Then it actually looks through the results found, so we are better off not to have the empty strings there, as it means less strings with the same lid to look through! That seems to back up my points.

I also went on to look at the cvs blame log so I can get an idea why this empty string approach was introduced the first place. Unfortunately this was not in a standalone patch, but included with "the new locale system" (Wed Aug 11 11:26:20 2004 UTC, commits: http://cvs.drupal.org/viewcvs/drupal/drupal/includes/locale.inc?r1=1.14&... and http://cvs.drupal.org/viewcvs/drupal/drupal/modules/locale/locale.module...) which introduced the managed database backend for translations.

So I did not found a reason *against* removing superfluous empty translation storage from the code.

I would welcome two types of reviews:

1. functionality: test the code!
2. performance: scrutinize if this could be worse for whatever reason then the current approach.

killes@www.drop.org’s picture

The reason for saving empty translation were added was the ability to use INNER joins instead of LEFT joins. LEFT joins are _:much:_ slower than INNER ones.

Gábor Hojtsy’s picture

My EXPLAINs above showed that it should really make no difference to switch between LEFT and INNER join in this case. Anyway, I went on and coded a short benchmark script:

include_once 'includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

$langcode = 'hu';
$iterations = 100;
$join = 'LEFT';

timer_start('join_queries');
for ($i = 0; $i < $iterations; $i++) {
  $result = db_query("SELECT source FROM {locales_source}");
  $count = db_num_rows($result);
  while ($source = db_fetch_object($result)) {
    db_query("SELECT s.lid, t.translation FROM {locales_source} s $join JOIN {locales_target} t ON s.lid = t.lid WHERE s.source = '%s' AND t.language = '%s' AND s.textgroup = 'default'", $source->source, $langcode);
  }
}
$time = timer_stop('join_queries');

echo $join .' : '. $time['time'] / $iterations / $count ."\n";
echo $count ." strings\n";
echo db_result(db_query("SELECT COUNT(*) FROM {locales_target}")) ." translations\n";
echo db_result(db_query("SELECT COUNT(DISTINCT(language)) FROM {locales_target}")) ." languages\n";
echo db_result(db_query("SELECT COUNT(*) FROM {locales_target} WHERE translation = ''")) ." empty translations\n\n";

Then I run this on a DB with many empty strings and no empty strings at all:

goba$ php bench.php 
LEFT : 0.11561778563
1698 strings
1672 translations
1 languages
0 empty translations
-----------------------------
goba$ php bench.php 
INNER : 0.114272320377
1698 strings
1672 translations
1 languages
0 empty translations
-----------------------------
goba$ php bench.php
LEFT : 0.129041671372
1771 strings
5306 translations
4 languages
3607 empty translations
-----------------------------
goba$ php bench.php
INNER : 0.125906437041
1771 strings
5306 translations
4 languages
3607 empty translations

This shows that INNER and LEFT JOIN usage on the same data set is different with small error margins. (Other runs I did showed similar numbers sometimes in favor of the INNER sometimes in favor of the LEFT JOINs). It also shows that by not storing empty translations there, we have a small performance gain (but not a noticably huge one).

I don't see that the argument stands that we should not use LEFT JOINs because they are slower by any means in this case. Maybe Gerhard can run this bench on his data set or point out what I was missing in the benchmarks.

To summarize, by applying my patch:
1. we use less database rows, smaller database, smaller dump, smaller backups to save
2. we use a lot less code which is more clear
3. we gain slightly better performance by the looks of the above benchmark

Gábor Hojtsy’s picture

By the way, measurements done with MySQL 5.0.24a-Debian_9ubuntu2-log and all tables being MyISAM. The strings were real imports of Hungarian and Czech translations, so they represent real life data sets.

Dries’s picture

I looked at this patch and I've no comments. The research done is this patch is great (much appreciated) which makes me feel 100% confident to delegate this to Gabor. So Gabor, feel free to commit this patch when you think its RTBC. :)

killes@www.drop.org’s picture

Goba, can you post both an EXPLAIN for INNER and LEFT Joins on the same dataset?

killes@www.drop.org’s picture

Actually, the explain using INNER should be done on a set which contains empty strings and the LEFT on a set that doesn't.

Gábor Hojtsy’s picture

OK, I have the two datasets in the 'installer' and 'lnoinsert' databases (= test sites). Note that the 'installer' site has Czech string imported and the 'lnoinsert' site has Hungarians, but that really makes no difference.

mysql> use installer;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> select count(*) from locales_target;
+----------+
| count(*) |
+----------+
|     5306 | 
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from locales_target where translation = '';
+----------+
| count(*) |
+----------+
|     3607 | 
+----------+
1 row in set (0.05 sec)

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s LEFT JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Home' AND t.language = 'cs' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | installer.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.03 sec)

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s INNER JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Home' AND t.language = 'cs' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | installer.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)

Search for a nonexistent translation (in this case empty value for all languages!)
----------------------------------------------------------------------------------

mysql> select * from locales_source s left join locales_target t on s.lid = t.lid where s.source = "Direction";
+------+---------------------------------------+-----------+-----------+------+-------------+----------+------+--------+
| lid  | location                              | textgroup | source    | lid  | translation | language | plid | plural |
+------+---------------------------------------+-----------+-----------+------+-------------+----------+------+--------+
| 1751 | /installer/?q=admin/settings/language | default   | Direction | 1751 |             | cs       |    0 |      0 | 
| 1751 | /installer/?q=admin/settings/language | default   | Direction | 1751 |             | ak       |    0 |      0 | 
| 1751 | /installer/?q=admin/settings/language | default   | Direction | 1751 |             | ka       |    0 |      0 | 
+------+---------------------------------------+-----------+-----------+------+-------------+----------+------+--------+
3 rows in set (0.00 sec)

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s INNER JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Direction' AND t.language = 'cs' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | installer.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)
mysql> use lnoinsert;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select count(*) from locales_target;
+----------+
| count(*) |
+----------+
|     1672 | 
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from locales_target where translation = '';
+----------+
| count(*) |
+----------+
|        0 | 
+----------+
1 row in set (0.04 sec)

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s INNER JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Home' AND t.language = 'hu' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | lnoinsert.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.01 sec)

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s LEFT JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Home' AND t.language = 'hu' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | lnoinsert.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)

Search for a nonexistent translation (in this case no row in locales_target!)
-----------------------------------------------------------------------------

mysql> select * from locales_source s left join locales_target t on s.lid = t.lid where s.source = "Unspecified error";
+-----+----------------+-----------+-------------------+------+-------------+----------+------+--------+
| lid | location       | textgroup | source            | lid  | translation | language | plid | plural |
+-----+----------------+-----------+-------------------+------+-------------+----------+------+--------+
|   1 | misc/drupal.js | default   | Unspecified error | NULL | NULL        | NULL     | NULL |   NULL | 
+-----+----------------+-----------+-------------------+------+-------------+----------+------+--------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT s.lid, t.translation FROM locales_source s LEFT JOIN locales_target t ON s.lid = t.lid WHERE s.source = 'Unspecified error' AND t.language = 'hu' AND s.textgroup = 'default';
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
| id | select_type | table | type | possible_keys  | key    | key_len | ref             | rows | Extra       |
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
|  1 | SIMPLE      | s     | ref  | PRIMARY,source | source | 32      | const           |    1 | Using where | 
|  1 | SIMPLE      | t     | ref  | language,lid   | lid    | 4       | lnoinsert.s.lid |   18 | Using where | 
+----+-------------+-------+------+----------------+--------+---------+-----------------+------+-------------+
2 rows in set (0.00 sec)

I think this pretty much shows (again) that we are better of with less data, swapping the joins make no difference.

Gábor Hojtsy’s picture

Status: Needs review » Fixed

OK, committed to unblock others working on i18n and l10n issues.

Anonymous’s picture

Status: Fixed » Closed (fixed)
JirkaRybka’s picture

This patch broke localization admin interface - search for untranslated strings. My bugreport already created elsewhere: http://drupal.org/node/171562