Needs work
Project:
Drupal core
Version:
main
Component:
locale.module
Priority:
Normal
Category:
Task
Assigned:
Unassigned
Reporter:
Created:
12 Jul 2010 at 08:06 UTC
Updated:
8 Apr 2019 at 09:44 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #1
damien tournoud commentedHm. I'm not sure about that.
1. The hash is not guaranteed to be unique. As a consequence, it should not be a unique key
2. We should have both hash = :hash and source = :source
One method I would consider is to combine the partial-index feature of MySQL and the functional index feature of other database engines by making an index on (context, source(30)) and querying on:
This query can be optimized by at least MySQL, PostgreSQL and SQL Server: MySQL will satisfy
context = :context AND source = :sourcewith it's partial index, PostgreSQL and SQL Server should be able to satisfycontext = :context AND SUBSTRING(source, 1, 30) = SUBSTRING(:source, 1, 30)using a functional index (or what SQL Server calls an indexed computed column, but the idea is the same).Comment #2
andypostHere's a benchmarks of frontpage with 20 nodes and locale enabled
ab -n 50 -c 2 http://drupal7/
before patch
after patch
Comment #4
andypostAgreed about uniqueness because we have version column so same source could be stored for different versions
Functional index this is a current implementation but Import could be much faster because a less data is transfered between php and mysql
Also some contrib could change indexes like i18n does #803380: locales_source.location index
Comment #5
andypostPatch with normal index
Comment #6
andypostmarked as duplicate #811158: We should not index on (source, context)
Comment #7
damien tournoud commentedCould we have a reroll of this one?
Comment #8
andypostFor D8 we don't need hook_update_N()
For D7 there's trouble with update because update.php tries to use t()
Comment #9
c960657 commentedNo, we don't keep individual sources for different versions. The version field is for keeping a record of when we last asked for a translation of the string.
I guess Damien's point is that you cannot be sure that the hash function will produce a unique string (?). There is a (theoritical) possibility that two strings will generate the same hash. If your code assumes that the hashes are unique, you might as well mark the column as unique.
I suggest we add a separator, e.g. \0, to avoid collisions, e.g. $source='foo', $context='bar' and $source='foobar', $context=''.
It is no longer an MD5 hash.
drupal_hash_base64() seems to always return a 44-character string.
I suggest we name the column just "hash" (or "hash_key", assuming it is unique).
Comment #10
andypostNot sure about uniqueness - data could be migrated but "hash" is reserved by sql
Agreed on "\0" separator!
Field comment fixed.
Added fix to no preprocess JS-files in MAINTENANCE_MODE because it tries to access table by new hash column
Comment #11
c960657 commentedI think we should kill the duplicates in the upgrade hook. Having duplicates lead to undefined results (with and without the patch). For each duplicate source we can easily delete those that have not been translated. If more than one have been translated, we should keep the one that is returned by the default SELECT query used in locale() (assuming that the query always returns the same row - if it doesn't then we can't do better than just keeping a random row).
Hmm, the name is used in {registry_file}.
Comment #12
andypostUnique hash column could be PITA if we don't take VERSION into account
Idea about of cleaning is great but out of scope of the patch, OTOH I found no contrib tools to make it. Suppose this could be a separate task within upgrade process. We have a lot of sites that moved 5->6->7 so they have a lot of outdated strings
EDIT: "hash" could be used, it's reserved work for oracle PLSQL
Comment #13
droplet commentedsub from #705284: Locale strings Import Performance
ref: #622500: Add hashkey to strings and translations for quicker comparison
LDO case, it compares 100 times of strings than drupal core but import strings still very fast. (3 times faster than drupal core to import same .po). it seems we still have some space can be improve.
Comment #14
catchHaven't reviewed the patch here but someone opened #1131048: Interface translation import times out during install (and there are duplicates) suggesting the installer won't complete on some hosting due to locale strings import - would this help with that?
Comment #15
droplet commenteddrupal_hash_base64 is too complex in this case ?? we only need an unique key here.
using CRC32 is another choose.
from l10n, we only have less collisions, CRC32 save more space & comparisons
my l10n test data: ~ 196310 rows
SELECT CRC32(value) as crc, COUNT(value) as total FROM l10n_server_string GROUP BY crc HAVING total > 1
result: 22 rows
** I haven't do performance test
Comment #16
gábor hojtsyComment #17
gábor hojtsyAdding UI language translation tag.
Comment #18
andypostsqlite does not support md5() and other functions so no matter which algo we would use.
For performance reasons there's only one bottle-neck - translation import
Comment #19
gerhard killesreiter commentedGabor asked me for input on this.
He asked about using md5: I don't see this as a problem since we already use this for cache keys (eg. filter cache).
I think benchmarks in this case should be done by evaluating queries directly in SQL cli since Apache/PHP add to much fluctuation.
Comment #20
gábor hojtsyOk, so looks like we don't want to use drupal_hash_base64() then.
Comment #21
sutharsan commentedRerolled #10. None of the recent comments have been included yet.
Comment #22
droplet commentedsee #15 & #20
Comment #23
sutharsan commentedcrc32 hash implemented. Know problem: Upgrade test fails because D7 tables have no hashkey field (yet).
For performance tests with different encriptions I suggest a few scenario's:
The first test gives a normal situation. The second test disables the short string cache. The third test puts the highest load on the database by requesting many translations.
Comment #25
droplet commentedjust query for CRC32 is not safe. It should add a comparison. e.g:
SELECT lid FROM {locales_source} WHERE hashkey = :hashkey AND source = :sourcetheoretically, other hash functions are same but most scripts (and drupal core ?) assume it never hit collisions. ( I'm not sure )
Comment #26
andypost"\0" is a bad separator, more info could be found at #532512: Plural string storage is broken, editing UI is missing
In this issue we introduce LOCALE_PLURAL_DELIMITER which is "\3" also "\0" is not compatible with pgsql
Not sure we could proceed with crc32 - all core used md5() and drupal_hash_base64()
So no reason to implement a third hash algo. Let's convert this to md5()
Comment #27
gábor hojtsyWell, the \0 will never actually reach the database in this patch, right? So whether it is compatible with some db does not matter.
Comment #28
sutharsan commented@droplet, fully agree that crc32 alone is not safe. Is was not meant to be, only meant to be an database accelerator. My criteria for a hash is 1. Database performance and 2. PHP performance. We don't need a secure or (absolute) unique result. We only need a relatively unique result to speed up the database index. I think 22 duplicates in 200k strings is unique enough. The combination with 'source' will do the trick. But performance tests may have the final answer.
@gabor, agree that "\0" does not hit the database. Additionally I checked that CRC32 produces a different result with and without this separator.
Comment #29
sutharsan commentedSource, context and hash included in in queries where applicable.
Patch re-rolled because of #746240: Race condition in locale() - duplicates in {locales_source}
Comment #30
sutharsan commentedI carried out performance tests. I have found no ground to use an hashkey for performance improvement.
The tests
Two kinds of test were performed:
With three variations each:
All tests are carried out with increasing number of records in the locales_source and locales_target database tables.
The results
1. No significant difference in performance between crc32 and md5.
2. Import is slightly slower when using an additional hashkey column.
3. Reading translations is not measurably slower when using an additional hashkey column.
Note that all measurements show a large variation is measurement results. +/- 20% is no exception! The variation is less when measuring with MySQL (test2), variation is larger when using both PHP and MySQL (test1).
For details see the attachments.
One series of test were carried out using a different patch in which the locales_source table had only an index on the 'hashkey' column. No significant performance difference was found.
Comment #32
andypostThis query should run without
s.source = 'Add content' AND s.context = '' ANDThe reason to use hash is to eliminate quering on
sourceblob field!String used for search is md5($source . "\3" . $context) should be faster
source & context should be removed
same
context is always string
source and context should be removed
Comment #33
sutharsan commented@andypost, your assumption is that the hashkey is unique. As @droplet showed in #15, CRC32 is not usable. It would be nice if we could test md5($source . "\0" . $context) with all source strings of l.d.o. But changes are good. This Probability table calculates the chance of two strings having the same md5 hash as less then 10e-18 (l.d.o. contains approx 350000 source strings; md5 is a 128 bit hash).
Comment #34
droplet commented@Sutharsan,
LDO also make an assumption on the hashkey.
If we assumed md5 hashkey is unique, I think we can move hashkey into locales_target table and remove LEFT_JOIN ??
Comment #35
sutharsan commented@droplet,
What assumption does l.d.o make on the hashkey?
We could do away with lid, not with LEFT JOIN. But replacing lid with hashkey would make a significant increase in data in locales_target. And less is faster.
Comment #36
sutharsan commentedHashtag changed to md5, source and context removed from the queries, source_context index removed. Todo: upgrade path.
New performance tests show approximate equal results with and without hashkey. Executing t() and format_plural() with hashtag takes 2 - 6% more longer.
Comment #37
sutharsan commentedComment #38
andypostLDO uses hashkey to lookup an existing string l10n_drupal_callback_save_string() so it's import is much faster
Add see otehr places _l10n_community_import_one_string() and l10n_gettext_store_string()
As commited #532512: Plural string storage is broken, editing UI is missing better to use \3 as separator but ldo does:
Comment #39
sutharsan commented_locale_import_one_string_db() does use the hashkey to lookup existing source. Note that #1189184: OOP & PSR-0-ify gettext .po file parsing and generation will overhaul the import process. My bigger worry is the reduced performance of t() and format_plural() due to this change. If we can't fix that, lets drop the issue.
Comment #40
andypostChanged hash('md5') to md5() - my tests shows that it's faster a 2 times
Added update hook, let's see for upgrade tests
I think hashkey field should be pre-populated on update and we actually need unique index on it
Comment #41
andypostSuppose we should add column at somewhere in update_prepare_d8_bootstrap()
EDIT At #12 I mentioned about VERSION key
Comment #42
andypostLet's discuss a measurement strategy and take into account that we have pgsql and sqlite too
Patch adds update function locale_update_8006() with batch to calculate hashes. Probably we should run this update right after locale_update_8002() so we have new duplicates in strings
This hunk is removed in the patch.
Suppose we need another approach to prevent string duplicates because update.php could be run from none-english and while no hashes are calculated we get duplicates that been killed in locale_update_8002()
Comment #43
andypostAnother idea is to use this hash in upcoming core's i18n_server integration so having the same algo with server is preferable. Related #1445004: Implement customized translation bit on translations
Bot is happy now we need performance tests to pass the gate
Comment #44
podarok#43 @andypost
What kind of tests do we need?
Just ab?
or possibly just a menu callback with some kind of locale code?
Comment #45
sutharsan commentedMy strategy was to use focused testing with minimum influence of processes we are not working on. Testing with 'realistic' situations does not give usefull information. As an example lets test the response time of the drupal front page using ab. The variation in the response time is large (StdDev > 10%), approx 50 strings are loaded through t(), translated strings are cached in t() per page call, translated strings are cached in locale() in the database. A single (un cached) call of t() takes approx 0.2 msec, 50 t() calls on the home page take approx 10 msec. Even large variations of 20% in t() execution (= 2 msec) will never be noticed in ab tests of the home page (70 - 100 msec). Testing realistic situations is not usable, we need focused testing.
I executed three kinds of tests:
All tests were performed with varying number of string in the database. Simulating small and medium size sites.
No 1. Tests the database performance of a set of select queries which could be executed by t().
No 2. Tests the execution time of the import function
_locale_import_read_po(). Time measured by addingtime_start()andtime_read()functions before and after the function call.No 3. Tests the execution time of a large number of
t()andformat_plural()function calls. Internal caching is turned off as much as possible to focus on the database retrieval. For this I have written a tTest module which you find in my Sandbox. The module calls t() or format_plural() with a fixed set of strings taken from D7 core. Locale string caching was turned off during all tests.Comment #46
droplet commentedback to NW first,
- update.php is failed on Localize enabled.
- missing add hashkey field in locale_update_8006
- switch between patched / non-patch, it will leave some strings haven't hashkey (it also may affect benchmark result)
Comment #47
andypostAlso we need discus a hash strategy - I think it's good to have the same algo on hash generation in l10n_server
@droplet 1 & 2 - field is added in update_prepare_d8_language()
Comment #48
andypostI'm going to change current storage. Remove
lidand addhashkeyas primary key forlocales_sourceand hashkey + langcode as primary key forlocales_targetWhat to do with
versioncolumn?The only possible performance degradation could appear on joins... we test it later
Benefits
1) easy to sync with l.d.o
2) kill duplicates by nature of PK
3) probably easy to make mass string import
Comment #49
gábor hojtsyThe version column once again is only used in locale() for lookup I believe (and is updated on use), to limit the size of the cache in memory for short strings (if you had strings leftover from previous versions). I'm not sure of the effectiveness, with Drupal releases going this far and wide (3 years inbetween them) and much more contribs used which also change a lot (but are not represented in this version tracking) I think its likely that the version tracking does not have much usefulness (and it slows down pages on major version updates as long as these updates happen for strings).
Comment #50
gábor hojtsyCorrection: looks like the version column is also used for minor versions, in which case this might still be a good way to weed out outdated strings on a site. It is really only used for that that I know.
Comment #51
soul88As I'm already aware of this: http://drupal.org/node/1189184 I know that the following code will be thrown away at some point. But as I've already written it, I'd post it here. It may show on how could batch inserts/updates be implemented in current situation.
1. I didn't write it to the end and didn't test it, so it's more of showing the idea, rather than the code for use.
2. It relies on the schema changes posted by andypost and modified on the codesprint today. We assumed, that hashcode should be the substitution for "lid" as PK.
P.S. As far as I'm concerned for now db_merge can't make batch updates that's why I decided to make insert+delete instead of 1 update.
P.P.S. The motivation for this code comes from here: http://drupal.org/node/361597#comment-5899520
Comment #52
jair commentedComment #53
alansaviolobo commented42: 851362-locale-hashkey-42.patch queued for re-testing.
Comment #55
alansaviolobo commentedre-rolled the patch.
Comment #57
alansaviolobo commentedComment #58
alansaviolobo commentedignore the previous comment.
Comment #59
alansaviolobo commentedComment #61
gábor hojtsyComment #64
rodrigoaguileraI will rerrol this
Comment #65
rodrigoaguilerarerrol with no conflicts
Comment #66
rodrigoaguileraComment #67
gábor hojtsyOK how does this change the performance of imports and lookups? @Sutharsan in #25 explained a strategy but did not explain the results AFAIS. Need performance results.
Comment #68
andypostComment #70
damienmckennaBumping to the 8.2.x branch.
Comment #77
mpp commentedNote that #65 is still using the old array notation.