Updated: as of comment #56
It is unclear whether our transliteration data is correct, and whether it is the best source long-term.
- Ideally, find a reliable upstream data source that completely eliminates downstream maintenance in Drupal core.
- Alternatively, ensure that we're using the best of available data sources.
Possible Data sources
Here is a list of possible data sources. I've given each one a nickname for reference below.
- "core current"
- The data that is currently in the Transliteration component in Drupal core. This came from the contrib Transliteration module, which had the following notes on the data source:
- The Text::Unidecode CPAN module mentioned as the source for the Transliteration contrib module.
- The data in the php.net/pecl intl extension's Transliterate class, which is a wrapper on the ICU project's C++-based Transform code. Some links and info:
- PECL classes: http://php.net/manual/book.intl.php
- ICU overview http://userguide.icu-project.org/transforms/general
- Info on ICU data http://userguide.icu-project.org/icudata
- In order to transliterate with this extension, you have to define a set of transformations to apply, and they are really meant for transliterating bodies of text rather than single characters. However, within the scope of this issue, we're trying to figure out better character-by-character transliteration data tables for our existing Transliteration component, which is meant to be used for things like making machine names for files, not trying to make really good transliterations of, for instance, Russian prose, so that is probably OK. There are discussions of this topic in comment #13, and #16-#19.
- The Unicode.org data that the ICU project apparently bases their transform code on. This is available from http://cldr.unicode.org/index/downloads
- The data in the clean URL generator for Midgard from https://github.com/bergie/midgardmvc_helper_urlize
- The node.js stuff at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data -- this is in YAML format.
- The JUnicode project, which claims to be based on the Perl Text::Unidecode module (see above) with some additional data. http://www.ippatsuman.com/projects/junidecode/index.html
- Read in these various data sources, and output each one into the format of our current files.
- Note what the differences are between the data from each source.
- Come up with a patch that replaces our current data.
Notes on these tasks and the data sources
- jhodgdon made a script called "readdata.php" that reads in various data sources and outputs them in a standard format (the format used in the Drupal Transliteration component) for comparison. The latest version of this script is attached to comment #56 at this time.
- The data sources that can be compared with this script are: "core current", "midgard", "cpan", "node.js", "JUnidecode", and "intl". For the "intl" data source, a hopefully canonical list of transformations was chosen that covers many Unicode scripts.
- In comment #5, an output file was generated that highlights differences between these data sets (except "intl"). In comment #11, these differences were analyzed and summarized. The conclusion there was that:
- Case differences between our current data set and other data sets were generally correct in ours.
- There were also many other differences (not just case) between our data set and others; these were not systematically analyzed.
- Some of the other data sets have data where we don't, and we should probably import this data if we can so at least we have something to work with in those character ranges.
- In comment #20, the first viable pass at the "intl" data set was done (after choosing a set of hopefully canonical transformations), and after spot checking, the conclusion was that we should use output from intl in place of ours where they differ, and as additions to ours where ours is missing, as it's more accurate. The other conclusion was that data from the other data sets was not very trustworthy to adopt where ours was missing.
- After comment #20, there was a side excursion into 5-byte characters that ended up being put on a different issue.
- Then there were several iterations on making a patch that fixes up our data set using the hopefully canonical set of "intl" transformations, and the latest patch at this point is in comment #56, along with the script used to generate it.
PASSED: [[SimpleTest]]: [MySQL] 57,832 pass(es).
[ View ]
PASSED: [[SimpleTest]]: [MySQL] 57,506 pass(es).
[ View ]