Verify transliteration data sources and their quality, and potentially eliminate maintenance [#1823454]

Updated: as of comment #56

Follow-up to #567832: Transliteration in core:

Problem

It is unclear whether our transliteration data is correct, and whether it is the best source long-term.

Goal

Ideally, find a reliable upstream data source that completely eliminates downstream maintenance in Drupal core.
Alternatively, ensure that we're using the best of available data sources.

Possible Data sources

Here is a list of possible data sources. I've given each one a nickname for reference below.

"core current"

The data that is currently in the Transliteration component in Drupal core. This came from the contrib Transliteration module, which had the following notes on the data source:

UTF-8 normalization is based on UtfNormal.php from MediaWiki and transliteration uses data from Sean M. Burke's Text::Unidecode module.

Links:

"cpan"

The Text::Unidecode CPAN module mentioned as the source for the Transliteration contrib module.

"intl"

The data in the php.net/pecl intl extension's Transliterate class, which is a wrapper on the ICU project's C++-based Transform code. Some links and info:

PECL classes: http://php.net/manual/book.intl.php
ICU overview http://userguide.icu-project.org/transforms/general
Info on ICU data http://userguide.icu-project.org/icudata
In order to transliterate with this extension, you have to define a set of transformations to apply, and they are really meant for transliterating bodies of text rather than single characters. However, within the scope of this issue, we're trying to figure out better character-by-character transliteration data tables for our existing Transliteration component, which is meant to be used for things like making machine names for files, not trying to make really good transliterations of, for instance, Russian prose, so that is probably OK. There are discussions of this topic in comment #13, and #16-#19.

"unicode.org"

The Unicode.org data that the ICU project apparently bases their transform code on. This is available from http://cldr.unicode.org/index/downloads

"midgard"

The data in the clean URL generator for Midgard from https://github.com/bergie/midgardmvc_helper_urlize

"node.js"

The node.js stuff at https://github.com/bitwalker/stringex/tree/master/lib/unidecoder_data -- this is in YAML format.

"JUnicode"

The JUnicode project, which claims to be based on the Perl Text::Unidecode module (see above) with some additional data. http://www.ippatsuman.com/projects/junidecode/index.html

Tasks

Read in these various data sources, and output each one into the format of our current files.
Note what the differences are between the data from each source.
Come up with a patch that replaces our current data.

Notes on these tasks and the data sources

jhodgdon made a script called "readdata.php" that reads in various data sources and outputs them in a standard format (the format used in the Drupal Transliteration component) for comparison. The latest version of this script is attached to comment #56 at this time.
The data sources that can be compared with this script are: "core current", "midgard", "cpan", "node.js", "JUnidecode", and "intl". For the "intl" data source, a hopefully canonical list of transformations was chosen that covers many Unicode scripts.
In comment #5, an output file was generated that highlights differences between these data sets (except "intl"). In comment #11, these differences were analyzed and summarized. The conclusion there was that:
- Case differences between our current data set and other data sets were generally correct in ours.
- There were also many other differences (not just case) between our data set and others; these were not systematically analyzed.
- Some of the other data sets have data where we don't, and we should probably import this data if we can so at least we have something to work with in those character ranges.
In comment #20, the first viable pass at the "intl" data set was done (after choosing a set of hopefully canonical transformations), and after spot checking, the conclusion was that we should use output from intl in place of ours where they differ, and as additions to ours where ours is missing, as it's more accurate. The other conclusion was that data from the other data sets was not very trustworthy to adopt where ours was missing.
After comment #20, there was a side excursion into 5-byte characters that ended up being put on a different issue.
Then there were several iterations on making a patch that fixes up our data set using the hopefully canonical set of "intl" transformations, and the latest patch at this point is in comment #56, along with the script used to generate it.

Comment	File	Size	Author
#61	readdata.php_.txt	18.5 KB	jhodgdon
#61	transdata-1823454-61.patch	725.64 KB	jhodgdon
#61	interdiff-transdata-56-61.txt	427.91 KB	jhodgdon
#56	transdata-1823454-56.patch	285.71 KB	jhodgdon
#56	readdata.php_.txt	18.47 KB	jhodgdon
#56	patchok.png	93.79 KB	jhodgdon
#43	readdata.php_.txt	18.41 KB	jhodgdon
#43	1823454-intldata-43.patch	254.34 KB	jhodgdon
#39	1823454-intldata-37.patch	300.2 KB	jhodgdon
#36	1823454-intldata-35.patch	300.91 KB	jhodgdon
#33	1823454-intldata.patch	307.78 KB	jhodgdon
#33	readdata.php_.txt	17.94 KB	jhodgdon
#27	5-byte-transliteration-test.patch	2.72 KB	jhodgdon
#20	differences.txt	1.84 MB	jhodgdon
#20	readdata.php_.txt	15.67 KB	jhodgdon
#13	rules.txt	1.44 KB	jhodgdon
#11	readdata.php_.txt	14.85 KB	jhodgdon
#11	differences.txt	1.51 MB	jhodgdon
#5	differences.txt	1.59 MB	jhodgdon
#5	differences-all.txt	2.16 MB	jhodgdon
#5	readdata.php_.txt	14.56 KB	jhodgdon

Title:	Test data sources and have better data for transliteration	» Verify transliteration data sources and their quality, and potentially eliminate maintenance
Issue tags:		+transliteration

Assigned:	damien tournoud	» jhodgdon
Status:	Needs review	» Needs work

Issue summary:	View changes
Parent issue:		» #567832: Transliteration in core

Verify transliteration data sources and their quality, and potentially eliminate maintenance

Problem

Goal

Possible Data sources

Tasks

Notes on these tasks and the data sources

Comments