Problem
- iso.inc as file name is misleading. ISO has 18500 standards, we only have some well defined things here, and they are not all goverened by ISO standards.
- First time HTML included language support (in HTML 4), it was already not using ISO codes but a derivative standard that composed language identifiers based on them in different combinations (http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.1.1)
- HTML 5 that is the target for Drupal 8 defines language codes as defined by BCP 47 that is long over using ISO codes as reference
Goal
- Standardize function names, file name and code and documentation to avoid refering outdated standards
- Improve developer experience by using less magic names
Proposed solution
- Refer to W3C language tags as a user/developer friendly name for current standard language codes in the UI and code.
- Rename iso.inc to standard.inc to clear up the confusion.
- Rename functions in the file to use a common naming scheme.
Comment | File | Size | Author |
---|---|---|---|
#31 | iso_to_standard_reroll.patch | 38.02 KB | Gábor Hojtsy |
#29 | iso-to-standard.patch | 37.88 KB | Gábor Hojtsy |
#17 | iso-to-standard.patch | 37.98 KB | Gábor Hojtsy |
#15 | iso-and-rfc-to-w3c.patch | 2.06 KB | Gábor Hojtsy |
Comments
Comment #1
Gábor HojtsyThe includes/iso.inc file contains two things: a list of languages and a list of countries. The list of languages is certainly not based on ISO, it uses language tags as defined by the W3C (or locale identifiers if you don't like W3C's terminology). The code comments for country codes do not suggest it is related to ISO country codes, is it?
Given this information, iso.inc is pretty misleading and should be renamed. Given that it has the country list which IS used on stock Drupal sites by system module and the language list which is only used if you have Locale module enabled (but might by used by contrib in a way without locale module, although that sounds very unlikely), we are even looking at a file that has a dual purpose.
One of the possibilities is that we move the language list code under locale module's tree, and the country list stays on as countries.inc. Reasonable? Better ideas?
Comment #2
Gábor HojtsyComment #3
sunNo idea what you're talking about...
The country codes are official ISO 3166-1-alpha-2 codes:
http://www.iso.org/iso/english_country_names_and_code_elements
The language codes are official ISO 639 codes: (though incomplete and partially outdated)
http://www.loc.gov/standards/iso639-2/php/code_list.php
http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
Won't fix for me.
Comment #4
Gábor Hojtsy@sun: is it just me or those language code lists include en-gb, pt-br, zh-hans, xx-lolcat and the other languages we have and going to have in there with these patterns? A big number of our language codes do match the ISO 639-1 identifiers, however, even the help text on the locale UI has this to say:
<a href="@rfc4646">RFC 4646</a> compliant language identifier. Language codes typically use a country code, and optionally, a script or regional variant name. <em>Examples: "en", "en-US" and "zh-Hant".</em>
So it asks for RFC4646 language tags (but it does avoid calling them language tags, yeah!), which now that I'm looking were actually obsoleted by RFC5646 language tags 2 years ago (see http://www.w3.org/International/articles/language-tags/). Select ISO 639 language codes are the first part of a language tag, but language tags obviously much broader, richer. They are also defined by the IETF.
I'm not happy about naming our include files based on standards bodies, but in this case it would be iso-ietf.inc? Not sure that makes sense.
Comment #5
Gábor HojtsyBTW the fact that we are using language tags in Drupal not language codes (even if we attempt to call them language codes at places) was discussed multiple times in #1216094: We call too many things 'language', clean that up especially deeply by @plach.
Comment #6
sunAs Wikipedia and RFC 5646 explains in detail,
They are based on ISO codes, so I don't see how
iso.inc
would be "misleading" in any way.I think what you rather discovered is that our definitions are partially outdated and incomplete, and not up to current standards/recommendations/best practices (in terms of ISO/RFC).
However, that's a separate issue, and a tough one to resolve. While we can easily add new entries, it's going to be close to impossible to change existing language codes/tags. The situation is comparable to text format identifiers, which are equally scattered across the entire system, without having any solid and reliable knowledge about "Where?".
Comment #7
Gábor HojtsyYes, they are *derived* from ISO language standards, obviously making up totally new things was not a good idea. However, they "forked" the ISO standards for use in language tags and are not referencing the original ISO standards for meaning but have their own registry which is *derived* from the original ISO data. From the wikipedia article:
Per the explanation, language tags inherit from ISO data but are forked off of ISO. The authoritative list for these values is in fact maintained by the IANA (http://www.iana.org/assignments/language-subtag-registry). As explained by the W3C at http://www.w3.org/International/articles/language-tags/#registry:
(Emphasis mine). This IANA registry search tool verifies if the language identifiers we use now match the standard. Some examples:
http://rishida.net/utils/subtags/index.php?check=pt-br&submit=Check
http://rishida.net/utils/subtags/index.php?check=en-gb&submit=Check
http://rishida.net/utils/subtags/index.php?check=zh-hant&submit=Check
http://rishida.net/utils/subtags/index.php?check=xx-lolspeak&submit=Check (xx not in the registry, they do have tlh for Klingon though, ha!)
http://rishida.net/utils/subtags/index.php?check=gsw-berne&submit=Check (gsw in itself is Swiss German, berne unknown)
Looks like we are not doing bad at all there generally.
includes/used-to-be-iso-ietf-iana-w3c.inc? :)
Comment #8
sunGábor, honestly, I don't think it's worth pursuing this discussion. In my opinion, this is extreme nitpicking and borderline bikeshedding -- regardless of what other name we come up with, it will 1) always be not 100% correct, 2) much more troublesome to explain and understand, and lastly 3) a totally, completely, and entirely unnecessary API change.
We should update and improve the phpDoc documentation for the file and functions, and we should also make sure to update our definitions according to the latest standards (as much as possible, that is). But I don't think that renaming this file buys us anything that would justify this lengthy discussion, and even less so, a relatively major API change.
Having these definitions separately in iso.inc is beneficial for all use-cases that need them, and is also beneficial for run-time performance.
In the long term (D9?), we should rather investigate whether we can replace our custom definitions with a Not Invented Here™ shared ISO codes library FOSS effort.
Comment #9
Gábor HojtsyA little history:
- HTML 4 defined RFC 1766 as reference for language codes (http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.1.1)
- XHTML 1.0 (that Drupal 7 genearates) refers to RFC 3066 as reference for language codes, which obsoleted 1766 at the time (http://www.w3.org/TR/xhtml1/)
- HTML 5 (that Drupal 8 will generate) defines BCP 47 (the permanent name for W3C's language tags) as reference for language codes, which as discussed above is at the moment defined by RFC 5646 (obsoleted 4646 which previously obsoleted 3066) (http://dev.w3.org/html5/spec/elements.html#the-lang-and-xml:lang-attributes)
So to put it in another way, Drupal 7 is a bit ahead of its time by refering to RFC 4646 (the standard it uses defines RFC 3066 to be used). However, for Drupal 8 to be an HTML 5 suporting CMS, we should follow their lead to use BCP 47 (currently defined by RFC 5646). It is not an ISO thing once again :)
Comment #10
Gábor HojtsySorry, cross post. I get how you think iso.inc is the best place for a country and language list. Ok. My motivation with this issue are to
(a) bring all that belongs to locale under one roof (I also have other issues that move locale.inc stuff under locale module, fix function names, etc).
(b) clear up this confusion that Drupal is an ancient CMS that is bound to ISO standards that the web moved over from years ago; the W3C says we should think of them as obsolete (http://www.w3.org/WAI/ER/IG/ert/iso639.htm), only sites like the hardly accurate w3schools site say HTML uses them (http://www.w3schools.com/tags/ref_language_codes.asp)
(c) get rid of magic names like iso.inc, that can be anything and not good for DX; ISO defined so many things, they have 18500 standards in their catalog (ref: http://www.iso.org/iso/store.htm)
Comment #11
sunWell, one issue I have with this is that iso.inc doesn't strictly belong to Locale module. It is also used by Locale module, but there are countless of modules in contrib that are using the definitions, and I've written a couple of custom modules for some sites in the past that also relied on iso.inc -- without being related to Locale module.
That's the major issue for me. You seem to look at this from the Locale module perspective, but I understand iso.inc as a "general purpose file containing standards definition lists provided by core, which each and everyone may use to stay consistent throughout Drupal and to don't duplicate efforts."
That said, in light of the Framework initiative, I do welcome a change to the contained function names to be in the
iso_
namespace ('cos they're in iso.inc). In turn, if we can agree on leaving the functions in that file, then that would be the point at which you could make me agree that iso_country_list() and iso_language_list() would be slightly misleading.Comment #12
Gábor Hojtsy@sun: yeah, see... The main problem with saying these are ISO codes is that if you search for ISO language codes in relation the web, all you find is outdated crap. The web moved on. ISO never had standards for the combination of their codes as en-gb and pt-br uses it and if people want to abide the standards and use ISO codes for their languages, they will not be compatible with the rest of the web. Google indexing, browser preferences, etc. all use the BCP 47 language tags. To be able to interface with the web, we need to embrace the change that happened a long time ago.
Last mention I found of ISO regards to HTML languages is at http://www.w3.org/International/O-help-lang.html (before HTML 4, don't know when was the page published) when they wrote "it's a good idea", it was not part of a standard yet. By the time they released HTML 4 (1999, http://www.w3.org/TR/html4/types.html#type-langcode) they refer to RFC 1766 which defines a combination of ISO codes (not ISO codes per say) as language codes (http://www.ietf.org/rfc/rfc1766.txt).
12 years passed since the W3C stopped recommending we use ISO language codes and they forked to refer to derivative standards.
Comment #13
sunProposal
Comment #14
xmacinfoI like the way this discussion turns to. I now feel outdated.
Sun's proposal is sound. Renaming to standard.inc makes a lot of sense.
Comment #15
Gábor HojtsyHere is a quick patch that just updated the code comments for the predefined language list. I was about to post this to underline the irony that "langauge codes" as understood by Drupal are defined by "W3C language tags" of which the predefined ones are in "iso.inc" in a function name "_locale_get_predefined_list". Language code, language tag, ISO and locale all in there in concert. Yeah!
Anyway, I like the proposal to rename the file, so will include that in the next edition of the patch as well.
Comment #17
Gábor HojtsyHere is one that renames iso.inc to standard.inc and does the suggested renames +
- updates code comments on both the country list and the language list
- removes the useless grouping on the language list (discussed with @jhodgdon and @sun)
- updates the language instructions as well to refer to up to date stuff like code comments in standard.inc
Comment #18
Gábor HojtsyComment #19
sunAlright. Let's move forward with this as-is. And let's improve and adjust the phpDoc in standard.inc in a separate follow-up patch.
Comment #20
Gábor HojtsyTagging.
Comment #21
sunComment #22
DamienMcKennaWould it be wrong for this to be two separate files, one function per file? Or is it just assumed that if you want one you'll either want the other or not care about the (probably minimal) overhead?
Comment #23
Gábor HojtsyYes, this was the idea as discussed above.
Comment #24
Gábor HojtsyTagging Usability for the simplified explanation of what we consider language codes.
Comment #25
Gábor HojtsyThe usability portion of this is:
With thinking that RFC 4646 is as alien to a user as it can be. Leading a description up with that is turning them off most probably. W3C language tags is less jargon, W3C is maybe still a very geeky thing, but less so than RC 4646 :D Also, fixes the description that language codes are typically country based with examples of "en" and "zh", which are hardly countries. This part of the patch will be candidate for backport once committed.
Comment #26
yoroy CreditAttribution: yoroy commentedQuick rewrite:
Tighter first sentence, then directly the examples. Removed "Language codes typically have a language and optionally, a script or regional variant name." as that's a whole lot of words to describe what is easier explained through the examples.
Comment #27
Gábor HojtsyWell, yeah, well, unfortunately W3C calls them language *tags*, and part of language *tags* are language *codes*, which are you know followed by script or regional variants optionally :) So we sacrifice accuracy here if we refer to W3C language *codes*, when that is in fact misleading for people who actually try to understand the trail of consequences here. However, if we refer to "language *tags*" (the complete language identifier as named by W3C), then we need a bit of explanation for this, since these things in our terminology are language codes. Alternatively of course, we can spread this and name it language *tag* everywhere in Drupal, but that is easily not as straightforward terminology, is it? BTW this was discussed to hell in #1216094: We call too many things 'language', clean that up (language codes vs. tags vs. identifiers).
Comment #28
yoroy CreditAttribution: yoroy commentedOk understood, then either keep the original sentence or avoid 'tags' completely and say
Comment #29
Gábor HojtsyOk, updated patch with exactly that, should be back to RTBC then? :)
Comment #31
Gábor HojtsyA little piece of the affected code got moved to modules/locale/locale.bulk.inc in the meantime. Here is a full reroll with the same changes.
Comment #32
yoroy CreditAttribution: yoroy commentedA good cleanup of the description. Back to rtbc.
Comment #33
johnvI'd like to draw your attention to module http://drupal.org/project/countries , since it doesn't show in the discussion.
This module takes the countries from iso.inc, and stores them in a table. It also contains a Field 'Country'.
The maintainer Alan D. releases new versions when new countries arise or change names, and these are updated in the table.
Comment #34
Gábor HojtsyTagging for HTML5 given we align our language code support with the spec here.
Comment #35
Dries CreditAttribution: Dries commentedGood clean-up. Committed to 8.x.
Comment #36
Dave Reidgit add standard.inc
This broke 8.x :)
Comment #37
webchickI went ahead and added, committed/pushed standard.inc so as not to stall other patches.
As a side note though, not to be "that guy" again, but standard.inc? According to the docblock, it "Provides a list of countries and languages". Why then is it not called "languages.inc" or "locale-support.inc" or "something-that-is-remotely-descriptive-of-what-it-does.inc"?
The only thing in Drupal called "standard" is the standard profile, and this has nothing to do with it, even though the name makes it look like it does.
Comment #38
webchickComment #39
xmacinfoWebchick just committed the new missing file. :-)
Comment #40
Dave ReidAgreed with webchick - standard.inc makes no sense at all.
Comment #41
Gábor Hojtsy@webchick, @Dave Reid: I think you might agree with the unofficial framework initiative plan to name functions/APIs prefixed with the file name (not just module name they are in). So standard.inc functions became standard_*(). My original proposal above was to move the language list to locale and leave countries in includes as countries.inc => countries_*(). Sun argued that (1) the language list might be used outside of locale module (2) we should have a file with multiple lists of standard stuff put altogether, not for individual things like countries and languages therefore (3) he suggested standard.inc which nobody disputed so far.
We cannot really name it languages.inc because (1) its not only languages (2) there is also a language.inc with language API functions and this can quickly get confusing. We cannot really call it locale-support.inc because (1) I think its as vague as standard.inc, locale has LOTs of functionality, and it has supporting code in bootstrap.inc, languages.inc, locale.inc, etc. (2) the country list is used by system even if no locale module exists, so its really not a locale support file.
Any better suggestions for code organization or file naming?
Comment #42
johnvA suggestion for filenaming, since both are part of the 'locale' concept:
- locale_countries.inc
- locatie_languages.inc
This makes room for locale_currencies etc.
However, IMO these lists are better served as a table.
Comment #43
Gábor Hojtsy@johnv: as I've explained above, the country list is definitely not "locale specific" (ATM at least, not in terms of how Drupal has "locale module" == "UI translation"), system module uses it even when locale module is not enabled. It also makes lots of sense to have that without locale module, since for an international shop for example with all English UI, recording country information is still vital. Also, @sun argued above that even the language list should not be confined to locale module, that might be used by other modules (though no direct example in core).
It is clearly an option that we rename locale module wholesale to something else and then we'll free up "locale" as the industry accepted concept of collections of data to use depending on locale, so we could move there then. However, as it is now, "locale" in core means the language list and UI translation and mixing it with other understandings of locale sounds like a mistake.
Comment #44
xmacinfoThere are more things for a truly multi-language/multi-country than only locale, countries and languages.
There are:
— date formats
— sorting order (translitated sorting order or not (É and E seen equal in sorting or not).
— number decimal
— number thousand separator
— drupal hardcoded path (user or “utilisateur”)
— languages (for example fr)
— localized languages (fr-CA)
— timezone
— i10n
— i18n
— currency symbol
— metric or imperial mesure unit
— first day of week
— calendar (gregorian?)
— etc.
Some of these things are already taken care by core, or if not by contribs.
But for most, there are no easy ways to localize all these formats (or no known way yet).
For example, for each country, by default we know which is the currency symbol, the first day of the week, the default country language sort order, etc.
So since there are more things to cover than just locale, countries and languages, I believe that standard.inc is the best name.
Comment #45
webchickWell, if that's what this eventually gets expanded to include, I guess I would've just chucked these into includes/utility.inc, since I would consider those utility functions.
Anyway, probably not worth harping on it. Maybe we can look at this again closer to code freeze and see if it makes sense given the evolution of the code between now and then.
Comment #46
jhodgdonUm. This issue is tagged "API change" and it is marked "fixed". But there is no change node. Does one need to be created? I don't want to read the issue and the summary doesn't make it clear...
Comment #47
Gábor HojtsySorry for that, added http://drupal.org/node/1276626.
Comment #48
Alan D. CreditAttribution: Alan D. commented@Gábor - totally off topic, but kind of related. What would be the best way to supply user defined country translations? I'm maintaining the countries module that extends the core countries list into the db. I was thinking about using the old t() paradigm:
t('!country-ISO2-name', array('!country-ISO2-name' => check_plain($db['name])));
eg:
!country-us-name = United States
!country-au-name = Australia
@Everyone else: Sorry for the unrelated noise.
Comment #49
jhodgdonYou should never ever ever ever use t() on anything except hard-coded strings in code. So your suggestion of passing information that is stored in the database and entered by the user into t() is not a good one (assuming I have interpreted correctly that the countries are user-supplied).
Comment #50
Alan D. CreditAttribution: Alan D. commented@jhodgdon - yes but the iso2 codes are limited to two char a-z only and are (almost) static. ie: "country-ISO2-property identifier".
I've opened this issue #1279774: I18n support (i18n_string integration) to continue the discussion.
Comment #52
Gábor HojtsyTagging for base language system. Lowering priority since the original issue was definitely not critical.
Comment #53
sir_squall CreditAttribution: sir_squall commentedHi,
Could you please respect the RFC 4646 in term of letter case for language-code.
Instead of defining language-code in all lowercase, it should use uppercase for the region part.
Example zh-hans => zh-Hans.
Is there any patch backport for Drupal 7 ?
Comment #54
Gábor Hojtsy@sir_squall: RFC 4646 has long been obsoleted by BCP 47. See http://www.w3.org/International/articles/language-tags/ for more information.
(Emphasis from me). I'm not sure there is anything to fix here.
Comment #55
sir_squall CreditAttribution: sir_squall commentedBCP 47 states "Although case distinctions do not carry meaning in language tags, consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. "
Comment #56
Gábor HojtsyFeel free to open an issue for this that would let us discuss in detail and link it back here. Thanks!
Comment #57
sir_squall CreditAttribution: sir_squall commentedOk thanks, if you want to follow the discuss:
http://drupal.org/node/1941732
Comment #57.0
sir_squall CreditAttribution: sir_squall commentedUpdate to current status.