Problem

  • iso.inc as file name is misleading. ISO has 18500 standards, we only have some well defined things here, and they are not all goverened by ISO standards.
  • First time HTML included language support (in HTML 4), it was already not using ISO codes but a derivative standard that composed language identifiers based on them in different combinations (http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.1.1)
  • HTML 5 that is the target for Drupal 8 defines language codes as defined by BCP 47 that is long over using ISO codes as reference

Goal

  • Standardize function names, file name and code and documentation to avoid refering outdated standards
  • Improve developer experience by using less magic names

Proposed solution

  • Refer to W3C language tags as a user/developer friendly name for current standard language codes in the UI and code.
  • Rename iso.inc to standard.inc to clear up the confusion.
  • Rename functions in the file to use a common naming scheme.
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Gábor Hojtsy’s picture

Issue tags: +D8MI

The includes/iso.inc file contains two things: a list of languages and a list of countries. The list of languages is certainly not based on ISO, it uses language tags as defined by the W3C (or locale identifiers if you don't like W3C's terminology). The code comments for country codes do not suggest it is related to ISO country codes, is it?

Given this information, iso.inc is pretty misleading and should be renamed. Given that it has the country list which IS used on stock Drupal sites by system module and the language list which is only used if you have Locale module enabled (but might by used by contrib in a way without locale module, although that sounds very unlikely), we are even looking at a file that has a dual purpose.

One of the possibilities is that we move the language list code under locale module's tree, and the country list stays on as countries.inc. Reasonable? Better ideas?

Gábor Hojtsy’s picture

Title: Rename iso.inc, because it is misleading » Rename/restructure iso.inc, because it is misleading
sun’s picture

No idea what you're talking about...

The country codes are official ISO 3166-1-alpha-2 codes:
http://www.iso.org/iso/english_country_names_and_code_elements

The language codes are official ISO 639 codes: (though incomplete and partially outdated)
http://www.loc.gov/standards/iso639-2/php/code_list.php
http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Won't fix for me.

Gábor Hojtsy’s picture

@sun: is it just me or those language code lists include en-gb, pt-br, zh-hans, xx-lolcat and the other languages we have and going to have in there with these patterns? A big number of our language codes do match the ISO 639-1 identifiers, however, even the help text on the locale UI has this to say:

<a href="@rfc4646">RFC 4646</a> compliant language identifier. Language codes typically use a country code, and optionally, a script or regional variant name. <em>Examples: "en", "en-US" and "zh-Hant".</em>

So it asks for RFC4646 language tags (but it does avoid calling them language tags, yeah!), which now that I'm looking were actually obsoleted by RFC5646 language tags 2 years ago (see http://www.w3.org/International/articles/language-tags/). Select ISO 639 language codes are the first part of a language tag, but language tags obviously much broader, richer. They are also defined by the IETF.

I'm not happy about naming our include files based on standards bodies, but in this case it would be iso-ietf.inc? Not sure that makes sense.

Gábor Hojtsy’s picture

BTW the fact that we are using language tags in Drupal not language codes (even if we attempt to call them language codes at places) was discussed multiple times in #1216094: We call too many things 'language', clean that up especially deeply by @plach.

sun’s picture

As Wikipedia and RFC 5646 explains in detail,

Components of language tags are drawn from ISO 639, ISO 15924, ISO 3166-1, and UN M.49.

They are based on ISO codes, so I don't see how iso.inc would be "misleading" in any way.

I think what you rather discovered is that our definitions are partially outdated and incomplete, and not up to current standards/recommendations/best practices (in terms of ISO/RFC).

However, that's a separate issue, and a tough one to resolve. While we can easily add new entries, it's going to be close to impossible to change existing language codes/tags. The situation is comparable to text format identifiers, which are equally scattered across the entire system, without having any solid and reliable knowledge about "Where?".

Gábor Hojtsy’s picture

Yes, they are *derived* from ISO language standards, obviously making up totally new things was not a good idea. However, they "forked" the ISO standards for use in language tags and are not referencing the original ISO standards for meaning but have their own registry which is *derived* from the original ISO data. From the wikipedia article:

Although subtags are often derived from ISO standards, they do not follow these standards absolutely, as this could lead to the meaning of language tags changing over time. In particular, a subtag derived from a code assigned by ISO 639, ISO 15924, ISO 3166 (or UN M.49 only for supranational geographical regions) remains a valid (though deprecated) subtag even if the code is withdrawn from the corresponding ISO standard. If the ISO standard later assigns a new meaning to the withdrawn code, the corresponding subtag will still retain its old meaning. This stability was introduced in the (now obsolete) RFC 4646 (and confirmed in its successor). Before RFC 4646, changes in the meaning of ISO codes could cause changes in the meaning of language tags.

Per the explanation, language tags inherit from ISO data but are forked off of ISO. The authoritative list for these values is in fact maintained by the IANA (http://www.iana.org/assignments/language-subtag-registry). As explained by the W3C at http://www.w3.org/International/articles/language-tags/#registry:

As mentioned above, you used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in one place. The IANA registry looks a little complicated at first, compared to the ISO code lists, but it is easy enough to use once you understand its structure.

(Emphasis mine). This IANA registry search tool verifies if the language identifiers we use now match the standard. Some examples:

http://rishida.net/utils/subtags/index.php?check=pt-br&submit=Check
http://rishida.net/utils/subtags/index.php?check=en-gb&submit=Check
http://rishida.net/utils/subtags/index.php?check=zh-hant&submit=Check
http://rishida.net/utils/subtags/index.php?check=xx-lolspeak&submit=Check (xx not in the registry, they do have tlh for Klingon though, ha!)
http://rishida.net/utils/subtags/index.php?check=gsw-berne&submit=Check (gsw in itself is Swiss German, berne unknown)

Looks like we are not doing bad at all there generally.

includes/used-to-be-iso-ietf-iana-w3c.inc? :)

sun’s picture

Priority: Normal » Minor
Issue tags: +API change

Gábor, honestly, I don't think it's worth pursuing this discussion. In my opinion, this is extreme nitpicking and borderline bikeshedding -- regardless of what other name we come up with, it will 1) always be not 100% correct, 2) much more troublesome to explain and understand, and lastly 3) a totally, completely, and entirely unnecessary API change.

We should update and improve the phpDoc documentation for the file and functions, and we should also make sure to update our definitions according to the latest standards (as much as possible, that is). But I don't think that renaming this file buys us anything that would justify this lengthy discussion, and even less so, a relatively major API change.

Having these definitions separately in iso.inc is beneficial for all use-cases that need them, and is also beneficial for run-time performance.

In the long term (D9?), we should rather investigate whether we can replace our custom definitions with a Not Invented Here™ shared ISO codes library FOSS effort.

Gábor Hojtsy’s picture

Priority: Minor » Normal
Issue tags: -API change

A little history:

- HTML 4 defined RFC 1766 as reference for language codes (http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.1.1)
- XHTML 1.0 (that Drupal 7 genearates) refers to RFC 3066 as reference for language codes, which obsoleted 1766 at the time (http://www.w3.org/TR/xhtml1/)
- HTML 5 (that Drupal 8 will generate) defines BCP 47 (the permanent name for W3C's language tags) as reference for language codes, which as discussed above is at the moment defined by RFC 5646 (obsoleted 4646 which previously obsoleted 3066) (http://dev.w3.org/html5/spec/elements.html#the-lang-and-xml:lang-attributes)

So to put it in another way, Drupal 7 is a bit ahead of its time by refering to RFC 4646 (the standard it uses defines RFC 3066 to be used). However, for Drupal 8 to be an HTML 5 suporting CMS, we should follow their lead to use BCP 47 (currently defined by RFC 5646). It is not an ISO thing once again :)

Gábor Hojtsy’s picture

Priority: Normal » Minor
Issue tags: +API change

Sorry, cross post. I get how you think iso.inc is the best place for a country and language list. Ok. My motivation with this issue are to

(a) bring all that belongs to locale under one roof (I also have other issues that move locale.inc stuff under locale module, fix function names, etc).

(b) clear up this confusion that Drupal is an ancient CMS that is bound to ISO standards that the web moved over from years ago; the W3C says we should think of them as obsolete (http://www.w3.org/WAI/ER/IG/ert/iso639.htm), only sites like the hardly accurate w3schools site say HTML uses them (http://www.w3schools.com/tags/ref_language_codes.asp)

(c) get rid of magic names like iso.inc, that can be anything and not good for DX; ISO defined so many things, they have 18500 standards in their catalog (ref: http://www.iso.org/iso/store.htm)

sun’s picture

(a) bring all that belongs to locale under one roof (I also have other issues that move locale.inc stuff under locale module, fix function names, etc).

Well, one issue I have with this is that iso.inc doesn't strictly belong to Locale module. It is also used by Locale module, but there are countless of modules in contrib that are using the definitions, and I've written a couple of custom modules for some sites in the past that also relied on iso.inc -- without being related to Locale module.

That's the major issue for me. You seem to look at this from the Locale module perspective, but I understand iso.inc as a "general purpose file containing standards definition lists provided by core, which each and everyone may use to stay consistent throughout Drupal and to don't duplicate efforts."

That said, in light of the Framework initiative, I do welcome a change to the contained function names to be in the iso_ namespace ('cos they're in iso.inc). In turn, if we can agree on leaving the functions in that file, then that would be the point at which you could make me agree that iso_country_list() and iso_language_list() would be slightly misleading.

Gábor Hojtsy’s picture

@sun: yeah, see... The main problem with saying these are ISO codes is that if you search for ISO language codes in relation the web, all you find is outdated crap. The web moved on. ISO never had standards for the combination of their codes as en-gb and pt-br uses it and if people want to abide the standards and use ISO codes for their languages, they will not be compatible with the rest of the web. Google indexing, browser preferences, etc. all use the BCP 47 language tags. To be able to interface with the web, we need to embrace the change that happened a long time ago.

Last mention I found of ISO regards to HTML languages is at http://www.w3.org/International/O-help-lang.html (before HTML 4, don't know when was the page published) when they wrote "it's a good idea", it was not part of a standard yet. By the time they released HTML 4 (1999, http://www.w3.org/TR/html4/types.html#type-langcode) they refer to RFC 1766 which defines a combination of ISO codes (not ISO codes per say) as language codes (http://www.ietf.org/rfc/rfc1766.txt).

12 years passed since the W3C stopped recommending we use ISO language codes and they forked to refer to derivative standards.

sun’s picture

Title: Rename/restructure iso.inc, because it is misleading » Rename iso.inc, because it is misleading

Proposal

  1. Rename iso.inc to standard.inc
  2. Rename the contained functions into standard_country_list() and standard_language_list().
xmacinfo’s picture

I like the way this discussion turns to. I now feel outdated.

Sun's proposal is sound. Renaming to standard.inc makes a lot of sense.

Gábor Hojtsy’s picture

Status: Active » Needs work
FileSize
2.06 KB

Here is a quick patch that just updated the code comments for the predefined language list. I was about to post this to underline the irony that "langauge codes" as understood by Drupal are defined by "W3C language tags" of which the predefined ones are in "iso.inc" in a function name "_locale_get_predefined_list". Language code, language tag, ISO and locale all in there in concert. Yeah!

Anyway, I like the proposal to rename the file, so will include that in the next edition of the patch as well.

Gábor Hojtsy’s picture

Title: Rename iso.inc, because it is misleading » Drupal does not use ISO language codes, iso.inc is misleading
FileSize
37.98 KB

Here is one that renames iso.inc to standard.inc and does the suggested renames +

- updates code comments on both the country list and the language list
- removes the useless grouping on the language list (discussed with @jhodgdon and @sun)
- updates the language instructions as well to refer to up to date stuff like code comments in standard.inc

Gábor Hojtsy’s picture

Status: Needs work » Needs review
sun’s picture

Priority: Minor » Normal
Status: Needs review » Reviewed & tested by the community
Issue tags: +Framework Initiative

Alright. Let's move forward with this as-is. And let's improve and adjust the phpDoc in standard.inc in a separate follow-up patch.

Gábor Hojtsy’s picture

Tagging.

sun’s picture

Issue tags: +API clean-up
DamienMcKenna’s picture

Would it be wrong for this to be two separate files, one function per file? Or is it just assumed that if you want one you'll either want the other or not care about the (probably minimal) overhead?

Gábor Hojtsy’s picture

Yes, this was the idea as discussed above.

Gábor Hojtsy’s picture

Issue tags: +Usability

Tagging Usability for the simplified explanation of what we consider language codes.

Gábor Hojtsy’s picture

The usability portion of this is:

-      '#description' => t('<a href="@rfc4646">RFC 4646</a> compliant language identifier. Language codes typically use a country code, and optionally, a script or regional variant name. <em>Examples: "en", "en-US" and "zh-Hant".</em>', array('@rfc4646' => 'http://www.ietf.org/rfc/rfc4646.txt')),
+      '#description' => t('Use language codes as defined by <a href="@w3ctags">W3C language tags</a> for interoperability. Language codes typically have a language and optionally, a script or regional variant name. <em>Examples: "en", "en-gb" and "zh-hant".</em>', array('@w3ctags' => 'http://www.w3.org/International/articles/language-tags/')),

With thinking that RFC 4646 is as alien to a user as it can be. Leading a description up with that is turning them off most probably. W3C language tags is less jargon, W3C is maybe still a very geeky thing, but less so than RC 4646 :D Also, fixes the description that language codes are typically country based with examples of "en" and "zh", which are hardly countries. This part of the patch will be candidate for backport once committed.

yoroy’s picture

Status: Reviewed & tested by the community » Needs review

Quick rewrite:

Use <a href="@w3ctags">W3C language codes</a> for interoperability. <em>Examples: "en", "en-gb" and "zh-hant".</em>

Tighter first sentence, then directly the examples. Removed "Language codes typically have a language and optionally, a script or regional variant name." as that's a whole lot of words to describe what is easier explained through the examples.

Gábor Hojtsy’s picture

Well, yeah, well, unfortunately W3C calls them language *tags*, and part of language *tags* are language *codes*, which are you know followed by script or regional variants optionally :) So we sacrifice accuracy here if we refer to W3C language *codes*, when that is in fact misleading for people who actually try to understand the trail of consequences here. However, if we refer to "language *tags*" (the complete language identifier as named by W3C), then we need a bit of explanation for this, since these things in our terminology are language codes. Alternatively of course, we can spread this and name it language *tag* everywhere in Drupal, but that is easily not as straightforward terminology, is it? BTW this was discussed to hell in #1216094: We call too many things 'language', clean that up (language codes vs. tags vs. identifiers).

yoroy’s picture

Ok understood, then either keep the original sentence or avoid 'tags' completely and say

Use language codes as <a href="@w3ctags">defined by the W3C</a> for interoperability.
Gábor Hojtsy’s picture

FileSize
37.88 KB

Ok, updated patch with exactly that, should be back to RTBC then? :)

Status: Needs review » Needs work

The last submitted patch, iso-to-standard.patch, failed testing.

Gábor Hojtsy’s picture

Status: Needs work » Needs review
FileSize
38.02 KB

A little piece of the affected code got moved to modules/locale/locale.bulk.inc in the meantime. Here is a full reroll with the same changes.

yoroy’s picture

Status: Needs review » Reviewed & tested by the community

A good cleanup of the description. Back to rtbc.

johnv’s picture

I'd like to draw your attention to module http://drupal.org/project/countries , since it doesn't show in the discussion.
This module takes the countries from iso.inc, and stores them in a table. It also contains a Field 'Country'.
The maintainer Alan D. releases new versions when new countries arise or change names, and these are updated in the table.

Gábor Hojtsy’s picture

Issue tags: +html5

Tagging for HTML5 given we align our language code support with the spec here.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Good clean-up. Committed to 8.x.

Dave Reid’s picture

Priority: Normal » Critical
Status: Fixed » Needs review

git add standard.inc

This broke 8.x :)

webchick’s picture

I went ahead and added, committed/pushed standard.inc so as not to stall other patches.

As a side note though, not to be "that guy" again, but standard.inc? According to the docblock, it "Provides a list of countries and languages". Why then is it not called "languages.inc" or "locale-support.inc" or "something-that-is-remotely-descriptive-of-what-it-does.inc"?

The only thing in Drupal called "standard" is the standard profile, and this has nothing to do with it, even though the name makes it look like it does.

webchick’s picture

Priority: Critical » Normal
xmacinfo’s picture

Priority: Normal » Critical
Status: Needs review » Fixed

Webchick just committed the new missing file. :-)

Dave Reid’s picture

Agreed with webchick - standard.inc makes no sense at all.

Gábor Hojtsy’s picture

@webchick, @Dave Reid: I think you might agree with the unofficial framework initiative plan to name functions/APIs prefixed with the file name (not just module name they are in). So standard.inc functions became standard_*(). My original proposal above was to move the language list to locale and leave countries in includes as countries.inc => countries_*(). Sun argued that (1) the language list might be used outside of locale module (2) we should have a file with multiple lists of standard stuff put altogether, not for individual things like countries and languages therefore (3) he suggested standard.inc which nobody disputed so far.

We cannot really name it languages.inc because (1) its not only languages (2) there is also a language.inc with language API functions and this can quickly get confusing. We cannot really call it locale-support.inc because (1) I think its as vague as standard.inc, locale has LOTs of functionality, and it has supporting code in bootstrap.inc, languages.inc, locale.inc, etc. (2) the country list is used by system even if no locale module exists, so its really not a locale support file.

Any better suggestions for code organization or file naming?

johnv’s picture

A suggestion for filenaming, since both are part of the 'locale' concept:
- locale_countries.inc
- locatie_languages.inc
This makes room for locale_currencies etc.
However, IMO these lists are better served as a table.

Gábor Hojtsy’s picture

@johnv: as I've explained above, the country list is definitely not "locale specific" (ATM at least, not in terms of how Drupal has "locale module" == "UI translation"), system module uses it even when locale module is not enabled. It also makes lots of sense to have that without locale module, since for an international shop for example with all English UI, recording country information is still vital. Also, @sun argued above that even the language list should not be confined to locale module, that might be used by other modules (though no direct example in core).

It is clearly an option that we rename locale module wholesale to something else and then we'll free up "locale" as the industry accepted concept of collections of data to use depending on locale, so we could move there then. However, as it is now, "locale" in core means the language list and UI translation and mixing it with other understandings of locale sounds like a mistake.

xmacinfo’s picture

There are more things for a truly multi-language/multi-country than only locale, countries and languages.

There are:
— date formats
— sorting order (translitated sorting order or not (É and E seen equal in sorting or not).
— number decimal
— number thousand separator
— drupal hardcoded path (user or “utilisateur”)
— languages (for example fr)
— localized languages (fr-CA)
— timezone
— i10n
— i18n
— currency symbol
— metric or imperial mesure unit
— first day of week
— calendar (gregorian?)
— etc.

Some of these things are already taken care by core, or if not by contribs.

But for most, there are no easy ways to localize all these formats (or no known way yet).

For example, for each country, by default we know which is the currency symbol, the first day of the week, the default country language sort order, etc.

So since there are more things to cover than just locale, countries and languages, I believe that standard.inc is the best name.

webchick’s picture

Well, if that's what this eventually gets expanded to include, I guess I would've just chucked these into includes/utility.inc, since I would consider those utility functions.

Anyway, probably not worth harping on it. Maybe we can look at this again closer to code freeze and see if it makes sense given the evolution of the code between now and then.

jhodgdon’s picture

Um. This issue is tagged "API change" and it is marked "fixed". But there is no change node. Does one need to be created? I don't want to read the issue and the summary doesn't make it clear...

Gábor Hojtsy’s picture

Sorry for that, added http://drupal.org/node/1276626.

Alan D.’s picture

@Gábor - totally off topic, but kind of related. What would be the best way to supply user defined country translations? I'm maintaining the countries module that extends the core countries list into the db. I was thinking about using the old t() paradigm:

t('!country-ISO2-name', array('!country-ISO2-name' => check_plain($db['name])));

eg:

!country-us-name = United States
!country-au-name = Australia

@Everyone else: Sorry for the unrelated noise.

jhodgdon’s picture

You should never ever ever ever use t() on anything except hard-coded strings in code. So your suggestion of passing information that is stored in the database and entered by the user into t() is not a good one (assuming I have interpreted correctly that the countries are user-supplied).

Alan D.’s picture

@jhodgdon - yes but the iso2 codes are limited to two char a-z only and are (almost) static. ie: "country-ISO2-property identifier".

I've opened this issue #1279774: I18n support (i18n_string integration) to continue the discussion.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Gábor Hojtsy’s picture

Priority: Critical » Normal
Issue tags: -API change +language-base

Tagging for base language system. Lowering priority since the original issue was definitely not critical.

sir_squall’s picture

Hi,

Could you please respect the RFC 4646 in term of letter case for language-code.
Instead of defining language-code in all lowercase, it should use uppercase for the region part.
Example zh-hans => zh-Hans.

Is there any patch backport for Drupal 7 ?

Gábor Hojtsy’s picture

@sir_squall: RFC 4646 has long been obsoleted by BCP 47. See http://www.w3.org/International/articles/language-tags/ for more information.

The entries in the registry follow certain conventions with regard to upper and lower letter-casing. For example, language tags are lower case, alphabetic region subtags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you are free to do as you like, unless you are constrained by the rules of the system you are working with. For HTML and XML language markup, the case should not matter.

(Emphasis from me). I'm not sure there is anything to fix here.

sir_squall’s picture

BCP 47 states "Although case distinctions do not carry meaning in language tags, consistent formatting and presentation of language tags will aid users. The format of subtags in the registry is RECOMMENDED as the form to use in language tags. "

Gábor Hojtsy’s picture

Feel free to open an issue for this that would let us discuss in detail and link it back here. Thanks!

sir_squall’s picture

Ok thanks, if you want to follow the discuss:

http://drupal.org/node/1941732

sir_squall’s picture

Issue summary: View changes

Update to current status.