Updated: Comment #50

Problem/Motivation

php ucwords() and lcfirst() can corrupt utf8 text under non-utf8 locales.

Drupal has incomplete Unicode support relative to full set of PHP core string manipulation functions as it has no equivalent for ucwords() or lcfirst().

Proposed resolution

Add Unicode::ucwords() and Unicode::lcfirst() to Drupal\Component\Utility\Unicode as direct replacements for native PHP ucwords() and lcfirst().

Remaining tasks

- RTBC patch
- Backport to D7

User interface changes

None.

API changes

2 minor API additions: Unicode::ucwords() and Unicode::lcfirst()

Original report by @msameer

php ucwords() can corrupt utf8 text under non-utf8 locales.

I've implemented a drupal_ucwords() using the drupal mbstring emulation layer

function drupal_ucwords($text)
{
$upper_words = array();
$words = explode(" ", $text);

foreach ($words as $word) {
$upper_words[] = drupal_ucfirst($word);
}

return implode(" ", $upper_words);
}
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

msameer’s picture

Version: 4.7.2 » x.y.z

oops, meant CVS

beginner’s picture

I see that simplenews.module uses ucwords, but not drupal core.
Will this function actually be used?

msameer’s picture

Yes I think it will be used. I had an issue with the article module corrupting utf8 text and I had to remove the call to ucwords.

webchick’s picture

Makes sense to me. Although drupal_ucfirst just calls drupal_strtoupper ... would it make sense to call drupal_strtoupper directly and avoid another function call?

@beginner, in addition to core functionality, Drupal is also supplying an API for developers. So I think it's appropriate, even if there's nothing directly in core that uses it.

msameer’s picture

drupal_strtoupper() turns all the string to uppercase. That's why I can't use it
I can avoid calling drupal_ucfirst() and call drupal_strtoupper() directly. But In that case we will have duplicate code.

webchick’s picture

Duh. I shouldn't review patches at 9am on a Sunday. :P Nevermind. ;)

LAsan’s picture

Version: x.y.z » 7.x-dev

Does it applies to current version?

neclimdul’s picture

Status: Active » Needs review
FileSize
2.07 KB

hm.. this seems like a glaring utf8 support omission along with drupal_lcfirst(). What are our chances of getting it in I wonder...

Attached is the code from original poster in a patch. I had concerns that is would cause it will change the spacing of a string with multiple spaces which would be incorrect. It seems like it works though. Test included.

chx’s picture

Status: Needs review » Needs work

use PREG_CLASS_UNICODE_WORD_BOUNDARY instead of a space.

neclimdul’s picture

Status: Needs work » Needs review
FileSize
2.31 KB

With some help from chx we have a significantly better patch.

neclimdul’s picture

I should give chx more credit... most of the patch was really written by him with code snippets in IRC :-D

chx’s picture

Some help? I wrote the patch in IRC for you :)

aspilicious’s picture

+/**
+ * Uppercase matched strings.
+ * 
+ * @see drupal_ucwords().
+ * @param $matches
+ *   An array of matched characters.
+ */

This needs some love...

- "Uppercase matched string" is not rly a description of what the function does.
It also has to start with a 3th person verb
- Second of all: the @see part needs to be below the @param block (sepperated by a newline)

 /**
+ * Capitalize the first letter of each word in a UTF-8 string.
+ *

- Capitalize has to be capitalizes, already explained why

Question: why don't we need an @return statement?

neclimdul’s picture

FileSize
3.29 KB

Yeah... I guess I can be a lazy dev too. Too much copy and pasting without reviewing documentation... *tsk*tsk drupal_ucfirst btw...

Anyways, updated docs with suggestions and extra descriptions of what's going on. Actually running the test function I wrote too... *cough*

aspilicious’s picture

 first chacter of a word.

I think this is wrong. The only other chacter I could find on the internet is a "totally spice chacter" ;). [1]

Sources
------
[1] http://www.proprofs.com/quiz-school/story.php?title=what-totally-spies-c...

neclimdul’s picture

FileSize
3.29 KB

That's totally what I meant and I'm like totally Sam. Wait? What?

aspilicious’s picture

Ok looks fine to me, search for an other reviewer on irc and this is rtbc.

kscheirer’s picture

Status: Needs review » Needs work

All tests passed, applies without offset.

I don't usually say this, but there's too much documentation :) _drupal_ucwords_callback() is a one-line private function with no use case besides helping drupal_ucwords() - the one line explanation is sufficient.

If we do keep the long paragraph explanation, "Because of the way our regular expression is build" should be "Because of the way our regular expression is built".

neclimdul’s picture

Status: Needs work » Needs review
FileSize
3.11 KB

Thanks for the review, based on similar callbacks(_decode_entities for example) some explanation of what we're getting from the regular expression is reasonable. I've condensed this down to the param and return documentation though.

neclimdul’s picture

FileSize
3.11 KB

Thanks for the review, based on similar callbacks(_decode_entities for example) some explanation of what we're getting from the regular expression is reasonable. I've condensed this down to the param and return documentation though.

kscheirer’s picture

Status: Needs review » Reviewed & tested by the community

Looks good to me. We're at over 4 years for this patch, I think it's ready.

Dries’s picture

Version: 7.x-dev » 8.x-dev
TR’s picture

Status: Reviewed & tested by the community » Needs review

I have to ask, is this still a problem? The original post was almost 5 years ago. Since then, PHP has undergone many changes, and so has Drupal. We shouldn't take it for granted that this PHP function is still broken in PHP 5.2, which is the minimum required for D7. This thread has been all about the implementation, and no one's gone back and checked the initial assumption or provided an example that fails. I'm not fond of re-implementing PHP functions in Drupal, so I don't think this should go in unless there's verification that the bug still exists. ucwords() is not used in Drupal core. It is used in a few contributions, including Views (views/handlers/views_handler_argument_string.inc and views/includes/admin.inc). I found one issue in the Views queue describing a similar problem three years ago which was fixed in Views in a different manner #327604: case_transform() should use multibyte string functions.

chx’s picture

Status: Needs review » Reviewed & tested by the community

I will not trust any PHP's internal string functions. PCRE is good but PHP strings re UTF8? Gimme a break. On more concrete grounds, ucwords is not UTF-8 aware so it will never uppercase á to Á because it lacks this table.

TR’s picture

Let's test it against D8 head then.

#20: 70719-drupal_ucwords.patch queued for re-testing.

catch’s picture

This is something we could backport as an API addition if we wanted to. Tagging as such and cross-linking #1221904: RFC: forwards compatibility in D7 / 'backports' defgroup.

klausi’s picture

Status: Reviewed & tested by the community » Needs work

this can be tested equally with DrupalUnitTest, no need to use a fully blown DrupalWebTestCase here.

pfrenssen’s picture

Wouldn't it be a good idea to expand the scope of this issue to provide drupal_lcfirst() as well?

superspring’s picture

Status: Needs work » Needs review
FileSize
3.59 KB

This is a D8 version of the patch. Includes replacing ucwords with drupal_ucwords.

Lars Toomre’s picture

+++ b/core/includes/unicode.incundefined
@@ -462,6 +462,36 @@ function drupal_ucfirst($text) {
+function _drupal_ucwords_callback($matches) {
+  return $matches[1] . drupal_strtoupper($matches[2]);

Can this be changed to '(array $matches)'?

+++ b/core/modules/system/lib/Drupal/system/Tests/System/UnicodeUnitTest.phpundefined
@@ -118,6 +120,30 @@ function helperTestUcFirst() {
+      $this->assertEqual(drupal_ucwords($input), $output, t('%input is ucword-ed as %output', array(
+        '%input' => $input,
+        '%output' => $output,

There is no need to translate test assertion messages. This should be changed to format_string().

superspring’s picture

Same patch as above with Lars Toomre's review.

kscheirer’s picture

Ok at 6.5 years we have to finally be getting close... :) Patch works as advertised for me.

@pfrenssen: I was gonna say we don't need it, I've never had a use for lcfirst. But searching through core reveals:

core/vendor/doctrine/common/lib/Doctrine/Common/Persistence/PersistentObject.php:        $field = lcfirst(substr($method, 3));
core/vendor/doctrine/common/lib/Doctrine/Common/Util/Inflector.php:        return lcfirst(self::classify($word));
core/vendor/symfony/serializer/Symfony/Component/Serializer/Normalizer/GetSetMethodNormalizer.php:                $attributeName = lcfirst(substr($method->name, 3));
core/vendor/symfony/serializer/Symfony/Component/Serializer/Normalizer/GetSetMethodNormalizer.php:                $paramName = lcfirst($constructorParameter->name);

I think that's code we adopted from Symfony, do we need to update those instances? Leaving as Needs review for this question, the actual patch is RTBC from me.

kscheirer’s picture

Status: Needs review » Needs work

Argh, we have many more instances of ucwords() in core...

Karls-MacBook-Pro:drupal-8 karl$ sudo grep -rbs ucwords *
core/modules/views/lib/Drupal/views/Plugin/views/argument/String.php:2838:        'ucwords' => t('Capitalize each word'),
core/modules/views/lib/Drupal/views/Plugin/views/argument/String.php:3443:        'ucwords' => t('Capitalize each word'),
core/modules/views/lib/Drupal/views/Plugin/views/field/FieldPluginBase.php:26651:          'ucwords' => t('Capitalize each word'),
core/modules/views/lib/Drupal/views/Plugin/views/HandlerBase.php:6620:   *      - ucwords: Make each word in the string uppercase.
core/modules/views/lib/Drupal/views/Plugin/views/HandlerBase.php:7125:      case 'ucwords':
core/modules/views/lib/Drupal/views/Plugin/views/HandlerBase.php:7277:          return ucwords($string);
core/modules/views/lib/Drupal/views/Tests/Handler/FieldTest.php:11581:    // Switch to ucfirst and ucwords.
core/modules/views/lib/Drupal/views/Tests/Handler/FieldTest.php:11879:    $id_field->options['alter']['path_case'] = 'ucwords';
core/vendor/doctrine/common/lib/Doctrine/Common/Util/Inflector.php:2265:        return str_replace(" ", "", ucwords(strtr($word, "_-", "  ")));
core/vendor/twig/twig/lib/Twig/Extension/Core.php:32649:        return ucwords(strtolower($string));
core/vendor/twig/twig/lib/Twig/Extension/Core.php:34027:        return ucwords(strtolower($string));

We also seem to be making a lot of use of ucfirst() in core - should all those be changed over to drupal_ucfirst() ?

superspring’s picture

Status: Needs work » Needs review

Hey @kscheirer,

Out of your grep list the only important one I saw there was in HandlerBase.php, the others were comments, select options and vender specific code. HandlerBase.php is in the patch above.

Is this your only objection to the patch?

superspring’s picture

After review with @chx, now using closures.

chx’s picture

Even if there are more , we can always do more in a followup. This is good to go but given that practically I wrote it some years ago I can't really RTBC it.

kscheirer’s picture

Status: Needs review » Reviewed & tested by the community

Yeah, that's my only objection. The patch itself is RTBC if we don't need to worry about FieldTest.php:11879. I guess we can't do anything about the vendor code.
Update: just saw the latest patch takes care of FieldTest as well, all good!

Status: Reviewed & tested by the community » Needs work

The last submitted patch, drupal_ucwords-70719-35.patch, failed testing.

kscheirer’s picture

I take back what I said, I was very wrong. These 2 pieces are Views module dealing with paths, which I now think we should leave alone. The purpose of drupal_ucwords() is to add to our Developer API when dealing with text strings, not start changing generated urls.

As webchick stated in #4:

Drupal is also supplying an API for developers. So I think it's appropriate, even if there's nothing directly in core that uses it.

and neclimdul in #8:

this seems like a glaring utf8 support omission along with drupal_lcfirst()

Let's add drupal_lcfirst() and a test and call it done!

+++ b/core/modules/system/lib/Drupal/system/Tests/System/UnicodeUnitTest.phpundefined
index 0b897ea..50a508d 100644
--- a/core/modules/views/lib/Drupal/views/Plugin/views/HandlerBase.php

This hunk should be removed from the patch.

+++ b/core/modules/views/lib/Drupal/views/Plugin/views/HandlerBase.phpundefined
index 967eae8..122dc4a 100644
--- a/core/modules/views/lib/Drupal/views/Tests/Handler/FieldTest.php

This hunk should be removed from the patch.

superspring’s picture

Status: Needs work » Needs review
FileSize
3.2 KB

Same patch as above with test fixes.

kscheirer’s picture

Title: drupal_ucwords() implementation. » Add drupal_ucwords() and drupal_lcfirst() implementation and make use of them in core

This patch builds on the previous, with the following changes:

  • Added the $multibyte check and mb_convert_case() to drupal_ucwords() so it matches our other php string wrappers.
  • Added drupal_lcfirst() to improve our developer API.
  • Addes unit test for drupal_lcfirst().
  • Cleaned up views' caseTransform(), no need for mb check, and made use of drupal_ucfirst() as well.
  • Caught instance of ucfirst() in format_backtrace(). This change might not be needed, but it seems reasonable.
  • Found 2 more instances of ufcirst() in Field's buildGroupByForm().
kscheirer’s picture

aaaand a patch!

Status: Needs review » Needs work

The last submitted patch, drupal-fix-up-mb-string-usage-70719-42.patch, failed testing.

kscheirer’s picture

Status: Needs work » Needs review
FileSize
6.32 KB

removed the call to mb_convert_case(), I was never able to get that to correctly convert 3 of our tests.

pfrenssen’s picture

Status: Needs review » Needs work
Issue tags: +Needs reroll

This needs to be rerolled. For D8 this should be reworked as well, as we now have \Drupal\Component\Utility\Unicode this should be placed in Unicode::ucwords() and Unicode::lcfirst() instead of drupal_ucwords() and drupal_lcfirst().

InternetDevels’s picture

Issue summary: View changes
Status: Needs work » Needs review
Issue tags: -Needs reroll
FileSize
6.83 KB

Rerolled.

Status: Needs review » Needs work

The last submitted patch, 46: drupal-fix-up-mb-string-usage-70719-46.patch, failed testing.

neclimdul’s picture

FileSize
6.83 KB
820 bytes

anonymous function can't use static sadly.

neclimdul’s picture

Status: Needs work » Needs review

Really don't like this new form... think I need to go back to only commenting until its fixed. ;)

thedavidmeister’s picture

Issue summary: View changes
Status: Needs review » Reviewed & tested by the community

The patch seems fine to me and it still applies, setting this to RTBC but I needed to update the issue summary as it was pretty old.

#2094585: [policy, no patch] Core review bonus for #2093161: Remove all calls to drupal_json_encode() in favour of \Drupal\Component\Utility\Json::encode().

thedavidmeister’s picture

Title: Add drupal_ucwords() and drupal_lcfirst() implementation and make use of them in core » Add Unicode::ucwords() and Unicode::lcfirst() implementation and make use of them in core

updated issue title.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs change record

We need a draft change record for this change as per the new policy - Draft change records are now required before commit

thedavidmeister’s picture

Status: Needs work » Reviewed & tested by the community
Issue tags: -Needs change record

@alexpott - I'm new to this policy. It would help if there was a field that references the change notice like the existing "related issues" field.

Anyway, I created a draft at https://drupal.org/node/2196751 so setting back to RTBC.

xjm’s picture

@thedavidmeister, one change record can already reference many issues, so you can add a reference to this issue on any existing change record and that change record will then show up in the issue sidebar.

You can search existing change records at:
https://drupal.org/list-changes/drupal

I searched for "unicode" and found this:
https://drupal.org/node/1992584
So we can edit that one to add a reference to this issue. :)

xjm’s picture

FileSize
58.89 KB

@alexpott apparently didn't know where the sidebar was, so here is screenshot:

alexpott’s picture

Category: Feature request » Task
Status: Reviewed & tested by the community » Needs work
  1. +++ b/core/modules/field/lib/Drupal/field/Plugin/views/field/Field.php
    @@ -8,6 +8,7 @@
    +use Drupal\Component\Utility\Unicode;
    
    @@ -602,7 +603,7 @@ public function buildGroupByForm(&$form, &$form_state) {
    -    ) + MapArray::copyValuesToKeys(array_keys($this->field_info->getColumns()), 'ucfirst');
    +    ) + MapArray::copyValuesToKeys(array_keys($this->field_info->getColumns()), 'Unicode::ucfirst');
    
    @@ -612,7 +613,7 @@ public function buildGroupByForm(&$form, &$form_state) {
    -    $options = MapArray::copyValuesToKeys(array('bundle', 'language', 'entity_type'), 'ucfirst');
    +    $options = MapArray::copyValuesToKeys(array('bundle', 'language', 'entity_type'), 'Unicode::ucfirst');
    

    The should be the fully namespaced string. It only works because Unicode and MapArray are in the same namespace. So we can remove the use statement. And replace 'Unicode... with '\Drupal\Component\Utility\Unicode...

  2. +++ b/core/lib/Drupal/Component/Utility/Unicode.php
    @@ -315,6 +315,40 @@ public static function ucfirst($text) {
    +   * Lowercase the first letter of a UTF-8 string.
    

    Should a something like "Lowercases"

  3. +++ b/core/lib/Drupal/Core/Utility/Error.php
    @@ -9,6 +9,7 @@
    +use Drupal\Component\Utility\Unicode;
    
    @@ -164,7 +165,7 @@ public static function formatBacktrace(array $backtrace) {
    -            $call['args'][] = ucfirst(gettype($arg));
    +            $call['args'][] = Unicode::ucfirst(gettype($arg));
    

    This can not possibly by a unicode string as far as I know.

  4. +++ b/core/modules/field/lib/Drupal/field/Plugin/views/field/Field.php
    @@ -8,6 +8,7 @@
    +use Drupal\Component\Utility\Unicode;
    
    @@ -602,7 +603,7 @@ public function buildGroupByForm(&$form, &$form_state) {
    -    ) + MapArray::copyValuesToKeys(array_keys($this->field_info->getColumns()), 'ucfirst');
    +    ) + MapArray::copyValuesToKeys(array_keys($this->field_info->getColumns()), 'Unicode::ucfirst');
    
    @@ -612,7 +613,7 @@ public function buildGroupByForm(&$form, &$form_state) {
    -    $options = MapArray::copyValuesToKeys(array('bundle', 'language', 'entity_type'), 'ucfirst');
    +    $options = MapArray::copyValuesToKeys(array('bundle', 'language', 'entity_type'), 'Unicode::ucfirst');
    

    Unicode::ucfirst() is not being introduced by this patch so I think this is out of scope. Additonally, can column names be utf - I don't think so.

neclimdul’s picture

Status: Needs work » Needs review
FileSize
3.12 KB
5.63 KB

56.1 + 56.4) Yeah, they didn't use the Unicode because the strings are static and clearly not Unicode so we can take that out.
56.2) ok
56.3) Yeah until PHP 6 that shouldn't be a thing we need to worry about.

ParisLiakos’s picture

  1. +++ b/core/lib/Drupal/Component/Utility/Unicode.php
    @@ -280,7 +280,7 @@ public static function strtoupper($text) {
    -   * Lowercase a UTF-8 string.
    

    Irrelevant change, but i guess we can keep it

  2. +++ b/core/lib/Drupal/Component/Utility/Unicode.php
    @@ -315,6 +315,40 @@ public static function ucfirst($text) {
    +   * Lowercase the first letter of a UTF-8 string.
    

    this should be the one that changes to "Lowercases" which is introduced here

  3. +++ b/core/lib/Drupal/Core/Utility/Error.php
    @@ -125,7 +125,7 @@ public static function getLastCaller(array &$backtrace) {
    -    // The second call gives us the function where the call originated.
    +    // The second call gi\Drupal\Component\Utility\Unicode::ves us the function where the call originated.
    

    by accident, should be reverted

longwave’s picture

FileSize
5.11 KB
1.43 KB

Fixed #58.2 and #58.3. If we are to change "Lowercase" to "Lowercases" here, we should also change "Uppercase" to "Uppercases" at the same time, so I made that change as well, even if it's slightly out of scope.

TR’s picture

"Lowercase" and "Uppercase" are adjectives and should not be used as verbs. Likewise, UTF-8 strings are composed of characters, not letters.

Instead of "Lowercases the first letter of a UTF-8 string.", I would write "Converts the first character of a UTF-8 string to lowercase."

HeikeT’s picture

Issue tags: +#Drupal8nz
longwave’s picture

FileSize
5.61 KB
1.86 KB

Made changes suggested in #60

hdmaestro’s picture

i have same issue. now fixed.

yemek tarifleri

ParisLiakos’s picture

Status: Needs review » Reviewed & tested by the community

thanks, looks great

Status: Reviewed & tested by the community » Needs work

The last submitted patch, 62: 70719-unicode-62.patch, failed testing.

longwave’s picture

ianthomas_uk’s picture

62: 70719-unicode-62.patch queued for re-testing.

longwave’s picture

Status: Needs work » Reviewed & tested by the community

Back to RTBC

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

Committed f130cc3 and pushed to 8.x. Thanks!

  • Commit f130cc3 on 8.x by alexpott:
    Issue #70719 by neclimdul, superspring, longwave, kscheirer,...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.