Updated: Comment #50
Problem/Motivation
php ucwords() and lcfirst() can corrupt utf8 text under non-utf8 locales.
Drupal has incomplete Unicode support relative to full set of PHP core string manipulation functions as it has no equivalent for ucwords() or lcfirst().
Proposed resolution
Add Unicode::ucwords() and Unicode::lcfirst() to Drupal\Component\Utility\Unicode as direct replacements for native PHP ucwords() and lcfirst().
Remaining tasks
- RTBC patch
- Backport to D7
User interface changes
None.
API changes
2 minor API additions: Unicode::ucwords() and Unicode::lcfirst()
Original report by @msameer
php ucwords() can corrupt utf8 text under non-utf8 locales.
I've implemented a drupal_ucwords() using the drupal mbstring emulation layer
function drupal_ucwords($text)
{
$upper_words = array();
$words = explode(" ", $text);
foreach ($words as $word) {
$upper_words[] = drupal_ucfirst($word);
}
return implode(" ", $upper_words);
}
Comment | File | Size | Author |
---|---|---|---|
#62 | interdiff-70719.txt | 1.86 KB | longwave |
#62 | 70719-unicode-62.patch | 5.61 KB | longwave |
#48 | interdiff.txt | 820 bytes | neclimdul |
#48 | drupal-fix-up-mb-string-usage-70719-48.patch | 6.83 KB | neclimdul |
#31 | drupal_ucwords-70719-31.patch | 3.61 KB | superspring |
Comments
Comment #1
msameer CreditAttribution: msameer commentedoops, meant CVS
Comment #2
beginner CreditAttribution: beginner commentedI see that simplenews.module uses ucwords, but not drupal core.
Will this function actually be used?
Comment #3
msameer CreditAttribution: msameer commentedYes I think it will be used. I had an issue with the article module corrupting utf8 text and I had to remove the call to ucwords.
Comment #4
webchickMakes sense to me. Although drupal_ucfirst just calls drupal_strtoupper ... would it make sense to call drupal_strtoupper directly and avoid another function call?
@beginner, in addition to core functionality, Drupal is also supplying an API for developers. So I think it's appropriate, even if there's nothing directly in core that uses it.
Comment #5
msameer CreditAttribution: msameer commenteddrupal_strtoupper() turns all the string to uppercase. That's why I can't use it
I can avoid calling drupal_ucfirst() and call drupal_strtoupper() directly. But In that case we will have duplicate code.
Comment #6
webchickDuh. I shouldn't review patches at 9am on a Sunday. :P Nevermind. ;)
Comment #7
LAsan CreditAttribution: LAsan commentedDoes it applies to current version?
Comment #8
neclimdulhm.. this seems like a glaring utf8 support omission along with drupal_lcfirst(). What are our chances of getting it in I wonder...
Attached is the code from original poster in a patch. I had concerns that is would cause it will change the spacing of a string with multiple spaces which would be incorrect. It seems like it works though. Test included.
Comment #9
chx CreditAttribution: chx commenteduse PREG_CLASS_UNICODE_WORD_BOUNDARY instead of a space.
Comment #10
neclimdulWith some help from chx we have a significantly better patch.
Comment #11
neclimdulI should give chx more credit... most of the patch was really written by him with code snippets in IRC :-D
Comment #12
chx CreditAttribution: chx commentedSome help? I wrote the patch in IRC for you :)
Comment #13
aspilicious CreditAttribution: aspilicious commentedThis needs some love...
- "Uppercase matched string" is not rly a description of what the function does.
It also has to start with a 3th person verb
- Second of all: the @see part needs to be below the @param block (sepperated by a newline)
- Capitalize has to be capitalizes, already explained why
Question: why don't we need an @return statement?
Comment #14
neclimdulYeah... I guess I can be a lazy dev too. Too much copy and pasting without reviewing documentation... *tsk*tsk drupal_ucfirst btw...
Anyways, updated docs with suggestions and extra descriptions of what's going on. Actually running the test function I wrote too... *cough*
Comment #15
aspilicious CreditAttribution: aspilicious commentedI think this is wrong. The only other chacter I could find on the internet is a "totally spice chacter" ;). [1]
Sources
------
[1] http://www.proprofs.com/quiz-school/story.php?title=what-totally-spies-c...
Comment #16
neclimdulThat's totally what I meant and I'm like totally Sam. Wait? What?
Comment #17
aspilicious CreditAttribution: aspilicious commentedOk looks fine to me, search for an other reviewer on irc and this is rtbc.
Comment #18
kscheirerAll tests passed, applies without offset.
I don't usually say this, but there's too much documentation :)
_drupal_ucwords_callback()
is a one-line private function with no use case besides helping drupal_ucwords() - the one line explanation is sufficient.If we do keep the long paragraph explanation, "Because of the way our regular expression is build" should be "Because of the way our regular expression is built".
Comment #19
neclimdulThanks for the review, based on similar callbacks(_decode_entities for example) some explanation of what we're getting from the regular expression is reasonable. I've condensed this down to the param and return documentation though.
Comment #20
neclimdulThanks for the review, based on similar callbacks(_decode_entities for example) some explanation of what we're getting from the regular expression is reasonable. I've condensed this down to the param and return documentation though.
Comment #21
kscheirerLooks good to me. We're at over 4 years for this patch, I think it's ready.
Comment #22
Dries CreditAttribution: Dries commentedComment #23
TR CreditAttribution: TR commentedI have to ask, is this still a problem? The original post was almost 5 years ago. Since then, PHP has undergone many changes, and so has Drupal. We shouldn't take it for granted that this PHP function is still broken in PHP 5.2, which is the minimum required for D7. This thread has been all about the implementation, and no one's gone back and checked the initial assumption or provided an example that fails. I'm not fond of re-implementing PHP functions in Drupal, so I don't think this should go in unless there's verification that the bug still exists. ucwords() is not used in Drupal core. It is used in a few contributions, including Views (views/handlers/views_handler_argument_string.inc and views/includes/admin.inc). I found one issue in the Views queue describing a similar problem three years ago which was fixed in Views in a different manner #327604: case_transform() should use multibyte string functions.
Comment #24
chx CreditAttribution: chx commentedI will not trust any PHP's internal string functions. PCRE is good but PHP strings re UTF8? Gimme a break. On more concrete grounds, ucwords is not UTF-8 aware so it will never uppercase á to Á because it lacks this table.
Comment #25
TR CreditAttribution: TR commentedLet's test it against D8 head then.
#20: 70719-drupal_ucwords.patch queued for re-testing.
Comment #26
catchThis is something we could backport as an API addition if we wanted to. Tagging as such and cross-linking #1221904: RFC: forwards compatibility in D7 / 'backports' defgroup.
Comment #27
klausithis can be tested equally with DrupalUnitTest, no need to use a fully blown DrupalWebTestCase here.
Comment #28
pfrenssenWouldn't it be a good idea to expand the scope of this issue to provide drupal_lcfirst() as well?
Comment #29
superspring CreditAttribution: superspring commentedThis is a D8 version of the patch. Includes replacing ucwords with drupal_ucwords.
Comment #30
Lars Toomre CreditAttribution: Lars Toomre commentedCan this be changed to '(array $matches)'?
There is no need to translate test assertion messages. This should be changed to format_string().
Comment #31
superspring CreditAttribution: superspring commentedSame patch as above with Lars Toomre's review.
Comment #32
kscheirerOk at 6.5 years we have to finally be getting close... :) Patch works as advertised for me.
@pfrenssen: I was gonna say we don't need it, I've never had a use for lcfirst. But searching through core reveals:
I think that's code we adopted from Symfony, do we need to update those instances? Leaving as Needs review for this question, the actual patch is RTBC from me.
Comment #33
kscheirerArgh, we have many more instances of ucwords() in core...
We also seem to be making a lot of use of ucfirst() in core - should all those be changed over to drupal_ucfirst() ?
Comment #34
superspring CreditAttribution: superspring commentedHey @kscheirer,
Out of your grep list the only important one I saw there was in HandlerBase.php, the others were comments, select options and vender specific code. HandlerBase.php is in the patch above.
Is this your only objection to the patch?
Comment #35
superspring CreditAttribution: superspring commentedAfter review with @chx, now using closures.
Comment #36
chx CreditAttribution: chx commentedEven if there are more , we can always do more in a followup. This is good to go but given that practically I wrote it some years ago I can't really RTBC it.
Comment #37
kscheirerYeah, that's my only objection. The patch itself is RTBC if we don't need to worry about FieldTest.php:11879. I guess we can't do anything about the vendor code.
Update: just saw the latest patch takes care of FieldTest as well, all good!
Comment #39
kscheirerI take back what I said, I was very wrong. These 2 pieces are Views module dealing with paths, which I now think we should leave alone. The purpose of drupal_ucwords() is to add to our Developer API when dealing with text strings, not start changing generated urls.
As webchick stated in #4:
and neclimdul in #8:
Let's add drupal_lcfirst() and a test and call it done!
This hunk should be removed from the patch.
This hunk should be removed from the patch.
Comment #40
superspring CreditAttribution: superspring commentedSame patch as above with test fixes.
Comment #41
kscheirerThis patch builds on the previous, with the following changes:
Comment #42
kscheireraaaand a patch!
Comment #44
kscheirerremoved the call to mb_convert_case(), I was never able to get that to correctly convert 3 of our tests.
Comment #45
pfrenssenThis needs to be rerolled. For D8 this should be reworked as well, as we now have
\Drupal\Component\Utility\Unicode
this should be placed inUnicode::ucwords()
andUnicode::lcfirst()
instead ofdrupal_ucwords()
anddrupal_lcfirst()
.Comment #46
InternetDevels CreditAttribution: InternetDevels commentedRerolled.
Comment #48
neclimdulanonymous function can't use static sadly.
Comment #49
neclimdulReally don't like this new form... think I need to go back to only commenting until its fixed. ;)
Comment #50
thedavidmeister CreditAttribution: thedavidmeister commentedThe patch seems fine to me and it still applies, setting this to RTBC but I needed to update the issue summary as it was pretty old.
#2094585: [policy, no patch] Core review bonus for #2093161: Remove all calls to drupal_json_encode() in favour of \Drupal\Component\Utility\Json::encode().
Comment #51
thedavidmeister CreditAttribution: thedavidmeister commentedupdated issue title.
Comment #52
alexpottWe need a draft change record for this change as per the new policy - Draft change records are now required before commit
Comment #53
thedavidmeister CreditAttribution: thedavidmeister commented@alexpott - I'm new to this policy. It would help if there was a field that references the change notice like the existing "related issues" field.
Anyway, I created a draft at https://drupal.org/node/2196751 so setting back to RTBC.
Comment #54
xjm@thedavidmeister, one change record can already reference many issues, so you can add a reference to this issue on any existing change record and that change record will then show up in the issue sidebar.
You can search existing change records at:
https://drupal.org/list-changes/drupal
I searched for "unicode" and found this:
https://drupal.org/node/1992584
So we can edit that one to add a reference to this issue. :)
Comment #55
xjm@alexpott apparently didn't know where the sidebar was, so here is screenshot:
Comment #56
alexpottThe should be the fully namespaced string. It only works because Unicode and MapArray are in the same namespace. So we can remove the use statement. And replace
'Unicode...
with'\Drupal\Component\Utility\Unicode...
Should a something like "Lowercases"
This can not possibly by a unicode string as far as I know.
Unicode::ucfirst()
is not being introduced by this patch so I think this is out of scope. Additonally, can column names be utf - I don't think so.Comment #57
neclimdul56.1 + 56.4) Yeah, they didn't use the Unicode because the strings are static and clearly not Unicode so we can take that out.
56.2) ok
56.3) Yeah until PHP 6 that shouldn't be a thing we need to worry about.
Comment #58
ParisLiakos CreditAttribution: ParisLiakos commentedIrrelevant change, but i guess we can keep it
this should be the one that changes to "Lowercases" which is introduced here
by accident, should be reverted
Comment #59
longwaveFixed #58.2 and #58.3. If we are to change "Lowercase" to "Lowercases" here, we should also change "Uppercase" to "Uppercases" at the same time, so I made that change as well, even if it's slightly out of scope.
Comment #60
TR CreditAttribution: TR commented"Lowercase" and "Uppercase" are adjectives and should not be used as verbs. Likewise, UTF-8 strings are composed of characters, not letters.
Instead of "Lowercases the first letter of a UTF-8 string.", I would write "Converts the first character of a UTF-8 string to lowercase."
Comment #61
HeikeT CreditAttribution: HeikeT commentedComment #62
longwaveMade changes suggested in #60
Comment #63
hdmaestro CreditAttribution: hdmaestro commentedi have same issue. now fixed.
yemek tarifleri
Comment #64
ParisLiakos CreditAttribution: ParisLiakos commentedthanks, looks great
Comment #66
longwaveSeems this was accidentally committed in #2169765: Redesign update.php to be more consistent with the installation process?
http://drupalcode.org/project/drupal.git/commitdiff/728aad1
Comment #67
ianthomas_uk62: 70719-unicode-62.patch queued for re-testing.
Comment #68
longwaveBack to RTBC
Comment #69
alexpottCommitted f130cc3 and pushed to 8.x. Thanks!