Hi,

I had a user post an issue for Textimage (#465492: Text wrapper splits words (seemingly at random)) where the text wrapping wasn't working properly. It turned out that preg_match() was reporting an incorrect position due to a unicode character ('æ') in the string and therefore offsetting the cut point.

After a bit of searching I discovered a user submitted mb_preg_match() function that fixes the issue: http://php.nusa.net.id/manual/en/function.preg-match.php#71571.

Is there any chance that this function, once modified to better suit drupals current unicode functions, will make it into core?
Or am I barking up the wrong tree and should be solving the above issue with a cleaner solution that I've clearly overlooked?

Cheers,
Deciphered.

Comments

Damien Tournoud’s picture

Hi Deciphered,

I cannot say I understand completely what the issue is, but it seems clear that preg_match() is only dealing with offsets in bytes, not in characters, so you have to use substr() and strlen(), not drupal_substr() and drupal_strlen().

One other thing you might want to consider is using preg_split() to cut your string first at ponctuation marks, and then deal with the wrapping. The advantage is that after the string is cut, you don't have to worry about bytes and characters anymore: you can do all the other operations using characters (ie. with drupal_*() functions).

Regardless of everything, a function like this could indeed be useful:

<?php
/**
 * Unicode-safe preg_match().
 *
 * Search subject for a match to the regular expression given in pattern,
 * but return offsets in characters, where preg_match would return offsets
 * in bytes.
 *
 * @see http://php.net/manual/en/function.preg-match.php
 */
function drupal_preg_match($pattern, $subject, &$matches, $flags = NULL, $offset = 0) {
 
// Convert the offset value from characters to bytes.
 
$offset = strlen(drupal_substr($subject, 0, $offset, $encoding));

 
$return_value = preg_match($pattern, $subject, $matches, $flags, $offset);

  if (
$return_value && ($flags & PREG_OFFSET_CAPTURE)) {
    foreach (
$matches as &$match) {
     
// Convert the offset returned by preg_match from bytes back to characters.
     
$match[1] = drupal_strlen(substr($subject, 0, $match[1]));
    }
  }
  return
$return_value;
}
?>
Deciphered’s picture

Hi Damien,

I had originally fixed the issue by using substr over drupal_substr and had also considered breaking the words into an array as well, but this code does the job quite nicely and feels like the right way to go about unicode support.

Would be nice to see this code make it's way into core, but based on my searches it's clearly not a common issue.

Cheers,
Deciphered.

mdupont’s picture

Version:7.x-dev» 8.x-dev

Bumping feature request to 8.x-dev