I use the views_natural_sort_remove_symbols transformation with the following symbols:

#"'\()[]«?!»¡¿

Somehow it manages to turn

Cuando...se abre, ¿dará algún tipo de señal?

into

Cuando...se abre, dar? algún tipo de señal

With the following result when field is inserted into db

WD node: PDOException:  in views_natural_sort_store() (line 203 of                                                
/Users/nirbhasa/Documents/htdocs/libry/sites/all/modules/views_natural_sort/views_natural_sort.module).
WD php: PDOException:  in views_natural_sort_store() (line 203 of                                               
/Users/nirbhasa/Documents/htdocs/libry/sites/all/modules/views_natural_sort/views_natural_sort.module).

I fixed by making adding the unicode (u) modifier to the preg_replace regex, but I am still not 100% sure what is happening. It does strip ¿ from other fields, but there is some combination of ¿ and á that is making it go funny:

My modified function:

function views_natural_sort_remove_symbols($string) {
  $symbols = variable_get('views_natural_sort_symbols_remove', '');
  if (strlen($symbols) == 0) {
    return $string;
  }
  return preg_replace(
    '/[' . preg_quote($symbols) . ']/u',
    '',
    $string
  );
}
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

nirbhasa created an issue. See original summary.

generalredneck’s picture

Are you able to always reproduce it with that string? Like in other fields? Maybe on a fresh install? Just a quick question... to see if it's not a database encoding thing. If it's possible can you email me a sanatized sql dump? Generalredneck at Gmail dot com. Or at the very least double check the database encoding for me and make sure it's utf8_general_ci.

But I bet it's me using a unicorn unsafe function somewhere.

In the mean time I'll see what I can do as far as testing. I may not get a fast turn around on this one though. Good call on the /u anyway... it probably should be there.

generalredneck’s picture

I was revisiting this issue. Here was some info I found on PHP.net

If the _subject_ contains utf-8 sequences the 'u' modifier should be set, otherwise a pattern such as /./ could match a utf-8 sequence as two to four individual ASCII characters. It is not a requirement, however, as you may have a need to break apart utf-8 sequences into single bytes. Most of the time, though, if you're working with utf-8 strings you should use the 'u' modifier.

http://php.net/manual/en/reference.pcre.pattern.modifiers.php#107498

Since this is the case, it would be prudent to actually put this on all the preg_replaces I have. I'm going to do that and run my tests against it.

This might also explain some of the funkiness you found... Though I couldn't reproduce it at the time.

generalredneck’s picture

  • generalredneck committed bdd421d on 7.x-2.x
    Issue #2775643 by generalredneck, nirbhasa: Unicode issue with...
generalredneck’s picture

Status: Needs review » Fixed

So I went ahead and committed this without review because I wrote a test to double check the removal function. See bdd421d.

  • generalredneck committed 6b8f779 on 8.x-2.x
    Issue #2775643 by generalredneck, nirbhasa: Unicode issue with...
generalredneck’s picture

Ported to D8 as well.

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.