The transliteration system seems to be rather memory inefficient for large strings.
For instance, a random 1,000,000 character string passed to the transliterator produces a memory usage of 200MB+
This is due almost entirely to the use of preg_split in PhpTransliteration::transliterate(). It would be nice if there was a more efficient way to handle this.
This is causing a problem on our project because of transliteration of a very large node body taking place during search indexing. 1,000,000 characters may seem like a lot, but it's actually not that uncommon if content editors are copying/pasting from other sources like Word.
Comments
Comment #7
MatroskeenI was curious and decided to reproduce the described issue.
Here are some numbers based on blackfire profiling:
As we can see, the numbers go higher, so it might be 200+ MB in some circumstances.
I did some research and found the following alternatives for
preg_split('//u')
:mb_str_split
- profiling numbers are the same;for
loop accompanied withmb_strlen
andmb_substr
(https://stackoverflow.com/a/57748023) is much slower. (6.51s and 7.41 MB for 32 768 characters) because of expensivemb_substr
calls.I don't see other options, but if it would be a cheap way to retrieve unicode characters from string, it could work.
Comment #8
MatroskeenAs it was pointed by @chx in slack, it can be solved by using
intl
.I did a quick test using the following line:
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', $this->generateRandomString(XXX));
Numbers are very promising: