Tokenizer settings for the Arabic language
Drupal 7 will no longer be supported after January 5, 2025. Learn more and find resources for Drupal 7 sites
Tokenizer defaults are not friendly with non English strings. If you enable tokenizer on content with arabic strings, you may risk losing the entire arabic letters and having an english only search index.
To resolve this issue, The following values can to be useful to tokenize Arabic long strings:
Whitespace Characters:
[\p{P}|\p{C}|\p{Z}|\p{S}]
Ignorable Characters:
[\p{M}|ـ]
These settings worked well in mixed Arabic-English content, as well as in Arabic only or English only content.
The only problem found so far when using these settings is that the decimal point gets stripped, so a string like "1234.4567" would become "1234 4567"
These settings use Unicode Character Properties.
Help improve this page
You can:
- Log in, click Edit, and edit this page
- Log in, click Discuss, update the Page status value, and suggest an improvement
- Log in and create a Documentation issue with your suggestion