Tokenizer settings for the Arabic language

Last updated on
24 March 2017

Tokenizer defaults are not friendly with non English strings. If you enable tokenizer on content with arabic strings, you may risk losing the entire arabic letters and having an english only search index.

To resolve this issue, The following values can to be useful to tokenize Arabic long strings:

Whitespace Characters:
[\p{P}|\p{C}|\p{Z}|\p{S}]
Ignorable Characters:
[\p{M}|ـ]

These settings worked well in mixed Arabic-English content, as well as in Arabic only or English only content.

The only problem found so far when using these settings is that the decimal point gets stripped, so a string like "1234.4567" would become "1234 4567"

These settings use Unicode Character Properties.