Tokenizer settings for the Arabic language

Last updated on

24 March 2017

Drupal 7 will no longer be supported after January 5, 2025. Learn more and find resources for Drupal 7 sites

Tokenizer defaults are not friendly with non English strings. If you enable tokenizer on content with arabic strings, you may risk losing the entire arabic letters and having an english only search index.

To resolve this issue, The following values can to be useful to tokenize Arabic long strings:

Whitespace Characters:
[\p{P}|\p{C}|\p{Z}|\p{S}]
Ignorable Characters:
[\p{M}|ـ]

These settings worked well in mixed Arabic-English content, as well as in Arabic only or English only content.

The only problem found so far when using these settings is that the decimal point gets stripped, so a string like "1234.4567" would become "1234 4567"

These settings use Unicode Character Properties.

Help improve this page

Page status: No known problems

You can:

Log in, click Edit, and edit this page
Log in, click Discuss, update the Page status value, and suggest an improvement
Log in and create a Documentation issue with your suggestion

Advanced site building tutorials

Tokenizer settings for the Arabic language

Help improve this page

News items

Our community

Documentation

Drupal code base

Governance of community