Indexing words with more than 50 characters [#2418223]

CONTEXT/NEED
I am dealing with indexing texts in Sanskrit. As inconvenient as it might probably is, words can contain hundreds and thousands of characters. This is, of course, because there are no spaces among the "real" words.

TECHNICAL
This module truncates such words before indexing them to the first 50 characters (in file: "service.inc" function convert, case 'tokens'). I have read about the option using the Tokenizer processor, but I am afraid that it is not what I need. The Tokenizer, if I understand it correctly, actually let me specify which characters to ignore as of separating words.

QUESTIONS
1. Can I achieve my need goal of indexing so long words with this module? How?
2. If not, is another service class (such as solr based on search_api) will let me do that?
3. Any other ideas...

THANKS!

Comments

Comment #1

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 18 March 2015 at 08:33

Version:

7.x-1.4

» 7.x-1.x-dev

Hm, OK, that will really not work well out-of-the-box. I'm sorry, but as you can imagine it's well-nigh impossible to support all languages well, with all their different characteristics.

If there is some other way to tell automatically (without a dictionary or the like) where the borders of words are (e.g., a specific set of characters always being at the beginning of a word), the "Tokenizer" processor could still solve your problem. You'd just need someone familiar enough with regular expressions to write one that properly matches word boundaries in Sanskrit.

However, if that's not the case (and I fear it won't be), it's more complicated. Using a Solr server, configuring it correctly for Sanskrit (you'd probably need to modify the definition of the "text" type in schema.xml – search the internet for instructions on what to put in there for proper handling of Sanskrit) and disabling all the processors in the Search API (except, possibly, "HTML filter") could be a solution – I haven't tried it, but Solr is usually capable of handling all languages reasonably well, if configured correctly, so there's a good chance that might work.
Otherwise, you would probably have to write your own Search API processor that handles tokenizing in Sanskrit – maybe by just indexing overlapping chunks of a few characters as pseudo-words and preprocessing search keywords to match these, akin to simple CJK handling.

In any case, I unfortunately know next to nothing about Sanskrit so cannot help you further. You'll probably have to find someone more knowledgable with this problem, or search the internet for solutions.
In any case, if you manage to solve the problem, post back here (and mark the issue as "Fixed") so others having the problem have it easier to find a solution in the future.

Comment #2

Amir Simantov CreditAttribution: Amir Simantov commented 18 March 2015 at 15:54

Thanks Thomas for your time of thinking and effort putting it into words so clearly.

I will probably go with Solr. As I understand from your words that my request is not in the world of problems to be solved with this service class (and, indeed, should not, as Sanskrit is not a common need).

For any newcomers to this issue, I suggest considering a group (not Drupal) which I have stumbled upon: Google group sanskrit-programmers.

Regarding status of this issue here, it can actually be changed to "Closed (won't fix)" because I cannot see how - or why - this module (very useful, yet humble) would answer this problem in a good way.

Thanks again, Amir

Comment #3

drunken monkey

he/him

German

Vienna, Austria

CreditAttribution: drunken monkey commented 25 May 2015 at 09:54

Status:

Active

» Fixed

Good to hear I answered your question, thanks.
Since this is a support request and it was sufficiently answered, "Fixed" is the right status, I think.