I'd like to use this module for a German site. Obviously, the stopwords need to be different for the German version. So far, I have simply commented out the English version of the list and started to create my German version. It would be much nicer to have a clean way to have multiple lists.

I think, the easiest way would be to have an additional variable in the settings, where one could choose the language and then have the code choose the list for the selected language.

Do you have time to implement this or should I have a go myself? I could provide a German list of words.

CommentFileSizeAuthor
#11 related_block.module.patch3.57 KBrhys

Comments

rhys’s picture

I would definitely like to add this capability, so if you could provide a german list, I'd be happy to add this functionality.

nath’s picture

Ok, one problem I already noticed with German texts is that the words are splitted at umlauts (äöü) in the following way:

"Präsentation" is split to the following:
[16] => Pr [17] => auml [18] => sentation

This is likely because of the HTML encoding that has "Präsentation" as "Präsentation". I guess in other languages there would be similar problems.

nath’s picture

Ok, the problem with the encoding of umlauts can be solved by adding the following line to the beginning of _related_block_strip:

$text = html_entity_decode($text,ENT_QUOTES,"UTF-8");

nath’s picture

A first version of the German word list:

$overusedwords = array( '', 'aber', 'alle', 'als', 'auch', 'auf', 'aus', 'bei', 'beim', 'bis', 'brauchen', 'da', 'damit', 'dann', 'das', 'dass', 'dem', 'den', 'denn', 'der', 'des', 'die', 'dies', 'diese', 'doch', 'drei', 'durch', 'eigene', 'ein', 'eine', 'einem', 'einen', 'einer', 'es', 'gut', 'für', 'haben', 'hat', 'ich', 'ihnen', 'ihr', 'ihre', 'ihren', 'ihrer', 'im', 'ins', 'ist', 'kann', 'können', 'kommt', 'man', 'mit', 'müssen', 'muss', 'nach', 'neue', 'nicht', 'nur', 'oder', 'per', 'schon', 'sehr', 'seine', 'sich', 'sie', 'sind', 'so', 'sollten', 'sowie', 'über', 'und', 'unter', 'von', 'was', 'welche', 'wenn', 'werden', 'wird', 'wie', 'zum', 'zur');

rhys’s picture

Thanks, I'm quite busy with another module I'm in the process of finishing soon, so I'll try to get this idea as soon as possible.

spiderman’s picture

i'm interested in making this work in a Drupal-ish way. my suspicion is that we should follow these instructions to make the module "translatable", and then work out a way to include grab the list of stopwords from the relevant .po file, somehow. at very least, we should take care to integrate with the i18n mechanisms which are in core for D6, for providing language selection options, etc.

perhaps the porterstemmer module has tackled this problem already?

rhys’s picture

Agreed that the porterstemmer module be useful for making the related block more relevant at least in regards to the English language. I'm not sure that this process could be applied for multiple different languages.

Also, it seems from the list that nath provided that the usage of common words are somewhat similar, but not necessarily the same at least in terms of direct translation. This somewhat contradicts the system of the .pot files, as well as leaves the user unable to be able to specify more words that should be considered common.

To solve this, I suggest we implement somewhat of a locale specific list of common words, using the variable_get for common words.
This could be set up as an array with the key as the locale, and the list of common words similar to the current one.
Since we're doing only single words, this could be a string separated by spaces, which combined with an explode(' ',$variable), would provide the necessary array type to strip out the relevant words from whatever locale is currently selected. This would allow it to be integrated with modules such as the localizer module.

This would also allow a admin configurable method to edit the appropriate strings. This method could then use the .pot file to provide the default common string.

nath’s picture

Any news on how we could proceed?

rhys’s picture

So I'm going to do it somewhat in a drupalish way, which is to have seperate files which are included on the basis of locale. Will have a commit sometime soon.

rhys’s picture

Status: Active » Needs work

It's totally untested, including even for syntax errors. You'll need the files which should be stored within the updated module. these are located in the "ignore" directory, which will contain the word lists for the various languages.

If you have problems, please let me know immediately, so I can do something about it.

rhys’s picture

StatusFileSize
new3.57 KB
spiderman’s picture

Version: 5.x-0.x-dev » 6.x-1.x-dev
Assigned: Unassigned » spiderman
Status: Needs work » Needs review

I've re-rolled this patch and committed it here: http://drupalcode.org/project/related_block.git/commit/8e987cb
This will effectively pull in a language-specific stopwords list to use when filtering out words to search for related content.

spiderman’s picture

Category: support » feature

Also added the german list from http://drupal.org/node/191811#comment-629787 above, in this commit here: http://drupalcode.org/project/related_block.git/commit/8fc7dc1.

The process of adding new languages' stopwords is to create a file in the module's stopwords/ directory with the 2-letter prefix for the language in question, with a .inc suffix. The file should create a simple $words array containing all words which should be ignored by the related search algorithm. Thus related_block/stopwords/de.inc looks like this:

$words = array( '', 'aber', 'alle', 'als', 'auch', 'auf', 'aus', 'bei', 'beim', 'bis', 'brauchen', 'da', 'damit', 'dann', 'das', 'dass', 'dem', 'den', 'denn', 'der', 'des', 'die', 'dies', 'diese', 'doch', 'drei', 'durch', 'eigene', 'ein', 'eine', 'einem', 'einen', 'einer', 'es', 'gut', 'für', 'haben', 'hat', 'ich', 'ihnen', 'ihr', 'ihre', 'ihren', 'ihrer', 'im', 'ins', 'ist', 'kann', 'können', 'kommt', 'man', 'mit', 'müssen', 'muss', 'nach', 'neue', 'nicht', 'nur', 'oder', 'per', 'schon', 'sehr', 'seine', 'sich', 'sie', 'sind', 'so', 'sollten', 'sowie', 'über', 'und', 'unter', 'von', 'was', 'welche', 'wenn', 'werden', 'wird', 'wie', 'zum', 'zur');
spiderman’s picture

Title: German version » Add localized stopword lists

Updating title to reflect what this feature actually adds. It is now possible to create a .inc file in the stopwords/ subfolder of this module which simply returns a list of "stopwords" which are ignored when comparing the text of the content to determine Relevance.

spiderman’s picture

Status: Needs review » Closed (fixed)
spiderman’s picture

Version: 6.x-1.x-dev » 6.x-1.0
spiderman’s picture

Status: Closed (fixed) » Fixed

erf.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.