Here's my promised search patch. It's not 100% commit ready yet, but it's time to sollicit some feedback and get this tested ;). Note that this patch requires a db update, which will wipe the search index. You will then need to call cron.php enough times for the site to be indexed completely again. This could take a while for large databases, but you can control the throttle and see the progress at admin/settings/search.

Features

  • AND keyword matching by default ('all of the words'), instead of OR ('any of the words').
  • OR support through keyword1 OR keyword2 OR ...
  • Phrase searching through "quoted strings".
  • Negative matching through -"minus prefix" -word.
  • Restrict search by taxonomy or node type(s) using taxonomy:1,2 and type:blog,page.

The options are built-into the keyword string through a google like syntax, but there is an expandable "advanced settings" form below the search box which acts as a 'query builder':

This example will result in the following search string (of course not a practical example):

test type:forum,story category:1 "tinky winky" OR "dipsy" -"uh oh" "teletubby bye bye"

On a different note, I removed the wildcard matching. An important reason is that there were significant performance problems with leading wildcards. Such queries were not be able to use any indices, and the resulting full-table scan took a long time. Even Google does not have intra-word wildcards, theirs can only be used as placeholders for entire words in phrases.

Trailing wildcards on the other hand are usually used to accomodate grammatical variations on a word. But, wildcards are not really the best tool for this as this puts a burden on the user. If you need this feature, you should instead tie in an algorithm like the Porter Stemmer through the search_preprocess hook.
That way you can reduce related words to a single common root (e.g. "walker" "walking" "walked" to "walk"). The search system will then index and search on the reduced words. You will even benefit from a reduced database size because there are less unique words.

Because such algorithms are very language specific, I didn't build in any. But it should be trivial to make a Porter Stemmer module for Drupal search, which can be used on english sites.

Database
To implement the above searches, I added a 'search_dataset' table that is independent of the keyword index. Each dataset row contains the entire contents of the indexed item, but filtered, cleaned up and reduced to space-sparated tokens (words, numbers, dates, ...). This table is used to resolve the exact conditions, which means the keyword index is not as essential anymore. Because searches are AND by default, the OR method of search_index acts as an initial filter to eliminate the majority of items immediately. That subset is then further reduced through the search_dataset table. All of this means that the search_index table can now be indexed at a much higher minimum word lenght (e.g. 5), which means a reduced database size. Even with the new dataset table, the net database size shrinks slightly.

I also implemented the searching as two selects into temporary tables. This allows me to avoid doing a costly counting query for the pager and a range-limited query for the actual results. I added support for temporary tables to database.(my|pg)sql. The db api itself takes a normal SELECT and a table name, and turns it into an appropriate platform specific temporary table query (CREATE TABLE ... AS, CREATE TABLE ... SELECT).

I still need to do detailed benchmarking, but at least for the same queries as before, this patch should be faster. Of course, pre-patch, all searches were OR, not AND, so a direct comparison needs to take this into account (the pre-patch query "drupal theme development" is now "drupal OR theme OR development").

One feature request that I did not do is date based searching (before X/X/X, after X/X/X), mostly because we don't have a good date widget yet. I've been toying with making a simple in-page JS data picker, but it's not done yet and I think the patch is good enough already. Date restrictions can be added on later without any problems.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Steven’s picture

Oh and in case this wasn't clear, the syntax of putting extra conditions into the search keywords ("type:blog") means that each search result page can be linked to directly. They all have clean URLs: search/node/type:blog+keyword for example.

Steven’s picture

FileSize
37.05 KB

Sorry, the patch was malformed because wincvs wrapped those really long preg classes :P. Fixed patch attached.

stevryn’s picture

This looks great, cant wait till its fully ready. I tried trip_search, but couldnt get it to work, and the regular search definately needed some better features! I would like to test it, but I have no idea how to apply a patch. Can you give me simple, for a Unix dummy, instructions on how to go about it?

Tx
T

webchick’s picture

> Can you give me simple, for a Unix dummy, instructions on how to go about it?

I can help you there, I think. Follow step 2 here if you don't already have a CVS version of Drupal up and running (you can't use this patch against 4.6.2, for example): http://www.planetsoc.com/node/164

Then, switch to your Drupal CVS root directory, for example:

cd ~/drupal-cvs

Use wget to retrieve a copy of the most recent patch (in this case, search_3.patch):

wget http://drupal.org/files/issues/search_3.patch

Execute the following command to apply the patch to your Drupal installation:

patch -p0 -u < search_3.patch

This will patch all the files with the updated search.

Then go through the normal steps you would go through to get a new Drupal system up and running. Step 3 of the aforementioned link has some info on how to get a table prefix going if you want to keep this test version separate from your "normal" Drupal installation.

My problem is I've done all of this, but am still getting strange errors (even on a "normal" unpatched version of the search), so I need to figure out if I have a problem on my end or what's going on.

killes@www.drop.org’s picture

Status: Needs review » Reviewed & tested by the community

@Jeremy: Sorting by two fields does not seem to work.

@Moshe: This code does not rely on the fact that wid is an auto_increment field in any way. Just some concerns did.

killes@www.drop.org’s picture

Status: Reviewed & tested by the community » Needs review

Oops, that comment should have been for another issue.

matt_paz’s picture

It would be nice to allow the ability to select which vocabularies and node types are (or aren't) available in the advanced search. Or to be able to turn them off altogether. It would also be nice to be able to display the totla node count for each type/category in parens.

matt_paz’s picture

Nice addition! It seems to be working great. It would be nice to allow the ability to select which vocabularies and node types are (or aren't) available in the advanced search. Or to be able to turn them off altogether. It would also be nice to be able to display the totla node count for each type/category in parens.

stevryn’s picture

Thanks webchick! I have it working now. Great work Steven, my live site is 4.6.1, I havent wanted to take the great leap and update, last time I did it was *not* pretty. I assume this will work with that version once its completed and submitted? Seriously the search functionality needs this sort of advanced features!! Tx for all your work!

Kobus’s picture

Hi!

Like the previous replier to this post, I can't apply patches myself, but simply because I don't do it frequently enough, and forgot how to do it, but I will catch up with this and test if I get a chance. In principle this is a great patch! Definate +1.

I have another feature (not sure it belongs to this thread, so if not, I apologize) that I would love to see in the search module. That is to provide hooks so that you can create a customized search form, for example, I need the following three field sets in a search form for a property website:

  1. Price range "Start price" -> "End price", using some MIN() and MAX() functions if the user selects the wrong way around. (dropdowns with certain ranges).
  2. Area where the property is supposed to be located in (taxonomy category).
  3. Features that the property MUST have, for example, your requirement would be "TWO BATHROOMS" or "FOUR BEDROOMS". (Search through checkboxes, radios and text fields that was defined in the "property.module" file and database structure.

(This module is written by an amateur (me), that's why not contributed, but, should there be interest in it, I will contribute it.)

These extra search forms would be way different for each different application, so I don't expect the search module to be able to actually do this, but at least to provide functionality that a coder can write such an extension in his module, in other words, the queries that will define the search, and the form that the user will see, should be definable in the module, and called instead of the default search form if required.

Will this be possible at all?

Regards,

Kobus

Crell’s picture

@Kobus: Actually you can do that now. See the Location module for an example of a fully custom search function.

Kobus’s picture

Hi!

Thanks lgarfiel. I have downloaded the module and will play around with it over the weekend.

Regards,

Kobus

Bèr Kessels’s picture

* my update failed. You had/have a problem in update.inc in the updates/callback array. I had-modified the database and it works fine now.
* I am not happy with the 5 character default limit. It took me quite some time to find out why my words were not indexed. But that should not hold back this patch. More somthing to have acloser look for in teh future. I think of intelligent auto-blackslisting or so.
* When I do an advanced search, teh resutlts are primted in the box: very good!. But my advaced form is emty. That is no good udability, imo. That form should represent what I was searching for.
* I still do not like he way results are returned by default: "weblink - Bèr Kessels - 05/08/2005 - 14:20 - 0 comments" is far too much data. And i even have "do not show data for foo", set in the theme settings! can we please re-thing the *default* styles for search results? Without all the CSS bloat, and without all the details? Why not go for a default teaser vew?

$node->title = $title //the name, subject or title of teh element
$node->teaser = $content // teh nice highlighted data of search result body

That way it wil be consisten tih teh rest of the site, save a *lot* of code and be better theamble too.

* "your serach yielded no results" should be a set_message. Yes, we discussed that it should be in the place where ppl look for the results. but I tried it, and it is just as visible above the search as below! really. (see screenie)

Overall, i think introducing a complex search is very nice. And i think this is a great step forward.
But I much rather see, this being implemented in a MUCH easier API as well as MUCH easier hooks. preprocessing as a single hook. What /is/ preprocessing? form, index, etc in one hook? why put preprocess ii a single hook, but the others on one huge nodapi alike construction? What do they do, these nodeapi things? Why do I need them and when do I need them?
I say this, because I spent hours with and hours trial-and-error methods to get some form of advanced flexinode search going. The current system is just to hard to grok for an average developer like me.

I know these things are easy to say, but very hard to implement. But as long as we cannot allow searching for the obvious data people see (I am sure I read that that search guru has winamp as favorite, why can't i find his username when i search for winamp) I think we should be carefull with extending the search into advanced search.

Can we not first think of a general solution, one that will fix ALL drupals search problems and then dive into advanced searching, that will make it even more complex?

Steven: I still think you did a marvelous job!

Steven’s picture

* When I do an advanced search, teh resutlts are printed in the box: very good!. But my advaced form is emty. That is no good udability, imo. That form should represent what I was searching for.

I debated over how to implement this, and I went for the current method of not showing anything in the advanced form.

The first reason is that parsing a query back into the advanced options is not generally possible. For example, you could have more than one phrase in your query, yet the query builder only accomodates one for simplicity. It is a tool, not a complete equivalent to the keyword syntax.

Secondly, there is a conflict between what is in the keywords box and what is in the advanced form. For example, I might build an advanced query, and then start modifying the 'baked in' keyword version. At that point we need to decide which takes precedence, and arguments can be found for both sides. At first I tried to reconcile this by only respecting the advanced controls if you pressed "advanced search", but then there is ambiguity of which button to default to if the user submits the form by pressing enter.

I also thought of making it so that the query builder always adds to the keywords (ignoring duplicates), but that means you can't loosen a query except by manually removing parts of the keywords box /and/ manually unchecking the options in the advanced form.

After trying out the various options, I found that the usability of the current method is the clearest and results in the least amount of inconsistencies perceived as 'buggy behaviour'.

What /is/ preprocessing? why put preprocess ii a single hook, but the others on one huge nodapi alike construction?

Preprocessing is what it says: a transformation applied to text before it is inserted in and matched against the search index. You can use it to add in stemming algorithms, soundex as well as word-splitting for non-spaced languages like Chinese and Japanese.

As for why to put it in a separate hook: for performance reasons.

I am not happy with the 5 character default limit. It took me quite some time to find out why my words were not indexed. But that should not hold back this patch. More somthing to have acloser look for in teh future. I think of intelligent auto-blackslisting or so.

About the limit: it only has any effect if none of the words in your search query is 5 letters or longer. As soon as one word in the query is 5 letters, all results can be found. Judging from drupal.org searches, this was a correct trade-off to make. If you introduce a stemming algorithm through the preprocessor, you can probably afford to set the size limit smaller. But until then, imo 5 is a good default.

About intelligent blacklisting: this is not really possible. The goal of blacklisting is to keep the index size down. But to find out which words to blacklist, you need to include them in some sort of index. Chicken and egg.

Can we not first think of a general solution, one that will fix ALL drupals search problems and then dive into advanced searching, that will make it even more complex?

I await your proposals. But until people start getting involved in the architecture rather than just bombarding me with feature-requests and saying that 'search sucks', I can only do things my way.

In fact, the biggest complexity hurdle right now is simply that search is optional. This means that all search operations should go through search.module. Modules with 100% custom searches on the other hand can simply put a tab on "search/module" and do everything their way. But that means that that tab will be there even if search.module is disabled and that these modules cannot take advantage of the various utility functions provided by search.module.

For example, searching profile fields for user search is perfectly possible even without this patch, though it would require that the user search be moved into profile.module. A nodeapi-like system where profile.module can add to user.module's search would only result in huge joins and queries being added to other queries en-masse. The code would be even more complex than it is now.

As far as the hook_search ops, I'm sorry but I think they are quite clear. I specifically added comments to node_search to explain each op when this would not be done normally (e.g. everyone is assumed to know that nodeapi('validate') is used to validate extra node fields):

'name': return the name of this search
'search': perform a search
(new) 'form': add form items to the standard 'enter your keywords' search
(new) 'post': process a search form and inject the parameters into the keyword string (needed because of the clean URLs)

And only for index-using modules:
'reset': reset the indexing progress
'status': return status of the indexing progress (% completed)

Bèr Kessels’s picture

Thanks. So lets get teh focus back on the patch then. And move my proposal for a new search architecture to new thread; http://drupal.org/node/28275

I agree about the complexity of re-filling the search box. But still, it somehow gfeels odd that its not fileld anymore. Its a usability no-no to do that, though your rationale for not filling it makes more sense. Maybe we just need to rething the interface in general then? Does anyone s"ee any other options?
Put adv.search in a tab advanced could be an option: but that might be a no-no because we hide it away then.

matt_paz’s picture

I have the code running at http://connect.educause.edu

"100% of the site has been indexed."
Minimum word length to index: 4

Nonetheless, I'm getting mixed results.

If I search for 'Hawkins' I get no results ... even though I know (I think) it should be indexing the word hawkins from the body of this content ...
http://connect.educause.edu/blog/mpasiewicz/317

Any ideas?

cyberchucktx’s picture

Looks greeat! I may give it a shot this weekend ... and since I'm using
Civicspace i'll be really daring and try it out within that system as well.

I haven't read *all* the comments yet .. but would be it possible to do some
admin work to setup different search types similar to the "Input Formats"
section?

By this I mean that different search types could "lock down" various parts of the
interface to restrict the search to specific categories ...

Charlie

matt_paz’s picture

sorry, the link for the example where the text does exist should have been ...
http://connect.educause.edu/blog/mpasiewicz/podcast_coverage_of_brian_ha...

Jo Wouters’s picture

Carefull when using temporary tables: not all hosting companies might give their users the necessary privilege.
(Mine doesn't even give me LOCK TABLES privilege.)

"From MySQL 4.0.2 on, you must have the CREATE TEMPORARY TABLES privilege to be able to create temporary tables."(MySQL Reference Manual)

Maybe we should provide an (slower, but working) alternative, so that they don't get an ugly error message.

nedjo’s picture

Nice work Steven. These advanced operands are a great addition.

If you haven't already, you might wish to look at how I've implemented some of these advanced search functionalities in SQL Search (formerly trip search). Specifically:

* While it's handy having advanced as a form group right on the search form, all in all I think it's better to have "advanced" as a separate page. Google is a good guide here. The advanced page feeds into a search results with just the simple form, containing any advanced operands. This encourages users to recognize and use the operands in the simple form, and solves the problem of whether to replicate the advanced elements on the results page.
* If you want to add help text on advanced operands, you'll find some in SQL Search
* By providing a link for advanced search for each content type, it's possible to limit the taxonomy terms to vocabularies associated with that content type.
* Have a look at the filtering within results (by both content type and category). You might want to do something similar, or else I'd be happy to contribute a port of that to core after your patch hits.

In general, the new approach seems to move the core search module closer to what's done in SQL Search--that is, relying on native SQL matching rather than custom indexing. It may make sense to phase out SQL Search, and possibly introduce the options (native MySQL full text indexing, etc.) into the core module.

Steven’s picture

FileSize
43.99 KB

Keeping up to date.

druvision’s picture

Recent results must appear first. Please order the search by decending date order.

Have you tried to search drupal for info lately? Then you will understand what I mean. Tons of results for 2002 and 2003 are mixed with recent results.

Amnon
-
Personal: Bring Dolphin's Simple Joy to your Work - Job - Career
Professional: Small Business Web Hosting Strategies, Drupal Consulting / Drupal Hebrew Translation

jsaints’s picture

Title: Advanced search features » Searching taxonomies patch

I have made a patch to the search module to restrict searching within given taxonomies. I thought that this might be of use to you. See http://drupal.org/node/28933

Bèr Kessels’s picture

Title: Searching taxonomies patch » Advanced search features

Do not change the title, unless you have a good reason to!

jakeg’s picture

I can't get the patch to work against HEAD. I get erros about failed hunks, malformed patch at line 443 etc.

I'd really love to see this make 4.7. Also, I agree that search could be ordered by modified date.

m3avrck’s picture

I get a similiar error trying to use this patched... line 442 is malformed. Checked that in my diff program and it highlights it weird, I bet one of those character encodings is causing a problem. Maybe because of such extensive work on search.module it would make more sense to post an *entirely* new file to avoid these problems we are having? Just a thought :)

jakeg’s picture

yeah, would be good if full working copies of the changed files and mysql changes could be posted here.

killes@www.drop.org’s picture

FileSize
43.94 KB

I've updated the patch to current cvs.

Gerhard Killesreiter’s picture

FileSize
43.94 KB

fixed a small formatting glitch.

sepeck’s picture

Testing on CVS as of 10.02.05, Windows OS, IIS6 with no issues.

m3avrck’s picture

updates.inc code won't work with MySQLi ... please reroll using the switch syntax as defined in the comment at the top of the file, this allows for support with MySQL and MySQLi.

killes@www.drop.org’s picture

FileSize
44.02 KB

done

Gerhard Killesreiter’s picture

FileSize
44.03 KB

Dries discovered an error.

Bèr Kessels’s picture

First and for all: this is not a -1 nor a +1.
This is a general 'but what do I gain from it'. I talked about this ti Steven in short already, so my questions might be (for him) a bit awkward, yet I still think they might be important to all of the readers. And yes, I read the patch, but it was too big (err, difficult) for me to really answer these questions.

1) What do I, as module developer gain from this?
2) Will hook_search change?
3) As flexinde lover (and so it seems maintainer), I introduced advanced search in that module. How will that conflict with this patch? How will I be able to use these improvements in flexinode?
4) As an admin and frequent user of a drupal site: Will I be able to make sure my users find theyr nodes better? Or do they need even more hoops to jump trough to get my users to the goal: That One Preferred Node?
5) As a user of Drupal.org: will I, (finally) be able to find that one manual page about how to get FooBar to work? Or do I still need to revert to google? Or should i still hope that drupal.org starts using trip_search, one day?
6) How do I, as developer of drupal sites, and as admin of some some sites, use this improvement to get my users to read the posts I want them to read?

Well, 6 is a nasty one, But I (really personally) think that that is the aim of a search: to make sure the admin of a site gets his/her users to the right posts. And certainly not to make searching drupal.org easier (5).

But, hey, again, this is not a - nor a + 1. I like any improvement, so I like this one a lot! I just hope it is an improvment in general, not just one for drupal core.

killes@www.drop.org’s picture

Ber some short answers:

1) there are no new hooks, so probably not at all.

2) apparently not. user_Search does not change

3) I don't think you wil l have to make any changes.

4) That is the idea. If it really does that needs testing.

5) see 4)

6) Simply install it?

Dries’s picture

FileSize
120.52 KB

Looks buggy. See attached screenshot.

killes@www.drop.org’s picture

FileSize
43.86 KB

Sorry, fixed.

Dries’s picture

Uploaded a screenshot to http://flickr.com/photos/dries/49130662/. Feel free to review it.

killes@www.drop.org’s picture

I really like the interface.

The icing on the cake would be if we could have something similar for users too, exploiting the profile fields (or CiviCRM, or ...).

Steven’s picture

Ber, to address your points:

1) What do I, as module developer gain from this?

As a module developer: you can now put custom form element on your own search tab. The only downside at the moment is that each search tab always comes with an "enter your keywords" box. I was unsure of whether to leave it in. It's a bit like a node's title I suppose. Moving it into the modules themselves is not really a problem, it's just a matter of code duplication and whether this flexibility is needed.

I decided on the current approach (each tab has the box) with the idea that a search should always be possible by simply entering keywords. Additional conditions should add to that.

2) Will hook_search change?

hook_search gains a few extra subhooks (form, post) which allow you to create and handle custom form fields. The existing hooks don't change, though do_search()'s syntax/arguments do change due to the different queries.

3) As flexinde lover (and so it seems maintainer), I introduced advanced search in that module. How will that conflict with this patch? How will I be able to use these improvements in flexinode?

Well, at the same time, not much, and quite a lot. We talked on IRC a couple months ago about whether we should make a search "mega api" where you can extend the node search with custom fields and conditions from other modules. It was decided then that it would be too much work and that it would add way too much complexity to the already complicated system: it would come down to piling up subqueries upon subqueries. Still, many of the features used (e.g. extracting/inserting "foo:bar" arguments) are contained in re-usable function calls, so it is relatively easy to make a customized search in a contrib module.

5) As a user of Drupal.org: will I, (finally) be able to find that one manual page about how to get FooBar to work? Or do I still need to revert to google? Or should i still hope that drupal.org starts using trip_search, one day?

After this patch, search.module is pretty much inline with what trip_search does with only a couple detail features missing. While building this patch I experimented with doing full-table scans like trip_search does, and concluded that they are simply too slow for a large site like drupal.org, so it is unlikely that it will be installed. I know, we tried it out a long time ago and didn't notice much slowdown, but in the meantime Drupal.org has grown 10 times. Trip_search scales with O(n), search.module with less than that. Also, if we switch over Drupal.org to this patch, I'd like to install a stemming module which will reduce the index size and increase relevancy.

4) As an admin and frequent user of a drupal site: Will I be able to make sure my users find theyr nodes better? Or do they need even more hoops to jump trough to get my users to the goal: That One Preferred Node?
6) How do I, as developer of drupal sites, and as admin of some some sites, use this improvement to get my users to read the posts I want them to read?

This is one issue which is not addressed at the moment.

I think we all know that many people complain that the search does not pay attention to date, so old posts turn up between new posts. A common suggestion is then to sort posts by date. But this is really not a good solution. While newer posts tend to be better than older ones, this is not a rule set in stone.

Take for example a FAQ that gets asked a lot. In that case, we want to favor an older post with actual answers rather than all the repeated questions later. And just because an old book page hasn't been changed in a while doesn't mean it's not relevant anymore.

Furthermore, as you say, the search requirements change from site to site.

The best solution then is to work with different ranking factors, which the admin can assign different weights. You sum them together, and sort by that single result. For example, you could say that "ranking score = relevancy * [3] + freshness * [1]" which means that textual relevancy is three times as important as freshness.

It was really hard to fit something like this in before with the old search queries, but now with the temporary tables I think it's possible. In the admin UI you'd just get a set of fields and weights.

This does mean that only the admin and not the person doing the searching gets to decide, but I don't think it's too bad. Also, the way I currently see the implementation, changing the score weights would not require re-indexing, so you can experiment on test queries easily to make sure you get the desired results.

The only caveat is that we can't just use any database column for this. We need scores that are normalized to a fixed range (e.g. 0..1), otherwise the weighting factors don't represent anything. Raw data, like timestamps or read counts need to be mapped somehow. This is not impossible, but not trivial either (though I have some ideas about this).

Steven’s picture

FileSize
50.34 KB

Okay, I went ahead and implemented a first version of the advanced ranking to see how it goes. The only tricky part is mapping each factor to a nice 0...1 interval with a sensible distribution.

The admin bit looks like this:
Only local images are allowed.

What this means is: each result is assigned 4 scores, one for each factor, in the range 0..1. Then it multiplies each factor by its weight and sums them together per item to get an overall score, which is then sorted on. Note that the rank is a unitless number which is only use internally and whose scale depend on the coefficients set.

E.g. In this set up, a post with relevance 0.23, freshness 0.8, comment ranking 0.4 and view ranking 0.2 scores (10*0.23 + 3*0.8 + 2*0.4 + 1*0.2).

Setting all weights except one to zero simply sorts by the non-zero factor. Increasing the weight of a factor means the results will be sorted more according to it, and vice versa.

Now, currently the different possible factors are hardcoded in node_search() (checking for the presence of resp. comment and statistics module for the last two factors). In theory we could make a hook for this, but it would be ugly (a hook called from a node_search, which is an instance of hook_search already): all sorting is done in SQL, so you get ugly pieces of code where an SQL query, its joins and its arguments are assembled.

It works pretty well. Obviously it is a little bit slower than before if you use more than one non-zero factor, because extra joins are made. But the actual sorting happened before as well, it simply calculates a slightly more complicated key now.

Steven’s picture

Bah, crappy project format bug :P

Image:
http://acko.net/dumpx/searchranky.png

Steven’s picture

Here's Drupal HEAD with this patch. Note that commenting is broken in HEAD, and that I left in a tiny bit of debug code which makes the scores appear. But it's 7am, and this puppy's off to bed. :P

http://acko.net/dumpx/searchpatched.zip

Tobias Maier’s picture

I have just one comment to the screenshot on dries flickr page:
This "Only of the Type"-search may confuse the user.
The most drupal-administrators dont even know the (see a) difference between story and page.
How should my visitors see them?
It would be nice if we would be able to disable them (or configure them: merge content-types together, rename them for the user, disable them...)

cu tobias

Tobias Maier’s picture

it would be also good if i could disable the category view.

it would be good if I could use the menu for my search:
(especially when Integrate primary links with menu navigation gets commited)
for example "every page under 'Products' "

if this is possible at the moment is another question.
Maybe we have to assign on every menu Item a taxonomy-term (auto-generated) which is the phrase we are looking for...

Dries’s picture

The category view doesn't work for me either. It gets really long, especially if you enabled free tagging. FactoryJoe is right when he says search should be context sensitive. It makes it a _lot_ faster and a _lot_ easier to search forums. Fortunately, that should be fairly easy to accomplish with this patch. The question becomes: how to integrate context sensitive search in the forum module UI, etc ...

Dries’s picture

As for order/relevance of search results. IMO it is the _user_ who wants to control the search order, not the administrator. Search engines like http://blogsearch.google.com/ feature two links in the top-right corner: "sort by relevance" and "sort by date". Depending on what I'm looking for, I switch between them. It might be too complex too implement though.

Dries’s picture

Tobias; yes, the node type checkboxes are confusing for the average user. It requires the user to understand the underlying implementation details of the website in order to use them. I suggest to remove these.

Kobus’s picture

I don't think removing these options is the solution, unless you can weigh the returned search results (invisible to the user) such that it suits the particular site, for example, for a specific site, the administrator may put his content in terms of books, and users submit stuff by means of stories. Setting the order to "books, stories, forums, comments". Another admin may not need books, and support is done mostly on the forums, so the order could be "forum, books, stories, comments".

If you simply remove them without the above plan, comments and nodes will be mixed. Although comments often have great tidbits of information, most of them are useless or just plain ole ranting, which makes them less valuable in the search results.

If the above plan is not possible/feasible, then you can, instead of removing the items "Story", "Page", blah blah, rather make them into groups, namely, Content, Forums, Files, Comments, or whatever, which will be more understandible to users. I am using the term "Content" to describe the node system because I lack a better word for it, and I am aware that forum postings and comments are also content, but, this is only to show a general thought I have. Content (or the term finally used) can describe a collection of nodes, such as stories, pages, book pages, and so on.

Furthermore, a further evaluation of the screenshot on flickr, I have the following comments:

The screenshot shows a problem also mentioned by factoryjoe on the page. When I viewed the screenshot, I thought - "I wonder how many categories are there... It sure looks like a lot!".

That is all (besides the comment about the content types above) I could really see that is bothering me, the latter to a lesser degree, as this will not be a big problem on smaller sites, which most of my sites and many of all Drupal sites are.

My proposed solution to this: Split the vocabularies in one drop down list, and upon selecting a vocabulary, it fills the category dropdown with the terms for that vocabulary. This, of course, comes at a price. The price being:

  1. An extra click in the UI (probably negated by the fact that I don't have to search that long to find what I am looking for).
  2. Not sure about this one - depending on Drupal's architecture, would the page need to be reloaded after selecting the category? Can AJAX be used to overcome this?
  3. If AJAX is used, can it be degraded if the client machine doesn't support it?

I will attempt to test the functionality after this.

Regards,

Kobus

Bèr Kessels’s picture

Steven, I really like your ranking idea.
But, (there is always a but) should we not rather store a rank inside a node?
$node->rank = 0 ... 100 is and will be a VERY usefull for a lot of modules and cases. (teaser lists, search results, title lists, forum lists, galleries etc)
You can then always still add your keyword ranking to that in the search.

I would pity it if such a usfull variable would be accessible only to search.

Bèr

Tobias Maier’s picture

Steven, I really like your ranking idea.

me to :D

But, (there is always a but) should we not rather store a rank inside a node?

but the we should be able to preconfigure a value for the specific node type and/or taxonomy
gsitemap has already a ranking for every node.
Maybe we should adopt this...

kbahey’s picture

I like the ranking improvements you made. This is a step in the right direction on making search Googly ...

As for the node types, they can be confusing to the novice, but remember this is advanced search anyway, and is collapsible, so I say leave it in. This will come in handy for things like the job search module, where employers can search nodes of type resume, while job seekers can search job postings ..

(Aside: We need to do something about the node names, book is not really a book, but a hierarchy/outline. Story is not a story, ...etc.)

Kobus’s picture

Lo and behold! A non-techy has tested this patch! It can be done!

I think this patch is marvelous! Besides my comments in my previous post (#49), I think this is just dandy!

Regards,

Kobus

Steven’s picture

Dries: I'm not sure that ranking should be in the hands of the user. Providing a user choice is really a solution to the problem of having to sort on multiple fields. My weighting approach is another. As mr "drupal.org's #1 problem" pointed out: people want type in search terms and find the answer immediately.

The type selection is indeed confusing if there is 'page', 'story' and 'blog'. But on the other hand, on drupal.org it would be massively useful, because it would list: "forum topic", "book page", "project", "issue", ... Same for e.g. flexinodes.

As far as the taxonomy selection goes, I suppose we could move to individual selectors per vocabulary, but it would only make the form bigger and more complicated.

Also, keep in mind that this patch is primarily an algorithmic update. The UI is really an issue on its own, and it seems a single, non-configurable UI isn't going to cut it for most people. In that case I would like to push to get a least the basic functionality in, and make it prettier/usable later.

Ber: the node ranking is dynamic, so it is hard to store. The relevancy (obviously) depends on the search keywords. Freshness is a continuously decaying value: what is fresh today will score slightly less tomorrow. Comment/view scores depend on the popularity of your site. On drupal.org, a 10 comment thread will not score as much as e.g. on a small blog, where discussion is rare. Same for view counts. This is all part of the problem of mapping a bunch of arbitrary data into neat 0..1 values.

Right now, the search APIs are a bit complicated because the search has outgrown the old 4.5- model. It is a topic I would like to discuss at DrupalCon Amsterdam.

Kobus: ranking node types is a possibility, but there lies a problem of how to set it up. Do we simply provide a "favor these node types" set of checkboxes? An ordered preference list? It has to fit into an SQL expression, which might become a bit dirty IF(n.type IN ('forum', 'book'),1.0,0.0) is still managable, but IF (n.type = 'forum', 1.0, IF(n.type = 'book', 0.8, IF(n.type ...... ))) gets complicated real fast. Oh, and search has been counting comments as part of the node since 4.6.

sepeck’s picture

The problem with removing the node search boxes is some sites people are familier with them and they help narrow down your search.... Drupal.org's handbook for instance.

Steven’s picture

FileSize
61.43 KB

Okay, here's a patch that is IMO ready to go in.

Changes since list time:

  • Converted to the new form API.
  • Added normalization for the relevancy factor in ranking.
  • The node-link tracker is now even better... if someone posts a link to a node using the URL as the link text, it will recognize this, and fetch the node's title to boost rather than the URL itself (which is usually useless). I also got rid of some false positives where sometimes an external link would still be considered as an internal one.
  • I discovered that badly formed HTML was making the word ranking inaccurate on Drupal.org (e.g. an unclosed tag marked everything after it as important). I improved the tag counter so it recognizes such cases and adapts.
  • I tweaked the ranking algorithm more so that really long threads don't get an unnatural advantage. The further down a word is along a page, the smaller its score is.
  • I improved the first pass of the search to reject more nodes early. The difference is huge for searches with noise words.
  • The index now explicitly includes numbers and strips away leading zeroes.
  • To improve the quality of search snippets, the node's comments bodies are now also fetched for each result. It is a great improvement in usability, but it does mean a large increase in queries (because each comment body needs to be fetched from the filter cache). It remains to be seen whether this will work on Drupal.org, but we can disable it if needed.

I also made a stemming.module with is a wrapper around the Porter-Stemmer algorithm. It reduces english words to their root (blogging/blogger/blogs -> blog), which we can install on Drupal.org to make the index smaller, faster and more accurate.

Steven’s picture

Oh two more notes:

  • As far as UI goes, me and Dries agreed that there is too much discussion about this to get something better done in time for 4.7. As it is now, the search UI is not perfect, but it is definitely acceptable. We can improve it in the future.
  • I tested this patch extensively on a mirror of Drupal.org, comparing the old with the new. The difference is definitely noticable, especially after the last round of improvements.
moshe weitzman’s picture

Status: Needs review » Fixed

this patch has landed. please open new feature requests or bugs as needed.

Thomas Sewell’s picture

For reference, a minor bug was introduced by this patch. See http://drupal.org/node/36242.

Anonymous’s picture

Status: Fixed » Closed (fixed)
rjung’s picture

Version: » 4.6.3

Is it possible to backport this advanced search to Drupal 4.6 or 4.5? Or is it strictly a CVS/4.7 feature?

chx’s picture

Version: 4.6.3 » x.y.z
Category: feature » bug
Priority: Normal » Minor
Status: Closed (fixed) » Reviewed & tested by the community
FileSize
785 bytes

I suspect Steven copy pasted this line from a multiple select (maybe the taxonomy element two lines above) and forgot to delete the #multiple => TRUE. No harm for it being there, but it's dead code.

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed. Thanks.

Anonymous’s picture

Status: Fixed » Closed (fixed)