I was reading the blog post: http://pingv.com/blog/announcing-apache-solr-pages and the last comment by Robin Monks (6 April 2011 - 8:21am). He states that they shied away from search_api for performance reasons. What would these be? Is there a lot of database traffic on the search / facet pages?

I was using the apachesolr module before switching to search_api. I did this because of the views integration and the ability to create custom search / facet pages. But for a bigger project coming up performance will be key.

I would just like to have your thoughts on this. Thanks.

Cheers,

David

Comments

drunken monkey’s picture

This is a criticism I've heard several times already, most notably Peter Wolanin from the apachesolr module doesn't get tired of repeating it. The claimed issue is that, while apachesolr loads the result data directly from Solr without touching the database, the Search API will issue an entity_load() for the results to get their data. (Also, some specific feature will load additional entities, but the main issue is with the results.) This is of course a bit slower, as both the Solr and the database server will be queried to get the data. Also, entity_load() may carry a lot of additional overhead.
There are, however, very good reasons for doing this. Entities that are loaded will always be up-to-date, invoke Field API magic (including support for translations) as well as other hooks correctly, etc. The Solr index contains only one version, which is the one that got indexed, and won't invoke all those entity and Field API hooks that make up part of the flexibility of Drupal 7.
Most important, however, is that any site that faces performance issues in Drupal 7 should install some kind of entity cache anyways. With well-configured entity caching, the performance disadvantage will pretty much vanish. Also, an entity cache implementation will (or at least should) take care of all those snares of entities and fields and invalidate / reload cached entities when appropriate. What the apachesolr module does is, in principle, to implement such an entity cache, but without taking care of that flexibility, and using an underlying system that wasn't built for such a purpose.

I've also discussed this already with some other people, fago for instance, who also agreed with me on this. When he gets back from his vacation, maybe we can get him to chime in here and add some additional arguments or elaborate on mine.

Then I could maybe even link this issue on the project page, so this myth gets debunked once and for all. ;) Or we have at least a central place to discuss this.

davidseth’s picture

Thanks for that.

In response to:

Entities that are loaded will always be up-to-date, invoke Field API magic (including support for translations) as well as other hooks correctly, etc.

I understand why this is good, but could there be some advanced setting to turn this off? If my content is read heavy with much fewer writes, I would be happy to be potentially showing content that might be a few minutes out of date and reap the performance increase. Essentially using the results straight from my solr index and just serving them to Views without any intervention? This could be configured on a View by View basis.

drunken monkey’s picture

I still think a dedicated entity cache would be better suited for the task.

However, the basic building blocks for doing this are already there and should be easily usable by a custom module. What would have to change, in any case, is the schema.xml, because you have to set the needed fields to stored="true". But then, you just have to alter the Solr request to contain "fl=*,score" and then load the fields into an object in $return['results'][$i]['entity'] and you should be good to go. You can also do this conditionally, e.g. when Views sets a certain query option. You'd then have to also implement the necessary alteration to Views, so it exposes that option for admins.
On the presentation end, however, the Search API already defines that the 'entity' key in results can be used to return the already loaded entity, so you shouldn't have any problems there. (As long as you set all fields that are needed by the view, of course.)

All in all, the necessary module shouldn't take more than two or three hours to code, and could probably be done generic enough to be useful for all sites where the site builder wants to do this (i.e., you could then release the module on d.o).

davidseth’s picture

This sounds very promising indeed. I have been thinking of it quite a bit and it would actually allow search_api along with solr to essentially work with any arbitrary solr schema. An Entity is then created in Drupal with whatever fields the user would want to query on and create Views on. The next step would be to create a mapping of the Drupal Entity & Fields to respective solr fields.

Then as you said, a conditional flag in the Views admin screen is then set to True. If this was set it would use the mapping file to map returned solr fields to the Drupal Entity.

Am I any were near correct in this? If so I would be very keen to start work on this module.

Cheers,

David

drunken monkey’s picture

Yes, this sounds about right. I'm not sure whether a mapping (beyond the one that's already there, implicitly) is actually needed, though. Just setting all returned fields on an object (with appropriate handling for nested fields) should work alright, I guess.

Oh, but that's only for working with the provided schema. For working with an arbitrary schema, I think that considerably more work would be needed, as you'd have to rewrite practically all requests to Solr. And yes, then you'd of course also need your own mapping.

drunken monkey’s picture

Issue tags: +search_api performance
pwolanin’s picture

@drunken monkey - re: #1 the way apachesolr tries to treat data is *potentially* going to give better performance, but I see it more as a different philosophical approach. We have always considered multi-site and federated search as a use cases for apachesolr, and those use cases are more feasible if you avoid as much as possible relying on the local DB to give you the data for processing and displaying your result set.

dawehner’s picture

One key part of loading the entity wasn't mentioned yet. You actually need to load the entity to display field-data.
For easy fields this might be possible without but immediately if you use something complex like drupal commerce you are lost without the $entity.

Damien Tournoud’s picture

Status: Active » Fixed

Loading the full entities is always the good thing to do. Massaging database data directly is just gross.

pwolanin’s picture

@DamZ - see #7 - the idea is not to touch the DB at all, if possible.

drunken monkey’s picture

@DamZ - see #7 - the idea is not to touch the DB at all, if possible.

This is certainly a valid approach for most data, but when discussing Field API fields I don't think that this is really possible. In theory, you could of course store the rendered field in Solr, but then you lose the flexibility of using configurable field formatters. Those always take the entity as a parameter, so there is really no proper way around loading it.

And with #1154116: Search API Solr retrieving search results data directly from SOLR, avoid going through MySQL we now have the option of retrieving the data from Solr in other cases anyways, so I think Damien is right in marking this as „fixed“.

Automatically closed -- issue fixed for 2 weeks with no activity.