So I'm keeping track of node download counts in my solr index (similar to http://drupal.org/node/1149398) for sorting purposes. Obviously for these counts to be relevant, the nodes will need to be reindexed on a regular basis. My question is how to do this, and what the best implementation would be.

I run cron every ten minutes, and I can index 50 nodes per cron run. To keep things simple, I'm thinking to just loop through all nodes and push eg 30 for reindexing every cron run (on top of whatever nodes have already been queued because they've been created or updated). Not an ideal solution, because with thousands of nodes, it could take a couple of days to loop through and reindex all of them, so download spikes wouldn't be reflected very well in search results.

Another option would be to have a table keep track of which nodes have had the most downloads (or highest percentage change) since the last reindexing, and push 30 nodes from the top of the list every cron run. This method would provide more relevant results, but call for some additional overhead.

A third option would be to just update the solr download_count for a given node every time a node is downloaded, but I'm not sure if it's possible or efficient to just update a single field entry in the solr index (and these would be coming at a rate of potentially hundreds per minute)...

Regardless of which method makes the most sense to implement, I'm fairly stumped as to how to go about it, in terms of programmatically queuing nodes for reindexing. I've skimmed through the api, and apachesolr indexing methods, but haven't found what I need. Something like apachesolr_index_mark_for_reindex seems relevant... but if I understand correctly that marks all entities of a type for reindexing, which is obviously too broad for my needs... I feel like I'm missing something obvious here :o

Any help is very much appreciated!

Comments

Nick_vh’s picture

$indexer_table = apachesolr_get_indexer_table($type);

    // If we haven't seen this entity before it may not be there, so merge
    // instead of update.
    db_merge($indexer_table)
      ->key(array(
      'entity_type' => $type,
      'entity_id' => $id,
      ))
      ->fields(array(
        'bundle' => $bundle,
        'status' => 1,
        'changed' => REQUEST_TIME,
      ))
      ->execute();

In essence, the algorithm takes all nodes that are newer than the registered timestamp in apachesolr_get_last_index_position($env_id, $entity_type), limited by the amount it can do per cron run. If you update the changed timestamp to the current time it will be queued for indexing

JordanMagnuson’s picture

Status: Active » Fixed

Ah, okay, thanks.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Paul Kim Consulting’s picture

Status: Closed (fixed) » Active

Is there a way to not utilize the changed timestamp as a method for reindexing? The reason for this is due to the fact that some user-facing actions may update the entity/node (for example, a rating) and will require reindexing, but we don't want to change the last updated timestamp during these actions (a view on the site might be ordered by last update timestamp, for instance).

Any objections to hooking onto hook_node_update() and store all updated nodes in a table that the module utilizes for reindexing purposes?

Paul Kim Consulting’s picture

Nevermind, I see how we can just insert into the apachesolr_index_entities_node table. Would be nice if this was an API function though :)

ianthomas_uk’s picture

Status: Active » Fixed

Yes, it's the last updated date in apachesolr's queue of entities to index, rather than on the entity itself.

Seems you've solved your problem, so I'll close this.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

JordanMagnuson’s picture

For future reference, the snippet in #1 isn't quite correct (does not provide values for entity type and id in case of insert)... I believe it should be:

  // Queue node for reindexing by apachesolr.
  $type = 'node'; // Or whatever type you want to index.
  $indexer_table = apachesolr_get_indexer_table($type);
  db_merge($indexer_table)
    ->key(array(
      'entity_type' => $type,
      'entity_id' => $id,
    ))
    // insertFields acts only on insert.
    ->insertFields(array(
      'entity_type' => $type,
      'entity_id' => $id,
      'bundle' => $bundle,
      'status' => 1,
    ))
    /// fields acts on insert or update.
    ->fields(array(
      'changed' => REQUEST_TIME,
    ))
    ->execute();
jmehta’s picture

Hi Jordan, is this applicable to drupal 6? and from where can I get the $id and $bundle?

thanks

JordanMagnuson’s picture

Not sure about Drupal 6, but entity id should be the node id for the node you want to index, and bundle id should be the node type (e.g. 'article').

HiMyNameIsSeb’s picture

Hopefully someone will find this useful.

You can re-index certain nodes on any node activity using rules. You could create a custom rule using the rules API for what ever you want.

As standard though, rules will let you re-index on activities already defined by rules. Eg:

- Comment created on node
- Node viewed
- Node flagged ...etc...

On the event select one of the above, or an event of your choice.
On the trigger select 'Search API - index node'.

phponwebsites’s picture

How to done this in drupal 7?

subhojit777’s picture

The code snippet in #1 is working. But there is a problem. When I go to the indexing page, it shows up the number of items remaining that should be re-indexed. After I reindex the content, the remaining count does not changes to zero - it still shows up the same number.

subhojit777’s picture

I found this API apachesolr_index_mark_for_reindex($env_id, $entity_type = NULL), but this will reindex all entities.