So I'm keeping track of node download counts in my solr index (similar to http://drupal.org/node/1149398) for sorting purposes. Obviously for these counts to be relevant, the nodes will need to be reindexed on a regular basis. My question is how to do this, and what the best implementation would be.
I run cron every ten minutes, and I can index 50 nodes per cron run. To keep things simple, I'm thinking to just loop through all nodes and push eg 30 for reindexing every cron run (on top of whatever nodes have already been queued because they've been created or updated). Not an ideal solution, because with thousands of nodes, it could take a couple of days to loop through and reindex all of them, so download spikes wouldn't be reflected very well in search results.
Another option would be to have a table keep track of which nodes have had the most downloads (or highest percentage change) since the last reindexing, and push 30 nodes from the top of the list every cron run. This method would provide more relevant results, but call for some additional overhead.
A third option would be to just update the solr download_count for a given node every time a node is downloaded, but I'm not sure if it's possible or efficient to just update a single field entry in the solr index (and these would be coming at a rate of potentially hundreds per minute)...
Regardless of which method makes the most sense to implement, I'm fairly stumped as to how to go about it, in terms of programmatically queuing nodes for reindexing. I've skimmed through the api, and apachesolr indexing methods, but haven't found what I need. Something like apachesolr_index_mark_for_reindex seems relevant... but if I understand correctly that marks all entities of a type for reindexing, which is obviously too broad for my needs... I feel like I'm missing something obvious here :o
Any help is very much appreciated!
Comments
Comment #1
Nick_vhIn essence, the algorithm takes all nodes that are newer than the registered timestamp in apachesolr_get_last_index_position($env_id, $entity_type), limited by the amount it can do per cron run. If you update the changed timestamp to the current time it will be queued for indexing
Comment #2
JordanMagnuson CreditAttribution: JordanMagnuson commentedAh, okay, thanks.
Comment #4
Paul Kim Consulting CreditAttribution: Paul Kim Consulting commentedIs there a way to not utilize the changed timestamp as a method for reindexing? The reason for this is due to the fact that some user-facing actions may update the entity/node (for example, a rating) and will require reindexing, but we don't want to change the last updated timestamp during these actions (a view on the site might be ordered by last update timestamp, for instance).
Any objections to hooking onto hook_node_update() and store all updated nodes in a table that the module utilizes for reindexing purposes?
Comment #5
Paul Kim Consulting CreditAttribution: Paul Kim Consulting commentedNevermind, I see how we can just insert into the apachesolr_index_entities_node table. Would be nice if this was an API function though :)
Comment #6
ianthomas_ukYes, it's the last updated date in apachesolr's queue of entities to index, rather than on the entity itself.
Seems you've solved your problem, so I'll close this.
Comment #8
JordanMagnuson CreditAttribution: JordanMagnuson commentedFor future reference, the snippet in #1 isn't quite correct (does not provide values for entity type and id in case of insert)... I believe it should be:
Comment #9
jmehta CreditAttribution: jmehta commentedHi Jordan, is this applicable to drupal 6? and from where can I get the $id and $bundle?
thanks
Comment #10
JordanMagnuson CreditAttribution: JordanMagnuson commentedNot sure about Drupal 6, but entity id should be the node id for the node you want to index, and bundle id should be the node type (e.g. 'article').
Comment #11
HiMyNameIsSeb CreditAttribution: HiMyNameIsSeb commentedHopefully someone will find this useful.
You can re-index certain nodes on any node activity using rules. You could create a custom rule using the rules API for what ever you want.
As standard though, rules will let you re-index on activities already defined by rules. Eg:
- Comment created on node
- Node viewed
- Node flagged ...etc...
On the event select one of the above, or an event of your choice.
On the trigger select 'Search API - index node'.
Comment #12
phponwebsites CreditAttribution: phponwebsites commentedHow to done this in drupal 7?
Comment #13
subhojit777The code snippet in #1 is working. But there is a problem. When I go to the indexing page, it shows up the number of items remaining that should be re-indexed. After I reindex the content, the remaining count does not changes to zero - it still shows up the same number.
Comment #14
subhojit777I found this API
apachesolr_index_mark_for_reindex($env_id, $entity_type = NULL)
, but this will reindex all entities.