I have a DB index which is missing some nodes: they are not in the {search_api_item} table, and I can't see a way to refresh this table.

I thought at first this was because I added some node bundles to the index after initially creating it, but my git log shows this:

       bundles:
+        blog: blog
+        event: event
         page: page
+        press_release: press_release
         community_index: '0'
         cta_type: '0'
         section_index: '0'

-- I started with just page nodes and added three further types in that commit.

There are blog, page, & event nodes in the item table, but no press_release, as this query shows me:

SELECT si.item_id,
n.nid, n.type
FROM search_api_item si
JOIN node n 
  ON n.nid = SUBSTRING_INDEX(SUBSTRING_INDEX(si.item_id, ':', 2), '/', -1)
WHERE `datasource` = 'entity:node'
GROUP BY n.type
;

and this shows me that quite a few blog nodes are missing:

SELECT 
n.nid, n.type, si.item_id
FROM node n 
LEFT JOIN search_api_item si
  ON n.nid = SUBSTRING_INDEX(SUBSTRING_INDEX(si.item_id, ':', 2), '/', -1)
  AND si.`datasource` = 'entity:node'
;

Comments

joachim created an issue. See original summary.

joachim’s picture

I'm not managing to get to the bottom of this, as I can't find where this table gets initially populated when a new index is created (possibly it's not at the moment, due to #2747763: crash on adding a new index).

If anyone else has the same problem, this PHP code refills my table (will need tweaking for your own index!!)

  $index_id = 'combined_index';
  $index = \Drupal\search_api\Entity\Index::load($index_id);
  $datasources = $index->getDatasources();
  foreach ($datasources as $datasource_id => $datasource) {
    $entity_type_id = $datasource->getEntityTypeId();
    $query = \Drupal::service('entity.query')->get($entity_type_id);
    
    // WARNING!!! Assumptions about our own index here -- node datasource is
    // 'include' selected bundles; other datasources just have all bundles.
    if ($entity_type_id == 'node') {
      $config = $datasource->getConfiguration();
      $bundles = $config['bundles']['selected'];
      
      $query->condition('type', $bundles, 'IN');
    }
      
    $ids = $query->execute();
    
    $combine_id = function ($id) use ($datasource_id) {
      // Assume all in 'en'.
      return "$datasource_id/$id:en";
    };
    $search_api_item_ids = array_map($combine_id, $ids);
    
    // Go direct to the tracker, as calling this on the index will index
    // everything and take ages.
    $index->getTrackerInstance()->trackItemsInserted($search_api_item_ids);  
  }
swentel’s picture

I've had issues with my tracker as well, especially during development where I'm playing around a lot. This code at least helps recalculating that table after items are stuck in there, but the actual entity doesn't exist anymore for instance. And it seems there's no other way right now to tell search api to retrack the table from the UI (unless I'm missing something)

drunken monkey’s picture

And it seems there's no other way right now to tell search api to retrack the table from the UI (unless I'm missing something)

You can disable and re-enable the index, that should do it. If you set it to "Read only" before that, it shouldn't even clear your index.
But, of course, that's more of a workaround. Having something for this in the UI (or at least in Drush) could make sense – feel free to create an issue for that.

joachim’s picture

I can confirm that disabling then enabling the index got me a 'Track this index' UI button, which fixes the problem.

geerlingguy’s picture

@drunken monkey - Thanks for your tip! I was getting the following errors when I was indexing:

160 items could not be indexed. Check the logs for details.

The logs showed items like:

Could not load the following items for indexing on index Site Index: "entity:node/23886:en", "entity:node/23891:en", "entity:node/23896:en", "entity:node/23901:en", "entity:node/23906:en", "entity:node/23911:en", "entity:node/23916:en", "entity:node/23921:en", "entity:node/23926:en", "entity:node/23931:en

Looking up those nodes, they were all previously deleted, but still existed in the search_api_item table. After disabling and enabling the index, I got the 'Track' button... but after clicking it, the page still says 0/0 items.

Granted, I'm running on Search API 8.1.0-alpha14, so it could be a bug related to that older version...

joachim’s picture

> After disabling and enabling the index, I got the 'Track' button... but after clicking it, the page still says 0/0 items.

When you enabled, did you use the ajaxy link? What happened with me with that is I got a timeout (#2805285: re-enabling an index times out), but because it's ajax, I didn't notice it. That leaves the index improperly re-enabled. Workaround is to disable and enable it with Drush.

geerlingguy’s picture

@joachim - As it turns out, #2609200: Tracking (?) broken was the other issue—I didn't have the tracking_page_size config setting on the site, so it would get stuck at 0/0.

acbramley’s picture

@drunken monkey thanks for the work around of disabling and reenabling. For some reason we have a handful (5 out of 2430) of nodes that for whatever reason aren't in the search_api_item table. I'm not sure how they were removed in the first place. It has happened in the last few weeks as our other environments that had databases synced from production do have those nodes in the tracker table.

Can you think of any way they would be removed?

drunken monkey’s picture

Priority: Critical » Normal

Can you think of any way they would be removed?

Items are automatically removed from tracking when loading failed for them, even once. This is a safeguard for when items are deleted without triggering the appropriate "track delete" method, for whatever reason. But if, for any reason, you sometimes have issues where nodes can't get loaded (though it's hard to imagine such a scenario – if the DB connection is lost, removing the nodes from tracking also wouldn't work, after all), this might cause them to falsely be removed from tracking.

Other than that, I can't really think of any reason, no. If you play around with datasource settings, we of course add and remove nodes from tracking, as appropriate – but I can't really think of a way how just a few nodes would slip through the cracks in that case. (Unless they are the only nodes with a certain type, or language, of course.)

jummonk’s picture

I have a use case where this is a huge issue. The node rendered HTML output which is the main field I'm indexing gets populated with data using
SOAP calls to a particular service.
When this SOAP call fails at indexing, those nodes are removed from tracker table and not indexed, ever again.
Updating the node does not help in this case.

drunken monkey’s picture

Component: General code » Framework

Hm, yes, I guess that's a problem. We could probably introduce some config setting to make the current behavior optional.
Having some API way of determining a datasource (or entity type) as "unreliable" would also work, but probably be a bit overkill.
But even for the "soft" solution, it's questionable how many people would find it useful. It would be a very small change, though, so maybe it'd be fine either case. What's your opinion?

In any case, for your custom scenario, you could also remember which entities (or items, in case it's not entities?) failed loading and manually do a trackItemsInserted() for them later, when they are available again. (Or, instead of returning no object for them, just returning an incomplete one and queuing them for later reindexing.)

wouters_f’s picture

Drupal core 8.8
"drupal/search_api": "^1.15",
"drupal/search_api_elasticsearch_attachments": "^6",
"drupal/search_api_exclude_entity": "^1.0",

We have ran this code as this stil seems relevant.
We have two languages and some translations are not indexed.
As you may find in the query we have nl and fr.

You may see that we both do an insert and an update.
These are the functions that we suspect are not triggering correctly in search_api on node_insert (especially trackItemsInserted).

you may run this query and the fixing function in a cron hook or in a drush command.

/**
   * Find untracked Entities.
   */
  public function findUntrackedEntities() {
    $query = 'SELECT 
SUBSTRING_INDEX(item_id, \':\', 2)  as node_identifier,
SUBSTRING_INDEX(SUBSTRING_INDEX(item_id, \':\', -1), \'/\', 1) as lang,
SUBSTRING_INDEX(SUBSTRING_INDEX(item_id, \':\', 2), \'/\', -1) as id
from search_api_item 
WHERE
item_id LIKE \'%:nl\' 
AND SUBSTRING_INDEX(item_id, \':\', 2) NOT IN (
SELECT SUBSTRING_INDEX(item_id, \':\', 2) 
from search_api_item 
WHERE
item_id LIKE \'%:fr\'
)
UNION
SELECT 
SUBSTRING_INDEX(item_id, \':\', 2)  as node_identifier,
SUBSTRING_INDEX(SUBSTRING_INDEX(item_id, \':\', -1), \'/\', 1) as lang,
SUBSTRING_INDEX(SUBSTRING_INDEX(item_id, \':\', 2), \'/\', -1) as id

from search_api_item 
WHERE
item_id LIKE \'%:fr\' 
AND SUBSTRING_INDEX(item_id, \':\', 2) NOT IN (
SELECT SUBSTRING_INDEX(item_id, \':\', 2) 
from search_api_item 
WHERE
item_id LIKE \'%:nl\'
)';
    $database = \Drupal::database();
    $query = $database->query($query);
    return $query->fetchAll();
  }

  /**
   * Fix Untracked nodes in the tracking table.
   *
   * @throws \Drupal\Component\Plugin\Exception\InvalidPluginDefinitionException
   * @throws \Drupal\Component\Plugin\Exception\PluginNotFoundException
   */
  public function fixUntrackedNodes() {
    $untrackedEntities = $this->findUntrackedEntities();
    foreach ($untrackedEntities as $untrackedEntity) {
      // If it is a node.
      if (strpos($untrackedEntity->node_identifier, 'entity:node') !== FALSE) {
        $entity_manager = \Drupal::entityTypeManager();
        $node = $entity_manager->getStorage('node')
          ->load($untrackedEntity->id);
        if ($node) {
          /** @var \Drupal\Core\Entity\EntityBase $entity */
          $indexes = ContentEntity::getIndexesForEntity($node);
          foreach ($indexes as $index) {
            if ($node->hasTranslation('nl') && $untrackedEntity->lang == 'fr') {
              $index->trackItemsInserted('entity:node', [$node->id() . ':nl']);
              $index->trackItemsUpdated('entity:node', [$node->id() . ':nl']);
            }
            if ($node->hasTranslation('fr') && $untrackedEntity->lang == 'nl') {
              $index->trackItemsInserted('entity:node', [$node->id() . ':fr']);
              $index->trackItemsUpdated('entity:node', [$node->id() . ':fr']);
            }
          }
        }
        else {
          echo 'node ' . $untrackedEntity->id . ' could not be loaded.';
        }
      }
    }
  }

I was actually surprised to find this inreliability in search_api which is for the rest pretty reliable.