I just a launched a D8 site this morning which receives ~3 requests per second to the web servers. I did a code deploy a few hours later which involves a few things:

1. Changes the value of $settings['deployment_identifier'] (as a workaround for #2752961: No reliable method exists for clearing the Twig cache)
2. Runs a config-import
3. Runs a cache-rebuild
4. Run a "cache warmer" which just crawls a bunch of pages.

Immediately after this deploy, all path aliases on the site were seemingly lost. They still existed in the database, but all links throughout the site for nodes were using the node/[nid] path instead of their path alias.

This happened once before to us during beta testing of the site, but was unable to come up with a definitive reason and wrote it off as a fluke. When it happened, I spent some time trying to debug it, including cloning the database down to my local environment, but I couldn't reproduce it locally. It had to have something to do with the cache. I couldn't easily inspect the cache because we were using Memcache on the web servers. So, I just decided to run `drush cr` and that resolved it. All links were using the proper aliases like they should.

This time around, I began my investigation based on behavior I noticed in my cache warmer script. It works by first loading all nodes of a certain type and getting their alias like so:

  // Now add all nodes of types we care about. We leave out news and events.
  $alias_manager = \Drupal::service('path.alias_manager');
  $node_storage = \Drupal::entityTypeManager()->getStorage('node');
  $paths = [];
  $nodes = $node_storage->loadByProperties([
    'type' => [
      'landing_page',
      'chaptered',
    ],
    'status' => Node::PUBLISHED,
  ]);
  foreach ($nodes as $node) {
    $paths[] = $alias_manager->getAliasByPath('/node/' . $node->id());
  }

It then makes an HTTP request to that path and outputs each path as it's doing it. All of the paths it collected were just the /node/[nid] instead of the alias. So the alias manager was NOT returning their alias, even though their alias certainly exists and were not deleted.

So, it has something to do with the AliasManager service, caching, and probably some race condition. Note that we've done several other deploys today with the same amount of traffic hitting the site and did not experience this issue.

This is quite scary for us. It caused several infinite redirect loops due to caching in Varnish and the Redirect module which was redirecting users to the "canonical" path.

I'm going to keep investigating and update the issue with any findings.

Comments

bkosborne created an issue. See original summary.

bkosborne’s picture

Issue summary: View changes
bkosborne’s picture

AliasManager performs the following check when looking up an alias for a provided path:

if ($path === '/' || !$this->whitelist->get(strtok(trim($path, '/'), '/'))) {
  return $path;
}

So it's checks against a whitelist of paths that Drupal knows are eligible for aliases. If the root path (in this case "node") is not in the whitelist, it just returns the original path that was requested because it's not in the whitelist of paths that could have an alias.

I added a debug line like so:

if ($path === '/' || !$this->whitelist->get(strtok(trim($path, '/'), '/'))) {
  \Drupal::logger('alias_debugging')->error('WHITELIST FAIL: ' . $path);
  return $path;
}

I've been through a dozen or so deployments without the issue occurring, but the other day it happened, and indeed, I saw tons of these log entries in watchdog for /node/[nid] paths. So this is definitely where the failure is occurring. The AliasWhitelist service is claiming that "node" is not in the whitelist when in fact it is.

Next I'll have to pepper some logging into the AliasWhitelist and see if I can find anything from there.

bkosborne’s picture

Happened again during a deployment his morning.

Here's some findings:

  • "node" was NOT part of the cached path alias whitelist. Instead there were just a few paths, like "user", "contact", and "views". It should have much more than that.
  • I suspected that the cache was being populated with invalid data because the menu router state data was maybe incomplete, but that was not the case either. My logging shows that the state data that was loaded always had "node" in it.
  • This means that the path alias whitelist is being saved to the cache with incomplete data somehow for some reason.
bkosborne’s picture

Finally, I'm pretty sure I got this figured out, thanks to tons of debug logging!

I've confirmed that the whitelist cache data becomes corrupted due to a race condition with cache clears.

Some context: AliasManager is a service used to lookup path aliases for nodes (or to look up system paths for an alias). AliasWhitelist is a cache collector service that caches information about what "root" paths do or not not have any aliases, to prevent tons of DB queries to check for path aliases. An example of a root path is "/node" or "/admin".

Preface: The AliasWhitelist (and it's parent CacheCollector) are complex service classes. I'll try to explain the best I can, but it will help if you have both the AliasWhitelist and CacheCollector code opened up.

Here's the setup:
1. Caches are cold.
2. Something requests a path alias for a path, like /node/123.
3. AliasManager::getAliasByPath checks the AliasWhitelist service to see if the root segment of the path ('node') is whitelisted (if it's not, there for sure is no alias and it returns early). It does this check by calling AliasWhitelist->get($root_path_segment).
4. AliasWhitelist consults with a cache that keeps this data of which path roots have aliases or not. It has a cold cache, and primes it by loading all the menu path roots (stored in key/value service 'router.path_roots') into the local in-memory array (to be later written to cache storage). It sees that "node" does indeed have aliases, so it sets it's value to TRUE in the cache.
5. Request ends, and the AliasWhitelist::updateCache method saves all this data to cache storage.

And here's the bug & race condition:
1. AliasManager receives a request to return an alias for a specific path that was not previously looked up, like /user/1
2. AliasManager::getAliasByPath checks the AliasWhitelist service to see if the root segment of the path ('user') is whitelisted (if it's not, there for sure is no alias and it returns early). It does this check by calling AliasWhitelist->get($root_path_segment).
3. AliasWhitelist loads its list from cache, which is now warm thanks to the setup steps above. However, AliasWhitelist does not indicate that each of these paths should be persisted to cache storage again when the AliasWhitelist cache collector is destroyed. (this is important for later)
4. AliasWhitelist checks if 'user' has already been looked up as a path root that has aliases in the databases. It hasn't so it looks it up, determines if there are or are not aliases, and then saves that data for the 'user' path root to the local storage array and indicates that data for this path should be persisted back to the cache at the end of the request.
5. Meanwhile, while this request is still processing, another process clears the cache.
6. The original request ends, and the AliasWhitelist cache collector's updateCache method is called (it's the destructor for the service). This method's job is to update the cache with any new data (like the new information about the 'user' path root not having any aliases). It first loads the data from cache again, which is now gone due to the cache clear. Normally it would merge the new data it got with the existing cache data and write that back to cache storage. But instead, it just writes the new data to the cache.
7. Now the cache is corrupted. It only has data in it for one path root, 'user'. All future lookups via AliasManager for a path alias for any 'node' path will NOT return the alias, because this corrupted cache data is loaded into the AliasWhitelist's local memory storage array, and the logic in AliasWhitelist::get() says that if a path root does not exist in the local storage, it must not have an alias.

So, the problem occurs when a cache clear happens between the first cache lookup in CacheCollector::lazyLoadCache() and the second cache lookup in CacheCollector::updateCache().

Something that I believe fixes this is that when the CacheCollector does its lazy load (to bring the cache data from cache storage into local memory array) it needs to loop through each item and call ->persist() on each of them. I'm not quite sure if this is the right fix or not. Perhaps something better is to refactor to somehow not require two cache lookups in the same request.

bkosborne’s picture

Title: Path aliases sometimes "forgotten" after a cache rebuild » Path aliases can be "forgotten" after a cache rebuild due to race condition in CacheCollector
bkosborne’s picture

Here's a well-commented test that proves the bug exists. It will pass but it shouldn't.

bkosborne’s picture

Here's a decent fix I think. Detect if the cache that was present at the beginning of the request no longer exists. If so, don't persist new cache data.

The last submitted patch, 8: 2879512-test-only.patch, failed testing. View results

Status: Needs review » Needs work

The last submitted patch, 8: 2879512.patch, failed testing. View results

catch’s picture

The tests need fixing up but this seems like a decent approach, and #5 explains the race condition well.

bkosborne’s picture

Status: Needs work » Needs review
FileSize
5.22 KB

Thanks for reviewing. I think you and Berdir are the ones that worked on this component most, so it's good to have your eyes on it.

This fixes the failing unit test from CacheCollector. The test needed to mock a returned cache object when it wasn't previously. No interdiff since this is the only change from prev patch.

bkosborne’s picture

I've been using this patch in production for a few days now with no issues.