If I edit 1 page multiple times before the purge queue has been processed I can end up with duplicate invalidations in the queue.

When an item is enqueued it should be checked for dups. If a duplicate exists it should not be enqueued.

Issue fork purge-2851893

Command icon Show commands

Start within a Git clone of the project using the version control instructions.

Or, if you do not have SSH keys set up on git.drupalcode.org:

Comments

adam.weingarten created an issue. See original summary.

pedrop’s picture

+1 for this, we are struggling with too long queues and this would be a significant help. I'm seeing the same urls a lot of times in the queue.

nielsvm’s picture

Project: URLs queuer » Purge
Version: 8.x-1.x-dev » 8.x-3.x-dev

This is not a purge_queuer_url problem as all it does is consuming Purge's API's, it has also no way of finding out if something was queued before or not. This makes it a generic purge-problem which may be the only place to fix it, but there are serious performance risks here if we're going to precheck the queue for each set of (group) inserts we're going to make.

Moving to the purge project, I'll look at a potential fix later.

In the meanwhile for users annoyed by this: process your queues more, use the late runtime processor for isntance!

Niels

rbayliss’s picture

+1 for this, although I can certainly see how it would be difficult to implement properly.

wim leers’s picture

Hah, I talked to @nielsvm in chat yesterday about exactly this!

I did something similar for Fileconveyor: https://github.com/wimleers/fileconveyor/issues/68 -> https://github.com/wimleers/fileconveyor/blob/master/fileconveyor/arbitr....

jonhattan’s picture

I've created a module that provides a database queue that avoid creation of duplicate items - https://www.drupal.org/project/purge_queues

hanoii’s picture

Interesting this hasn't come up more often, and also went looking for this with purge_queue_url as it adds a lot of URLs to it. Module on #6 seems to work and I am currently using it. Thanks although I push for something like this added to the module somehow.

inversed’s picture

Could this be related to issue #3034525 "Clean up duplicate cache tags created by invalidation tokens"? Note that there's also the #2952277: Minify the cache tags sent in the header issue.

rosk0’s picture

Thanks a lot for the purge_queues module Jonathan!

That's a real game changer! My queue was growing to millions over-pacing purge cron job running every minute. Local tests are great , will see what it would look like on prod.

I believe that purge_queues module could be a great addition to the purge itself.

ericgsmith’s picture

We have been investigating performance issues caused by duplicate items when using purge in combination with purge_queuer_url module.

We have encountered issues in 2 areas - 1. duplicates in the buffer and 2. duplicates in the queue.

Duplicate items in the buffer

I can see that when an invalidation is created in the InvalidationsService it is using a instanceCounter to generate a unique integer ID for the invalidation object. When added to the buffer the buffer is calling has to see if that ID has been added to the buffer already.

Queuers seem to make some attempt to reduce duplicates, e.g by filtering out previously requested tags - but certain situations such as config importing can trigger thousands of duplicates into the buffer, which can lead to high memory consumption.

While I have been looking at this through the context of just the url/path queuer - I wonder if it would be possible for the queuers themselves could set either an id or another property on the invalidation that can be used to dedupe it. E.g - the url registry maintains a list of urls, so the url id could be considered unique. Individual cache tags could also consider themselves unique. Possibly other plugins may have difficulty determining their uniqueness, but opening up the possibility to set id or fallback to an instance counter could help plugins where this is problematic (e.g. the url queuer) to be more efficient.

Without looking through all the code, I would be interested in the maintainers thoughts as it appears the use of getId on the invalidation plugin is (according to my IDE) mainly through the buffer and tests.

Would there be any reasons against

  1. changing id getId in InvalidationInterface return type to be string
  2. introduce a third optional parameter InvalidationService->get to allow an ID to be provided when created
  3. introduce fallback behaviour for a unqiue id to be generated if not provided

That would then allow queuers to make changes to provide a unique value when creating an invalidation, and the existing buffer deduping code may not need to change.

Duplicate items in the queue

We are using the module @jonhattan provided - but the checks for duplicate items can be problematic for repeated large updates (e.g in our case it was multiple batch calls that each invalidated the media_list tag)

@RoSk0 raised an idea (offline) of storing an unique identifier for a queued item to make use of upsert queries instead of insert queries using a database queue. We have a proof of concept doing using by hashing the type and expression value of the data, but it would be easier with an enforced / persisted unique ID for an invalidation item. We would be interested in any thoughts on this approach.

o'briat’s picture

I confirm that duplicated invalidation occur when Drupal is importing or update regularly large volume of content.

A simple solution could be to delete all identical "data" when purging an item?

Or just add a global duplicate deletion at the end of every purge, here's some pseudo code:

"SELECT MAX(item_id), data FROM purge_queue  GROUP BY data HAVING COUNT(*) > 1"
foreach $item_id => $data
 DELETE FROM purge_queue  WHERE data=$data AND item_id  != $item_id
santhoshkumar’s picture

StatusFileSize
new3.86 KB

We have identified similar kind of issue when using purge_queuer_coretags module, there are 2 issues we identified as below

  1. Same cachetags inserted into purge_queue table multiple times.
  2. Due to duplicated tags inserted into purge_queueu table we facing your queue exceeded 100 000 items ! Purge shut down issue frequently.

To fix the issue we have added the patch duplicate_purge_tags.patch, In this patch we have DB lookup before insert into purge_queue also maintained the array in static array to prevent multiple database calls for same tag.

hchonov’s picture

StatusFileSize
new3.06 KB

Re-roll.

hchonov’s picture

Status: Active » Reviewed & tested by the community
StatusFileSize
new967 bytes
new3.02 KB

Fixed issue in the query logic as search for the cache tag "paragraph_list" was returning items like "paragraph_list:text" too. After testing this I can confirm that no duplicate queue items are created anymore that drastically reduces the queue length for us.

hchonov’s picture

Status: Reviewed & tested by the community » Needs work

Turns out the patch does not work for new site installations as it queries the database purge_queue table before it is created. We are switching simply to the unique queues provided by https://www.drupal.org/project/purge_queues.