Allow Entity Mesh to analyze content using configurable user roles
Problem/Motivation
Currently, the Entity Mesh module only analyzes links that are visible to anonymous users.
However, some sites we want to analyze are not accessible to anonymous users, so we can use the module in this context.
We need to extend Entity Mesh so that site administrators can configure which user role(s) should be used when crawling and analyzing links.
By default, the module should continue to behave as it does now (use the anonymous role), but it should allow selecting other roles through configuration.
This would make the module usable for a wider range of scenarios, especially for websites where access control is based on authenticated roles.
Steps to reproduce
- Install and enable the Entity Mesh module.
- Attempt to analyze a site where most or all content is restricted to authenticated users.
Observe that:
- Entity Mesh only analyzes links visible to anonymous users.
- Links restricted to authenticated or custom roles are skipped.
There is currently no configuration option to analyze the site using a different role.
Proposed resolution
Add a module configuration setting to choose which user role(s) Entity Mesh should use when analyzing links. By default, use the anonymous role for backward compatibility.
Reuse the functionality already proposed in issue #3535302, which implements the ability to generate a fake account object and assign roles dynamically.
At this point, there are two options:
- either continue with the approach used in this issue, which involves changing the account globally in each processing step,
- or implement a more refined solution, which would require at least the following two actions:
- Replace new AnonymousUserSession() calls with the configurable fake account object.
- Audit all access checks in the module to ensure they use the configured roles.
Remaining tasks
Add a configuration form to select the role(s) used for crawling.
Extend the fake account functionality from issue #3535302 to support configured roles.
At this point, there are two options:
- either continue with the approach used in this issue, which involves changing the account globally in each processing step,
- or implement a more refined solution, which would require at least the following two actions:
- Replace new AnonymousUserSession() calls with the configurable fake account object.
- Audit all access checks in the module to ensure they use the configured roles.
Add tests to confirm:
- Default behavior remains anonymous-only.
- Configured roles are respected during link analysis.
Entity Mesh Tracker System - Technical Overview
Purpose
The Tracker system provides a queue-based mechanism to manage and process entity
link analysis asynchronously, replacing the previous approach of truncating and
rebuilding the entire entity_mesh table during batch operations.
Database Schema
The entity_mesh_tracker table tracks entities pending analysis with
the following structure:
- id: Primary key (auto-increment)
- entity_type: Entity type identifier (e.g., 'node',
'taxonomy_term') - entity_id: Entity identifier
- operation: Operation type (1 = process/update, 2 =
delete) - status: Processing status (1 = pending, 2 = processing, 3 =
processed, 4 = failed) - timestamp: Unix timestamp of last update
- retry_count: Number of failed processing attempts
Indexes: entity_lookup (entity_type, entity_id), status, timestamp
Unique constraint: entity_type + entity_id combination
Core Components
TrackerInterface
Defines service contract with constants for operations and statuses
Tracker Service (entity_mesh.tracker)
Implements tracking functionality:
addEntity(): Adds/updates entity in tracker (uses MERGE for
upsert behavior)addMultipleEntities(): Batch adds entities within
transactiongetPendingEntities(): Retrieves entities awaiting processing
(ordered by timestamp)getFailedEntities(): Retrieves failed entities for retry
logicmarkAsProcessed(): Updates status to processedmarkAsFailed(): Updates status to failed and increments
retry_countdeleteEntity(): Removes entity from trackerdeleteProcessedRecords(): Cleanup of old processed recordsgetPendingCount()/getTotalCount(): Statistics
methodstruncate(): Clears entire tracker table
Integration Points
Entity Hooks: Entity CRUD operations (insert/update/delete)
automatically add entries to tracker via entity hooks
Batch Processing: Refactored to populate tracker instead of
directly processing all entities
Cron Processing: Configurable cron job processes pending
entities with limit control (default: 50 per run, configurable via
entity_mesh.settings.cron_limit)
Drush Commands: New commands for manual tracker management and
processing
Processing Flow
- Entity operation (create/update/delete) triggers tracker entry
- Entry status = PENDING (1)
- Cron or manual processing picks up pending entries
- Status changes to PROCESSING (2) during analysis
- On success: status = PROCESSED (3), on failure: status = FAILED (4) +
retry_count incremented - Failed entities can be reprocessed based on retry limits
- Processed records older than configured days are automatically purged
Configuration
cron_enabled: Enable/disable automatic cron processing (default:
TRUE)cron_limit: Maximum entities to process per cron run (default:
50)processing_mode: Controls synchronous vs asynchronous processing
behaviorsynchronous_limit: Threshold for immediate vs queued
processing
Benefits
- Incremental processing: Only changed entities are
analyzed - Performance: Avoids full table truncation and rebuild
- Reliability: Retry mechanism for failed processing
- Flexibility: Manual and automatic processing options
- Scalability: Configurable limits prevent timeout issues on
large sites
Issue fork entity_mesh-3544912
Show commands
Start within a Git clone of the project using the version control instructions.
Or, if you do not have SSH keys set up on git.drupalcode.org:
Comments
Comment #2
lpeidro commentedComment #3
lpeidro commentedComment #4
lpeidro commentedComment #6
lpeidro commentedThe option to configure the user profile under which content processing is executed has been implemented.
The configuration now includes three mutually exclusive options:
This is useful for intranet environments or systems where authenticated users access the website.
Relevant functional tests have been added to ensure that access permissions to content are properly validated.
Ready for testing and suggestions for improvements.
Comment #7
lpeidro commentedComment #8
lpeidro commentedDue to the need to improve performance in this task and the way content is processed, we have also implemented a tracking system along with a cron job to ensure greater stability in the process, as well as a specific cron for the module. I’m also taking this opportunity to update the task title and the description.
Comment #59
lpeidro commentedFuncionality merged.