In ContentLoader::entityExists() on line 617, the existence check matches all fields on the entity that aren't arrays when checking for the existence of an entity.

        // Load entity type handler.
        $entity_handler = $this->entityTypeManager->getStorage($entity_type);

        // @todo Load this through dependency injection instead.
        $query = \Drupal::entityQuery($entity_type);
        foreach ($content_data as $key => $value) {
          if ($key != 'entity' && !is_array($value)) {
            $query->condition($key, $value);
          }
        }
        $entity_ids = $query->execute();

I have a Taxonomy (a bunch of locations) where the terms have several plain-text fields (field_room_code, field_building_code) as well as more complex fields (e.g. an address in field_address). The array-check properly excludes the address from the query, but the plain-text fields are included query. Because my plain-text fields are included in the query, any changes to them result in duplicate taxonomy terms.

One potential fix for this would be to change:

 if ($key != 'entity' && !is_array($value)) {

To:

if ($key != 'entity' && !is_array($value) && !preg_match('/^field_/', $key)) {
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

adamfranco created an issue. See original summary.

adamfranco’s picture

Attached is a patch that makes the change suggested.

If using custom fields for existence-checking is a desired feature, it would be nice to have those fields be specified in some way so that other fields can be updated and the content re-imported with the updates.

adamfranco’s picture

Status: Active » Needs review
slucero’s picture

Status: Needs review » Needs work

The main concern I have with this approach is the assumption that all fields to be excluded will follow the pattern of field_*. While this is true of fields created through the field UI, it's not universally true since field created from code may not be prefixed in this way.

The question of whether to include content fields in the existence identifications, however, comes down to the use case and structure being used for the content imports.

In most cases, the content is being made available for installation, but it isn't expected to be the sole source of the content for the site. Based on this workflow, defining IDs within the content file opens up a lot of risk for content overlap and ID clashing. In this scenario all of the content should be defined without assigned ID values and existence checking would require comparison of at least some subset of the content fields. To your point, being able to configure which fields are to be used for this comparison could be a very helpful feature.

slucero’s picture

As an alternative workaround for this issue of which fields are being used in the entity comparison you should be able to define even the plain text field values with an array structure which would exclude them from the comparison. For example, a plain text field may be populated in either of these ways:

  # A single-value field may be assigned directly.
  # This field will be included in entity existence comparisons.
  field_plain_text: 'value assigned directly'

  # Alternatively, any field may be assigned using a value list structure.
  # This field will NOT be included in entity existence comparisons.
  field_plain_text:
    - value: 'value assigned in list'
adamfranco’s picture

Thanks for the feedback and work-around suggestion. My use-case for the yaml_content module is to allow version-controlling of taxonomy terms and similar throughout the life of a site and allow these to be migrated from a development server to production.

The eventual workflow will be:

  1. Create a new term in the UI in development, update -- update any views or other references to it.
  2. Export the taxonomy to YAML (a feature I hope to write for yaml_content at some point :-) ).
  3. Commit the configuration and content changes to the git repository
  4. Deploy the changes to production and run drush cim and drush yaml-content-import-profile profile_name to activate the new configuration and content.

Since YAML content doesn't support export now, I'm just manually updating the YAML taxonomy and then importing.

I appreciate the downside of basing decisions off field names. As for identity determination, what do you think about a configuration that maps an entity_type string to a set of fields to use to build an identifier when importing? The default case could be the current behavior of all fields, with a configuration entry for an entity_type reducing the fields compared. I can work up a patch that implements this if that would be helpful.

slucero’s picture

I think you're identifying a very good opportunity to make the system more flexible through optional configuration. I think it would be very important to maintain the current behavior as a default to keep the system working with minimal overhead, but added options for customization is a good direction to go.

This is the route I've been going with the v2 rewrite to support a more plug-able system for the processing commands. Depending how it was written, this added behavior for customizing and extending the behavior for identity matching could be very valuable and something I agree would be very helpful.

Providing a patch to start this work would be very helpful if still interested. I could see a number of ways for this to be implemented cleanly including:

  • A dispatched event to alter evaluation criteria
  • A helper service to identify matching entities
  • Potentially a combination of both for easier extension

I'm definitely open to other suggestions as well, or even just a proof of concept, but I would prefer to extract the content matching logic from the ContentLoader to make it more maintainable.

slucero’s picture

Status: Needs work » Active
Related issues: +#2893055: Expand logic for identifying content to update

Setting back to "Active" and relating to #2893055: Expand logic for identifying content to update since discussion is relating back to identifying existing content.

slucero’s picture

Category: Bug report » Feature request

Changing this to a feature request since the functionality currently works as expected, but a new configuration option is being proposed.

slucero’s picture

Title: Taxonomy terms with plain-text fields get duplicated » Allow configuration of properties used to match existing entities
Issue tags: +Needs tests

Relabeling the ticket to more effectively describe the new feature being requested.

slucero’s picture

In line with these comments above in #7, I've created #2949625: Move Entity Loading and Lookup into Helper Service to work on extracting this logic into a centralized and reusable service.

I could see a number of ways for this to be implemented cleanly including:

  • A dispatched event to alter evaluation criteria
  • A helper service to identify matching entities
  • Potentially a combination of both for easier extension

I'm definitely open to other suggestions as well, or even just a proof of concept, but I would prefer to extract the content matching logic from the ContentLoader to make it more maintainable.

Next steps

  1. Complete work in #2949625: Move Entity Loading and Lookup into Helper Service to separate entity loading logic
  2. Introduce alter-able system for matching entities
slucero’s picture

Status: Active » Postponed
Related issues: +#2949625: Move Entity Loading and Lookup into Helper Service

Marking this as postponed until work in #2949625: Move Entity Loading and Lookup into Helper Service can be completed and stabilized.