Problem/Motivation

We need a way to detect, merge, and update references when there are multiple contacts

Steps to reproduce

Proposed resolution

Remaining tasks

Field Type Plugin ID Description / Scoring Strategy Notes / Config Options
Integer integer_exact Exact match; 1 if equal, 0 otherwise None
integer_linear Linear decay based on difference max_delta = maximum difference for score 0
integer_exponential Exponential decay with difference decay_base
integer_bucket Compare in discrete ranges bucket_size
Float / Decimal number_exact Exact match Same as integer_exact
number_linear Linear decay max_delta
number_exponential Exponential decay decay_base
number_bucket Discrete buckets bucket_size
Short String string_exact Case-insensitive exact match None
string_jaro_winkler Fuzzy string similarity Optional threshold
string_metaphone Phonetic similarity For names / nicknames
string_soundex Phonetic similarity Alternative to metaphone
Text / Long Form text_exact Exact string match Optional min length
text_tfidf TF-IDF vector + cosine similarity Config: min_length, max_comparisons, stopwords
text_semantic Embedding / ML-based similarity Optional / heavy dependency

User interface changes

API changes

Data model changes

Comments

bluegeek9 created an issue. See original summary.

bluegeek9’s picture

I think we should look to the community for frameworks the detect duplicate entities and can merge entities.

bluegeek9’s picture

Status: Active » Postponed

Need a community module as a framework.

bluegeek9’s picture

Title: Contact Merge » Contact Duplicate + Merge
Issue summary: View changes
bluegeek9’s picture

Maybe we can use this namespace.

https://www.drupal.org/project/similarity

bluegeek9’s picture

Deduping when the fields are encrypted is an additional consideration.

mortona2k’s picture

Title: Contact Duplicate + Merge » Contact deduplicate + merge

dedupe, not duplicate.

This came up for a project today, if a contact registers and the email doesn't match but the name does, how do we consolidate?

bluegeek9’s picture

In my head I think it would be best to be generic for all entities, a separate project. The system would show a score for a user to decide to merge contacts. CRM would integrate, but not require the duplicate detect project.

I want this to work with, but not require workspace so contacts can be merged and reviewed without impacting other users.

There are shared emails, like support@acme.com. There should be a way to make somethings as not the same; some kind of rules.

There is an existing project for merging entities; not sure of its current state.

I think each entity should be able to assign weights to different fields, use different plugins for scoring.

Handling duplicate detection during contact creation like during registration is a good scenario. I assume if auto merge is a feature it would need to meet a minimum threshold, and we will need to provide exemption rules mentioned earlier.

In terms of priorities, migrating data into CRM is my current priority. I don't know when this will be the priority. It is a big elephant. Many features.

Is this something you are looking to take on?

mortona2k’s picture

That's a great idea, but I don't have the bandwidth atm.

I just discovered a dedupe submodule in redhen that might have a useful UI: https://git.drupalcode.org/project/redhen/-/tree/2.x/modules/redhen_dedu...