Problem/Motivation
We need a way to detect, merge, and update references when there are multiple contacts
Steps to reproduce
Proposed resolution
Remaining tasks
| Field Type | Plugin ID | Description / Scoring Strategy | Notes / Config Options |
|---|---|---|---|
| Integer | integer_exact |
Exact match; 1 if equal, 0 otherwise | None |
integer_linear |
Linear decay based on difference | max_delta = maximum difference for score 0 |
|
integer_exponential |
Exponential decay with difference | decay_base |
|
integer_bucket |
Compare in discrete ranges | bucket_size |
|
| Float / Decimal | number_exact |
Exact match | Same as integer_exact |
number_linear |
Linear decay | max_delta |
|
number_exponential |
Exponential decay | decay_base |
|
number_bucket |
Discrete buckets | bucket_size |
|
| Short String | string_exact |
Case-insensitive exact match | None |
string_jaro_winkler |
Fuzzy string similarity | Optional threshold | |
string_metaphone |
Phonetic similarity | For names / nicknames | |
string_soundex |
Phonetic similarity | Alternative to metaphone | |
| Text / Long Form | text_exact |
Exact string match | Optional min length |
text_tfidf |
TF-IDF vector + cosine similarity | Config: min_length, max_comparisons, stopwords | |
text_semantic |
Embedding / ML-based similarity | Optional / heavy dependency |
Comments
Comment #2
bluegeek9 commentedI think we should look to the community for frameworks the detect duplicate entities and can merge entities.
Comment #3
bluegeek9 commentedNeed a community module as a framework.
Comment #4
bluegeek9 commentedComment #5
bluegeek9 commentedMaybe we can use this namespace.
https://www.drupal.org/project/similarity
Comment #6
bluegeek9 commentedDeduping when the fields are encrypted is an additional consideration.
Comment #7
mortona2k commenteddedupe, not duplicate.
This came up for a project today, if a contact registers and the email doesn't match but the name does, how do we consolidate?
Comment #8
bluegeek9 commentedIn my head I think it would be best to be generic for all entities, a separate project. The system would show a score for a user to decide to merge contacts. CRM would integrate, but not require the duplicate detect project.
I want this to work with, but not require workspace so contacts can be merged and reviewed without impacting other users.
There are shared emails, like support@acme.com. There should be a way to make somethings as not the same; some kind of rules.
There is an existing project for merging entities; not sure of its current state.
I think each entity should be able to assign weights to different fields, use different plugins for scoring.
Handling duplicate detection during contact creation like during registration is a good scenario. I assume if auto merge is a feature it would need to meet a minimum threshold, and we will need to provide exemption rules mentioned earlier.
In terms of priorities, migrating data into CRM is my current priority. I don't know when this will be the priority. It is a big elephant. Many features.
Is this something you are looking to take on?
Comment #9
mortona2k commentedThat's a great idea, but I don't have the bandwidth atm.
I just discovered a dedupe submodule in redhen that might have a useful UI: https://git.drupalcode.org/project/redhen/-/tree/2.x/modules/redhen_dedu...