Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent duplicate sibling contact capture #9601

Open
ChinHairSaintClair opened this issue Nov 1, 2024 · 3 comments
Open

Prevent duplicate sibling contact capture #9601

ChinHairSaintClair opened this issue Nov 1, 2024 · 3 comments
Labels
Type: Feature Add something new

Comments

@ChinHairSaintClair
Copy link
Contributor

ChinHairSaintClair commented Nov 1, 2024

Is your feature request related to a problem? Please describe.
Currently, there appears to be no built-in deterrent against creating records with names similar to existing siblings.

Describe the solution you'd like
Prevent duplicate place/person creation and display possible duplicates for consideration. On record submission (through create or edit flow), we want to show the possible duplicate items to our user. They can then navigate to the possible duplicate item via a link, and proceed with record changes there, or circumvent the duplicate check & proceed with record submission.

Describe alternatives you've considered
Despite improving our search functionality, and training the CHWs to use it, usage of the search feature before creating new records remains low, leading to frequent duplicates.

Additional context
We noticed that our CHWs either forget they've previously captured items or miss previously captured items due to it being slightly mistyped. This has resulted in quite a few duplicate records being created on all user created levels of our hierarchy. We want to fix this at the source before tasks are rolled out to make sure no unnecessary/incorrect tasks fill up our CHWs worklist. This will naturally also improve the accuracy of our data for reporting purposes.

We have a working prototype that we will soon upstream, which can produce the following:
Image

Please see the following related discussions for more info:
https://forum.communityhealthtoolkit.org/t/mitigate-duplicate-data-capture/3313
#6363

As a "damage control" step, as discussed with the medic team, we plan to use Databricks (our tool primarily responsible for pulling couchDB data into our community database0 to also push "flags" to potential duplicate items. That will, in turn, cause tasks to trigger on the app. CHWs are expected to then confirm/deny possible duplicate and determine what should happen to the record (delete, merge, other).

@jkuester
Copy link
Contributor

Overview

First I just want to clarify the space and existing problem-set to clarify exactly what should be addressed in this issue. I think there are three separate, but related, problems to solve under the heading of "Duplicate prevention":

  1. Prevent user from creating a duplicate contact - (this current issue)
    • Runs in the webapp. Works offline.
    • Only checks against contact's siblings due to performance and data-access considerations.
  2. Flag/merge duplicate contacts on the server - Prevent and/or merge duplicate contacts #6363
    • Sentinel transition? Historically, not possible to do much interesting here because of the limitations of Couch views. However, Lucene may change the game.
    • Checks contacts across entire instance. Flags for resolution by admin (or auto-merges if possible).
  3. Prevent user from creating duplicate reports - Workflow to merge likely duplicates upon submission #6309
    • Proposed approach is to warn/block users when they try to create a new report for a contact that already has the same report existing in a particular time frame.

Details

With that summary in mind, we can dig into the details of how to specifically prevent a (potentially offline) user from creating a duplicate contact. The prototype PR provides a great starting point for this conversation. My goal here is to synthesis/generalize the details from that prototype into a design summary here that is easier for folks to understand and discuss (also including some of my own suggestions and editorializing). Once we coalesce on a particular design approach, we can return to the actual code and make that happen.

Configuration

Since different types of contacts contain different levels of data that might need to be checked for duplication, it makes sense to configure the rules for dup checking individually for each contact type. I think it would be logical to include this config in the app-settings contact_types config.

For each contact type, we need to define the rules for what constitutes a duplicate. The most flexible way to do this would be to allow a custom function to be included in the config that accepts two contacts and returns a boolean indicating if the contacts are duplicate or not. The Levenshtein library could be in-context for this function's logic to make use of. This would allow for the dupe logic to be as complex or simple as necessary. The main downside of this function-based approach is that I guess it would be almost impossible to re-use this configuration for any kind of server-side dupe checking (in any future solution for #6363). It is not feasible from a performance perspective to run a function like this against every contact in the db. (Honestly, though, the more I consider this, the less I think we should try to optimize this config for any kind of server-side reuse... It seems like the most likely possibilities for server-side dupe checking functionality in the near future are not going to happen with Couch data, but via an external data store, e.g. DOT or Databricks)

The prototype PR presents an alternative approach where, instead of a function, for each contact type we define which fields should be compared from the contact docs and which algorithm (e.g. Levenshtein vs NormalizedLevenshtein) should be used to compare them. (Note that I think it would be best to hold off on including the queryParams functionality or any other kind of override config in the initial MVP for duplicate checking. The goal is to add the minimum viable functionality first and then we can extend it further in future PRs.) Some considerations around this declarative approach are that it does not really allow for inter-field dependencies (e.g. specifying that contacts could have the same name or the same phone but not both). Also, unlike the approach in the PR, we probably need to be able to define the comparison algorithm for each field. Levenshtien is great for string fields, but we might need to compare things like dates or numbers in the future.

I think more discussion about the best form of configuration would be valuable here! One thing I would love to see is a simple way to enable some "default" dupe checks (e.g. something that just validates the name field).

Functionality

Moving on to the functionality of how this should actually work in webapp. I think the example from the PR is a solid approach where, when opening a new contact form, it also triggers a lookup of the existing sibling contacts of the new contact (via medic-client/contacts_by_parent). When the contact form is submitted, the configured duplicate checking is run against those sibling contacts to determine if the new contact is a duplicate of any existing ones. Checking against the direct siblings seems like the most reasonable balance of functionality and performance. An offline user can only know about what is visible on their device, anyway, so it is impossible to guarantee the new contact will be unique for the whole instance. This means we have to stop checking somewhere. Also, we need to do the check at the end of the form (once the new contact info has been entered), but before the form has been closed (so the user can go back and change data if necessary).

I am not sure if this is covered in the current PR functionality or not, but I think it will be important to do the same duplicate checking when editing a contact.

UX

As demonstrated in the screenshot above, the current UX in the PR is to present the user with a list of the found duplicates (along with links to go to their profile page). One incremental enhancement that might make sense here would be to present the duplicates more as a proper list of contacts (or even "contact cards") with various important identifying information displayed (instead of highlighting specifically which data is being matched for the duplicate check). If a user clicks into one of the other contacts, they will loose all the data they have entered into the contact form, so we want to give them as much info as is feasible about the contacts before they navigate away from the form.

Instead of including the list of contacts inline in the form page, it might be better to pop a modal containing the list. 🤔 (Either way, we should be able to use a xforms-value-changed listener to clear the dupe error and list of contacts when the user updates a value in form.)

Another consideration is what the default behavior should be if duplicate contacts are found. Should we warn the user, but still let them submit the new contact? Or, should we totally disable the submit button and prevent the new contact from being added? Ultimately, this is something that we could make configurable, but it would be good to have a simple approach in the MVP and add configuration later...


@ChinHairSaintClair @fardarter Please weigh in here with anything that I have missed or mistaken or additional thoughts of considerations that you have!

@garethbowen let me know what you think about this proposed approach. What other stake-holders should we pull into this conversation to make sure we can maintain momentum on this feature?

@garethbowen
Copy link
Member

Should we warn the user, but still let them submit the new contact?

I think this is a must. We can't assume anything about naming conventions and it's quite possible for two people at the same family to have the same name. The point of this feature is to stop a CHW doing the wrong thing accidentally, not prevent them from doing an action on purpose.

Prevent user from creating duplicate reports

I'm really interested in seeing if we can make this workflow generic enough that it works for all report types, not just contact creation. We have so many examples of duplicates being created accidentally across all types that this would be powerful. I worry that some of the thoughts here (like Levenshtein distance, using multiple fields, etc) are over and above what's actually needed. Can we just check the exact name and family? For reports, can we just check report code, reported date, and subject? If we can simplify it enough, can it work out-of-the-box without configuration?

When the contact form is submitted, the configured duplicate checking is run against those sibling contacts to determine if the new contact is a duplicate of any existing ones.

The ideal solution would warn about duplicates as early as possible. Some forms are very long and forcing a user to enter all the details before telling them about dupes we could have found after the first input field was complete would be a very frustrating UX. In my head this looks like a validation error on the name input with a checkbox to bypass the check, but implementing it as an enketo validation would be difficult I think? But however it's done, notifying as early as possible would be a huge win.

What other stake-holders should we pull into this conversation to make sure we can maintain momentum on this feature?

I think the eCHIS Kenya team would be interested in this too.

@jkuester
Copy link
Contributor

I'm really interested in seeing if we can make this workflow generic enough that it works for all report types, not just contact creation. ... For reports, can we just check report code, reported date, and subject? If we can simplify it enough, can it work out-of-the-box without configuration?

This was my initial thoughts as well, but then I was convinced by your comment on the other issue that the "duplicate report" workflow was quite different from duplicate contacts and perhaps there is not much overlap in the configs/logic. Specifically, when detecting a "duplicate report" is is probably much less about the contents of the report than just as you said: the type of report, who it is for, and when it is submitted. Basically for reports we would be looking for other reports of the same type that were created for the same contact in a particular timeframe. These checks can happen up-front before even loading the form. All of this is pretty different from the "duplicate contact" flow where the most important thing is the content that the user enters for the new contact. So we cannot do an upfront check for a duplicate contact. Also, it is likely that more config is necessary to allow the contact dupe checking to be really useful. (Maybe we can find some sensible pre-sets, but is seems like lots of tuning may be needed for some cases...)

Because of this, I am skeptical of a "one-size-fits-all" solution for dupe-doc checking that covers both reports and contacts. (And even if we do decide to go that route, we would not need to support dupe-checking reports in this MVP PR.) It seems like the most important thing to decide at this point is if we think report dupe checking will need to allow for the same level of flexible configuration as contacts (e.g. specifying which fields should be dupe checked). If so, then yeah it probably makes sense to at least design the contact dupe-checking to be extended later for also checking reports. If not, then I think we probably just leave the report dupe checking to its own issue and not worry about it here.

The ideal solution would warn about duplicates as early as possible. .... but implementing it as an enketo validation would be difficult I think?

Okay, this got me thinking that maybe I have been coming at this from the wrong direction! What if, instead of configuring the dupe-checking in the app-settings, we did it in the actual form xlsx files? In the form config we could use a custom column to mark all the fields that should be dupe-checked (and maybe even indicate what comparison algorithm to use). Then we could have a custom Enekto widget that would listen for changes to any of these fields and trigger the dupe-checking logic when any of the values change. (Then, the widget could trip the enketo validation logic to make the error look like a constraint violation if we wanted to go in that direction.) The data/logic flow is going to be quite a bit more complex (e.g. how to get the docs to check against, how the dupe-checking logic knows all the fields that are supposed to be included, what to do if we actually find a duplicate, etc).

The main downside I see to configuring things in the form is that for contact forms it would be important that only the fields that map directly to the contact doc are eligible for dupe checking. Fields in the inputs group or the intro group cannot be dupe-checked. But, it seems feasible to include a validation in cht-conf to help prevent this.

(Tell me if I am wrong here, though! 😅 I feel like I am seeing Enekto widget + custom xlsx column as the solution to all problems lately....)

I think the eCHIS Kenya team would be interested in this too.

@eljhkrr just putting this on your radar! Please jump in to the discussion here if you have any specific concerns, requirements, or ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Add something new
Projects
None yet
Development

No branches or pull requests

3 participants