-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157
Comments
I think I heard that during the meeting 2024-10-15, Mobile Connect has some Normalisation guidelines. If I was correct, could anyone share this MC guidelines with us, please? BR |
In the Netherlands , we use the following 5 normalisation steps for Mobile Connect KYC Match, for the family name:
For Mobile Connect we also truncate the result to 20 characters and apply a hash function. However, we think this is not needed for the CAMARA version. Step 2 and step 4 in the above normalisation are highly language or country specific. We use step 2, because these characters are not much used in dutch language, and sometimes our CRM systems do not support the entry of these characters, or with manual entry done in the past, people use these characters in stead of what it should be, because they don't know how to enter these on a keyboard. Step 4 is being used mainly for the dutch language, because we have many prepositions in dutch family names like "van der" in "van der Hout". Many of these can be abbreviated, so what you see in practice with "van der" is also "vd", "v.d." or they have spelling mistakes in them like "van de". For a Dutch reader, they would typically accept these asbeing trivial, and they would only look at the "Hout" in the family name. Because comparing hashes is very sensitive to spelling mistakes, we decided to remove the prepositions in step 4. For the CAMARA Jaro-Winkler comparison it may be that step 2 and step 4 are no longer needed, but we would first like to A/B test this in practice with some customers to see what the effect on the match rate is. So for now, we are keeping step 2 and 4. In Dutch family names we also still have one issue that we never have been able to solve, and that is when somebody gets married they can use the family name of their partner as well. However they can choose in which order they plave the family names. So, you can for example see "Jansen", "Jansen-Hout", "Hout-Jansen" and sometimes also with an abbreviation like "Jansen ev Hout". We never managed to solve this in Mobile Connect, but with plaint text comparison we may give it a try, and try to match all allowed combinations, and see what gives the highest match rate. |
Sharing for completeness; We have the Vodafone Normalisation rules (for mobile connect KYC version) documented here: We're keen on the concept (I think it was @GillesInnov35 mentioned) for plain-text KYC Match with Score to allow the submitter to simply provide the text without normalisation, and then the MNO applies the normalisation rules to both inputs.
@HuubAppelboom Would be very interested to understand how your study with and without normalisation for your Jaro-Winkler scores perform (we only have UK based results today). |
@KevScarr From current results with NL customers, we see a difference in matchrates for the name fields compared to the date of birth field. The name fields are typically 10% lower in match rate, and that probably can be fixed with the Telco doing the normalisation. Jaro-Winkler may do part of closing this gap, but we may also need to solve the issue of people getting married and combining family names to close this gap even further. I fully support the idea that the Telco's should do the normalisation for all the reasons you mention above, |
We've been reviewing the normalisation rules in the context of 'score' and we don't think it makes any sense to truncate the fields so we would propose the following ruleset for adoption as a base standard:-
And would support the principle that the MNO applies this to both input sets, ie the calling party doesn't need to normalise. |
Regarding the prepositions, the best approach is probably to do an A/B test with the first customers that gave fuzzy name matching logic, and see whether the removal of prepositions is still necessary. |
thanks @KevScarr, @HuubAppelboom for your proposition
At Orange, recommendations on normalization's step are the same. |
@KevScarr @GillesInnov35 Do you also remove any spaces when your remove any non-alphanumeric characters ? For the Netherlands we probobaly still need to remove the prepositions in family name. We porpose tol A/B test this with the first customers that will use the CAMARA version, so we can make an informed decision on this. |
hi @HuubAppelboom , yes spaces are removed in our solution. |
@HuubAppelboom Same as @GillesInnov35 we (Vodafone) also remove spaces. |
Hi all, In Telefonica, the normalization we apply follows the same Mobile Connect rules shared above:
Telefonica applies normalization in both, the MNO data and the customer data. Regarding the proposed changes:
Question about the new attributes:
|
@claraserranosolsona Purely a "Simplification" step; introducing 'score' doesn't need the fields to be truncated, a high score would also be considered a match (ie 85+ by customers). |
hi @claraserranosolsona, I see that truncation is one of the steps proposed by TEF. Is it only a security requirement for the back-end service ? From what I know there isn't any RFC or rules for string truncation (length value) regarding what kind of data it contains. If I well understand, any string truncation will depend on the MNO's solution and so will be specific. Point of email string normalization has been also raised at Orange. As mentioned by @claraserranosolsona , normalization should be specialized for email format. For example skip the following characters “@”, “.”, “-“, “_” BR |
Hi all, Thank you for your discussions on normalization. I don't think we need email string normalization. I feel the example skip the following characters “@”, “.”, “-“, “_” may cause negative impact to comparison results. I also have another question. It seems the way to normalize values of attributes that has been discussed is very much language dependent, so, first of all, it will be an optional feature. And, how do you think it will be specified? Will it be specified in the API specification (YAML), or will it be some guideline document separate from the YAML? Thanks, |
@ToshiWakayama-KDDI In the netherlands we have recorded things like normalisation in a local spec in a local industry organisation between the active MNO's, and KPN has also included it in the specs we provide to aggregators (including all the extra's and specials we provide on top of the standard specification). I don;t think it is a good idea to try and put every normalisation in the YAML, also because it is probably not that relevant for each and every developer. It may be a good idea to discuss this also with the GSMA and CAMARA project organisation. |
hi @ToshiWakayama-KDDI , @HuubAppelboom , I agree with Huub , I think also that normalization should not be specified in the yaml file. BR |
@ToshiWakayama-KDDI , I'm sorry I made a big mistake in my previous comment, for email address the idea was to avoid skiping the following characters “@”, “.”, “-“, “_”. |
All, what about the following as a general recommendation for normalisation steps to be carried out by the API Provider:
Regards |
thanks @HuubAppelboom, I wonder if the normalization steps order is important for score calcultation. On our side it is not exactly the same. I think it could have an impact in few specific cases. I'll check. Gilles |
Problem description
Some normalisation process of the input attributes may be beneficial to harmonize the inputs, if applicable, to minimize false nagatives when dealing with special characters of a language or even to make different languages compatible. This topic has not been studied yet in the SP.
Possible evolution
It is understood that Normlisation is up to MNOs/API providers, however, some guidelines of Normalisation may be benefical for globally located MNOs/API providers. So, this issue to identify if such guidelines are beneficial, and, if so, to create guidelines.
->Please correct the above, if my understanding above was not correct.
The text was updated successfully, but these errors were encountered: