Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157

Open
ToshiWakayama-KDDI opened this issue Oct 22, 2024 · 19 comments
Labels
enhancement New feature or request

Comments

@ToshiWakayama-KDDI
Copy link
Collaborator

Problem description

Some normalisation process of the input attributes may be beneficial to harmonize the inputs, if applicable, to minimize false nagatives when dealing with special characters of a language or even to make different languages compatible. This topic has not been studied yet in the SP.

Possible evolution

It is understood that Normlisation is up to MNOs/API providers, however, some guidelines of Normalisation may be benefical for globally located MNOs/API providers. So, this issue to identify if such guidelines are beneficial, and, if so, to create guidelines.

->Please correct the above, if my understanding above was not correct.

@ToshiWakayama-KDDI ToshiWakayama-KDDI added the enhancement New feature or request label Oct 22, 2024
@ToshiWakayama-KDDI
Copy link
Collaborator Author

I think I heard that during the meeting 2024-10-15, Mobile Connect has some Normalisation guidelines. If I was correct, could anyone share this MC guidelines with us, please?

BR
Toshi

@HuubAppelboom
Copy link
Collaborator

In the Netherlands , we use the following 5 normalisation steps for Mobile Connect KYC Match, for the family name:

  1. convert to lowercase
  2. convert characters àèìòùëïâêîôûáéíóúðýãñõšžçåø to aeioueiaeiouaeioudyanoszcao and ä, ö, ü, ß to ae, oe, ue and ss
  3. replace any non-alphanumeric character by spaces
  4. Replace prepositions; any substring from the set should be replaced with a space:
    {" af ", “ aan “, “ auf “, “ bij “, “ boven “, “ de “, “ den “, “ der “, “ d “, “ en “, “ ev “, “ e v “, “ het “, “ t”, “ in “, “ onder “, “ op “, “ over “, “ s “, “ te “, “ ten “, “ ter “, “ tot “, “ uit “, “ uijt “,“ van “, “ vd “, “ v “, ” ver “, “ vom “, “ von “, “ voor “, “ vor “, “ zu “, “ zum “, “ zur “}
    The prepositions should always be separate from the rest of the text, hence the leading and trailing spaces – at the beginning and end of the string a leading and trailing space should be added, to avoid missing a preposition.
  5. remove all spaces

For Mobile Connect we also truncate the result to 20 characters and apply a hash function. However, we think this is not needed for the CAMARA version.

Step 2 and step 4 in the above normalisation are highly language or country specific.

We use step 2, because these characters are not much used in dutch language, and sometimes our CRM systems do not support the entry of these characters, or with manual entry done in the past, people use these characters in stead of what it should be, because they don't know how to enter these on a keyboard.

Step 4 is being used mainly for the dutch language, because we have many prepositions in dutch family names like "van der" in "van der Hout". Many of these can be abbreviated, so what you see in practice with "van der" is also "vd", "v.d." or they have spelling mistakes in them like "van de". For a Dutch reader, they would typically accept these asbeing trivial, and they would only look at the "Hout" in the family name. Because comparing hashes is very sensitive to spelling mistakes, we decided to remove the prepositions in step 4.

For the CAMARA Jaro-Winkler comparison it may be that step 2 and step 4 are no longer needed, but we would first like to A/B test this in practice with some customers to see what the effect on the match rate is. So for now, we are keeping step 2 and 4.

In Dutch family names we also still have one issue that we never have been able to solve, and that is when somebody gets married they can use the family name of their partner as well. However they can choose in which order they plave the family names. So, you can for example see "Jansen", "Jansen-Hout", "Hout-Jansen" and sometimes also with an abbreviation like "Jansen ev Hout". We never managed to solve this in Mobile Connect, but with plaint text comparison we may give it a try, and try to match all allowed combinations, and see what gives the highest match rate.

@KevScarr
Copy link
Collaborator

Sharing for completeness; We have the Vodafone Normalisation rules (for mobile connect KYC version) documented here:
https://developer.vodafone.com/api-catalogue/match/additional-information#preparing-hashes (UK,DE,ES,NL). UK aligns to the MC global standard and then we've extended it in some countries to deal with the same types of issues Huub mentions above.

We're keen on the concept (I think it was @GillesInnov35 mentioned) for plain-text KYC Match with Score to allow the submitter to simply provide the text without normalisation, and then the MNO applies the normalisation rules to both inputs.

  • Easier to control changes to the algorithm (centrally maintained)
  • Easier for the Customer/Calling party (ie in some markets MNOs may differ on the rule-set)
  • Ensures consistency
  • Easier to A/B test changes to the algorithm or run parallel tests for offline feedback

@HuubAppelboom Would be very interested to understand how your study with and without normalisation for your Jaro-Winkler scores perform (we only have UK based results today).

@HuubAppelboom
Copy link
Collaborator

@KevScarr From current results with NL customers, we see a difference in matchrates for the name fields compared to the date of birth field. The name fields are typically 10% lower in match rate, and that probably can be fixed with the Telco doing the normalisation. Jaro-Winkler may do part of closing this gap, but we may also need to solve the issue of people getting married and combining family names to close this gap even further.

I fully support the idea that the Telco's should do the normalisation for all the reasons you mention above,

@KevScarr
Copy link
Collaborator

KevScarr commented Oct 24, 2024

We've been reviewing the normalisation rules in the context of 'score' and we don't think it makes any sense to truncate the fields so we would propose the following ruleset for adoption as a base standard:-

  1. Truncation: No truncation, apart from postalCode: 10 chars
  2. Convert to lowercase
  3. “Downgrade” all non standard characters :
    "àèìòù äëïöü âêîôû áéíóú ðý ãñõ šžçåø"
    into
    "aeiou aeiou aeiou aeiou dy ano szcao"
  4. Downgrade any "ß" as "ss"
  5. Remove any non-alphanumeric characters
  6. The result is the “normalised” value

And would support the principle that the MNO applies this to both input sets, ie the calling party doesn't need to normalise.

@HuubAppelboom
Copy link
Collaborator

Regarding the prepositions, the best approach is probably to do an A/B test with the first customers that gave fuzzy name matching logic, and see whether the removal of prepositions is still necessary.

@GillesInnov35
Copy link
Collaborator

thanks @KevScarr, @HuubAppelboom for your proposition

  • Truncation: No truncation, apart from postalCode: 10 chars
  • Convert to lowercase
  • “Downgrade” all non standard characters : "àèìòù äëïöü âêîôû áéíóú ðý ãñõ šžçåø" into "aeiou aeiou aeiou aeiou dy ano szcao"
  • Downgrade any "ß" as "ss"
  • Remove any non-alphanumeric characters

At Orange, recommendations on normalization's step are the same.
BR
Gilles

@HuubAppelboom
Copy link
Collaborator

@KevScarr @GillesInnov35 Do you also remove any spaces when your remove any non-alphanumeric characters ?

For the Netherlands we probobaly still need to remove the prepositions in family name. We porpose tol A/B test this with the first customers that will use the CAMARA version, so we can make an informed decision on this.

@GillesInnov35
Copy link
Collaborator

hi @HuubAppelboom , yes spaces are removed in our solution.
Gilles

@KevScarr
Copy link
Collaborator

@HuubAppelboom Same as @GillesInnov35 we (Vodafone) also remove spaces.

@claraserranosolsona
Copy link

Hi all,

In Telefonica, the normalization we apply follows the same Mobile Connect rules shared above:

  1. Convert name string to lower case, downgrade all non-standard characters ("àèìòù äëïöü âêîôû áéíóú ðý ãñõšžçåø ß” become "aeiou aeiou aeiou aeiou dy ano szcao ss"). With additional particularities for German market when downgrading non-standard characters (ä->ae, ö->oe, ü->ue)
  2. Remove any articles/prepositions (defined in an appendix)
  3. Remove any nonalphanumeric character (including spaces)
  4. Length is truncated (length varies depending on the attribute)

Telefonica applies normalization in both, the MNO data and the customer data.

Regarding the proposed changes:

  • Truncation: @KevScarr, you have proposed to remove it, what is the rational behind? does it improve the score?
  • Removal of Articles/prepositions: this step was defined as an evolution to the original MC normalization rules, after the result of some pilots performed in Germany. They reported improvement in the ratio of true responses when doing it. I believe Vodafone and DT were applying same rules in Germany at least. We could revisit this step as mentioned by @HuubAppelboom if we think it is difficult to define a global article/preposition list for all countries and/or does not show improvement in the new results.

Question about the new attributes:

  • Email: should we apply above normalization rules? Or simply validate that complies with the RFC format "{local-part}@{domain}"?

@KevScarr
Copy link
Collaborator

@claraserranosolsona Purely a "Simplification" step; introducing 'score' doesn't need the fields to be truncated, a high score would also be considered a match (ie 85+ by customers).

@GillesInnov35
Copy link
Collaborator

hi @claraserranosolsona, I see that truncation is one of the steps proposed by TEF. Is it only a security requirement for the back-end service ? From what I know there isn't any RFC or rules for string truncation (length value) regarding what kind of data it contains. If I well understand, any string truncation will depend on the MNO's solution and so will be specific.

Point of email string normalization has been also raised at Orange. As mentioned by @claraserranosolsona , normalization should be specialized for email format. For example skip the following characters “@”, “.”, “-“, “_”
To be discussed
Thanks

BR
Gilles

@ToshiWakayama-KDDI
Copy link
Collaborator Author

Hi all,

Thank you for your discussions on normalization.

I don't think we need email string normalization. I feel the example skip the following characters “@”, “.”, “-“, “_” may cause negative impact to comparison results.

I also have another question. It seems the way to normalize values of attributes that has been discussed is very much language dependent, so, first of all, it will be an optional feature. And, how do you think it will be specified? Will it be specified in the API specification (YAML), or will it be some guideline document separate from the YAML?

Thanks,
Toshi

@HuubAppelboom
Copy link
Collaborator

@ToshiWakayama-KDDI In the netherlands we have recorded things like normalisation in a local spec in a local industry organisation between the active MNO's, and KPN has also included it in the specs we provide to aggregators (including all the extra's and specials we provide on top of the standard specification). I don;t think it is a good idea to try and put every normalisation in the YAML, also because it is probably not that relevant for each and every developer. It may be a good idea to discuss this also with the GSMA and CAMARA project organisation.

@GillesInnov35
Copy link
Collaborator

hi @ToshiWakayama-KDDI , @HuubAppelboom , I agree with Huub , I think also that normalization should not be specified in the yaml file.

BR
Gilles

@GillesInnov35
Copy link
Collaborator

@ToshiWakayama-KDDI , I'm sorry I made a big mistake in my previous comment, for email address the idea was to avoid skiping the following characters “@”, “.”, “-“, “_”.

@HuubAppelboom
Copy link
Collaborator

All, what about the following as a general recommendation for normalisation steps to be carried out by the API Provider:

  1. convert all characters to lowercase
  2. country specific: convert characters "àèìòù äëïöü âêîôû áéíóú ðý ãñõšžçåø" to "aeioueiaeiouaeioudyanoszcao"
  3. country specific: downgrade ä, ö, ü, ß to ae, oe, ue and ss
  4. In case you want to remove prepositions at step 5, replace any non-alphanumeric character by spaces, otherwise remove all
    non-alphanumeric characters
  5. country specific: remove all prepositions following a defined list of prepositions
  6. remove all remaning spaces

Regards
Huub

@GillesInnov35
Copy link
Collaborator

thanks @HuubAppelboom, I wonder if the normalization steps order is important for score calcultation. On our side it is not exactly the same. I think it could have an impact in few specific cases. I'll check.

Gilles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants