KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157

ToshiWakayama-KDDI · 2024-10-22T02:24:51Z

Problem description

Some normalisation process of the input attributes may be beneficial to harmonize the inputs, if applicable, to minimize false nagatives when dealing with special characters of a language or even to make different languages compatible. This topic has not been studied yet in the SP.

Possible evolution

It is understood that Normlisation is up to MNOs/API providers, however, some guidelines of Normalisation may be benefical for globally located MNOs/API providers. So, this issue to identify if such guidelines are beneficial, and, if so, to create guidelines.

->Please correct the above, if my understanding above was not correct.

ToshiWakayama-KDDI · 2024-10-22T02:27:39Z

I think I heard that during the meeting 2024-10-15, Mobile Connect has some Normalisation guidelines. If I was correct, could anyone share this MC guidelines with us, please?

BR
Toshi

HuubAppelboom · 2024-10-22T05:30:11Z

In the Netherlands , we use the following 5 normalisation steps for Mobile Connect KYC Match, for the family name:

convert to lowercase
convert characters àèìòùëïâêîôûáéíóúðýãñõšžçåø to aeioueiaeiouaeioudyanoszcao and ä, ö, ü, ß to ae, oe, ue and ss
replace any non-alphanumeric character by spaces
Replace prepositions; any substring from the set should be replaced with a space:
{" af ", “ aan “, “ auf “, “ bij “, “ boven “, “ de “, “ den “, “ der “, “ d “, “ en “, “ ev “, “ e v “, “ het “, “ t”, “ in “, “ onder “, “ op “, “ over “, “ s “, “ te “, “ ten “, “ ter “, “ tot “, “ uit “, “ uijt “,“ van “, “ vd “, “ v “, ” ver “, “ vom “, “ von “, “ voor “, “ vor “, “ zu “, “ zum “, “ zur “}
The prepositions should always be separate from the rest of the text, hence the leading and trailing spaces – at the beginning and end of the string a leading and trailing space should be added, to avoid missing a preposition.
remove all spaces

For Mobile Connect we also truncate the result to 20 characters and apply a hash function. However, we think this is not needed for the CAMARA version.

Step 2 and step 4 in the above normalisation are highly language or country specific.

We use step 2, because these characters are not much used in dutch language, and sometimes our CRM systems do not support the entry of these characters, or with manual entry done in the past, people use these characters in stead of what it should be, because they don't know how to enter these on a keyboard.

Step 4 is being used mainly for the dutch language, because we have many prepositions in dutch family names like "van der" in "van der Hout". Many of these can be abbreviated, so what you see in practice with "van der" is also "vd", "v.d." or they have spelling mistakes in them like "van de". For a Dutch reader, they would typically accept these asbeing trivial, and they would only look at the "Hout" in the family name. Because comparing hashes is very sensitive to spelling mistakes, we decided to remove the prepositions in step 4.

For the CAMARA Jaro-Winkler comparison it may be that step 2 and step 4 are no longer needed, but we would first like to A/B test this in practice with some customers to see what the effect on the match rate is. So for now, we are keeping step 2 and 4.

In Dutch family names we also still have one issue that we never have been able to solve, and that is when somebody gets married they can use the family name of their partner as well. However they can choose in which order they plave the family names. So, you can for example see "Jansen", "Jansen-Hout", "Hout-Jansen" and sometimes also with an abbreviation like "Jansen ev Hout". We never managed to solve this in Mobile Connect, but with plaint text comparison we may give it a try, and try to match all allowed combinations, and see what gives the highest match rate.

KevScarr · 2024-10-22T08:20:26Z

Sharing for completeness; We have the Vodafone Normalisation rules (for mobile connect KYC version) documented here:
https://developer.vodafone.com/api-catalogue/match/additional-information#preparing-hashes (UK,DE,ES,NL). UK aligns to the MC global standard and then we've extended it in some countries to deal with the same types of issues Huub mentions above.

We're keen on the concept (I think it was @GillesInnov35 mentioned) for plain-text KYC Match with Score to allow the submitter to simply provide the text without normalisation, and then the MNO applies the normalisation rules to both inputs.

Easier to control changes to the algorithm (centrally maintained)
Easier for the Customer/Calling party (ie in some markets MNOs may differ on the rule-set)
Ensures consistency
Easier to A/B test changes to the algorithm or run parallel tests for offline feedback

@HuubAppelboom Would be very interested to understand how your study with and without normalisation for your Jaro-Winkler scores perform (we only have UK based results today).

HuubAppelboom · 2024-10-23T10:21:36Z

@KevScarr From current results with NL customers, we see a difference in matchrates for the name fields compared to the date of birth field. The name fields are typically 10% lower in match rate, and that probably can be fixed with the Telco doing the normalisation. Jaro-Winkler may do part of closing this gap, but we may also need to solve the issue of people getting married and combining family names to close this gap even further.

I fully support the idea that the Telco's should do the normalisation for all the reasons you mention above,

KevScarr · 2024-10-24T10:16:51Z

We've been reviewing the normalisation rules in the context of 'score' and we don't think it makes any sense to truncate the fields so we would propose the following ruleset for adoption as a base standard:-

Truncation: No truncation, apart from postalCode: 10 chars
Convert to lowercase
“Downgrade” all non standard characters :
"àèìòù äëïöü âêîôû áéíóú ðý ãñõ šžçåø"
into
"aeiou aeiou aeiou aeiou dy ano szcao"
Downgrade any "ß" as "ss"
Remove any non-alphanumeric characters
The result is the “normalised” value

And would support the principle that the MNO applies this to both input sets, ie the calling party doesn't need to normalise.

HuubAppelboom · 2024-10-29T13:30:01Z

Regarding the prepositions, the best approach is probably to do an A/B test with the first customers that gave fuzzy name matching logic, and see whether the removal of prepositions is still necessary.

GillesInnov35 · 2024-10-29T17:56:15Z

thanks @KevScarr, @HuubAppelboom for your proposition

Truncation: No truncation, apart from postalCode: 10 chars

Convert to lowercase

“Downgrade” all non standard characters : "àèìòù äëïöü âêîôû áéíóú ðý ãñõ šžçåø" into "aeiou aeiou aeiou aeiou dy ano szcao"

Downgrade any "ß" as "ss"

Remove any non-alphanumeric characters

At Orange, recommendations on normalization's step are the same.
BR
Gilles

HuubAppelboom · 2024-10-30T08:16:07Z

@KevScarr @GillesInnov35 Do you also remove any spaces when your remove any non-alphanumeric characters ?

For the Netherlands we probobaly still need to remove the prepositions in family name. We porpose tol A/B test this with the first customers that will use the CAMARA version, so we can make an informed decision on this.

GillesInnov35 · 2024-10-30T08:40:00Z

hi @HuubAppelboom , yes spaces are removed in our solution.
Gilles

KevScarr · 2024-10-30T08:54:36Z

@HuubAppelboom Same as @GillesInnov35 we (Vodafone) also remove spaces.

claraserranosolsona · 2024-11-11T13:54:35Z

Hi all,

In Telefonica, the normalization we apply follows the same Mobile Connect rules shared above:

Convert name string to lower case, downgrade all non-standard characters ("àèìòù äëïöü âêîôû áéíóú ðý ãñõšžçåø ß” become "aeiou aeiou aeiou aeiou dy ano szcao ss"). With additional particularities for German market when downgrading non-standard characters (ä->ae, ö->oe, ü->ue)
Remove any articles/prepositions (defined in an appendix)
Remove any nonalphanumeric character (including spaces)
Length is truncated (length varies depending on the attribute)

Telefonica applies normalization in both, the MNO data and the customer data.

Regarding the proposed changes:

Truncation: @KevScarr, you have proposed to remove it, what is the rational behind? does it improve the score?
Removal of Articles/prepositions: this step was defined as an evolution to the original MC normalization rules, after the result of some pilots performed in Germany. They reported improvement in the ratio of true responses when doing it. I believe Vodafone and DT were applying same rules in Germany at least. We could revisit this step as mentioned by @HuubAppelboom if we think it is difficult to define a global article/preposition list for all countries and/or does not show improvement in the new results.

Question about the new attributes:

Email: should we apply above normalization rules? Or simply validate that complies with the RFC format "{local-part}@{domain}"?

KevScarr · 2024-11-11T14:55:10Z

@claraserranosolsona Purely a "Simplification" step; introducing 'score' doesn't need the fields to be truncated, a high score would also be considered a match (ie 85+ by customers).

GillesInnov35 · 2024-11-13T13:04:19Z

hi @claraserranosolsona, I see that truncation is one of the steps proposed by TEF. Is it only a security requirement for the back-end service ? From what I know there isn't any RFC or rules for string truncation (length value) regarding what kind of data it contains. If I well understand, any string truncation will depend on the MNO's solution and so will be specific.

Point of email string normalization has been also raised at Orange. As mentioned by @claraserranosolsona , normalization should be specialized for email format. For example skip the following characters “@”, “.”, “-“, “_”
To be discussed
Thanks

BR
Gilles

ToshiWakayama-KDDI · 2024-11-26T04:44:49Z

Hi all,

Thank you for your discussions on normalization.

I don't think we need email string normalization. I feel the example skip the following characters “@”, “.”, “-“, “_” may cause negative impact to comparison results.

I also have another question. It seems the way to normalize values of attributes that has been discussed is very much language dependent, so, first of all, it will be an optional feature. And, how do you think it will be specified? Will it be specified in the API specification (YAML), or will it be some guideline document separate from the YAML?

Thanks,
Toshi

HuubAppelboom · 2024-11-26T07:10:34Z

@ToshiWakayama-KDDI In the netherlands we have recorded things like normalisation in a local spec in a local industry organisation between the active MNO's, and KPN has also included it in the specs we provide to aggregators (including all the extra's and specials we provide on top of the standard specification). I don;t think it is a good idea to try and put every normalisation in the YAML, also because it is probably not that relevant for each and every developer. It may be a good idea to discuss this also with the GSMA and CAMARA project organisation.

GillesInnov35 · 2024-11-26T07:29:05Z

hi @ToshiWakayama-KDDI , @HuubAppelboom , I agree with Huub , I think also that normalization should not be specified in the yaml file.

BR
Gilles

GillesInnov35 · 2024-11-26T08:14:52Z

@ToshiWakayama-KDDI , I'm sorry I made a big mistake in my previous comment, for email address the idea was to avoid skiping the following characters “@”, “.”, “-“, “_”.

HuubAppelboom · 2024-11-26T09:46:16Z

All, what about the following as a general recommendation for normalisation steps to be carried out by the API Provider:

convert all characters to lowercase
country specific: convert characters "àèìòù äëïöü âêîôû áéíóú ðý ãñõšžçåø" to "aeioueiaeiouaeioudyanoszcao"
country specific: downgrade ä, ö, ü, ß to ae, oe, ue and ss
In case you want to remove prepositions at step 5, replace any non-alphanumeric character by spaces, otherwise remove all
non-alphanumeric characters
country specific: remove all prepositions following a defined list of prepositions
remove all remaning spaces

Regards
Huub

GillesInnov35 · 2024-11-26T14:46:28Z

thanks @HuubAppelboom, I wonder if the normalization steps order is important for score calcultation. On our side it is not exactly the same. I think it could have an impact in few specific cases. I'll check.

Gilles

ToshiWakayama-KDDI added the enhancement New feature or request label Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157

KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157

ToshiWakayama-KDDI commented Oct 22, 2024

ToshiWakayama-KDDI commented Oct 22, 2024

HuubAppelboom commented Oct 22, 2024

KevScarr commented Oct 22, 2024

HuubAppelboom commented Oct 23, 2024

KevScarr commented Oct 24, 2024 •

edited

Loading

HuubAppelboom commented Oct 29, 2024

GillesInnov35 commented Oct 29, 2024

HuubAppelboom commented Oct 30, 2024

GillesInnov35 commented Oct 30, 2024

KevScarr commented Oct 30, 2024

claraserranosolsona commented Nov 11, 2024

KevScarr commented Nov 11, 2024

GillesInnov35 commented Nov 13, 2024

ToshiWakayama-KDDI commented Nov 26, 2024

HuubAppelboom commented Nov 26, 2024

GillesInnov35 commented Nov 26, 2024

GillesInnov35 commented Nov 26, 2024

HuubAppelboom commented Nov 26, 2024

GillesInnov35 commented Nov 26, 2024

KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157

KYC-Match: Noramlisation - how to normalise values of attributes before comparison #157

Comments

ToshiWakayama-KDDI commented Oct 22, 2024

ToshiWakayama-KDDI commented Oct 22, 2024

HuubAppelboom commented Oct 22, 2024

KevScarr commented Oct 22, 2024

HuubAppelboom commented Oct 23, 2024

KevScarr commented Oct 24, 2024 • edited Loading

HuubAppelboom commented Oct 29, 2024

GillesInnov35 commented Oct 29, 2024

HuubAppelboom commented Oct 30, 2024

GillesInnov35 commented Oct 30, 2024

KevScarr commented Oct 30, 2024

claraserranosolsona commented Nov 11, 2024

KevScarr commented Nov 11, 2024

GillesInnov35 commented Nov 13, 2024

ToshiWakayama-KDDI commented Nov 26, 2024

HuubAppelboom commented Nov 26, 2024

GillesInnov35 commented Nov 26, 2024

GillesInnov35 commented Nov 26, 2024

HuubAppelboom commented Nov 26, 2024

GillesInnov35 commented Nov 26, 2024

KevScarr commented Oct 24, 2024 •

edited

Loading