Improve speaker mapping algorithm #9

ninpnin · 2024-06-10T14:05:45Z

The mediocre fuzzy matching etc. cause issues in the speaker matching algorithm. Especially false negatives seem to be an issue as we add more metadata.

Draw a sample of maybe 150 intros
Check whether intro_to_dict works properly
- Fix issues if there are any
Develop matching algorithm against the sample
- Use eg. https://pypi.org/project/fuzzywuzzy/ for names
- Use similar fuzzy matching for i-ort as well
- Generate potential alternate spellings (i.e. A. B. Andersson) instead of fuzzy matching them
Draw another sample to test that the algorithm works
- Compare with current predictions

ninpnin · 2024-10-25T11:36:16Z

intro-to-dict-sample-2024-10-25.csv

We got a 93% accuracy. Roughly 2% of this is due to OCR errors and multiple people in the intro (which is hard to solve atm). The remaining 5% can be corrected.

Some clear issues are: processing hyphens, multiple first names, and name abbreviations (like A. B. Lundberg etc.).

Let's write a test that checks this file, and fix the algorithm so that the test passes.

ninpnin assigned ninpnin and BobBorges Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve speaker mapping algorithm #9

Improve speaker mapping algorithm #9

ninpnin commented Jun 10, 2024 •

edited

Loading

ninpnin commented Oct 25, 2024

Improve speaker mapping algorithm #9

Improve speaker mapping algorithm #9

Comments

ninpnin commented Jun 10, 2024 • edited Loading

ninpnin commented Oct 25, 2024

ninpnin commented Jun 10, 2024 •

edited

Loading