Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speaker mapping algorithm #9

Open
2 of 9 tasks
ninpnin opened this issue Jun 10, 2024 · 1 comment
Open
2 of 9 tasks

Improve speaker mapping algorithm #9

ninpnin opened this issue Jun 10, 2024 · 1 comment
Assignees

Comments

@ninpnin
Copy link
Contributor

ninpnin commented Jun 10, 2024

The mediocre fuzzy matching etc. cause issues in the speaker matching algorithm. Especially false negatives seem to be an issue as we add more metadata.

  • Draw a sample of maybe 150 intros
  • Check whether intro_to_dict works properly
    • Fix issues if there are any
  • Develop matching algorithm against the sample
    • Use eg. https://pypi.org/project/fuzzywuzzy/ for names
    • Use similar fuzzy matching for i-ort as well
    • Generate potential alternate spellings (i.e. A. B. Andersson) instead of fuzzy matching them
  • Draw another sample to test that the algorithm works
    • Compare with current predictions
@ninpnin
Copy link
Contributor Author

ninpnin commented Oct 25, 2024

intro-to-dict-sample-2024-10-25.csv

We got a 93% accuracy. Roughly 2% of this is due to OCR errors and multiple people in the intro (which is hard to solve atm). The remaining 5% can be corrected.

Some clear issues are: processing hyphens, multiple first names, and name abbreviations (like A. B. Lundberg etc.).

Let's write a test that checks this file, and fix the algorithm so that the test passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants