Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand Supported Top-Level Domains (TLDs) in split_sentences Function #1194

Open
robertvy opened this issue Dec 9, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@robertvy
Copy link

robertvy commented Dec 9, 2024

Description

The split_sentences function in the livekit.agents.tokenize.basic module inadequately handles domain names and email addresses with Top-Level Domains (TLDs) not included in its predefined list. This limitation results in the insertion of a space before the TLD, causing websites, emails, and other domain-related strings to be split incorrectly. Consequently, Text-to-Speech (TTS) systems mispronounce these entities, making them unrecognizable and disrupting the user experience.

I have successfully resolved it using a monkey-patch in my case, but I believe this should be generally addressed to cater to the global audience of LiveKit.

Proposed Solution

  1. Expand the TLD List:

    • Update the websites regex pattern to include a comprehensive list of both generic and country-code TLDs.
  2. Make TLDs Configurable:

    • Allow users to provide a custom list of TLDs or fetch the latest TLDs from a reliable source to ensure up-to-date and extensive coverage.
  3. Modify the split_sentences Function:

    • Integrate the expanded TLD list directly into the existing sentence-splitting logic to prevent improper splits.

More Comprehensive list of TLDs from IANA registry

tlds = (
"com|net|org|edu|gov|mil|int|info|biz|name|pro|museum|coop|aero|"
"dev|app|io|ai|co|me|tv|xyz|cloud|tech|design|studio|online|store|"
"ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|"
"ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|"
"ca|cat|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|"
"de|dj|dk|dm|do|dz|"
"ec|ee|eg|eh|er|es|et|eu|eus|"
"fi|fj|fk|fm|fo|fr|"
"ga|gal|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|"
"hk|hm|hn|hr|ht|hu|"
"id|ie|il|im|in|io|iq|ir|is|it|"
"je|jm|jo|jp|"
"ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|"
"la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|"
"ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|"
"na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|"
"om|"
"pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|"
"qa|"
"re|ro|rs|ru|rw|"
"sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|"
"tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|"
"ua|ug|uk|us|uy|uz|"
"va|vc|ve|vg|vi|vn|vu|"
"wf|ws|"
"ye|yt|"
"za|zm|zw"
)

@robertvy robertvy added the bug Something isn't working label Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant