Expand Supported Top-Level Domains (TLDs) in split_sentences
Function
#1194
Labels
bug
Something isn't working
split_sentences
Function
#1194
Description
The
split_sentences
function in thelivekit.agents.tokenize.basic
module inadequately handles domain names and email addresses with Top-Level Domains (TLDs) not included in its predefined list. This limitation results in the insertion of a space before the TLD, causing websites, emails, and other domain-related strings to be split incorrectly. Consequently, Text-to-Speech (TTS) systems mispronounce these entities, making them unrecognizable and disrupting the user experience.I have successfully resolved it using a monkey-patch in my case, but I believe this should be generally addressed to cater to the global audience of LiveKit.
Proposed Solution
Expand the TLD List:
websites
regex pattern to include a comprehensive list of both generic and country-code TLDs.Make TLDs Configurable:
Modify the
split_sentences
Function:More Comprehensive list of TLDs from IANA registry
tlds = (
"com|net|org|edu|gov|mil|int|info|biz|name|pro|museum|coop|aero|"
"dev|app|io|ai|co|me|tv|xyz|cloud|tech|design|studio|online|store|"
"ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|"
"ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|"
"ca|cat|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|"
"de|dj|dk|dm|do|dz|"
"ec|ee|eg|eh|er|es|et|eu|eus|"
"fi|fj|fk|fm|fo|fr|"
"ga|gal|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|"
"hk|hm|hn|hr|ht|hu|"
"id|ie|il|im|in|io|iq|ir|is|it|"
"je|jm|jo|jp|"
"ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|"
"la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|"
"ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|"
"na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|"
"om|"
"pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|"
"qa|"
"re|ro|rs|ru|rw|"
"sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|"
"tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|"
"ua|ug|uk|us|uy|uz|"
"va|vc|ve|vg|vi|vn|vu|"
"wf|ws|"
"ye|yt|"
"za|zm|zw"
)
The text was updated successfully, but these errors were encountered: