Skip to content

ionite34/h2p-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heteronym to Phoneme Parser

Python package CodeQL codecov FOSSA Status

Fast parsing of English Heteronyms to Phonemes using contextual part-of-speech.

Provides the ability to convert heteronym graphemes to their phonetic pronunciations.

Designed to be used in conjunction with other fixed grapheme-to-phoneme dictionaries such as CMUdict

This package also offers a combined Grapheme-to-Phoneme dictionary, combining the functionality of fixed lookups handled by CMUdict and context-based parsing as offered by this module.

Usage

1. Combined Grapheme-to-Phoneme dictionary

The CMUDictExt class combines a pipeline for context-based heteronym parsing to phonemes and a fixed dictionary lookup replacement using the CMU Pronouncing Dictionary.

Example:

from h2p_parser.cmudictext import CMUDictExt

CMUDictExt = CMUDictExt()

# Parsing replacements for a line. This can be one or more sentences.
line = CMUDictExt.convert("The cat read the book. It was a good book to read.")
# -> "{DH AH0} {K AE1 T} {R EH1 D} {DH AH0} {B UH1 K}. {IH1 T} {W AA1 Z} {AH0} {G UH1 D} {B UH1 K} {T UW1} {R IY1 D}."

Additional optional parameters are available when defining a CMUDictExt instance:

Parameter Type Default Value Description
cmu_dict_path str None Path to a custom CMUDict file in .txt format
h2p_dict_path str None Path to a custom H2p Dictionary file in .json format. See the example.json for the expected format.
cmu_multi_mode int 0 Default selection index for CMUDict entries with multiple pronunciations as donated by the (1) or (n) format
process_numbers bool True Toggles conversion of some numbers and symbols to their spoken pronunciation forms. See numbers.py for details on what is covered.
phoneme_brackets bool True Surrounds phonetic words with curly brackets i.e. {R IY1 D}
unresolved_mode str keep Unresolved word resolution modes:
keep - Keeps the text-form word in the output.
remove - Removes the text-form word from the output.
drop - Returns the line as None if any word is unresolved.

2. Heteronym-to-Phoneme parsing only

To use only the core heteronym-to-phoneme parsing functions, without fixed dictionary support, use H2p class.

Example:

from h2p_parser.h2p import H2p

h2p = H2p(preload=True) # preload flag improves first-inference performance

# checking if a line contains a heteronym
state = h2p.contains_het("There are no heteronyms in this line.")
# -> False

# replacing a single line
line = h2p.replace_het("I read the book. It was a good book to read.")
# -> "I {R EH1 D} the book. It was a good book to {R IY1 D}."

# replacing a list of lines
lines = h2p.replace_het_list(["I read the book. It was a good book to read.",
                              "Don't just give the gift; present the present.",
                              "If you were to reject the product, it would be a reject."])
# -> ["I {R EH1 D} the book. It was a good book to {R IY1 D}.",
#     "Don't just give the gift; {P R IY0 Z EH1 N T} the {P R EH1 Z AH0 N T}.",
#     "If you were to {R IH0 JH EH1 K T} the product, it would be a {R IY1 JH EH0 K T}."]

Note: Depending on your performance requirements, there is a speed improvement for processing large text line batches by using replace_het_list() with a list of all text lines, instead of making repeated calls to replace_het(). See the performance section for more details and guidelines for optimizations.

License

The code in this project is released under Apache License 2.0.

FOSSA Status