The voice database format (*.vdb
) is an on-disk file format used by Cainteoir
Text-to-Speech to store the information needed for a synthetic voice to produce
audio when given phoneme and prosodic information. This format acts as a container
for different voice formats in a form suitable for reading directly from the file.
| u8 | An 8-bit unsigned integer |
| u16 | A 16-bit unsigned integer |
| u32 | A 32-bit unsigned integer |
| u64 | A 64-bit unsigned integer |
| f8:8 | A fixed point number (8-bit integral part, 8-bit fraction part) |
| f16:16 | A fixed point number (16-bit integral part, 16-bit fraction part) |
| str | A variable-length UTF-8 string terminated by a NULL (0
) character |
| pstr | A u32
containing the offset from the start of the file to a str
. |
The voice database file has the following general structure:
Header
Section
...
Section
A String Table
section follows any section (including the Header
section)
that contains pstr
values.
The specific layout is dependent on the synthesizer type.
These files describe how Cainteoir Text-to-Speech phonemes are mapped to the
MBROLA PHO file format and the voice-specific phonemes. This is then sent to
the mbrola
program to synthesize using the MBROLA voice database file.
The id
in the Header section is the name of the MBROLA voice database file
to use.
A Duration Table section is used to specify the duration model to be used.
The Phoneme Table and Phoneme Unit Table sections map Cainteoir Text-to-Speech phonemes to the voice-specific MBROLA phonemes.
The Pitch Data section specifies the mean and standard deviation of the voice's pitch.
The header section identifies the file as a VoiceDB file and provides information about the voice. This information is used to construct the RDF metadata associated with the voice.
Field | Type | Offset |
---|---|---|
magic | u8[7] | 0 |
endianness | u16 | 7 |
rdfns | pstr | 9 |
id | pstr | 13 |
name | pstr | 17 |
synthesizer | pstr | 21 |
author | pstr | 25 |
locale | pstr | 29 |
gender | u8 | 33 |
volume-scale | f8:8 | 34 |
frequency | u16 | 36 |
channels | u8 | 38 |
sample-format | pstr | 39 |
END OF HEADER | 43 |
The magic
field identifies the file as a voice database file. This is the
string "VOICEDB".
The endianness
field contains the value 0x3031
. It is used to identify
whether the file is in little endian (10
) or big endian (01
) order.
The rdfns
field contains the URI used as the RDF namespace for generated RDF
metadata. It is also the URI associated with information about the synthesizer.
The id
field is the programatic identifier used to reference the voice. Any
RDF metadata associated with the voice will be associated with rdfns:id
.
The name
field is the name of the voice that will be displayed to the user.
The synthesizer
field is the name of the synthesizer this voice uses. This is
used to select the synthesis algorithm, as well as being displayed to the user.
The author
field is the name of the authors/companies that created the voice.
The locale
field is the name of the BCP 47 language code for the default
language and accent supported by the voice.
The gender
field is the gender of the voice, with M
used for male voices
and F
used for female voices.
The volume-scale
field is the value used to scale the audio to 0.5 on a scale
of 0.0 to 1.0. This represents 100% volume.
This provides information about the fundamental frequency (F0) of the voice, used to determine the pitch envelopes of the different phonemes. It describes the default speaking pitch of the voice.
Field | Type | Offset |
---|---|---|
magic | u8[3] | 0 |
baseline | f16:16 | 3 |
step | f16:16 | 7 |
sdev | f16:16 | 11 |
END OF HEADER | 15 |
The magic
field identifies the file as a voice database file. This is the
string "PTC".
The baseline
field is the average low pitch of the voice in Hertz. That is,
this represents the lowest pitch the voice is comfortable speaking.
The step
field is the frequency gap between each tone levels.
The sdev
field is the standard deviation of the pitch around each tone level
in Hertz.
This model is used as it can be adapted to the different prosody algorighms.
To calculate the average pitch for the IPA tone levels, you can use the following:
- Top Tone =
baseline + 4*step
- High Tone =
baseline + 3*step
- Mid Tone =
baseline + 2*step
- Low Tone =
baseline + step
- Bottom Tone =
baseline
To use the INTSINT
prosody model, T
is the Top Tone, M
is the Mid Tone
and B
is the Bottom Tone.
The baseline
, step
and sdev
values can be derived from a minimum (min
)
and maximum (max
) pitch value, such that:
sdev = (max - min)/ 20
baseline = min + 2*sdev
step = 4 *sdev
This is because there are 5 tone levels, each with 4 sdev
sections, with 2
sdev
sections either side of the mean value of each tone level.
This is the generic structure of a tabular section. That is, a fixed sized array of entities of the same type.
Field | Type | Offset |
---|---|---|
magic | u8[3] | 0 |
num-entries | u16 | 3 |
END OF SECTION | 5 |
The magic
field identifies the section as a data table. See the sub-sections
for the actual value.
The num-entries
field is the number of entries there are in this table.
After the section block, num-entries
entry blocks are written out in order.
An associated String Table section occurs after the last entry, with the pstr
strings from all the entry blocks included.
This is the binary representation of a (mean, standard-deviation) duration model.
A duration table has the "DUR" magic string, and each entry has the form:
Field | Type | Offset |
---|---|---|
phoneme1 | u64 | 0 |
phoneme2 | u64 | 8 |
duration-mean | u8 | 16 |
duration-stdev | u8 | 17 |
END OF ENTRY | 18 |
The phoneme1
field is the value of the phoneme as represented by the
cainteoir::ipa::phoneme
class.
The phoneme2
field is the value of the phoneme as represented by the
cainteoir::ipa::phoneme
class. This is for affricates and diphthongs.
Theduration-mean
field is the average duration of the phoneme in milliseconds.
Theduration-stdev
field is the standard deviation for the duration of the phoneme
in milliseconds.
This is the information about how a given unit should be constructed when synthesizing the associated phoneme.
A phoneme unit table has the "PUT" magic string, and each entry has the form:
Field | Type | Offset |
---|---|---|
name | pstr | 0 |
phoneme-start | u8 | 4 |
unit-start | u8 | 5 |
unit-end | u8 | 6 |
END OF ENTRY | 7 |
The name
field is the name of the unit as used to select the unit in the
phone or diphone synthesis.
The phoneme-start
field is the percentage offset of the phoneme at which this
unit is generated. The end of the phoneme is the phoneme-start
of the next
unit in the sequence, or 100%
if this is the last unit.
The unit-start
field is the percentage offset at which this unit starts. This
allows for partial units to be synthesized.
The unit-end
field is the percentage offset at which this unit ends. This
allows for partial units to be synthesized.
This is the information for how to map phonemes to units in unit-based phoneme synthesis.
A phoneme table has the "PHO" magic string, and each entry has the form:
Field | Type | Offset |
---|---|---|
phoneme1 | u64 | 0 |
phoneme2 | u64 | 8 |
first-unit | u16 | 16 |
num-units | u8 | 18 |
END OF ENTRY | 19 |
The phoneme1
field is the value of the phoneme as represented by the
cainteoir::ipa::phoneme
class.
The phoneme2
field is the value of the phoneme as represented by the
cainteoir::ipa::phoneme
class. This is for affricates and diphthongs.
The first-unit
field is the index of the first unit in the Phoneme Unit
Table.
Thenum-units
field is the number of sound units this phoneme is composed of.
A string table is a data table that does not contain a num-elements
field.
Instead, it contains an offset to the start of the next section. Each entry
is a str
value that is referenced by a pstr
field in the previous section.
This is designed to make it easy to traverse over the variable-length string data.
It has the form:
Field | Type | Offset |
---|---|---|
magic | u8[3] | 0 |
next-section | u32 | 3 |
END OF SECTION | 7 |
The magic
field identifies the section as a data table. This is the string
"STR".
The next-section
field is the offset to the next data block.
This is the list of 3-letter magic values used to identify the different section and table types. This list is non-normative and is useful when creating a new section type to avoid collisions in the magic values.
Magic | Usage |
---|---|
DUR | Duration Table |
PHO | Phoneme Table |
PTC | Pitch Data |
PUT | Phoneme Unit Table |
STR | String Table |
Copyright (C) 2014 Reece H. Dunn