Skip to content
This repository has been archived by the owner on Feb 11, 2024. It is now read-only.

word timestamp #7

Open
dangvansam opened this issue May 6, 2021 · 1 comment
Open

word timestamp #7

dangvansam opened this issue May 6, 2021 · 1 comment

Comments

@dangvansam
Copy link

can i get words timestamp when predict audio file

@iceychris
Copy link
Owner

iceychris commented May 6, 2021

Hey!

You can extract alignment information by adding some code after line 886:

# iterate through all timesteps
y_seq, log_p = [], 0.0
for t, h_t_enc in enumerate(encoder_out):
iters = 0
while iters < max_iters:
iters += 1
# h_t_enc is of shape [H]
# go through the joint network
_h_t_pred = h_t_pred[None]
_h_t_enc = h_t_enc[None, None, None]
joint_out = self.joint(
_h_t_pred, _h_t_enc, temp=temp_model, softmax=True, log=False
)
# decode one character
# extra["outs"].append(joint_out.clone())
prob, pred = joint_out.max(-1)
pred = int(pred)
log_p += float(prob)
# if blank, advance encoder state
# if not blank, add to the decoded sequence so far
# and advance predictor state
if pred == self.blank:
break
else:
# fuse with lm
_, prob, pred = fuser.fuse(joint_out, prob, pred, alpha=alpha)
# print(iters)
y_seq.append(pred)
y_one_char[0][0] = pred
# advance predictor
h_t_pred, pred_state = self.predictor(y_one_char, state=pred_state)
# advance lm
fuser.advance(y_one_char, temp=temp_lm)

Save the current encoder timestamp index t (from line 855) together with the current output token pred in a list.

To convert t to seconds, you could use sth like:

# encoder input freq, depends on the model architecture
#  usually 80ms
encoder_freq = 0.08

# rough alignment estimate for an output at encoder output index t
t_seconds = t * encoder_freq

Note that this is just a rough estimate and the actual alignment is usually slightly
off when using RNN-T based models like shown in Figure 1 of this paper.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants