word timestamp #7

dangvansam · 2021-05-06T08:20:14Z

can i get words timestamp when predict audio file

iceychris · 2021-05-06T11:05:59Z

Hey!

You can extract alignment information by adding some code after line 886:

Lines 853 to 892 in f08c8e8

    
           # iterate through all timesteps 
        
           y_seq, log_p = [], 0.0 
        
           for t, h_t_enc in enumerate(encoder_out): 
        
               iters = 0 
        
               while iters < max_iters: 
        
                   iters += 1 
        
                   # h_t_enc is of shape [H] 
        
                   # go through the joint network 
        
                   _h_t_pred = h_t_pred[None] 
        
                   _h_t_enc = h_t_enc[None, None, None] 
        
                   joint_out = self.joint( 
        
                       _h_t_pred, _h_t_enc, temp=temp_model, softmax=True, log=False 
        
                   ) 
        
                   # decode one character 
        
                   # extra["outs"].append(joint_out.clone()) 
        
                   prob, pred = joint_out.max(-1) 
        
                   pred = int(pred) 
        
                   log_p += float(prob) 
        
                   # if blank,     advance encoder state 
        
                   # if not blank, add to the decoded sequence so far 
        
                   #               and advance predictor state 
        
                   if pred == self.blank: 
        
                       break 
        
                   else: 
        
                       # fuse with lm 
        
                       _, prob, pred = fuser.fuse(joint_out, prob, pred, alpha=alpha) 
        
                       # print(iters) 
        
                       y_seq.append(pred) 
        
                       y_one_char[0][0] = pred 
        
                       # advance predictor 
        
                       h_t_pred, pred_state = self.predictor(y_one_char, state=pred_state) 
        
                       # advance lm 
        
                       fuser.advance(y_one_char, temp=temp_lm)

Save the current encoder timestamp index t (from line 855) together with the current output token pred in a list.

To convert t to seconds, you could use sth like:

# encoder input freq, depends on the model architecture
#  usually 80ms
encoder_freq = 0.08

# rough alignment estimate for an output at encoder output index t
t_seconds = t * encoder_freq

Note that this is just a rough estimate and the actual alignment is usually slightly
off when using RNN-T based models like shown in Figure 1 of this paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word timestamp #7

word timestamp #7

dangvansam commented May 6, 2021

iceychris commented May 6, 2021 •

edited

Loading

word timestamp #7

word timestamp #7

Comments

dangvansam commented May 6, 2021

iceychris commented May 6, 2021 • edited Loading

iceychris commented May 6, 2021 •

edited

Loading