-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Word-Level Alignment #215
base: main
Are you sure you want to change the base?
Conversation
|
Update 12/09: Update 12/17: |
@fabiocat93 Would you kindly allow the workflow to run tests on this? All the forced alignment-relevant tests pass. I haven't run all the tests locally because I still have to address the squim issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirm forced alignment tests don't pass (the rest of the tests pass with no problem). I have also left some more comments. Moreover, I would suggest uncommenting and fixing transcribe_timestamped
and relative tests.
@@ -182,7 +172,8 @@ def test_align_transcriptions_fixture(resampled_mono_audio_sample: Audio, script | |||
aligned_transcriptions = align_transcriptions(audios_and_transcriptions_and_language) | |||
assert len(aligned_transcriptions) == 2 | |||
assert len(aligned_transcriptions[0]) == 1 | |||
assert aligned_transcriptions[0][0].text == "test" | |||
if aligned_transcriptions[0][0]: | |||
assert aligned_transcriptions[0][0].text == "test" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ibevers, how is the expected transcript text "test"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabiocat93 Good point--will change to match the audio file
audios_and_transcriptions_and_language = [ | ||
(resampled_had_that_curiosity_audio_sample, script_line_fixture_curiosity, Language(language_code="en")) | ||
] | ||
aligned_transcriptions = align_transcriptions(audios_and_transcriptions_and_language) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aligned_transcriptions is [[ScriptLine(text='I had that curiosity beside me at this moment', speaker=None, start=0.664, end=3.098, chunks=[ScriptLine(text='I had that curiosity beside me at this moment.', speaker=None, start=0.664, end=3.098, chunks=[ScriptLine(text='I', speaker=None, start=0.664, end=0.684, chunks=[ScriptLine(text='I', speaker=None, start=0.664, end=0.684, chunks=None)]), ScriptLine(text='had', speaker=None, start=0.684, end=0.845, chunks=[ScriptLine(text='', speaker=None, start=0.684, end=0.704, chunks=None), ScriptLine(text='h', speaker=None, start=0.704, end=0.765, chunks=None), ScriptLine(text='a', speaker=None, start=0.765, end=0.805, chunks=None), ScriptLine(text='d', speaker=None, start=0.805, end=0.845, chunks=None)]), ScriptLine(text='that', speaker=None, start=0.845, end=1.026, chunks=[ScriptLine(text='', speaker=None, start=0.845, end=0.885, chunks=None), ...
Something is wrong with this. Chunks should include the words and not the entire sentence again... @ibevers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why do words have chunks? What's the point? Can you clarify that to me please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is wrong with this. Chunks should include the words and not the entire sentence again...
@fabiocat93 Nothing is wrong here. Input transcriptions are aligned at the sentence level, so a transcription that is one sentence long will have a chunk for the sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why do words have chunks? What's the point? Can you clarify that to me please?
@fabiocat93 Words have chunks for transparency into how the alignment was done and because having character-level alignments will allow easy mapping to phoneme-level alignments in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should conceptually and internally agree on what ScriptLine contains. If you imagine a theatre script, you have a collection of sentences by potentially multiple speakers represented as an ordered list of sentences. Analogously, in our case, a script can be represented as a collection of ScriptLine objects, and a ScriptLine contains one sentence (which contains words). I would have max one sentence per ScriptLine for a matter of consistency at least among us in the group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're thinking--in a theater script, a particular character might say multiple sentences in a row though, so if we make a list of sentences, it won't exactly correspond to the structure of a theater script. Previously, I was thinking of ScriptLine as a generic data structure for hierarchically representing text--I can change it to be an ordered list of sentence ScriptLine objects. Perhaps we can discuss this at Biometrics?
@fabiocat93 Are you referring to these tests or other tests? They pass on my machine. |
@fabiocat93 Will do, thanks! |
@fabiocat93 This is outside the scope of this PR. I will fix the |
@fabiocat93 I'm trying to figure out why the tests passed on my machine but not the GitHub actions server. I'm seeing that the forced alignment tests failed because of worker segfaults. According to ChatGPT:
I'll try to confirm that this is the source of the issue. |
@fabiocat93 This may be related as I believe wav2vec2 uses some C++ modules under-the-hood: |
@fabiocat93 I realized you gave me permission to run the tests on my own--thanks! I tried running them serially, and got a segfault again. |
@fabiocat93 The tests all call |
@fabiocat93 I figured out the issue!
I changed the devicetype used to CPU and macOS-tests worked. I believe this is because the |
54cdcaa
to
0429de0
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #215 +/- ##
==========================================
+ Coverage 60.24% 66.39% +6.15%
==========================================
Files 113 120 +7
Lines 4017 4232 +215
==========================================
+ Hits 2420 2810 +390
+ Misses 1597 1422 -175 ☔ View full report in Codecov by Sentry. |
Related: #232 |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks ok to me, but would ask you to address my comment above related to ScriptLine before merging it. I would maybe ask some others to see what they think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see some useful scenarios for using compare_alignments. Thank you for implementing it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, glad to hear!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good overall, but I recommend addressing my earlier comment about ScriptLine before proceeding with the merge. You might also consider gathering feedback from others in the group to ensure alignment
@fabiocat93 Ah, okay, good to know. I will update according! |
I'll add it to the Biometrics agenda to ensure alignment :) |
Description
Related Issue(s)
#14
Motivation and Context
Corrects buggy forced alignment and adds more granular alignment (word and character), which is useful for various purposes. This is a character-based approach. I plan to add a phoneme-based approach in the future to compare to this.
How Has This Been Tested?
Added test to compare with publicly available alignment. Plan to do more comprehensive evaluation on TIMIT as soon as possible.
Screenshots (if appropriate):
Types of changes
Checklist: