Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Word-Level Alignment #215

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Add Word-Level Alignment #215

wants to merge 28 commits into from

Conversation

ibevers
Copy link
Collaborator

@ibevers ibevers commented Dec 5, 2024

Description

  • Fixes a bug that was leading to incorrect alignments.
  • Updates the output to give transcription, sentence, word, and character timestamps.
  • Adds a recursive evaluation function for ensuring the differences in timestamps between two alignments are under a certain threshold.
  • Adds a test that validates that the generated alignments match a publicly available alignment with a tolerance of .1 second for every start and end timestamp for every aligned level present in both alignments.
  • Uses SenseLab native data types more (can be fully converted to SenseLab types in the future).

Related Issue(s)

#14

Motivation and Context

Corrects buggy forced alignment and adds more granular alignment (word and character), which is useful for various purposes. This is a character-based approach. I plan to add a phoneme-based approach in the future to compare to this.

How Has This Been Tested?

Added test to compare with publicly available alignment. Plan to do more comprehensive evaluation on TIMIT as soon as possible.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • My code follows the code style of this project.

@ibevers ibevers linked an issue Dec 5, 2024 that may be closed by this pull request
3 tasks
@ibevers
Copy link
Collaborator Author

ibevers commented Dec 5, 2024

  • Identify where the timestamps are calculated for indices
  • Ensure indices for words are present at that point
  • Update to calculate timestamps for words
  • Ensure word timestamps are stored in final result

@ibevers
Copy link
Collaborator Author

ibevers commented Dec 17, 2024

Update 12/09:
Reviewed code and identified what needs to be changed. Still figuring out how exactly it should be changed.

Update 12/17:
The code works and returns an aligned ScriptLine for sentences, words, and characters. I am now fixing discrepancies with the previous tests. I also plan to add a correctness test for at least one labeled example.

@ibevers ibevers marked this pull request as ready for review December 19, 2024 17:49
@ibevers
Copy link
Collaborator Author

ibevers commented Dec 19, 2024

@fabiocat93 Would you kindly allow the workflow to run tests on this? All the forced alignment-relevant tests pass. I haven't run all the tests locally because I still have to address the squim issue.

Copy link
Collaborator

@fabiocat93 fabiocat93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm forced alignment tests don't pass (the rest of the tests pass with no problem). I have also left some more comments. Moreover, I would suggest uncommenting and fixing transcribe_timestamped and relative tests.

@@ -182,7 +172,8 @@ def test_align_transcriptions_fixture(resampled_mono_audio_sample: Audio, script
aligned_transcriptions = align_transcriptions(audios_and_transcriptions_and_language)
assert len(aligned_transcriptions) == 2
assert len(aligned_transcriptions[0]) == 1
assert aligned_transcriptions[0][0].text == "test"
if aligned_transcriptions[0][0]:
assert aligned_transcriptions[0][0].text == "test"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ibevers, how is the expected transcript text "test"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiocat93 Good point--will change to match the audio file

audios_and_transcriptions_and_language = [
(resampled_had_that_curiosity_audio_sample, script_line_fixture_curiosity, Language(language_code="en"))
]
aligned_transcriptions = align_transcriptions(audios_and_transcriptions_and_language)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligned_transcriptions is [[ScriptLine(text='I had that curiosity beside me at this moment', speaker=None, start=0.664, end=3.098, chunks=[ScriptLine(text='I had that curiosity beside me at this moment.', speaker=None, start=0.664, end=3.098, chunks=[ScriptLine(text='I', speaker=None, start=0.664, end=0.684, chunks=[ScriptLine(text='I', speaker=None, start=0.664, end=0.684, chunks=None)]), ScriptLine(text='had', speaker=None, start=0.684, end=0.845, chunks=[ScriptLine(text='', speaker=None, start=0.684, end=0.704, chunks=None), ScriptLine(text='h', speaker=None, start=0.704, end=0.765, chunks=None), ScriptLine(text='a', speaker=None, start=0.765, end=0.805, chunks=None), ScriptLine(text='d', speaker=None, start=0.805, end=0.845, chunks=None)]), ScriptLine(text='that', speaker=None, start=0.845, end=1.026, chunks=[ScriptLine(text='', speaker=None, start=0.845, end=0.885, chunks=None), ...

Something is wrong with this. Chunks should include the words and not the entire sentence again... @ibevers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why do words have chunks? What's the point? Can you clarify that to me please?

Copy link
Collaborator Author

@ibevers ibevers Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something is wrong with this. Chunks should include the words and not the entire sentence again...

@fabiocat93 Nothing is wrong here. Input transcriptions are aligned at the sentence level, so a transcription that is one sentence long will have a chunk for the sentence.

Copy link
Collaborator Author

@ibevers ibevers Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why do words have chunks? What's the point? Can you clarify that to me please?

@fabiocat93 Words have chunks for transparency into how the alignment was done and because having character-level alignments will allow easy mapping to phoneme-level alignments in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should conceptually and internally agree on what ScriptLine contains. If you imagine a theatre script, you have a collection of sentences by potentially multiple speakers represented as an ordered list of sentences. Analogously, in our case, a script can be represented as a collection of ScriptLine objects, and a ScriptLine contains one sentence (which contains words). I would have max one sentence per ScriptLine for a matter of consistency at least among us in the group

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're thinking--in a theater script, a particular character might say multiple sentences in a row though, so if we make a list of sentences, it won't exactly correspond to the structure of a theater script. Previously, I was thinking of ScriptLine as a generic data structure for hierarchically representing text--I can change it to be an ordered list of sentence ScriptLine objects. Perhaps we can discuss this at Biometrics?

@ibevers
Copy link
Collaborator Author

ibevers commented Dec 23, 2024

I confirm forced alignment tests don't pass (the rest of the tests pass with no problem). I have also left some more comments. Moreover, I would suggest uncommenting and fixing transcribe_timestamped and relative tests.

@fabiocat93 Are you referring to these tests or other tests? They pass on my machine.
Screenshot 2024-12-23 at 12 53 42 PM

@ibevers
Copy link
Collaborator Author

ibevers commented Dec 23, 2024

Moreover, I would suggest uncommenting and fixing transcribe_timestamped and relative tests.

@fabiocat93 Will do, thanks!

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

Moreover, I would suggest uncommenting and fixing transcribe_timestamped and relative tests.

@fabiocat93 Will do, thanks!

@fabiocat93 This is outside the scope of this PR. I will fix the transcribe_timestamped workflow here: #231

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

@fabiocat93 I'm trying to figure out why the tests passed on my machine but not the GitHub actions server. I'm seeing that the forced alignment tests failed because of worker segfaults. According to ChatGPT:

Using pytest-xdist can spawn multiple workers. Wav2Vec2 (and underlying native libraries) occasionally segfaults when run in parallel on certain hardware/OS combos. On a local machine, you may be running tests serially (or on different hardware), so it doesn’t crash.

I'll try to confirm that this is the source of the issue.

@ibevers ibevers mentioned this pull request Jan 2, 2025
6 tasks
@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

@fabiocat93 This may be related as I believe wav2vec2 uses some C++ modules under-the-hood:
pytest-dev/pytest-xdist#126

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

@fabiocat93 I realized you gave me permission to run the tests on my own--thanks! I tried running them serially, and got a segfault again.

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

@fabiocat93 The tests all call align_transcriptions. They all fail with a segfault in the same place: "/Users/runner/work/senselab/senselab/.venv/lib/python3.10/site-packages/transformers/models/wav2vec2/modeling_wav2vec2.py", line 2229 in forward

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

@fabiocat93 I figured out the issue!

Nested-virtualization and Metal Performance Shaders (MPS) are not supported due to the limitation of Apple's Virtualization Framework.
https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

I changed the devicetype used to CPU and macOS-tests worked. I believe this is because the _select_device_and_dtype() function selects MPS on the GitHub Actions MacOS runner.

@codecov-commenter
Copy link

codecov-commenter commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 96.26168% with 4 lines in your changes missing coverage. Please review.

Project coverage is 66.39%. Comparing base (113721a) to head (650e5f5).
Report is 84 commits behind head on main.

Files with missing lines Patch % Lines
...b/audio/tasks/forced_alignment/forced_alignment.py 94.73% 3 Missing ⚠️
src/tests/audio/tasks/forced_alignment_test.py 97.05% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #215      +/-   ##
==========================================
+ Coverage   60.24%   66.39%   +6.15%     
==========================================
  Files         113      120       +7     
  Lines        4017     4232     +215     
==========================================
+ Hits         2420     2810     +390     
+ Misses       1597     1422     -175     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 2, 2025

Related: #232

@ibevers ibevers requested a review from fabiocat93 January 2, 2025 22:06
@fabiocat93
Copy link
Collaborator

@fabiocat93 I figured out the issue!

Nested-virtualization and Metal Performance Shaders (MPS) are not supported due to the limitation of Apple's Virtualization Framework.
https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

I changed the devicetype used to CPU and macOS-tests worked. I believe this is because the _select_device_and_dtype() function selects MPS on the GitHub Actions MacOS runner.

_select_device_and_dtype() selects the fastest device among those that are supported by the model. You can specify the supported devices and you can exclude MPS. This way, the model will run on CUDA or CPU only, depending on the availability

Copy link
Collaborator

@fabiocat93 fabiocat93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks ok to me, but would ask you to address my comment above related to ScriptLine before merging it. I would maybe ask some others to see what they think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see some useful scenarios for using compare_alignments. Thank you for implementing it!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, glad to hear!

Copy link
Collaborator

@fabiocat93 fabiocat93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall, but I recommend addressing my earlier comment about ScriptLine before proceeding with the merge. You might also consider gathering feedback from others in the group to ensure alignment

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 10, 2025

_select_device_and_dtype() selects the fastest device among those that are supported by the model. You can specify the supported devices and you can exclude MPS. This way, the model will run on CUDA or CPU only, depending on the availability

@fabiocat93 Ah, okay, good to know. I will update according!

@ibevers
Copy link
Collaborator Author

ibevers commented Jan 10, 2025

It looks good overall, but I recommend addressing my earlier comment about ScriptLine before proceeding with the merge. You might also consider gathering feedback from others in the group to ensure alignment

I'll add it to the Biometrics agenda to ensure alignment :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Task: Word-Level Alignment
3 participants