Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MalformedCaptionError: Invalid Time Format #44

Open
ZhijingEu opened this issue Oct 16, 2022 · 5 comments · May be fixed by #45
Open

MalformedCaptionError: Invalid Time Format #44

ZhijingEu opened this issue Oct 16, 2022 · 5 comments · May be fixed by #45

Comments

@ZhijingEu
Copy link

Hey everyone - I just wanted to share a quick fix for a problem where I noticed webvtt-py does not do well when timestamps are in the format of 0:1:5.2 as opposed to 00:01:05:002.

I have written a piece of regex find replace to convert the format that I've shared over here on this repo https://github.com/ZhijingEu/VTT_File_Cleaner and also accompanied by a video tutorial https://www.youtube.com/watch?v=iZ0pOSL8JZw

Hope this helps someone out there in the future facing this issue

@apetresc
Copy link

apetresc commented Nov 7, 2022

Thank you @ZhijingEu - this is certainly helpful, but I think the real solution to the problem is for webvtt-py to be much more forgiving in the way it parses VTTs. I don't know what the precise VTT spec says about time formats, but judging by the fact that mainstream sources like, e.g., the Microsoft Teams autogenerated transcripts, exhibit this behaviour, it would behoove webvtt-py to accommodate this relatively trivial change.

I'll hopefully open a PR for that soon.

@filipsworks
Copy link

filipsworks commented Dec 21, 2022

https://www.w3.org/TR/webvtt1/#webvtt-timestamp

Exactly 3 digits are required by the standard. Else things like VideoJS will stop execution.
Only optional thing is hours mark and only if 0

Basically the Teams, AWS and many services are breaking the standard and instead of getting it fixed there - everyone is doing their own hacks to handle broken things.

@apetresc
Copy link

apetresc commented Jan 3, 2023

I get that, but pragmatically speaking, it's probably best for tools to be as permissive as they reasonably can, especially for spec violations that are widely common in the wild. Users of webvtt-py likely can't choose to just consume transcripts from some other source, but they can choose to just use some other VTT parser.

@jrowen
Copy link

jrowen commented Jan 19, 2023

I've found this works as a temporary fix for Teams time formats.

import io
from webvtt import structures
from webvtt.parsers import WebVTTParser
import re

structures.TIMESTAMP_PATTERN = re.compile('(\d+)?:?(\d{1,2}):(\d{1,2})[.,](\d{1,3})')
WebVTTParser.TIMEFRAME_LINE_PATTERN = re.compile(r'\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})\s*-->\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})')

from webvtt import WebVTT

for caption in WebVTT.read_buffer(io.StringIO(tcontent)):
    print(caption.start)
    print(caption.end)
    print(caption.text)

@apetresc
Copy link

apetresc commented Jan 20, 2023

Ah yeah, clever - you can just monkey-patch those variables directly in the module.

... still would be nicer not to have to do that, though 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants