MalformedCaptionError: Invalid Time Format #44

ZhijingEu · 2022-10-16T15:27:26Z

Hey everyone - I just wanted to share a quick fix for a problem where I noticed webvtt-py does not do well when timestamps are in the format of 0:1:5.2 as opposed to 00:01:05:002.

I have written a piece of regex find replace to convert the format that I've shared over here on this repo https://github.com/ZhijingEu/VTT_File_Cleaner and also accompanied by a video tutorial https://www.youtube.com/watch?v=iZ0pOSL8JZw

Hope this helps someone out there in the future facing this issue

apetresc · 2022-11-07T20:05:29Z

Thank you @ZhijingEu - this is certainly helpful, but I think the real solution to the problem is for webvtt-py to be much more forgiving in the way it parses VTTs. I don't know what the precise VTT spec says about time formats, but judging by the fact that mainstream sources like, e.g., the Microsoft Teams autogenerated transcripts, exhibit this behaviour, it would behoove webvtt-py to accommodate this relatively trivial change.

I'll hopefully open a PR for that soon.

filipsworks · 2022-12-21T16:30:58Z

https://www.w3.org/TR/webvtt1/#webvtt-timestamp

Exactly 3 digits are required by the standard. Else things like VideoJS will stop execution.
Only optional thing is hours mark and only if 0

Basically the Teams, AWS and many services are breaking the standard and instead of getting it fixed there - everyone is doing their own hacks to handle broken things.

apetresc · 2023-01-03T03:56:46Z

I get that, but pragmatically speaking, it's probably best for tools to be as permissive as they reasonably can, especially for spec violations that are widely common in the wild. Users of webvtt-py likely can't choose to just consume transcripts from some other source, but they can choose to just use some other VTT parser.

jrowen · 2023-01-19T18:53:16Z

I've found this works as a temporary fix for Teams time formats.

import io
from webvtt import structures
from webvtt.parsers import WebVTTParser
import re

structures.TIMESTAMP_PATTERN = re.compile('(\d+)?:?(\d{1,2}):(\d{1,2})[.,](\d{1,3})')
WebVTTParser.TIMEFRAME_LINE_PATTERN = re.compile(r'\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})\s*-->\s*((?:\d+:)?\d{1,2}:\d{1,2}.\d{1,3})')

from webvtt import WebVTT

for caption in WebVTT.read_buffer(io.StringIO(tcontent)):
    print(caption.start)
    print(caption.end)
    print(caption.text)

apetresc · 2023-01-20T22:45:19Z

Ah yeah, clever - you can just monkey-patch those variables directly in the module.

... still would be nicer not to have to do that, though 😅

apetresc linked a pull request Nov 7, 2022 that will close this issue

Add support for WebVTT timeframes in MS Teams' non-compliant format #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MalformedCaptionError: Invalid Time Format #44

MalformedCaptionError: Invalid Time Format #44

ZhijingEu commented Oct 16, 2022

apetresc commented Nov 7, 2022

filipsworks commented Dec 21, 2022 •

edited

Loading

apetresc commented Jan 3, 2023

jrowen commented Jan 19, 2023

apetresc commented Jan 20, 2023 •

edited

Loading

MalformedCaptionError: Invalid Time Format #44

MalformedCaptionError: Invalid Time Format #44

Comments

ZhijingEu commented Oct 16, 2022

apetresc commented Nov 7, 2022

filipsworks commented Dec 21, 2022 • edited Loading

apetresc commented Jan 3, 2023

jrowen commented Jan 19, 2023

apetresc commented Jan 20, 2023 • edited Loading

filipsworks commented Dec 21, 2022 •

edited

Loading

apetresc commented Jan 20, 2023 •

edited

Loading