Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow JSON serialization and deserialization and blocking event loop #489

Open
Luksalos opened this issue Nov 25, 2024 · 1 comment
Open

Comments

@Luksalos
Copy link

What is the current behavior?

PrerecordedResponse.from_json(result) (link to code) is very slow, especially for larger inputs. This is due to the Dataclasses JSON library, where they are already aware of that performance issue but haven’t addressed it since 2020. In addition to .from_json(), the .to_dict() operation is also very slow, which one would use if they want to parse the output from the Deepgram SDK into their own Pydantic model.

In our case, for recordings lasting around 1 hour:

source = {"url": signed_url}
options = rerecordedOptions(
        model="nova-2-general",
        diarize=True,
        utterances=True,
        paragraphs=True)
deepgram.listen.rest.v("1").transcribe_url(source, options=options)

The .from_json() takes over 10 seconds. Pydantic parsing takes ~30ms.
For a 7-minute recording, the .from_json() operation took ~1.7 seconds, while Pydantic parsing took ~5ms.

This issue also affects the asynchronous version, where the problem is even more significant as it blocks the event loop for a long time.

Expected behavior

JSON serialization and deserialization shouldn't take that long, and CPU-heavy operations should definitely not block the event loop. Please consider using Pydantic or raw dataclasses.

@Luksalos Luksalos changed the title Blocking Very slow JSON serialization and deserialization and blocking event loop Nov 25, 2024
@jjmaldonis
Copy link
Contributor

Adding __slots__ to the dataclasses may help -- this is worth a quick try. I have not tested, and I don't know if dataclasses actually support __slots__, but adding the class variable can result in dramatic speed improvements.

Overall, my opinion is that dataclasses begin to break down once the scope of their usage extends past the immediate value proposition of dataclasses, and a different implementation tends to work better. Pydantic tends to be used for input validation, which isn't a critically important feature within this SDK because responses do not need to be validated. That said, I'm a big fan of pydantic in general. But choosing a different class implementation may give us the speed and flexibility wins we're looking for. That said, moving away from dataclasses will be a major breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants