Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializations for mixed content documents #94

Closed
wants to merge 17 commits into from
Closed

Conversation

funkyfuture
Copy link
Contributor

@funkyfuture funkyfuture commented Sep 18, 2024

so, i'm mostly done with what took its departure in #54. given the lengths that this was on my desk and in my drawers and several moments where i was under the impression that what i seeked wasn't sanely doable, i'm very happy to eventually be at that point.

a review of these changes are imo sufficient by studying and criticising:

my guess is that the latter one is functioning as it continuously yielded then fixed errors on each code iteration from the 360k something documents with a total volume of ~4.1GB. to be explicit: all these documents were parsed, a non-altered and two whitespace-altering variants were produced, these were each reparsed (where the latter two received whitespace normalization as per TEI recommendation) and finally successfully compared against the originating documents.

(just two unimportant insights from the process: if one had tried to achieve that based on lxml's data model they'd certainly gone nuts and the := operator can be a super powerful tool for concise expressions; what was all the fuzz about?)

anyway, don't look to much on the implementation. it's architecture is fundamentally wrong (we really need an event based writer and some state machinish connectors) and inefficient.

but the current structure allows targeted debugging, that's what i did at length. and i would consider this as kind of a breakthrough (showing what is possible) and the establishment of a Distinktionsmerkmal for libraries that operate on the basic level. in that regard, you can pitch me other suited libraries (regardless their language) to include them in the comparison.

hence i'd say the implementation is good enough to move on.

i promise not to force-push to this branch. but i may consolidate and merge it locally at the end.

please contact me directly if you'd like an in-person discussion.

@funkyfuture funkyfuture added enhancement New feature or request design Proposals and discussion of API changes labels Sep 18, 2024
@funkyfuture funkyfuture added this to the 0.5 milestone Sep 18, 2024
_delb/nodes.py Outdated Show resolved Hide resolved
@funkyfuture funkyfuture marked this pull request as ready for review September 22, 2024 18:29
JKatzwinkel
JKatzwinkel previously approved these changes Sep 24, 2024
@funkyfuture
Copy link
Contributor Author

thanks! i'm resolving the conflicts locally and push that to the main branch.

@funkyfuture funkyfuture closed this Nov 2, 2024
@funkyfuture funkyfuture deleted the serialization branch January 1, 2025 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Proposals and discussion of API changes enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants