-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 5284177
Showing
17 changed files
with
1,664 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
coverage: | ||
range: 80..90 | ||
round: up | ||
precision: 4 | ||
ignore: | ||
- "tests/*" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
version: 2 | ||
updates: | ||
- package-ecosystem: github-actions | ||
directory: / | ||
schedule: | ||
interval: weekly |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
name: Coverage | ||
on: | ||
push: | ||
branches: [ master ] | ||
pull_request: | ||
branches: [ master ] | ||
jobs: | ||
ubuntu: | ||
runs-on: ubuntu-22.04 | ||
timeout-minutes: 10 | ||
strategy: | ||
matrix: | ||
python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12' ] | ||
steps: | ||
- name: Clone | ||
uses: actions/checkout@v4 | ||
- name: Python | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
- name: Build | ||
run: pip install --verbose .[test] | ||
- name: Test | ||
run: pytest --cov=msglc tests/ | ||
- name: Upload | ||
uses: codecov/codecov-action@v4 | ||
if: matrix.python-version == '3.10' | ||
with: | ||
token: ${{ secrets.CODECOV_TOKEN }} | ||
slug: TLCFEM/msglc | ||
plugin: pycoverage |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
name: Wheels | ||
on: | ||
push: | ||
branches: [ master ] | ||
pull_request: | ||
branches: [ master ] | ||
jobs: | ||
build_sdist: | ||
name: Build | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v4 | ||
- name: Build | ||
run: pipx run build --sdist | ||
- name: Check | ||
run: pipx run twine check dist/* | ||
- uses: actions/upload-artifact@v4 | ||
with: | ||
name: msglc-sdist | ||
path: dist/*.tar.gz | ||
upload_all: | ||
name: Upload | ||
needs: [ build_sdist ] | ||
runs-on: ubuntu-latest | ||
if: contains(github.event.head_commit.message, '[publish]') | ||
steps: | ||
- uses: actions/download-artifact@v4 | ||
with: | ||
pattern: msglc* | ||
path: dist | ||
- uses: pypa/gh-action-pypi-publish@release/v1 | ||
with: | ||
user: __token__ | ||
password: ${{ secrets.PYPI }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
.venv | ||
.idea | ||
*.pyc | ||
**/*.egg-info |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
# msglc --- (de)serialize json objects with lazy/partial loading containers using msgpack | ||
|
||
[![codecov](https://codecov.io/gh/TLCFEM/msglc/graph/badge.svg?token=JDPARZSVDR)](https://codecov.io/gh/TLCFEM/msglc) | ||
|
||
## Quick Start | ||
|
||
Use `dump` to serialize a json object to a file. | ||
|
||
```python | ||
from msglc import dump | ||
|
||
data = {"a": [1, 2, 3], "b": {"c": 4, "d": 5, "e": [0x221548313] * 10000}} | ||
dump("data.msg", data) | ||
``` | ||
|
||
Use `Reader` to read a file. | ||
|
||
```python | ||
from msglc import Reader, to_obj | ||
|
||
with Reader("data.msg") as reader: | ||
data = reader.read("b/c") | ||
print(data) # 4 | ||
b_dict = reader.read("b") | ||
print(b_dict.__class__) # <class 'msglc.reader.Dict'> | ||
for k, v in b_dict.items(): | ||
if k != "e": | ||
print(k, v) # c 4, d 5 | ||
b_json = to_obj(b_dict) # ensure plain dict | ||
``` | ||
|
||
Please note all data operations shall be performed inside the `with` block. | ||
|
||
## What | ||
|
||
`msglc` is a Python library that provides a way to serialize and deserialize json objects with lazy/partial loading | ||
containers using `msgpack` as the serialization format. | ||
|
||
## Why | ||
|
||
The `msgpack` specification and the corresponding Python library `msgpack` provide a tool to serialize json objects into | ||
binary data. | ||
However, the encoded data has to be fully decoded to reveal what is inside. | ||
This becomes an issue when the data is large and only a small part of it is needed. | ||
|
||
`msglc` provides an enhanced format to embed structure information into the encoded data. | ||
This allows lazy and partial decoding of the data of interest, which can be a significant performance improvement. | ||
|
||
## How | ||
|
||
### Overview | ||
|
||
`msglc` packs tables of contents and data into a single binary blob. The detailed layout can be shown as follows. | ||
|
||
```text | ||
##################################################################### | ||
# magic bytes # 20 bytes # encoded data # encoded table of contents # | ||
##################################################################### | ||
``` | ||
|
||
1. The magic bytes are used to identify the format of the file. | ||
2. The 20 bytes are used to store the start position and the length of the encoded table of contents. | ||
3. The encoded data is the original msgpack encoded data. | ||
|
||
The table of contents is placed at the end of the file to allow direct writing of the encoded data to the file. | ||
This makes the memory footprint small. | ||
|
||
### Buffering | ||
|
||
One can configure the buffer size for reading and writing. | ||
|
||
```python | ||
from msglc import configure | ||
|
||
configure(write_buffer_size=2 ** 23) | ||
configure(read_buffer_size=2 ** 16) | ||
``` | ||
|
||
### Table of Contents | ||
|
||
There are two types of containers in json objects: array and object. | ||
They correspond to `list` and `dict` in Python, respectively. | ||
|
||
The table of contents mimics the structure of the original json object. | ||
However, only containers that exceed a certain size are included in the table of contents. | ||
This size is configurable and can be often set to the block size of the storage system. | ||
|
||
```python | ||
from msglc import configure | ||
|
||
configure(small_obj_optimization_threshold=8192) | ||
``` | ||
|
||
The basic structure of the table of contents of any object is a `dict` with two keys: `t` (toc) and `p` (position). | ||
The `t` field only exists when the object is a **sufficiently large container**. | ||
|
||
If all the elements in the container are small, the `t` field will also be omitted. | ||
|
||
For the purpose of demonstration, the size threshold is set to 2 bytes in the following examples. | ||
|
||
```python | ||
# an integer is not a container | ||
data = 2154848 | ||
toc = {"p": [0, 5]} | ||
|
||
# a string is not a container | ||
data = "a string" | ||
toc = {"p": [5, 14]} | ||
|
||
# the inner lists contain small elements, so the `t` field is omitted | ||
# the outer list is larger than 2 bytes, so the `t` field is included | ||
data = [[1, 1], [2, 2, 2, 2, 2]] | ||
toc = {"t": [{"p": [15, 18]}, {"p": [18, 24]}], "p": [14, 24]} | ||
|
||
# the outer dict is larger than 2 bytes, so the `t` field is included | ||
# the `b` field is not a container | ||
# the `aa` field is a container, but all its elements are small, so the `t` field is omitted | ||
data = {'a': {'aa': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]}, 'b': 2} | ||
toc = {"t": {"a": {"t": {"aa": {"p": [31, 42]}}, "p": [27, 42]}, "b": {"p": [44, 45]}}, "p": [24, 45]} | ||
``` | ||
|
||
Due to the presence of the size threshold, the table of contents only requires a small amount of extra space. | ||
|
||
### Reading | ||
|
||
The table of contents is read first. The actual data is represented by `Dict` and `List` classes, which have similar | ||
interfaces to the original `dict` and `list` classes in Python. | ||
|
||
As long as the table of contents contains the `t` field, no actual data is read. | ||
Each piece of data is read only when it is accessed, and it is cached for future use. | ||
Thus, the data is read lazily and will only be read once (unless fast loading is enabled). | ||
|
||
### Fast Loading | ||
|
||
There are two ways to read a container into memory: | ||
|
||
1. Read the entire container into memory. | ||
2. Read each element of the container into memory one by one. | ||
|
||
The first way only requires one system call, but data may be repeatedly read if some of its children have been read | ||
before. | ||
The second way requires multiple system calls, but it ensures that each piece of data is read only once. | ||
Depending on various factors, one may be faster than the other. | ||
|
||
Fast loading is a feature that allows the entire data to be read into memory at once. | ||
This helps to avoid issuing multiple system calls to read the data, which can be slow if the latency is high. | ||
|
||
```python | ||
from msglc import configure | ||
|
||
configure(fast_loading=True) | ||
``` | ||
|
||
One shall also configure the threshold for fast loading. | ||
|
||
```python | ||
from msglc import configure | ||
|
||
configure(fast_loading_threshold=0.5) | ||
``` | ||
|
||
The threshold is a fraction between 0 and 1. The above 0.5 means if more than half of the children of a container have | ||
been read already, `to_obj` will use the second way to read the whole container. Otherwise, it will use the first way. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
[build-system] | ||
requires = ["setuptools"] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[tool.cibuildwheel] | ||
archs = ["auto64"] | ||
|
||
[project] | ||
dynamic = ["version"] | ||
name = "msglc" | ||
description = "msgpack with lazy/partial loading containers" | ||
readme = "README.md" | ||
requires-python = ">=3.8" | ||
license = { file = "LICENSE" } | ||
keywords = ["msgpack", "serialization", "lazy loading"] | ||
authors = [{ name = "Theodore Chang", email = "[email protected]" }] | ||
maintainers = [{ name = "Theodore Chang", email = "[email protected]" }] | ||
classifiers = [ | ||
"Development Status :: 5 - Production/Stable", | ||
"Intended Audience :: Developers", | ||
"Topic :: Software Development :: Build Tools", | ||
"License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", | ||
"Programming Language :: Python :: 3", | ||
"Programming Language :: Python :: 3.8", | ||
"Programming Language :: Python :: 3.9", | ||
"Programming Language :: Python :: 3.10", | ||
"Programming Language :: Python :: 3.11", | ||
"Programming Language :: Python :: 3.12", | ||
"Programming Language :: Python :: 3 :: Only", | ||
] | ||
dependencies = [ | ||
"msgpack>=1", | ||
] | ||
|
||
[project.optional-dependencies] | ||
test = [ | ||
"pytest-cov", | ||
"black", | ||
] | ||
|
||
[project.urls] | ||
"Homepage" = "https://github.com/TLCFEM/msglc" | ||
"Bug Reports" = "https://github.com/TLCFEM/msglc/issuess" | ||
"Source" = "https://github.com/TLCFEM/msglc" | ||
|
||
[tool.black] | ||
line-length = 120 | ||
fast = true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# | ||
# This file is autogenerated by pip-compile with Python 3.11 | ||
# by the following command: | ||
# | ||
# pip-compile pyproject.toml | ||
# | ||
msgpack==1.0.8 | ||
# via msglc (pyproject.toml) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Copyright (C) 2024 Theodore Chang | ||
# | ||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details. | ||
# | ||
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
from datetime import datetime | ||
|
||
from setuptools import setup | ||
|
||
setup( | ||
version=datetime.now().strftime("%y%m%d"), | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Copyright (C) 2024 Theodore Chang | ||
# | ||
# This program is free software: you can redistribute it and/or modify | ||
# it under the terms of the GNU General Public License as published by | ||
# the Free Software Foundation, either version 3 of the License, or | ||
# (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details. | ||
# | ||
# You should have received a copy of the GNU General Public License | ||
# along with this program. If not, see <http://www.gnu.org/licenses/>. | ||
|
||
from .config import configure | ||
from .reader import Reader, to_obj | ||
from .writer import Writer | ||
|
||
|
||
def dump(file: str, obj, **kwargs): | ||
with Writer(file, **kwargs) as msglc_writer: | ||
msglc_writer.write(obj) |
Oops, something went wrong.