Commit

TLCFEM · Mar 6, 2024 · 5284177 · 5284177
commit 5284177
Show file tree

Hide file tree

Showing 17 changed files with 1,664 additions and 0 deletions.
diff --git a/.github/.codecov.yml b/.github/.codecov.yml
@@ -0,0 +1,6 @@
+coverage:
+  range: 80..90
+  round: up
+  precision: 4
+ignore:
+  - "tests/*"
diff --git a/.github/dependabot.yml b/.github/dependabot.yml
@@ -0,0 +1,6 @@
+version: 2
+updates:
+  - package-ecosystem: github-actions
+    directory: /
+    schedule:
+      interval: weekly
diff --git a/.github/workflows/coverage.yml b/.github/workflows/coverage.yml
@@ -0,0 +1,31 @@
+name: Coverage
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+jobs:
+  ubuntu:
+    runs-on: ubuntu-22.04
+    timeout-minutes: 10
+    strategy:
+      matrix:
+        python-version: [ '3.8', '3.9', '3.10', '3.11', '3.12' ]
+    steps:
+      - name: Clone
+        uses: actions/checkout@v4
+      - name: Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Build
+        run: pip install --verbose .[test]
+      - name: Test
+        run: pytest --cov=msglc tests/
+      - name: Upload
+        uses: codecov/codecov-action@v4
+        if: matrix.python-version == '3.10'
+        with:
+          token: ${{ secrets.CODECOV_TOKEN }}
+          slug: TLCFEM/msglc
+          plugin: pycoverage
diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
@@ -0,0 +1,35 @@
+name: Wheels
+on:
+  push:
+    branches: [ master ]
+  pull_request:
+    branches: [ master ]
+jobs:
+  build_sdist:
+    name: Build
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Build
+        run: pipx run build --sdist
+      - name: Check
+        run: pipx run twine check dist/*
+      - uses: actions/upload-artifact@v4
+        with:
+          name: msglc-sdist
+          path: dist/*.tar.gz
+  upload_all:
+    name: Upload
+    needs: [ build_sdist ]
+    runs-on: ubuntu-latest
+    if: contains(github.event.head_commit.message, '[publish]')
+    steps:
+      - uses: actions/download-artifact@v4
+        with:
+          pattern: msglc*
+          path: dist
+      - uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          user: __token__
+          password: ${{ secrets.PYPI }}
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+.venv
+.idea
+*.pyc
+**/*.egg-info
diff --git a/LICENSE b/LICENSE
diff --git a/README.md b/README.md
@@ -0,0 +1,163 @@
+# msglc --- (de)serialize json objects with lazy/partial loading containers using msgpack
+
+[![codecov](https://codecov.io/gh/TLCFEM/msglc/graph/badge.svg?token=JDPARZSVDR)](https://codecov.io/gh/TLCFEM/msglc)
+
+## Quick Start
+
+Use `dump` to serialize a json object to a file.
+
+```python
+from msglc import dump
+
+data = {"a": [1, 2, 3], "b": {"c": 4, "d": 5, "e": [0x221548313] * 10000}}
+dump("data.msg", data)
+```
+
+Use `Reader` to read a file.
+
+```python
+from msglc import Reader, to_obj
+
+with Reader("data.msg") as reader:
+    data = reader.read("b/c")
+    print(data)  # 4
+    b_dict = reader.read("b")
+    print(b_dict.__class__)  # <class 'msglc.reader.Dict'>
+    for k, v in b_dict.items():
+        if k != "e":
+            print(k, v)  # c 4, d 5
+    b_json = to_obj(b_dict)  # ensure plain dict
+```
+
+Please note all data operations shall be performed inside the `with` block.
+
+## What
+
+`msglc` is a Python library that provides a way to serialize and deserialize json objects with lazy/partial loading
+containers using `msgpack` as the serialization format.
+
+## Why
+
+The `msgpack` specification and the corresponding Python library `msgpack` provide a tool to serialize json objects into
+binary data.
+However, the encoded data has to be fully decoded to reveal what is inside.
+This becomes an issue when the data is large and only a small part of it is needed.
+
+`msglc` provides an enhanced format to embed structure information into the encoded data.
+This allows lazy and partial decoding of the data of interest, which can be a significant performance improvement.
+
+## How
+
+### Overview
+
+`msglc` packs tables of contents and data into a single binary blob. The detailed layout can be shown as follows.
+
+```text
+#####################################################################
+# magic bytes # 20 bytes # encoded data # encoded table of contents #
+#####################################################################
+```
+
+1. The magic bytes are used to identify the format of the file.
+2. The 20 bytes are used to store the start position and the length of the encoded table of contents.
+3. The encoded data is the original msgpack encoded data.
+
+The table of contents is placed at the end of the file to allow direct writing of the encoded data to the file.
+This makes the memory footprint small.
+
+### Buffering
+
+One can configure the buffer size for reading and writing.
+
+```python
+from msglc import configure
+
+configure(write_buffer_size=2 ** 23)
+configure(read_buffer_size=2 ** 16)
+```
+
+### Table of Contents
+
+There are two types of containers in json objects: array and object.
+They correspond to `list` and `dict` in Python, respectively.
+
+The table of contents mimics the structure of the original json object.
+However, only containers that exceed a certain size are included in the table of contents.
+This size is configurable and can be often set to the block size of the storage system.
+
+```python
+from msglc import configure
+
+configure(small_obj_optimization_threshold=8192)
+```
+
+The basic structure of the table of contents of any object is a `dict` with two keys: `t` (toc) and `p` (position).
+The `t` field only exists when the object is a **sufficiently large container**.
+
+If all the elements in the container are small, the `t` field will also be omitted.
+
+For the purpose of demonstration, the size threshold is set to 2 bytes in the following examples.
+
+```python
+# an integer is not a container
+data = 2154848
+toc = {"p": [0, 5]}
+
+# a string is not a container
+data = "a string"
+toc = {"p": [5, 14]}
+
+# the inner lists contain small elements, so the `t` field is omitted
+# the outer list is larger than 2 bytes, so the `t` field is included
+data = [[1, 1], [2, 2, 2, 2, 2]]
+toc = {"t": [{"p": [15, 18]}, {"p": [18, 24]}], "p": [14, 24]}
+
+# the outer dict is larger than 2 bytes, so the `t` field is included
+# the `b` field is not a container
+# the `aa` field is a container, but all its elements are small, so the `t` field is omitted
+data = {'a': {'aa': [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]}, 'b': 2}
+toc = {"t": {"a": {"t": {"aa": {"p": [31, 42]}}, "p": [27, 42]}, "b": {"p": [44, 45]}}, "p": [24, 45]}
+```
+
+Due to the presence of the size threshold, the table of contents only requires a small amount of extra space.
+
+### Reading
+
+The table of contents is read first. The actual data is represented by `Dict` and `List` classes, which have similar
+interfaces to the original `dict` and `list` classes in Python.
+
+As long as the table of contents contains the `t` field, no actual data is read.
+Each piece of data is read only when it is accessed, and it is cached for future use.
+Thus, the data is read lazily and will only be read once (unless fast loading is enabled).
+
+### Fast Loading
+
+There are two ways to read a container into memory:
+
+1. Read the entire container into memory.
+2. Read each element of the container into memory one by one.
+
+The first way only requires one system call, but data may be repeatedly read if some of its children have been read
+before.
+The second way requires multiple system calls, but it ensures that each piece of data is read only once.
+Depending on various factors, one may be faster than the other.
+
+Fast loading is a feature that allows the entire data to be read into memory at once.
+This helps to avoid issuing multiple system calls to read the data, which can be slow if the latency is high.
+
+```python
+from msglc import configure
+
+configure(fast_loading=True)
+```
+
+One shall also configure the threshold for fast loading.
+
+```python
+from msglc import configure
+
+configure(fast_loading_threshold=0.5)
+```
+
+The threshold is a fraction between 0 and 1. The above 0.5 means if more than half of the children of a container have
+been read already, `to_obj` will use the second way to read the whole container. Otherwise, it will use the first way.
diff --git a/pyproject.toml b/pyproject.toml
@@ -0,0 +1,48 @@
+[build-system]
+requires = ["setuptools"]
+build-backend = "setuptools.build_meta"
+
+[tool.cibuildwheel]
+archs = ["auto64"]
+
+[project]
+dynamic = ["version"]
+name = "msglc"
+description = "msgpack with lazy/partial loading containers"
+readme = "README.md"
+requires-python = ">=3.8"
+license = { file = "LICENSE" }
+keywords = ["msgpack", "serialization", "lazy loading"]
+authors = [{ name = "Theodore Chang", email = "[email protected]" }]
+maintainers = [{ name = "Theodore Chang", email = "[email protected]" }]
+classifiers = [
+    "Development Status :: 5 - Production/Stable",
+    "Intended Audience :: Developers",
+    "Topic :: Software Development :: Build Tools",
+    "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.8",
+    "Programming Language :: Python :: 3.9",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3 :: Only",
+]
+dependencies = [
+    "msgpack>=1",
+]
+
+[project.optional-dependencies]
+test = [
+    "pytest-cov",
+    "black",
+]
+
+[project.urls]
+"Homepage" = "https://github.com/TLCFEM/msglc"
+"Bug Reports" = "https://github.com/TLCFEM/msglc/issuess"
+"Source" = "https://github.com/TLCFEM/msglc"
+
+[tool.black]
+line-length = 120
+fast = true
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,8 @@
+#
+# This file is autogenerated by pip-compile with Python 3.11
+# by the following command:
+#
+#    pip-compile pyproject.toml
+#
+msgpack==1.0.8
+    # via msglc (pyproject.toml)
diff --git a/setup.py b/setup.py
@@ -0,0 +1,22 @@
+#  Copyright (C) 2024 Theodore Chang
+#
+#  This program is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+from datetime import datetime
+
+from setuptools import setup
+
+setup(
+    version=datetime.now().strftime("%y%m%d"),
+)
diff --git a/src/msglc/__init__.py b/src/msglc/__init__.py
@@ -0,0 +1,23 @@
+#  Copyright (C) 2024 Theodore Chang
+#
+#  This program is free software: you can redistribute it and/or modify
+#  it under the terms of the GNU General Public License as published by
+#  the Free Software Foundation, either version 3 of the License, or
+#  (at your option) any later version.
+#
+#  This program is distributed in the hope that it will be useful,
+#  but WITHOUT ANY WARRANTY; without even the implied warranty of
+#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+#  GNU General Public License for more details.
+#
+#  You should have received a copy of the GNU General Public License
+#  along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+from .config import configure
+from .reader import Reader, to_obj
+from .writer import Writer
+
+
+def dump(file: str, obj, **kwargs):
+    with Writer(file, **kwargs) as msglc_writer:
+        msglc_writer.write(obj)