From 509387d6e88c9e22e08ab67e58c12f40a728d953 Mon Sep 17 00:00:00 2001 From: MoooCat <141886018+MooooCat@users.noreply.github.com> Date: Thu, 29 Feb 2024 17:55:50 +0800 Subject: [PATCH] Update readme.md (#150) * update 2 readme * shorten example line * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --- README.md | 48 +++++++++++++++++++++++++++++------------------- README_ZH_CN.md | 44 +++++++++++++++++++++++++++----------------- 2 files changed, 56 insertions(+), 36 deletions(-) diff --git a/README.md b/README.md index 1aa91c45..68001837 100644 --- a/README.md +++ b/README.md @@ -25,53 +25,63 @@
- View Colab Examples: + Colab Examples: LLM: Data Synthesis | - LLM: Off-table Feature Inference + LLM: Off-Table Inference | - CTGAN + Billion-Level-Data supported CTGAN
-The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. It incorporates a wide range of single-table, multi-table data synthesis algorithms and LLM-based synthetic data generation models. +The Synthetic Data Generator (SDG) is a specialized framework designed to generate high-quality structured tabular data. -Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications. +Synthetic data does not contain any sensitive information, yet it retains the essential characteristics of the original data, making it exempt from privacy regulations such as GDPR and ADPPA. -High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc. Read [**latest API docs**](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details! +High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc. -## 🔧 Features +## 💥News -- Technological advancements - - Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated; - - Optimised for big data scenarios, effectively reducing memory consumption; - - Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner. -- Privacy enhancements - - SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data. -- Easy to extend - - Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages +Our current key achievements and timelines are as follows: + +🔥 Feb 20, 2024: a single-table data synthesis model based on LLM is included, view colab example: LLM: Data Synthesis and LLM: Off-table Feature Inference. + +🔶 Dec 20, 2023: v0.1.0 released, a CTGAN model that supports billions of data processing capabilities is included, view colab example: Billion-Level-Data supported CTGAN. + +🔆 Aug 10, 2023: First line of SDG code committed. -### 🎉 LLM-integrated synthetic data generation +## 🎉 LLM-integrated synthetic data generation For a long time, LLM has been used to understand and generate various types of data. In fact, LLM also has certain capabilities in tabular data generation. Also, it has some abilities that cannot be achieved by traditional (based on GAN methods or statistical methods) . Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements two new features: -#### Synthetic data generation without Data +### Synthetic data generation without Data No training data is required, synthetic data can be generated based on metadata data, view in our colab example. ![Synthetic data generation without Data](assets/LLM_Case_1.gif) -#### Off-Table feature inference +### Off-Table feature inference Infer new column data based on the existing data in the table and the knowledge mastered by LLM, view in our colab example. ![Off-Table feature inference](assets/LLM_Case_2.gif) -## 🔛 Quick Start +## 💫 Why SDG ? + +- Technological advancements: + - Supports a wide range of statistical data synthesis algorithms, LLM-based synthetic data generation model is also integrated; + - Optimised for big data scenarios, effectively reducing memory consumption; + - Continuously tracking the latest advances in academia and industry, and introducing support for excellent algorithms and models in a timely manner. +- Privacy enhancements: + - SDG supports differential privacy, anonymization and other methods to enhance the security of synthetic data. +- Easy to extend: + - Supports expansion of models, data processing, data connectors, etc. in the form of plug-in packages. + +## 🌀 Quick Start ### Pre-build image diff --git a/README_ZH_CN.md b/README_ZH_CN.md index 3da4964b..f50c4c7c 100644 --- a/README_ZH_CN.md +++ b/README_ZH_CN.md @@ -37,43 +37,53 @@ -合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量的结构化表格数据的数据组件。SDG支持单表和多表数据合成算法,并集成了基于大语言模型(LLM)的合成数据生成模型。 +合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量的结构化表格数据的数据组件。 -合成数据(Synthetic Data)是由计算机使用真实数据、元数据和算法生成的合成数据不包含任何敏感信息,但它保留了原始数据的基本特性。合成数据和真实数据之间没有直接的关联,使其免于GDPR和ADPPA等隐私法规的约束,消除实际应用中的隐私泄露风险。 +合成数据(Synthetic Data)不包含任何敏感信息,但它保留了原始数据的基本特性,使其免于GDPR和ADPPA等隐私法规的约束,消除实际应用中的隐私泄露风险。 -高质量的合成数据可以安全、多样化地在各种领域中使用,包括数据共享、模型训练和调试、系统开发和测试等应用。阅读 [**最新API文档**](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节。 +高质量的合成数据可以安全、多样化地在各种领域中使用,包括数据共享、模型训练和调试、系统开发和测试等应用。 -## 🔧 主要特性 +## 💥 相关信息 -- 无限进步: - - 支持多种统计学数据合成算法,支持基于LLM的仿真数据生成方法; - - 为大数据场景优化,有效减少内存消耗; - - 持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。 -- 隐私增强: - - 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段; - - 支持差分隐私、匿名化等方法,加强合成数据安全性。 -- 易扩展: - - 支持以插件包的形式拓展模型、数据处理、数据连接器等功能。 +我们的里程碑和时间节点如下所示: + +🔥 2024年2月20日:基于LLM的单表数据合成模型已包含,查看colab示例:LLM:数据合成 和 LLM:表外特征推断。 + +🔶 2023年12月20日:v0.1.0版本发布,包含支持数十亿数据处理能力的CTGAN模型,查看colab示例:支持数十亿数据的CTGAN。 + +🔆 2023年8月10日:第一行SDG代码提交。 -### 🎉 借助LLM进行合成数据生成 +## 🎉 借助LLM进行合成数据生成 长期以来,LLM一直被用来理解和生成各种类型的数据。 事实上,LLM在表格数据生成方面也有较强的性能。 且LLM还具有一些传统(基于GAN方法或统计方法)无法实现的能力。 我们的 `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` 实现了两个新功能: -#### 无原始记录的数据合成功能 +### 无原始记录的数据合成功能 无需原始训练数据,可以根据元数据生成合成数据,查看 Colab 例子。 ![Synthetic data generation without Data](assets/LLM_Case_1.gif) -#### 表外特征推断功能 +### 表外特征推断功能 根据表中已有的数据以及LLM掌握的知识推断表外特征,即新的列数据,查看 Colab 例子。 ![Off-Table feature inference](assets/LLM_Case_2.gif) -## 🔛 快速开始 +## 💫 Why SDG ? + +- 无限进步: + - 支持多种统计学数据合成算法,支持基于LLM的仿真数据生成方法; + - 为大数据场景优化,有效减少内存消耗; + - 持续跟踪学术界和工业界的最新进展,及时引入支持优秀算法和模型。 +- 隐私增强: + - 提供中文敏感数据自动识别能力,包括姓名、身份证号、人名等17种常见敏感字段; + - 支持差分隐私、匿名化等方法,加强合成数据安全性。 +- 易扩展: + - 支持以插件包的形式拓展模型、数据处理、数据连接器等功能。 + +## 🌀 快速开始 ### 预构建镜像