Skip to content

Commit

Permalink
Add Colab Examples, Update Readme (#147)
Browse files Browse the repository at this point in the history
* add ctgan example

* update html a tag

* Update sdgx_example_ctgan.ipynb

* Update README.md

* add sdgx LLM example

* add off table feature inference example

* Update README.md

* add wechat QR code

* update QR code width

* update zh-cn readme

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

* Update motivation.rst

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
MooooCat and pre-commit-ci[bot] authored Feb 28, 2024
1 parent 54ee09b commit f82ca57
Show file tree
Hide file tree
Showing 10 changed files with 3,813 additions and 123 deletions.
39 changes: 27 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,18 @@
# 🚀 Synthetic Data Generator

<p style="font-size: small;">Switch Language:
<a href="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/README_ZH_CN.md" target="_blank">简体中文</a>
</p>
<a href="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/README_ZH_CN.md" target="_blank">简体中文</a> &nbsp;| &nbsp;Latest
<a href="https://synthetic-data-generator.readthedocs.io/en/latest/" target="value">API Docs</a>&nbsp;| &nbsp; Join <a href="assets/wechat_QR_code.JPG" target="value">Wechat Group</a>
</p>

<p style="font-size: small;">
View Colab Examples:&nbsp;
<a href="https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing" target="value"> LLM: Data Synthesis</a>
&nbsp;| &nbsp;
<a href="https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing" target="value"> LLM: Off-table Feature Inference</a>
&nbsp;| &nbsp;
<a href="https://colab.research.google.com/drive/1cMB336jN3kb-m_pr1aJjshnNep_6bhsf?usp=sharing" target="value">CTGAN</a>
</p>

</p>
</div>
Expand All @@ -30,7 +40,7 @@ The Synthetic Data Generator (SDG) is a specialized framework designed to genera

Synthetic data, generated by machines using real data, metadata, and algorithms, does not contain any sensitive information, yet it retains the essential characteristics of the original data. There is no direct correlation between synthetic data and real data, making it exempt from privacy regulations such as GDPR and ADPPA. This eliminates the risk of privacy breaches in practical applications.

High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc. Read [the latest API docs](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details!
High-quality synthetic data can be safely utilized across various domains including data sharing, model training and debugging, system development and testing, etc. Read [**latest API docs**](https://synthetic-data-generator.readthedocs.io/en/latest/) for more details!

## 🔧 Features

Expand All @@ -51,13 +61,13 @@ Our `sdgx.models.LLM.single_table.gpt.SingleTableGPTModel` implements two new fe

#### Synthetic data generation without Data

No training data is required, synthetic data can be generated based on metadata data.
No training data is required, synthetic data can be generated based on metadata data, view in our <a href="https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing" target="value"> colab example</a>.

![Synthetic data generation without Data](assets/LLM_Case_1.gif)

#### Off-Table feature inference

Infer new column data based on the existing data in the table and the knowledge mastered by LLM.
Infer new column data based on the existing data in the table and the knowledge mastered by LLM, view in our <a href="https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing" target="value"> colab example</a>.

![Off-Table feature inference](assets/LLM_Case_2.gif)

Expand Down Expand Up @@ -161,13 +171,6 @@ Synthetic data are as follows:
[1000 rows x 15 columns]
```

## 🤝 Join Community

The SDG project was initiated by **Institute of Data Security, Harbin Institute of Technology**. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:

- Read [CONTRIBUTING](./CONTRIBUTING.md) before draft a pull request.
- Submit an issue by viewing [View First Good Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) or submit a Pull Request.

## 👩‍🎓 Related Work

- CTGAN:[Modeling Tabular Data using Conditional GAN](https://proceedings.neurips.cc/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html)
Expand All @@ -177,6 +180,18 @@ The SDG project was initiated by **Institute of Data Security, Harbin Institute
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)

## 🤝 Join Community

The SDG project was initiated by **Institute of Data Security, Harbin Institute of Technology**. If you are interested in out project, welcome to join our community. We welcome organizations, teams, and individuals who share our commitment to data protection and security through open source:

- Read [CONTRIBUTING](./CONTRIBUTING.md) before draft a pull request.
- Submit an issue by viewing [View First Good Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) or submit a Pull Request.
- Join our Wechat Group through QR code.

<div align="left">
<img src="assets/wechat_QR_code.JPG" width="400" >
</div>

## 📄 License

The SDG open source project uses Apache-2.0 license, please refer to the [LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE).
37 changes: 26 additions & 11 deletions README_ZH_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,27 @@
# 🚀 合成数据生成器 -- 快速生成高质量合成数据!

<p style="font-size: small;">切换语言:
<a href="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/README.md" target="_blank">English</a>
<a href="https://github.com/hitsz-ids/synthetic-data-generator/blob/main/README.md" target="_blank">English</a> &nbsp;| &nbsp;最新
<a href="https://synthetic-data-generator.readthedocs.io/en/latest/" target="value">API文档</a>&nbsp;| &nbsp; 加入 <a href="assets/wechat_QR_code.JPG" target="value">Wechat群组</a>
</p>

<p style="font-size: small;">
查看 Colab 例子:&nbsp;
<a href="https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing" target="value"> 使用LLM仿真数据</a>
&nbsp;| &nbsp;
<a href="https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing" target="value"> 借助LLM进行表外特征推断</a>
&nbsp;| &nbsp;
<a href="https://colab.research.google.com/drive/1cMB336jN3kb-m_pr1aJjshnNep_6bhsf?usp=sharing" target="value">CTGAN模型</a>
</p>

</p>
</div>

合成数据生成器(Synthetic Data Generator,SDG)是一个专注于快速生成高质量的结构化表格数据的数据组件。SDG支持单表和多表数据合成算法,并集成了基于大语言模型(LLM)的合成数据生成模型。

合成数据(Synthetic Data)是由计算机使用真实数据、元数据和算法生成的合成数据不包含任何敏感信息,但它保留了原始数据的基本特性。合成数据和真实数据之间没有直接的关联,使其免于GDPR和ADPPA等隐私法规的约束,消除实际应用中的隐私泄露风险。

高质量的合成数据可以安全、多样化地在各种领域中使用,包括数据共享、模型训练和调试、系统开发和测试等应用。阅读 [最新API文档](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节。
高质量的合成数据可以安全、多样化地在各种领域中使用,包括数据共享、模型训练和调试、系统开发和测试等应用。阅读 [**最新API文档**](https://synthetic-data-generator.readthedocs.io/en/latest/) 获取更多细节。

## 🔧 主要特性

Expand All @@ -53,13 +63,13 @@

#### 无原始记录的数据合成功能

无需原始训练数据,可以根据元数据生成合成数据。
无需原始训练数据,可以根据元数据生成合成数据,查看 <a href="https://colab.research.google.com/drive/1VFnP59q3eoVtMJ1PvcYjmuXtx9N8C7o0?usp=sharing" target="value"> Colab 例子</a>

![Synthetic data generation without Data](assets/LLM_Case_1.gif)

#### 表外特征推断功能

根据表中已有的数据以及LLM掌握的知识推断表外特征,即新的列数据。
根据表中已有的数据以及LLM掌握的知识推断表外特征,即新的列数据,查看 <a href="https://colab.research.google.com/drive/1_chuTVZECpj5fklj-RAp7ZVrew8weLW_?usp=sharing" target="value"> Colab 例子</a>

![Off-Table feature inference](assets/LLM_Case_2.gif)

Expand Down Expand Up @@ -163,13 +173,6 @@ print(sampled_data)
[1000 rows x 15 columns]
```

## 🤝 如何贡献

SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发起,若您对SDG项目感兴趣并愿意一起完善它,欢迎加入我们的开源社区:

- 非常欢迎你的加入![提一个 Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) 或者提交一个 Pull Request。
- 开发环境配置请参考[开发者文档](./CONTRIBUTING.md)

## 👩‍🎓 相关工作

### 论文
Expand All @@ -181,6 +184,18 @@ SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发
- CTAB-GAN:[CTAB-GAN: Effective Table Data Synthesizing](https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf)
- OCT-GAN: [OCT-GAN: Neural ODE-based Conditional Tabular GANs](https://arxiv.org/pdf/2105.14969.pdf)

## 🤝 如何贡献

SDG开源项目由**哈尔滨工业大学(深圳)数据安全研究院**发起,若您对SDG项目感兴趣并愿意一起完善它,欢迎加入我们的开源社区:

- 非常欢迎你的加入![提一个 Issue](https://github.com/hitsz-ids/synthetic-data-generator/issues/new) 或者提交一个 Pull Request。
- 开发环境配置请参考[开发者文档](./CONTRIBUTING.md)
- 加入微信群:

<div align="left">
<img src="assets/wechat_QR_code.JPG" width="400" >
</div>

## 📄 许可证

SDG开源项目使用 Apache-2.0 license,有关协议请参考[LICENSE](https://github.com/hitsz-ids/synthetic-data-generator/blob/main/LICENSE)
Binary file added assets/wechat_QR_code.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions docs/source/design/motivation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ SDV: SOTA, and not perfact
In this case, we found `SDV <https://github.com/sdv-dev/SDV>`_,
a Python library designed to be your one-stop shop for creating tabular synthetic data.
`In this research <https://dai.lids.mit.edu/wp-content/uploads/2018/03/SDV.pdf>`_,
they propose techniques for data simulation against associative relationships in relational databases and open source it as SDV.
they propose techniques for data synthesis against associative relationships in relational databases and open source it as SDV.

However, while SDV is a powerful tool for generating synthetic data, it is not without its limitations.
One of the main challenges we encountered during our usage was related to performance.
Expand All @@ -42,7 +42,7 @@ This can lead to longer processing times and increased demand on system resource
which might not be feasible for all use cases or environments.

Second, the architecture of SDV presents certain limitations when it comes to extending its capabilities.
While SDV is designed to support a variety of data simulation techniques for relational databases,
While SDV is designed to support few data synthesis techniques for relational databases,
its architecture makes it difficult to incorporate additional algorithms or support different modalities of data.
This restricts our ability to expand upon its functionality and adapt it to a wider range of use cases.

Expand Down
43 changes: 0 additions & 43 deletions example/1_ctgan_example.py

This file was deleted.

19 changes: 0 additions & 19 deletions example/2_guassian_copula_example.py

This file was deleted.

36 changes: 0 additions & 36 deletions example/3_save_load_synthesizer.py

This file was deleted.

Loading

0 comments on commit f82ca57

Please sign in to comment.