Format conversion tools for post tuning datasets (#514)

* + add sharegpt <--> dj format conversion tools * - move multimodal into fmt_conversion * + add basic docs for format conversion tools and post tuning dialog format conversion tools * * rename tools * + add messages <--> dj conversion tools * + add messages <--> dj conversion tools * - reorganize the directory * * rename functions * + add conversion tools for ModelScope-Swift ShareGPT format * + add conversion tools for Alpaca format * * fix typos in doc strings * Update post_tuning_dialog/README.md * Update pos_tuning_dialog/README_ZH.md align with en version * clearly point out the DJ format * clearly point out the DJ format in zh * minor typo fix --------- Co-authored-by: Daoyuan Chen <[email protected]>
modelscope · Dec 26, 2024 · 1554138 · 1554138
1 parent 36af193
commit 1554138
Show file tree

Hide file tree

Showing 32 changed files with 1,490 additions and 18 deletions.
diff --git a/README.md b/README.md
@@ -55,7 +55,7 @@ In this new version, we support more features for **multimodal data (including v
 - [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
 - [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
 - [2024-01-05] We release **Data-Juicer v0.1.3** now!
-In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
+In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/fmt_conversion/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
 Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
 - [2023-10-13] Our first data-centric LLM competition begins! Please
   visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.

diff --git a/README_ZH.md b/README_ZH.md
@@ -47,7 +47,7 @@ Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多
 - [2024-02-05] 我们的论文被SIGMOD'24 industrial track接收！
 - [2024-01-10] 开启“数据混合”新视界——第二届Data-Juicer大模型数据挑战赛已经正式启动！立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532174)，了解赛事详情。
 - [2024-01-05] **Data-Juicer v0.1.3** 版本发布了。 
-在这个新版本中，我们支持了**更多Python版本**（3.8-3.10），同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)（包括文本、图像和音频。更多模态也将会在之后支持）！
+在这个新版本中，我们支持了**更多Python版本**（3.8-3.10），同时支持了**多模态**数据集的[转换](tools/fmt_conversion/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)（包括文本、图像和音频。更多模态也将会在之后支持）！
 此外，我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。
 - [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了！
   请访问大赛官网，FT-Data Ranker（[1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ，了解更多信息。

diff --git a/tools/fmt_conversion/README.md b/tools/fmt_conversion/README.md
@@ -0,0 +1,54 @@
+# Format Conversion Tools
+
+Here Data-Juicer provides tens of format conversion tools for diverse datasets, including multimodal datasets, post tuning datasets, and so on.
+These tools help to convert the dataset in the original format to a unified, intermediate format used in Data-Juicer, which we call it "DJ format".
+An overview of DJ format is shown below:
+
+```python
+{
+  // >>> core contents: texts, dialogs, ...
+  "text": "xxx",
+  "query": "xxx",
+  "response": "xxx",
+  ......
+  // <<< core contents
+
+  // >>> extra data contents: multimodal data paths, ...
+  "images": [
+    "path/to/the/image/of/antarctica_snowfield",
+    "path/to/the/image/of/antarctica_map",
+    "path/to/the/image/of/europe_map"
+  ],
+  "audios": [
+    "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
+  ],
+  "videos": [
+    "path/to/the/video/of/remote_sensing_view_of_antarctica"
+  ],
+  // <<< extra data contents
+
+  // >>> meta infos and stats, which could be primitive or produced by Data-Juicer
+  "meta": {
+    "src": "customized",
+    "version": "0.1",
+    "author": "xxx"
+  },
+  "stats": {
+    "lang": "en",
+    "image_widths": [224, 336, 512],
+    ...
+  },
+  // <<< meta infos and stats
+}
+```
+
+There are about three parts in DJ format:
+1. Core contents: such as texts in the pretraining dataset of LLMs, dialogs in the post tuning dataset, and so on. They are directly related to the training or fine-tuning procedures in the downstream usage of the dataset.
+2. Extra data contents: such as the paths to the multimodal data in the multimodal datasets. They are organized as path lists.
+3. Meta infos & Stats: such as version or source information of the dataset that are inherent from the original dataset, or category tags and stats produced by OPs of Data-Juicer.
+
+The 2nd and 3rd parts of them are common used and organized in nearly the same structures for diverse datasets.
+As a contrast, the 1st part, which is the core contents, might be quite different for different kinds of datasets.
+Here are the corresponding documents for different datasets that introduce more details about this part:
+- [Multimodal datasets](multimodal/README.md)
+- [Post Tuning](post_tuning_dialog/README.md)
diff --git a/tools/fmt_conversion/README_ZH.md b/tools/fmt_conversion/README_ZH.md
@@ -0,0 +1,54 @@
+# 格式转换工具
+
+在这里，Data-Juicer 为各式各样的数据集提供了十数种格式转换工具，包括多模态数据集，后微调数据集等等。
+这些工具帮助我们将原始格式的数据集转换为 Data-Juicer 使用的一种统一的、中间的格式表示，我们将其称为"DJ 格式"。
+DJ 格式的一个示例如下所示：
+
+```python
+{
+  // >>> 核心内容：文本，对话，......
+  "text": "xxx",
+  "query": "xxx",
+  "response": "xxx",
+  ......
+  // <<< 核心内容
+
+  // >>> 额外数据内容：多模态数据路径，......
+  "images": [
+    "path/to/the/image/of/antarctica_snowfield",
+    "path/to/the/image/of/antarctica_map",
+    "path/to/the/image/of/europe_map"
+  ],
+  "audios": [
+    "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
+  ],
+  "videos": [
+    "path/to/the/video/of/remote_sensing_view_of_antarctica"
+  ],
+  // <<< 额外数据内容
+
+  // >>> meta 信息和 stats，它们可能是数据集原生的，也可以由 Data-Juicer 产出
+  "meta": {
+    "src": "customized",
+    "version": "0.1",
+    "author": "xxx"
+  },
+  "stats": {
+    "lang": "en",
+    "image_widths": [224, 336, 512],
+    ...
+  },
+  // <<< meta 信息和 stats
+}
+```
+
+在 DJ 格式中大概包括三个部分：
+1. 核心内容：例如 LLM 的预训练数据集中的文本内容，后微调数据集中的对话内容等。它们与数据集的下游使用的训练或者微调过程直接相关。
+2. 额外数据内容：例如多模态数据集中的多模态数据路径。它们被组织为路径列表。
+3. Meta 信息和 Stats：例如从原始数据集中继承而来的数据集版本或来源信息，或者由 Data-Juicer 的算子产出的类别 tags 和 stats 信息。
+
+其中，第 2 和第 3 部分对于不同的数据集来说是通用的，而且都会被组织为几乎相同的结构。
+作为对比，第 1 部分，也就是核心内容部分，对于各种数据集来说可能非常不同。
+这里列举了针对不同种类数据集介绍这个部分更多细节的对应的文档：
+- [多模态数据集](multimodal/README_ZH.md)
+- [后微调数据集](post_tuning_dialog/README_ZH.md)
diff --git a/tools/multimodal/README.md → tools/fmt_conversion/multimodal/README.md b/tools/multimodal/README.md → tools/fmt_conversion/multimodal/README.md
@@ -10,7 +10,7 @@ Both input and output of this utility conform to Data-Juicer's data format. If y
 To learn more about the usage of the absolute to relative path conversion tool, you can execute the following command:
 
 ```shell
-python tools/multimodal/absolute_path_to_relative_path.py --help
+python tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py --help
 ```
 
 ## Dataset Format Conversion
@@ -94,7 +94,7 @@ For all tools, you can run the following command to find out the usage of them:
 
 ```shell
 # e.g. llava_to_dj.py
-python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
+python tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
 ```
 
 Before using these tools, you might need to take a glance at the reference

diff --git a/tools/multimodal/README_ZH.md → tools/fmt_conversion/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md → tools/fmt_conversion/multimodal/README_ZH.md
@@ -10,7 +10,7 @@
 可以运行以下命令来了解绝对路径转化相对路径工具的详细用法：
 
 ```shell
-python tools/multimodal/absolute_path_to_relative_path.py --help
+python tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py --help
 ```
 
 ## 数据集格式转换
@@ -86,7 +86,7 @@ python tools/multimodal/absolute_path_to_relative_path.py --help
 
 ```shell
 # 例如：llava_to_dj.py
-python tools/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
+python tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py --help
 ```
 在使用这些工具之前，您可能需要查看上表中每个格式的参考资料，以更好地了解详细的格式信息，并理解每个工具的参数含义。
 

diff --git a/...timodal/absolute_path_to_relative_path.py → ...timodal/absolute_path_to_relative_path.py b/...timodal/absolute_path_to_relative_path.py → ...timodal/absolute_path_to_relative_path.py
diff --git a/...ormat_to_target_format/dj_to_internvid.py → ...ormat_to_target_format/dj_to_internvid.py b/...ormat_to_target_format/dj_to_internvid.py → ...ormat_to_target_format/dj_to_internvid.py
@@ -35,7 +35,7 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import remove_dj_special_tokens
+from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens
 
 
 def main(

diff --git a/...er_format_to_target_format/dj_to_llava.py → ...er_format_to_target_format/dj_to_llava.py b/...er_format_to_target_format/dj_to_llava.py → ...er_format_to_target_format/dj_to_llava.py
diff --git a/...cer_format_to_target_format/dj_to_mmc4.py → ...cer_format_to_target_format/dj_to_mmc4.py b/...cer_format_to_target_format/dj_to_mmc4.py → ...cer_format_to_target_format/dj_to_mmc4.py
diff --git a/...r_format_to_target_format/dj_to_msrvtt.py → ...r_format_to_target_format/dj_to_msrvtt.py b/...r_format_to_target_format/dj_to_msrvtt.py → ...r_format_to_target_format/dj_to_msrvtt.py
@@ -44,7 +44,7 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import remove_dj_special_tokens
+from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens
 
 
 def main(

diff --git a/...t_to_target_format/dj_to_video_chatgpt.py → ...t_to_target_format/dj_to_video_chatgpt.py b/...t_to_target_format/dj_to_video_chatgpt.py → ...t_to_target_format/dj_to_video_chatgpt.py
@@ -38,7 +38,7 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import remove_dj_special_tokens
+from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens
 
 
 def main(

diff --git a/..._format_to_target_format/dj_to_wavcaps.py → ..._format_to_target_format/dj_to_wavcaps.py b/..._format_to_target_format/dj_to_wavcaps.py → ..._format_to_target_format/dj_to_wavcaps.py
diff --git a/...er_format_to_target_format/dj_to_youku.py → ...er_format_to_target_format/dj_to_youku.py b/...er_format_to_target_format/dj_to_youku.py → ...er_format_to_target_format/dj_to_youku.py
@@ -59,7 +59,7 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import remove_dj_special_tokens
+from tools.fmt_conversion.multimodal.utils import remove_dj_special_tokens
 
 
 def main(

diff --git a/..._to_data_juicer_format/internvid_to_dj.py → ..._to_data_juicer_format/internvid_to_dj.py b/..._to_data_juicer_format/internvid_to_dj.py → ..._to_data_juicer_format/internvid_to_dj.py
@@ -42,8 +42,8 @@
 from data_juicer.utils.file_utils import add_suffix_to_filename
 from data_juicer.utils.mm_utils import (SpecialTokens, cut_video_by_seconds,
                                         timecode_string_to_seconds)
-from tools.multimodal.utils import (check_args_load_to_dj_data,
-                                    convert_text_to_dj)
+from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
+                                                   convert_text_to_dj)
 
 
 def main(

diff --git a/...rmat_to_data_juicer_format/llava_to_dj.py → ...rmat_to_data_juicer_format/llava_to_dj.py b/...rmat_to_data_juicer_format/llava_to_dj.py → ...rmat_to_data_juicer_format/llava_to_dj.py
diff --git a/...ormat_to_data_juicer_format/mmc4_to_dj.py → ...ormat_to_data_juicer_format/mmc4_to_dj.py b/...ormat_to_data_juicer_format/mmc4_to_dj.py → ...ormat_to_data_juicer_format/mmc4_to_dj.py
diff --git a/...mat_to_data_juicer_format/msrvtt_to_dj.py → ...mat_to_data_juicer_format/msrvtt_to_dj.py b/...mat_to_data_juicer_format/msrvtt_to_dj.py → ...mat_to_data_juicer_format/msrvtt_to_dj.py
@@ -43,8 +43,8 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import (check_args_load_to_dj_data,
-                                    convert_text_to_dj)
+from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
+                                                   convert_text_to_dj)
 
 
 def main(

diff --git a/...data_juicer_format/video_chatgpt_to_dj.py → ...data_juicer_format/video_chatgpt_to_dj.py b/...data_juicer_format/video_chatgpt_to_dj.py → ...data_juicer_format/video_chatgpt_to_dj.py
@@ -37,8 +37,8 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import (check_args_load_to_dj_data,
-                                    convert_text_to_dj)
+from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
+                                                   convert_text_to_dj)
 
 
 @logger.catch(reraise=True)

diff --git a/...at_to_data_juicer_format/wavcaps_to_dj.py → ...at_to_data_juicer_format/wavcaps_to_dj.py b/...at_to_data_juicer_format/wavcaps_to_dj.py → ...at_to_data_juicer_format/wavcaps_to_dj.py
diff --git a/...rmat_to_data_juicer_format/youku_to_dj.py → ...rmat_to_data_juicer_format/youku_to_dj.py b/...rmat_to_data_juicer_format/youku_to_dj.py → ...rmat_to_data_juicer_format/youku_to_dj.py
@@ -58,8 +58,8 @@
 from tqdm import tqdm
 
 from data_juicer.utils.mm_utils import SpecialTokens
-from tools.multimodal.utils import (check_args_load_to_dj_data,
-                                    convert_text_to_dj)
+from tools.fmt_conversion.multimodal.utils import (check_args_load_to_dj_data,
+                                                   convert_text_to_dj)
 
 
 @logger.catch(reraise=True)

diff --git a/tools/multimodal/utils.py → tools/fmt_conversion/multimodal/utils.py b/tools/multimodal/utils.py → tools/fmt_conversion/multimodal/utils.py
diff --git a/tools/fmt_conversion/post_tuning_dialog/README.md b/tools/fmt_conversion/post_tuning_dialog/README.md
@@ -0,0 +1,96 @@
+# Post Tuning Tools
+
+For post tuning formats, we mainly consider 4 formats to support [ModelScope-Swift](https://github.com/modelscope/ms-swift/blob/main/docs/source_en/Customization/Custom-dataset.md) and [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory/blob/main/data/README.md).
+
+- Swift's Messages format (Very similar to the LLaMA-Factory's ShareGPT format, with different key names):
+
+```python
+{
+  "messages": [
+    {
+      "role": "system",
+      "content": "<system>"
+    },
+    {
+      "role": "user",
+      "content": "<query1>"
+    },
+    {
+      "role": "assistant",
+      "content": "<response1>"
+    },
+    {
+      "role": "user",
+      "content": "<query2>"
+    },
+    {
+      "role": "assistant",
+      "content": "<response2>"
+    }
+  ]
+}
+```
+
+- Swift's ShareGPT format:
+
+```python
+{
+  "system": "<system>",
+  "conversation": [
+    {
+      "human": "<query1>",
+      "assistant": "<response1>"
+    },
+    {
+      "human": "<query2>",
+      "assistant": "<response2>"
+    }
+  ]
+}
+```
+
+- Alpaca format (used in the same definition in Swift and LLaMA-Factory):
+
+```python
+{
+  "system": "<system>",
+  "instruction": "<query-inst>",
+  "input": "<query-input>",
+  "output": "<response>"
+}
+```
+
+- Swift's Query-Response format:
+
+```python
+{
+  "system": "<system>",
+  "query": "<query2>",
+  "response": "<response2>",
+  "history": [
+    [
+      "<query1>",
+      "<response1>"
+    ]
+  ]
+}
+```
+
+In Data-Juicer, we pre-set fields to align with the last two formats (Alpaca and Query-Response), which serves as our intermediate format for post-tuning dialog datasets. Correspondingly, we provide several tools to convert datasets in other formats to the following DJ format and vice versa.
+
+- DJ default format for post-tuning OPs:
+
+```python
+{
+  "system": "<system>",
+  "instruction": "<query-inst>",
+  "query": "<query2>",
+  "response": "<response2>",
+  "history": [
+    [
+      "<query1>",
+      "<response1>"
+    ]
+  ]
+}
+```