10 more post-tuning OPs, regarding dialog data analysis from multiple…

… aspects (#513) * add api call * add call_api ops * clean * minor update * more tests * update tests * update prompts * fix unittest * update tests * add docs * minor fix * add API processor * refine API processor * refine * chunk and extract events * fix bugs * fix tests * refine tests * extract nickname * nickname test done * lightRAG to OP * doc done * remove extra test * relavant -> relevant * fix minor error * group by op done * ValueError -> Exception * fix config_all error * fix prepare_api_model * fix rank sample None * constant fix key * aggregator op * init python_lambda_mapper * set default arg * fix init * add python_file_mapper * support text & most relavant entities * coverage ignore_errors * index sample * role_playing_system_prompt_yaml * system_prompt begin * support batched * remove unforkable * support batched & add docs * add docs * fix docs * update docs * pre-commit done * fix batch bug * fix batch bug * fix filter batch * fix filter batch * system prompt recipe done * not rank for filter * limit pyav version * add test for op * tmp * doc done * skip api test * add env dependency * install by recipe * dialog sent intensity * add query * change to dj_install * change to dj_install * developer doc done * query sent_int mapper * query sentiment test done * change meta pass * doc done * sentiment detection * diff label * sentiment * test done * dialog intent label * fix typo * prompt adjust * add more test * query intent detection * for test * for test * change model * fix typo * fix typo * for test * for test * doc done * dialog topic detection * dialog topic detection * dialog topic detection * dialog topic detection * dialog topic detection * dialog topic detection * query topic detection * query topic detection * query topic detection * query topic detection * query topic detection * doc done * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * meta tags aggregator * naive reverse grouper * naive reverse grouper * tags specified field * doc done * - rename tests/ops/Aggregator intotests/ops/aggregator for right linking; - minor fix for OP doc * rename for right doc linking in test dir * fix bad dingtalk link --------- Co-authored-by: null <[email protected]> Co-authored-by: gece.gc <[email protected]> Co-authored-by: daoyuan <[email protected]>
modelscope · Dec 26, 2024 · 9466c73 · 9466c73
1 parent 7d5f37d
commit 9466c73
Show file tree

Hide file tree

Showing 43 changed files with 2,621 additions and 57 deletions.
diff --git a/README.md b/README.md
@@ -34,7 +34,7 @@ We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try
 [Platform for AI of Alibaba Cloud (PAI)](https://www.aliyun.com/product/bigdata/learn) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: [PAI-Data Processing for Large Models](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX).
 
 Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. 
-We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw)  channel, [DingDing](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) group, ...), in promoting data-model co-development along with research and applications of (multimodal) LLMs!
+We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw)  channel, [DingDing](https://qr.dingtalk.com/action/joingroup?code=v1,k1,YFIXM2leDEk7gJP5aMC95AfYT+Oo/EP/ihnaIEhMyJM=&_dt_no_comment=1&origin=11) group, ...), in promoting data-model co-development along with research and applications of (multimodal) LLMs!
 
 ----
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -27,7 +27,7 @@ Data-Juicer 是一个一站式**多模态**数据处理系统，旨在为大语
 
 [阿里云人工智能平台 PAI](https://www.aliyun.com/product/bigdata/learn) 已引用我们的工作，将Data-Juicer的能力集成到PAI的数据处理产品中。PAI提供包含数据集管理、算力管理、模型工具链、模型开发、模型训练、模型部署、AI资产管理在内的功能模块，为用户提供高性能、高稳定、企业级的大模型工程化能力。数据处理的使用文档请参考：[PAI-大模型数据处理](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX)。
 
-Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们（issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11)/...），一起推进LLM-数据的协同开发和研究！
+Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们（issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?code=v1,k1,YFIXM2leDEk7gJP5aMC95AfYT+Oo/EP/ihnaIEhMyJM=&_dt_no_comment=1&origin=11)/...），一起推进LLM-数据的协同开发和研究！
 
 
 ----

diff --git a/configs/config_all.yaml b/configs/config_all.yaml
@@ -77,6 +77,68 @@ process:
   - clean_ip_mapper:                                        # remove ip addresses from text.
   - clean_links_mapper:                                     # remove web links from text.
   - clean_copyright_mapper:                                 # remove copyright comments.
+  - dialog_intent_detection_mapper:                         # Mapper to generate user's intent labels in dialog.
+      api_model: 'gpt-4o'                                     # API model name.
+      intent_candidates: null                                 # The output intent candidates. Use the intent labels of the open domain if it is None.
+      max_round: 10                                           # The max num of round in the dialog to build the prompt.
+      api_endpoint: null                                      # URL endpoint for the API.
+      response_path: null                                     # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
+      system_prompt: null                                     # System prompt for the task.
+      query_template: null                                    # Template for query part to build the input prompt.
+      response_template: null                                 # Template for response part to build the input prompt.
+      candidate_template: null                                # Template for intent candidates to build the input prompt.
+      analysis_template: null                                 # Template for analysis part to build the input prompt.
+      labels_template: null                                   # Template for labels to build the input prompt.
+      analysis_pattern: null                                  # Pattern to parse the return intent analysis.
+      labels_pattern: null                                    # Pattern to parse the return intent labels.
+      try_num: 3                                              # The number of retry attempts when there is an API call error or output parsing error.
+      model_params: {}                                        # Parameters for initializing the API model.
+      sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
+  - dialog_sentiment_detection_mapper:                      # Mapper to generate user's sentiment labels in dialog.
+      api_model: 'gpt-4o'                                     # API model name.
+      max_round: 10                                           # The max num of round in the dialog to build the prompt.
+      api_endpoint: null                                      # URL endpoint for the API.
+      response_path: null                                     # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
+      system_prompt: null                                     # System prompt for the task.
+      query_template: null                                    # Template for query part to build the input prompt.
+      response_template: null                                 # Template for response part to build the input prompt.
+      analysis_template: null                                 # Template for analysis part to build the input prompt.
+      labels_template: null                                   # Template for labels part to build the input prompt.
+      analysis_pattern: null                                  # Pattern to parse the return sentiment analysis.
+      labels_pattern: null                                    # Pattern to parse the return sentiment labels.
+      try_num: 3                                              # The number of retry attempts when there is an API call error or output parsing error.
+      model_params: {}                                        # Parameters for initializing the API model.
+      sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
+  - dialog_sentiment_intensity_mapper:                      # Mapper to predict user's sentiment intensity (from -5 to 5 in default prompt) in dialog.
+      api_model: 'gpt-4o'                                     # API model name.
+      max_round: 10                                           # The max num of round in the dialog to build the prompt.
+      api_endpoint: null                                      # URL endpoint for the API.
+      response_path: null                                     # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
+      system_prompt: null                                     # System prompt for the task.
+      query_template: null                                    # Template for query part to build the input prompt.
+      response_template: null                                 # Template for response part to build the input prompt.
+      analysis_template: null                                 # Template for analysis part to build the input prompt.
+      intensity_template: null                                # Template for intensity part to build the input prompt.
+      analysis_pattern: null                                  # Pattern to parse the return sentiment analysis.
+      intensity_pattern: null                                 # Pattern to parse the return sentiment intensity.
+      try_num: 3                                              # The number of retry attempts when there is an API call error or output parsing error.
+      model_params: {}                                        # Parameters for initializing the API model.
+      sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
+  - dialog_topic_detection_mapper:                          # Mapper to generate user's topic labels in dialog.
+      api_model: 'gpt-4o'                                     # API model name.
+      max_round: 10                                           # The max num of round in the dialog to build the prompt.
+      api_endpoint: null                                      # URL endpoint for the API.
+      response_path: null                                     # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
+      system_prompt: null                                     # System prompt for the task.
+      query_template: null                                    # Template for query part to build the input prompt.
+      response_template: null                                 # Template for response part to build the input prompt.
+      analysis_template: null                                 # Template for analysis part to build the input prompt.
+      labels_template: null                                   # Template for labels part to build the input prompt.
+      analysis_pattern: null                                  # Pattern to parse the return topic analysis.
+      labels_pattern: null                                    # Pattern to parse the return topic labels.
+      try_num: 3                                              # The number of retry attempts when there is an API call error or output parsing error.
+      model_params: {}                                        # Parameters for initializing the API model.
+      sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
   - expand_macro_mapper:                                    # expand macro definitions in Latex text.
   - extract_entity_attribute_mapper:                        # Extract attributes for given entities from the text.
       api_model: 'gpt-4o'                                     # API model name.
@@ -277,6 +339,21 @@ process:
   - python_lambda_mapper:                                   # executing Python lambda function on data samples.
       lambda_str: ''                                          # A string representation of the lambda function to be executed on data samples. If empty, the identity function is used.
       batched: False                                          # A boolean indicating whether to process input data in batches.
+  - query_intent_detection_mapper:                          # Mapper to predict user's Intent label in query.
+      hf_model: 'bespin-global/klue-roberta-small-3i4k-intent-classification'     # Hugginface model ID to predict intent label.
+      zh_to_en_hf_model: 'Helsinki-NLP/opus-mt-zh-en'         # Translation model from Chinese to English. If not None, translate the query from Chinese to English.
+      model_params: {}                                        # model param for hf_model.
+      zh_to_en_model_params: {}                               # model param for zh_to_hf_model.
+  - query_sentiment_detection_mapper:                       # Mapper to predict user's sentiment label ('negative', 'neutral' and 'positive') in query.
+      hf_model: 'mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis'     # Hugginface model ID to predict sentiment label.
+      zh_to_en_hf_model: 'Helsinki-NLP/opus-mt-zh-en'         # Translation model from Chinese to English. If not None, translate the query from Chinese to English.
+      model_params: {}                                        # model param for hf_model.
+      zh_to_en_model_params: {}                               # model param for zh_to_hf_model.
+  - query_topic_detection_mapper:                           # Mapper to predict user's topic label in query.
+      hf_model: 'dstefa/roberta-base_topic_classification_nyt_news'     # Hugginface model ID to predict topic label.
+      zh_to_en_hf_model: 'Helsinki-NLP/opus-mt-zh-en'         # Translation model from Chinese to English. If not None, translate the query from Chinese to English.
+      model_params: {}                                        # model param for hf_model.
+      zh_to_en_model_params: {}                               # model param for zh_to_hf_model.
   - relation_identity_mapper:                               # identify relation between two entity in the text.
       api_model: 'gpt-4o'                                     # API model name.
       source_entity: '孙悟空'                                  # The source entity of the relation to be dentified.
@@ -715,6 +792,9 @@ process:
       upper_percentile:                                       # the upper bound of the percentile to be sampled
       lower_rank:                                             # the lower rank of the percentile to be sampled
       upper_rank:                                             # the upper rank of the percentile to be sampled
+  - tags_specified_field_selector:                          # Selector to select samples based on the tags of specified field.
+      field_key: '__dj__meta__.query_sentiment_label'         # the target keys corresponding to multi-level field information need to be separated by '.'
+      target_tags: ['happy', 'sad']                           # Target tags to be select.
   - topk_specified_field_selector:                          # selector to select top samples based on the sorted specified field
       field_key: ''                                           # the target keys corresponding to multi-level field information need to be separated by '.'
       top_ratio:                                              # ratio of selected top samples
@@ -723,6 +803,7 @@ process:
 
 # Grouper ops.
   - naive_grouper:                                          # Group all samples to one batched sample.
+  - naive_reverse_grouper:                                  # Split one batched sample to samples.
   - key_value_grouper:                                      # Group samples to batched samples according values in given keys.
       group_by_keys: null                                     # Group samples according values in the keys. Support for nested keys such as "__dj__stats__.text_len". It is [self.text_key] in default.
 
@@ -744,6 +825,20 @@ process:
       try_num: 3                                              # The number of retry attempts when there is an API call error or output parsing error.
       model_params: {}                                        # Parameters for initializing the API model.
       sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
+  - meta_tags_aggregator:                                   # Merge similar meta tags to one tag.
+      api_model: 'gpt-4o'                                     # API model name.
+      meta_tag_key: '__dj__meta__.query_sentiment_label'      # The key of the meta tag to be mapped.
+      target_tags: ['开心', '难过', '其他']                     # The tags that is supposed to be mapped to.
+      api_endpoint: null                                      # URL endpoint for the API.
+      response_path: null                                     # Path to extract content from the API response. Defaults to 'choices.0.message.content'.
+      system_prompt: null                                     # The system prompt.
+      input_template: null                                    # The input template.
+      target_tag_template: null                               # The tap template for target tags.
+      tag_template: null                                      # The tap template for each tag and its frequency.
+      output_pattern: null                                    # The output pattern.
+      try_num: 3                                              # The number of retry attempts when there is an API call error or output parsing error.
+      model_params: {}                                        # Parameters for initializing the API model.
+      sampling_params: {}                                     # Extra parameters passed to the API call. e.g {'temperature': 0.9, 'top_p': 0.95}
   - most_relavant_entities_aggregator:                      # Extract entities closely related to a given entity from some texts, and sort them in descending order of importance.
       api_model: 'gpt-4o'                                     # API model name.
       entity: '孙悟空'                                         # The given entity.

diff --git a/data_juicer/ops/aggregator/__init__.py b/data_juicer/ops/aggregator/__init__.py
@@ -1,8 +1,9 @@
 from .entity_attribute_aggregator import EntityAttributeAggregator
+from .meta_tags_aggregator import MetaTagsAggregator
 from .most_relavant_entities_aggregator import MostRelavantEntitiesAggregator
 from .nested_aggregator import NestedAggregator
 
 __all__ = [
-    'NestedAggregator', 'EntityAttributeAggregator',
+    'NestedAggregator', 'MetaTagsAggregator', 'EntityAttributeAggregator',
     'MostRelavantEntitiesAggregator'
 ]
diff --git a/data_juicer/ops/aggregator/entity_attribute_aggregator.py b/data_juicer/ops/aggregator/entity_attribute_aggregator.py
@@ -8,14 +8,10 @@
 from data_juicer.utils.common_utils import (avg_split_string_list_under_limit,
                                             is_string_list, nested_access,
                                             nested_set)
-from data_juicer.utils.lazy_loader import LazyLoader
 from data_juicer.utils.model_utils import get_model, prepare_model
 
 from .nested_aggregator import NestedAggregator
 
-torch = LazyLoader('torch', 'torch')
-vllm = LazyLoader('vllm', 'vllm')
-
 OP_NAME = 'entity_attribute_aggregator'
-Original file line number
+Diff line change
@@ Expand Up @@
     [阿里云人工智能平台 PAI](https://www.aliyun.com/product/bigdata/learn) 已引用我们的工作，将Data-Juicer的能力集成到PAI的数据处理产品中。PAI提供包含数据集管理、算力管理、模型工具链、模型开发、模型训练、模型部署、AI资产管理在内的功能模块，为用户提供高性能、高稳定、企业级的大模型工程化能力。数据处理的使用文档请参考：[PAI-大模型数据处理](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX)。
-    Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们（issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11)/...），一起推进LLM-数据的协同开发和研究！
+    Data-Juicer正在积极更新和维护中，我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们（issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?code=v1,k1,YFIXM2leDEk7gJP5aMC95AfYT+Oo/EP/ihnaIEhMyJM=&_dt_no_comment=1&origin=11)/...），一起推进LLM-数据的协同开发和研究！
     ----
@@ Expand Down @@