雅意大模型

介绍

雅意大模型在百万级人工构造的高质量领域数据上进行指令微调得到，训练数据覆盖媒体宣传、舆情分析、公共安全、金融风控、城市治理等五大领域，上百种自然语言指令任务。雅意大模型从预训练初始化权重到领域模型的迭代过程中，我们逐步增强了它的中文基础能力和领域分析能力，并增加了部分插件能力。同时，经过数百名用户内测过程中持续不断的人工反馈优化，我们进一步提升了模型性能和安全性。

通过雅意大模型的开源为促进中文预训练大模型开源社区的发展，贡献自己的一份力量，通过开源，与每一位合作伙伴共建雅意大模型生态。

运行方式

环境安装

下载本仓库内容至本地/远程服务器

git clone https://github.com/wenge-research/YaYi.git
cd YaYi

创建conda环境

conda create --name yayi python=3.8
conda activate yayi

安装依赖

pip install -r requirements.txt

其中 torch 和 transformers 版本不建议低于推荐版本。

模型推理

模型权重（7b版本）已在我们的 Huggingface 模型仓库开源，欢迎下载使用。以下是一个简单调用 yayi-7b 进行下游任务推理的示例代码，可在单张 A100/A800/3090 等GPU运行，使用FP16精度推理时约占用 20GB 显存：

from transformers import AutoTokenizer, AutoModelForCausalLM

yayi_7b_path = "wenge-research/yayi-7b"
tokenizer = AutoTokenizer.from_pretrained(yayi_7b_path)
model = AutoModelForCausalLM.from_pretrained(yayi_7b_path, device_map="auto", torch_dtype=torch.bfloat16)

prompt = "你好"
formatted_prompt = f"<|System|>:\nA chat between a human and an AI assistant named YaYi.\nYaYi is a helpful and harmless language model developed by Beijing Wenge Technology Co.,Ltd.\n\n<|Human|>:\n{prompt}\n\n<|YaYi|>:"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

generation_config = GenerationConfig(
    do_sample=True,
    max_new_tokens=100,
    temperature=0.3,
    repetition_penalty=1.1,
    no_repeat_ngram_size=0
)
response = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0]))

注意，模型训练时添加了 special token <|End|> 作为结束符，上述代码在生成式若不能自动停止，可定义 KeywordsStoppingCriteria 类，并将其对象传参至 model.generate() 函数。

class KeywordsStoppingCriteria(StoppingCriteria):
    def __init__(self, keywords_ids:list):
        self.keywords = keywords_ids

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        if input_ids[0][-1] in self.keywords:
            return True
        return False

stop_criteria_7b = KeywordsStoppingCriteria([yayi_7b_tokenizer.encode(w)[0] for w in ["<|End|>"]])
...
response = model.generate(**inputs, generation_config=generation_config, stop_criteria=stop_criteria_7b)

模型微调

本项目基于 deepspeed 框架进行模型训练，配置完环境后执行以下命令行即可开始模型微调（单机多卡）。

deepspeed --num_gpus=8 \
    --module training.trainer \
    --data-path ./data/yayi_train_example.json \
    --input-model ./checkpoints/yayi-7b \
    --deepspeed ./config/deepspeed_zero2_bf16.json \
    --epochs 2 \
    --local-output-dir ./checkpoints \
    --per-device-train-batch-size 8 \
    --per-device-eval-batch-size 8 \
    --logging-steps 1 \
    --save-steps 100 \
    --save-total-limit 10 \
    --eval-steps 100 \
    --warmup-steps 100 \
    --test-size 400 \
    --lr 5e-7 \
    --seed 515

训练数据

雅意大模型基于中科闻歌百万级高质量领域指令微调数据集训练而来，我们本次开源 5w 条训练数据集，可在我们的 Huggingface 数据仓库下载。数据集主要涵盖了金融、安全、舆情、媒体等几大领域，我们为各领域任务大部分指令数据添加了离散 prompt 前缀，以区分各领域数据。

Todo

15B 参数模型指令微调
多轮对话、逻辑推理能力增强
技术报告、插件工具、模型能力评测

致谢

本项目使用了 BigScience 的 bloomz-7b-mt 模型权重作为初始化权重，并基于词表进行扩展；
本项目训练代码参考了 Databricks 的 dolly 项目及 Huggingface transformers 库；
本项目分布式训练使用了 Microsoft 的 DeepSpeed 分布式训练工具及 Huggingface transformers 文档中的 ZeRO stage 2 配置文件；

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
config		config
data		data
training		training
.gitignore		.gitignore
DISCLAIMER		DISCLAIMER
LICENSE		LICENSE
LICENSE_DATA		LICENSE_DATA
LICENSE_MODEL		LICENSE_MODEL
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

雅意大模型

介绍

运行方式

环境安装

模型推理

模型微调

训练数据

Todo

相关协议

局限性

免责声明

开源协议

致谢

Star History

About

Releases

Packages

Languages

License

jicro/YaYi

Folders and files

Latest commit

History

Repository files navigation

雅意大模型

介绍

运行方式

环境安装

模型推理

模型微调

训练数据

Todo

相关协议

局限性

免责声明

开源协议

致谢

Star History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages