数据类型	数据描述	数据链接
指令微调	self-instruct，GPT3自动生成&过滤得到指令集	https://github.com/yizhongw/self-instruct
指令微调	Standford Alpaca：52K text-davinci-003生成的self-instruct指令数据集	https://github.com/tatsu-lab/stanford_alpaca
指令微调	GPT4-for-LLM 中文+英文+对比指令	https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM
指令微调	GPTTeacher更多样的通用指令，角色扮演和代码指令	https://github.com/teknium1/GPTeacher/tree/main
指令微调	中文翻译Alpaca还有一些其他指令数据集	https://github.com/hikariming/alpaca_chinese_dataset https://github.com/carbonz0/alpaca-chinese-dataset
指令微调	alpaca指令GPT4生成，和以上几版对比显著质量更高，回复更长	https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/tree/main
指令微调	Guanaco数据：对Alphca指令重写后以不同语言生成总共534K，有对话和非对话类型，还有补充的QA生成样本	https://huggingface.co/datasets/JosephusCheung/GuanacoDataset
指令微调	OIG中文指令包括翻译alpaca+natural+unnatural，多轮对话，考试，leetcode指令	https://github.com/BAAI-Zlab/COIG
指令微调	Vicuna训练使用的样本，用API获取了sharegpt上用户和chatgpt对话历史，部分网友整理到了HF	https://github.com/domeccleston/sharegpt https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main
指令微调	HC3指令数据中英文，包括金融，开放QA，百科，DBQA，医学等包含人工回复	https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/tree/main
指令微调	MOSS开源的SFT数据包含使用plugin的对话数据	https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/tree/main
指令微调	InstructWild数据：用四处爬取的chatgpt指令作为种子self-instruct扩充生成，中英双语	https://github.com/XueFuzhao/InstructionWild/tree/main/data
指令微调	BELLE100万指令数据，参考Alpaca用ChatGPT生成，有数学，多轮对话，校色对话等等	https://github.com/LianjiaTech/BELLE
指令微调	PromptCLUE多任务提示数据集：模板构建，只包含标准NLP任务	https://github.com/CLUEbenchmark/pCLUE
指令微调	TK-Instruct微调用的指令数据集, 全人工标注1600+NLP任务	https://instructions.apps.allenai.org/
指令微调	T0微调用的指令数据集（P3）	https://huggingface.co/datasets/bigscience/P3
指令微调	p3衍生的46种多语言数据集（xmtf）	https://github.com/bigscience-workshop/xmtf
指令微调	Unnatural Instruction使用GPT3生成后改写得到240k	https://github.com/orhonovich/unnatural-instructions
指令微调	alpaca COT对多个数据源进行了清理并统一格式放到的了HF, 重点是人工整理的COT数据	https://github.com/PhoebusSi/Alpaca-CoT
指令微调	人工编写包含23种常见的中文NLP任务的指令数据，中文写作方向	https://github.com/yangjianxin1/Firefly
指令微调	Amazon COT指令样本包括各类QA，bigbench，math等	https://github.com/amazon-science/auto-cot
指令微调	CSL包含 396,209 篇中文核心期刊论文元信息（标题、摘要、关键词、学科、门类）可做预训练可构建NLP指令任务	https://github.com/ydli-ai/CSL
指令微调	alpaca code 20K代码指令数据	https://github.com/sahil280114/codealpaca#data-release
指令微调	GPT4Tools 71K GPT4指令样本	https://github.com/StevenGrove/GPT4Tools
指令微调	GPT4指令+角色扮演+代码指令	https://huggingface.co/datasets/teknium/OpenHermes-2.5?row=0
指令微调	Mol-Instructions 2043K 分子+蛋白质+生物分子文本指令，覆盖分子设计、蛋白质功能预测、蛋白质设计等任务	https://github.com/zjunlp/Mol-Instructions
指令微调	Inifity-Instruct 智源千万级指令微调数据集	https://github.com/FlagOpen/Infinity-Instruct/tree/main
指令微调	OpenHermes: GPT4生成数据做了过滤筛选	https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B
指令微调	AgentInstruct:微软浪潮开源Agent指令数据	https://huggingface.co/datasets/THUDM/AgentInstruct
数学	腾讯人工智能实验室发布网上爬取的数学问题APE210k	https://github.com/Chenny0808/ape210k
数学	猿辅导 AI Lab开源小学应用题Math23K	https://github.com/SCNU203/Math23k/tree/main
数学	grade school math把OpenAI的高中数学题有改造成指令样本有2-8步推理过程	https://huggingface.co/datasets/qwedsacf/grade-school-math-instructions
数学	数学问答数据集有推理过程和多项选择	https://huggingface.co/datasets/math_qa/viewer/default/test?row=2
数学	AMC竞赛数学题	https://huggingface.co/datasets/competition_math
数学	线性代数等纯数学计算题	https://huggingface.co/datasets/math_dataset
代码	APPS从不同的开放访问编码网站Codeforces、Kattis 等收集的问题	https://opendatalab.org.cn/APPS
代码	Lyra代码由带有嵌入式 SQL 的 Python 代码组成，经过仔细注释的数据库操作程序，配有中文评论和英文评论。	https://opendatalab.org.cn/Lyra
代码	Conala来自StackOverflow问题,手动注释3k，英文	https://opendatalab.org.cn/CoNaLa/download
代码	code-alpaca ChatGPT生成20K代码指令样本	https://github.com/sahil280114/codealpaca.git
代码	32K, 四种不同类型、不同难度的代码相关中文对话数据，有大模型生成，	https://github.com/zxx000728/CodeGPT
对话	LAION 策划的开放指令通用数据集中手动选择的组件子集已开源40M 3万个,100M在路上	https://github.com/LAION-AI/Open-Instruction-Generalist
对话	Baize基于Chat GPT构建的self-chat数据	https://github.com/project-baize/baize-chatbot/tree/main/data
对话	FaceBook开源BlenderBot训练对话数据~6K	https://huggingface.co/datasets/blended_skill_talk
对话	AllenAI开源38.5万个对话高质量数据集SODA	https://realtoxicityprompts.apps.allenai.org/
对话	InstructDial在单一对话任务类型上进行指令微调	https://github.com/prakharguptaz/Instructdial
对话	Ultra Chat 两个独立的 ChatGPT Turbo API 进行对话，从而生成多轮对话数据	https://github.com/thunlp/UltraChat
对话	Awesome Open-domain Dialogue Models提供多个开放域对话数据	https://github.com/cingtiye/Awesome-Open-domain-Dialogue-Models#%E4%B8%AD%E6%96%87%E5%BC%80%E6%94%BE%E5%9F%9F%E5%AF%B9%E8%AF%9D%E6%95%B0%E6%8D%AE%E9%9B%86
对话	Salesforce开源超全DialogStudio	https://github.com/salesforce/DialogStudio
对话	基于事实Reference的多轮问答中文数据，已开源5万，之后会开源更多	https://github.com/sufengniu/RefGPT
对话	lmsys-chat: Vicuna demo和Chatbot Arena上真实用户和模型的对话数据	https://huggingface.co/datasets/lmsys/lmsys-chat-1m
RLFH	北大河狸开源RLHF数据集10K，1M需要申请	https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-10K
RLHF	Anthropic hh-rlhf数据集	https://huggingface.co/datasets/Anthropic/hh-rlhf
RLHF	Stack-exchange上问题对应多个答案，每个答案有打分	https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences/tree/main
RLHF	Facebook Bot Adversarial Dialogues数据集5K	https://github.com/facebookresearch/ParlAI
RLHF	AllenAI Real Toxicity prompts	https://github.com/facebookresearch/ParlAI
RLHF	OpenAssistant Conversations 160K消息，13500人工生成, 英文为主	https://huggingface.co/datasets/OpenAssistant/oasst1
RLHF	知乎问答偏好数据集	https://huggingface.co/datasets/liyucheng/zhihu_rlhf_3k
RLHF	hh-rlhf中文翻译偏好数据	https://huggingface.co/datasets/liswei/rm-static-zhTW
RLHF	面壁智能开源大规模偏好数据，基于64Kprompt使用不同模型生成4个回答使用GPT-4评估	https://github.com/OpenBMB/UltraFeedback
RLHF	HelpSteer:英伟达推出的10K偏好数据	https://huggingface.co/datasets/nvidia/HelpSteer
评估集	BigBench(Beyond the Imitation Game Benchmark)	https://github.com/google/BIG-bench
评估集	Complex QA：用于ChatGPT的评测指令集	https://github.com/tan92hl/Complex-Question-Answering-Evaluation-of-ChatGPT
评估集	Langchain开源评估数据集	https://huggingface.co/LangChainDatasets
评估集	2010-2022年全国高考卷的题目	https://github.com/OpenLMLab/GAOKAO-Bench
评估集	中文通用大模型综合性评测基准SuperCLUE	https://github.com/CLUEbenchmark/SuperCLUE
英文预训练	RedPajama：开源的复刻llama的预训练数据集，1.21万亿Token	https://github.com/togethercomputer/RedPajama-Data
英文预训练	SlimPajama：Cerebras基于RedPajama进行清洗去重后得到的高质量数据集, 6270亿Token	https://huggingface.co/datasets/cerebras/SlimPajama-627B/tree/main/train
英文预训练	The Pile：22个高质量数据集混合的预训练数据集800G,全量开放下载	https://pile.eleuther.ai/
英文预训练	Fineweb：Huggingface发布从CC清洗消重后的15T tokens web数据，超越C4，pile，pajama	https://huggingface.co/datasets/HuggingFaceFW/fineweb
英文预训练	Finweb-EDU：从FineWeb中通过分类器筛选得到的高质量教育水平的数据集 5.4T Token	https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
英文预训练	1.3T高质量小规模混合预训练数据集	https://huggingface.co/datasets/Zyphra/Zyda
通用预训练	UER整理CLUECorpusSmall+News Commentary中英	https://github.com/dbiir/UER-py/wiki/%E9%A2%84%E8%AE%AD%E7%BB%83%E6%95%B0%E6%8D%AE
中文预训练	智源人工智能开源的wudao 200G预训练数据	https://github.com/BAAI-WuDao/WuDaoMM
中文预训练	里屋社区发起开源力量收集中文互联网语料集MNBVC目标是对标ChatGPT的40T	https://github.com/esbatmop/MNBVC
中文预训练	复旦开源15万中文图书下载和抽取方案	https://github.com/FudanNLPLAB/CBook-150K
中文预训练	书生万卷数据集来自公开网页多模态数据集，包括文本，图文和视频，其中文本1T，图文150G	https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0
中文预训练	昆仑天工开源3.2TB中英语料	https://github.com/SkyworkAI/Skywork
中文预训练	浪潮开源的用于Yuan1.0训练的预训练中文语料	https://www.airyuan.cn/home
领域预训练	度小满开源60G金融预训练语料	https://github.com/Duxiaoman-DI/XuanYuan
领域预训练	首个中文科学文献数据集CSL,也有多种NLP任务数据	https://github.com/ydli-ai/CSL
平行语料	news-commentary中英平行语料，用于中英间知识迁移	https://data.statmt.org/news-commentary/v15/training/
多源数据集整合	opendatalab整合了预训练阶段的多个数据源	https://opendatalab.org.cn/?industry=9821&source=JUU3JTlGJUE1JUU0JUI5JThF
Tool-搜索增强	webCPM开源的和搜索工具进行交互问答的数据集，包括网页抽取式摘要，多事实内容回答等人工标注数据	https://github.com/thunlp/WebCPM
Tool-多工具	BmTools开源的多工具调用指令数据集	https://github.com/OpenBMB/BMTools
Tool-多工具	AgentInstruct包含6项Agent任务，包括REACT式COT标注	https://thudm.github.io/AgentTuning/
Tool-多工具	MSAgent-Bench 大模型调用数据集 598k训练数据	https://modelscope.cn/datasets/damo/MSAgent-Bench/summary
Tool-多工具	MOSS开源的知识搜索，文生图，计算器，解方程等4个插件的30万条多轮对话数据	https://github.com/OpenLMLab/MOSS#%E6%95%B0%E6%8D%AE
NL2SQL	DB-GPT-Hub梳理了多源text-to-sql数据集	https://github.com/eosphoros-ai/DB-GPT-Hub
长文本	清华开源的长文本对齐数据集LongAlign-10k	https://huggingface.co/datasets/THUDM/LongAlign-10k
多模态-图表	MMC图表理解问答数据集	https://github.com/FuxiaoLiu/MMC
表格数据	汇总了各类表格数据	https://github.com/SpursGoZmy/Tabular-LLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

开源数据.MD

开源数据.MD

Files

开源数据.MD

Latest commit

History

开源数据.MD

File metadata and controls