Skip to content

Commit

Permalink
add suggestions code review
Browse files Browse the repository at this point in the history
  • Loading branch information
davidberenstein1957 committed Dec 20, 2024
1 parent ae8ff94 commit c088086
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 13 deletions.
2 changes: 1 addition & 1 deletion 6_synthetic_datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Synthetic Datasets

Synthetic data is artificially generated data that mimics real-world usage. It allows overcoming data limitations by expanding or enhancing datasets. Even though synthetic data was already used for some usescases, large language models have made synthetic datasets more popular for both training and evaluation of predictive and generative models.
Synthetic data is artificially generated data that mimics real-world usage. It allows overcoming data limitations by expanding or enhancing datasets. Even though synthetic data was already used for some usescases, large language models have made synthetic datasets more popular for pre- and post-training, and the evaluation of language models.

We'll use [`distilabel`](https://distilabel.argilla.io/latest/), a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. For a deeper dive into the package and best practices, check out the [documentation](https://distilabel.argilla.io/latest/).

Expand Down
27 changes: 18 additions & 9 deletions 6_synthetic_datasets/notebooks/instruction_sft_dataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start synthetsizing\n",
"## Start synthesizing\n",
"\n",
"As we've seen in the previous course content, we can create a distilabel pipeline for insturction dataset generation. The bare minimum pipline is already provided. Make sure to scale up this pipeline to generate a large dataset for instruction tuning. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n",
"As we've seen in the previous course content, we can create a distilabel pipelines for instruction dataset generation. The bare minimum pipline is already provided. Make sure to scale up this pipeline to generate a large dataset for instruction tuning. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n",
"\n",
"Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them.\n",
"Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. \n",
"\n",
"Don't forget to push your dataset to the Hub!"
"An example of loading data from the Hub instead of dictionaries is provided below.\n",
"\n",
"```python\n",
"from datasets import load_dataset\n",
"\n",
"with Pipeline(...) as pipeline:\n",
" ...\n",
"\n",
"if __name__ == \"__main__:\n",
" dataset = load_dataset(\"my-dataset\", split=\"train\")\n",
" distiset = pipeline.run(dataset=dataset)\n",
"```\n",
"\n",
"Don't forget to push your dataset to the Hub after running the pipeline!"
]
},
{
Expand All @@ -75,11 +88,7 @@
"\n",
"if __name__ == \"__main__\":\n",
" distiset = pipeline.run(use_cache=False)\n",
" distiset.push_to_hub(\"huggingface-smol-course-instruction-tuning-dataset\")\n",
"# [{\n",
"# \"instruction\": \"What is the purpose of Smol-Course?\",\n",
"# \"response\": \"The Smol-Course is a platform designed to learning computer science concepts.\"\n",
"# }]"
" distiset.push_to_hub(\"huggingface-smol-course-instruction-tuning-dataset\")"
]
},
{
Expand Down
19 changes: 16 additions & 3 deletions 6_synthetic_datasets/notebooks/preference_dpo_dataset.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,26 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Start synthetsizing\n",
"## Start synthesizing\n",
"\n",
"As we've seen in the previous notebook, we can create a distilabel pipeline for preference dataset generation. The bare minimum pipline is already provided. You can continue work on this pipeline to generate a large dataset for preference alignment. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n",
"\n",
"Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. You could for example use the [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/) to start with the instruction dataset we generated in the previous notebook.\n",
"Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. \n",
"\n",
"Don't forget to push your dataset to the Hub!"
"An example of loading data from the Hub instead of dictionaries is provided below.\n",
"\n",
"```python\n",
"from datasets import load_dataset\n",
"\n",
"with Pipeline(...) as pipeline:\n",
" ...\n",
"\n",
"if __name__ == \"__main__:\n",
" dataset = load_dataset(\"my-dataset\", split=\"train\")\n",
" distiset = pipeline.run(dataset=dataset)\n",
"```\n",
"\n",
"Don't forget to push your dataset to the Hub after running the pipeline!"
]
},
{
Expand Down

0 comments on commit c088086

Please sign in to comment.