add suggestions code review

huggingface · Dec 20, 2024 · c088086 · c088086
1 parent ae8ff94
commit c088086
Show file tree

Hide file tree

Showing 3 changed files with 35 additions and 13 deletions.
diff --git a/6_synthetic_datasets/README.md b/6_synthetic_datasets/README.md
@@ -1,6 +1,6 @@
 # Synthetic Datasets
 
-Synthetic data is artificially generated data that mimics real-world usage. It allows overcoming data limitations by expanding or enhancing datasets. Even though synthetic data was already used for some usescases, large language models have made synthetic datasets more popular for both training and evaluation of predictive and generative models.
+Synthetic data is artificially generated data that mimics real-world usage. It allows overcoming data limitations by expanding or enhancing datasets. Even though synthetic data was already used for some usescases, large language models have made synthetic datasets more popular for pre- and post-training, and the evaluation of language models.
 
 We'll use [`distilabel`](https://distilabel.argilla.io/latest/), a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers. For a deeper dive into the package and best practices, check out the [documentation](https://distilabel.argilla.io/latest/).
 

diff --git a/6_synthetic_datasets/notebooks/instruction_sft_dataset.ipynb b/6_synthetic_datasets/notebooks/instruction_sft_dataset.ipynb
@@ -46,13 +46,26 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Start synthetsizing\n",
+    "## Start synthesizing\n",
     "\n",
-    "As we've seen in the previous course content, we can create a distilabel pipeline for insturction dataset generation. The bare minimum pipline is already provided. Make sure to scale up this pipeline to generate a large dataset for instruction tuning. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n",
+    "As we've seen in the previous course content, we can create a distilabel pipelines for instruction dataset generation. The bare minimum pipline is already provided. Make sure to scale up this pipeline to generate a large dataset for instruction tuning. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n",
     "\n",
-    "Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them.\n",
+    "Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. \n",
     "\n",
-    "Don't forget to push your dataset to the Hub!"
+    "An example of loading data from the Hub instead of dictionaries is provided below.\n",
+    "\n",
+    "```python\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "with Pipeline(...) as pipeline:\n",
+    "    ...\n",
+    "\n",
+    "if __name__ == \"__main__:\n",
+    "    dataset = load_dataset(\"my-dataset\", split=\"train\")\n",
+    "    distiset = pipeline.run(dataset=dataset)\n",
+    "```\n",
+    "\n",
+    "Don't forget to push your dataset to the Hub after running the pipeline!"
    ]
   },
   {
@@ -75,11 +88,7 @@
     "\n",
     "if __name__ == \"__main__\":\n",
     "    distiset = pipeline.run(use_cache=False)\n",
-    "    distiset.push_to_hub(\"huggingface-smol-course-instruction-tuning-dataset\")\n",
-    "# [{\n",
-    "#   \"instruction\": \"What is the purpose of Smol-Course?\",\n",
-    "#   \"response\": \"The Smol-Course is a platform designed to learning computer science concepts.\"\n",
-    "# }]"
+    "    distiset.push_to_hub(\"huggingface-smol-course-instruction-tuning-dataset\")"
    ]
   },
   {

diff --git a/6_synthetic_datasets/notebooks/preference_dpo_dataset.ipynb b/6_synthetic_datasets/notebooks/preference_dpo_dataset.ipynb
@@ -46,13 +46,26 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Start synthetsizing\n",
+    "## Start synthesizing\n",
     "\n",
     "As we've seen in the previous notebook, we can create a distilabel pipeline for preference dataset generation. The bare minimum pipline is already provided. You can continue work on this pipeline to generate a large dataset for preference alignment. Swap out models, model providers and generation arguments to see how they affect the quality of the dataset. Experiment small, scale up later.\n",
     "\n",
-    "Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. You could for example use the [LoadDataFromHub](https://distilabel.argilla.io/latest/components-gallery/steps/loaddatafromhub/) to start with the instruction dataset we generated in the previous notebook.\n",
+    "Check out the [distilabel components gallery](https://distilabel.argilla.io/latest/components-gallery/) for information about the processing classes and how to use them. \n",
     "\n",
-    "Don't forget to push your dataset to the Hub!"
+    "An example of loading data from the Hub instead of dictionaries is provided below.\n",
+    "\n",
+    "```python\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "with Pipeline(...) as pipeline:\n",
+    "    ...\n",
+    "\n",
+    "if __name__ == \"__main__:\n",
+    "    dataset = load_dataset(\"my-dataset\", split=\"train\")\n",
+    "    distiset = pipeline.run(dataset=dataset)\n",
+    "```\n",
+    "\n",
+    "Don't forget to push your dataset to the Hub after running the pipeline!"
    ]
   },
   {