Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a GeneratorStep from a dataset using a helper function #812

Merged
merged 29 commits into from
Jul 29, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Jul 23, 2024

Description

This PR simplifies the process to create a generator step from a dataset:

1. Helper function to create the GeneratorStep from the dataset already processed:

From the example by @dvsrepo, we can update the code like so:

dataset = load_dataset(
  "DIBT/10k_prompts_ranked",
  split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)

dataset = dataset.to_list()
load_dataset = LoadDataFromDicts(
  name="load_dataset",
  data=dataset[0:500], # during development I used [0:1] to iterate quickly
  output_mappings={"prompt": "instruction"}
)

can be integrated in a pipeline as:

from distilabel.pipeline import Pipeline
from distilabel.steps import make_generator_step

dataset = load_dataset(
  "DIBT/10k_prompts_ranked",
  split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)

with Pipeline() as pipeline:
    load_dataset = make_generator_step(
        dataset,
        output_mappings={"prompt": "instruction"}
    )
    dummy = DummyStep()
    load_dataset >> dummy

distiset = pipeline.run()

New entry in the docs:

image

2. Pass the dataset via pipeline.run(dataset=....)

Internally we will create the step (has less flexibility, but it's more direct, and easier if you don't need the flexibility):

from distilabel.pipeline import Pipeline

dataset = load_dataset(
  "DIBT/10k_prompts_ranked",
  split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)

with Pipeline() as pipeline:
    dummy = DummyStep()

distiset = pipeline.run(dataset=dataset)

Example in the docs with the new functionality:

image

A new example that will appear in the quick start with the new simplifications (the original pipeline can be compared here):

image


The pipeline from this blogpost you did @dvsrepo could now do with a less lines:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    GroupColumns,
    KeepColumns,
-    LoadDataFromDicts
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

from datasets import load_dataset

dataset = load_dataset(
    "DIBT/10k_prompts_ranked",
    split="train"
).filter(
    lambda r: r['avg_rating']>=4 and r['num_responses']>=2
+).select(range(100))
- dataset = dataset.to_list()

with Pipeline(
-    name="prefs-with-llama-3",
    description="Pipeline for building preference datasets using Llama 3",
) as pipeline:
-    load_dataset = LoadDataFromDicts(
-        name="load_dataset",
-        data=dataset[0:100],
-        output_mappings={"prompt": "instruction"}
-    )
    generate_with_llama3_70B = TextGeneration(
-        name="generate_with_llama3",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
        ),
    )

    generate_with_llama3_8B = TextGeneration(
-        name="generate_with_llama3_8B",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
        ),
    )

    combine_columns = GroupColumns(
-      name="combine_columns",
      columns=["generation", "model_name"],
      output_columns=["generations", "generation_models"],
    )

    ultrafeedback = UltraFeedback(
-      name="ultrafeedback",
      aspect='overall-rating',
      llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
    )
    keep_columns = KeepColumns(
-        name="keep_columns",
        columns=[
            "instruction",
            "generations",
            "generation_models",
            "ratings",
            "rationales",
        ],
    )

-    generate_with_llama3_70B.connect(combine_columns)
-    generate_with_llama3_8B.connect(combine_columns)
-    load_dataset.connect(generate_with_llama3_70B)
-    load_dataset.connect(generate_with_llama3_8B)
-    combine_columns.connect(ultrafeedback)
-    ultrafeedback.connect(keep_columns)
+    (
+        [generate_with_llama3_70B, generate_with_llama3_8B]
+        >> combine_columns
+        >> ultrafeedback
+        >> keep_columns
+    )


if __name__ == "__main__":
    distiset = pipeline.run(
+       dataset=dataset,
        parameters={
-            "load_dataset": {
-                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
-                "split": "test",
-            },
            "generate_with_llama3": {
                "llm": {
                    "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.7, "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"]}
                }
            },
            "generate_with_llama3_8B": {
                "llm": {
                    "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.7, "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"]}
                }
            },
            "ultrafeedback": {
                "llm": {
                    "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.1, "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"]}
                }
            },
        }
    )
    distiset.push_to_hub(
        "dvilasuero/distillama3-prompts10k",
+        include_script=True
    )

@plaguss plaguss added the enhancement New feature or request label Jul 23, 2024
@plaguss plaguss added this to the 1.3.0 milestone Jul 23, 2024
@plaguss plaguss requested review from dvsrepo and gabrielmbmb July 23, 2024 12:28
@plaguss plaguss self-assigned this Jul 23, 2024
Copy link

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-812/

Copy link

codspeed-hq bot commented Jul 23, 2024

CodSpeed Performance Report

Merging #812 will not alter performance

Comparing pipeline-with-dataset (64e4ff2) with develop (25601bb)

Summary

✅ 1 untouched benchmarks

@plaguss plaguss changed the title Pass a dataset to a Create a GeneratorStep from a dataset using a helper function Jul 23, 2024
@plaguss plaguss marked this pull request as ready for review July 24, 2024 11:57
Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some comments regarding the quickstart on the docs, docstrings and some suggestion for the code.

docs/sections/getting_started/quickstart.md Outdated Show resolved Hide resolved
docs/sections/getting_started/quickstart.md Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could leave here the new example that you've added @plaguss as it's the quicker way to get started. WDYT @dvsrepo @davidberenstein1957 ?

src/distilabel/pipeline/base.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/base.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/ray.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/ray.py Show resolved Hide resolved
src/distilabel/steps/__init__.py Outdated Show resolved Hide resolved
src/distilabel/steps/generators/utils.py Outdated Show resolved Hide resolved
src/distilabel/steps/generators/utils.py Outdated Show resolved Hide resolved
plaguss and others added 3 commits July 24, 2024 14:15
Co-authored-by: Gabriel Martín Blázquez <[email protected]>
@MandyLoc60 MandyLoc60 mentioned this pull request Jul 26, 2024
@gabrielmbmb gabrielmbmb force-pushed the pipeline-with-dataset branch from 710337c to 64e4ff2 Compare July 29, 2024 07:38
@gabrielmbmb gabrielmbmb merged commit fc7e82e into develop Jul 29, 2024
8 of 13 checks passed
@gabrielmbmb gabrielmbmb deleted the pipeline-with-dataset branch July 29, 2024 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Create a direct way of passing a dataset as part of a pipeline
3 participants