Create a `GeneratorStep` from a dataset using a helper function #812

plaguss · 2024-07-23T12:28:54Z

Description

This PR simplifies the process to create a generator step from a dataset:

1. Helper function to create the `GeneratorStep` from the dataset already processed:

From the example by @dvsrepo, we can update the code like so:

dataset = load_dataset(
  "DIBT/10k_prompts_ranked",
  split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)

dataset = dataset.to_list()
load_dataset = LoadDataFromDicts(
  name="load_dataset",
  data=dataset[0:500], # during development I used [0:1] to iterate quickly
  output_mappings={"prompt": "instruction"}
)

can be integrated in a pipeline as:

from distilabel.pipeline import Pipeline
from distilabel.steps import make_generator_step

dataset = load_dataset(
  "DIBT/10k_prompts_ranked",
  split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)

with Pipeline() as pipeline:
    load_dataset = make_generator_step(
        dataset,
        output_mappings={"prompt": "instruction"}
    )
    dummy = DummyStep()
    load_dataset >> dummy

distiset = pipeline.run()

New entry in the docs:

2. Pass the dataset via `pipeline.run(dataset=....)`

Internally we will create the step (has less flexibility, but it's more direct, and easier if you don't need the flexibility):

from distilabel.pipeline import Pipeline

dataset = load_dataset(
  "DIBT/10k_prompts_ranked",
  split="train"
).filter(
  lambda r: r['avg_rating']>=4 and r['num_responses']>=2
)

with Pipeline() as pipeline:
    dummy = DummyStep()

distiset = pipeline.run(dataset=dataset)

Example in the docs with the new functionality:

A new example that will appear in the quick start with the new simplifications (the original pipeline can be compared here):

The pipeline from this blogpost you did @dvsrepo could now do with a less lines:

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    GroupColumns,
    KeepColumns,
-    LoadDataFromDicts
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback

from datasets import load_dataset

dataset = load_dataset(
    "DIBT/10k_prompts_ranked",
    split="train"
).filter(
    lambda r: r['avg_rating']>=4 and r['num_responses']>=2
+).select(range(100))
- dataset = dataset.to_list()

with Pipeline(
-    name="prefs-with-llama-3",
    description="Pipeline for building preference datasets using Llama 3",
) as pipeline:
-    load_dataset = LoadDataFromDicts(
-        name="load_dataset",
-        data=dataset[0:100],
-        output_mappings={"prompt": "instruction"}
-    )
    generate_with_llama3_70B = TextGeneration(
-        name="generate_with_llama3",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
        ),
    )

    generate_with_llama3_8B = TextGeneration(
-        name="generate_with_llama3_8B",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-8B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
        ),
    )

    combine_columns = GroupColumns(
-      name="combine_columns",
      columns=["generation", "model_name"],
      output_columns=["generations", "generation_models"],
    )

    ultrafeedback = UltraFeedback(
-      name="ultrafeedback",
      aspect='overall-rating',
      llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
        ),
    )
    keep_columns = KeepColumns(
-        name="keep_columns",
        columns=[
            "instruction",
            "generations",
            "generation_models",
            "ratings",
            "rationales",
        ],
    )

-    generate_with_llama3_70B.connect(combine_columns)
-    generate_with_llama3_8B.connect(combine_columns)
-    load_dataset.connect(generate_with_llama3_70B)
-    load_dataset.connect(generate_with_llama3_8B)
-    combine_columns.connect(ultrafeedback)
-    ultrafeedback.connect(keep_columns)
+    (
+        [generate_with_llama3_70B, generate_with_llama3_8B]
+        >> combine_columns
+        >> ultrafeedback
+        >> keep_columns
+    )


if __name__ == "__main__":
    distiset = pipeline.run(
+       dataset=dataset,
        parameters={
-            "load_dataset": {
-                "repo_id": "distilabel-internal-testing/instruction-dataset-mini",
-                "split": "test",
-            },
            "generate_with_llama3": {
                "llm": {
                    "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.7, "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"]}
                }
            },
            "generate_with_llama3_8B": {
                "llm": {
                    "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.7, "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"]}
                }
            },
            "ultrafeedback": {
                "llm": {
                    "generation_kwargs": {"max_new_tokens": 1024, "temperature": 0.1, "stop_sequences": ["<|eot_id|>", "<|end_of_text|>"]}
                }
            },
        }
    )
    distiset.push_to_hub(
        "dvilasuero/distillama3-prompts10k",
+        include_script=True
    )

github-actions · 2024-07-23T12:30:18Z

Documentation for this PR has been built. You can view it at: https://distilabel.argilla.io/pr-812/

codspeed-hq · 2024-07-23T12:37:09Z

CodSpeed Performance Report

Merging #812 will not alter performance

_{Comparing pipeline-with-dataset (64e4ff2) with develop (25601bb)}

Summary

✅ 1 untouched benchmarks

gabrielmbmb

LGTM! Just some comments regarding the quickstart on the docs, docstrings and some suggestion for the code.

docs/sections/getting_started/quickstart.md

gabrielmbmb · 2024-07-24T12:03:58Z

docs/sections/getting_started/quickstart.md

I think that we could leave here the new example that you've added @plaguss as it's the quicker way to get started. WDYT @dvsrepo @davidberenstein1957 ?

src/distilabel/pipeline/base.py

src/distilabel/pipeline/ray.py

src/distilabel/steps/__init__.py

src/distilabel/steps/generators/utils.py

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

plaguss added 4 commits July 23, 2024 14:17

Add helper function to create generator step from dataset

1a71823

Add integration tests for make_generator_step

911c9e2

Redirect import

2928845

Update LoadDataFromHub to not call load if a dataset is already defined

484c568

plaguss added the enhancement New feature or request label Jul 23, 2024

plaguss added this to the 1.3.0 milestone Jul 23, 2024

plaguss requested review from dvsrepo and gabrielmbmb July 23, 2024 12:28

plaguss self-assigned this Jul 23, 2024

plaguss added 3 commits July 23, 2024 16:26

Update docs

fcab8b6

Add unit tests for the new helper function

7b4a15a

Update filename to utils

3c046e5

plaguss changed the title ~~Pass a dataset to a~~ Create a GeneratorStep from a dataset using a helper function Jul 23, 2024

plaguss added 9 commits July 24, 2024 12:59

Add helper method to insert a root step

1eb7e9c

Add logic to create a generator step internally from a dataset

608da33

Pass the dataset variable from all the pipeline implementations

2d4aa49

Add type for the input datasets

50cbac5

Avoid circular imports

b734fe8

Add test for pipelines with generator step and dataset

0c32758

Add integration tests for dataset passed via run method

941d8bb

Fix error evaluation dataframe

47906cc

Add example on quickstart and entry on how to guide

4fe0b34

plaguss marked this pull request as ready for review July 24, 2024 11:57

plaguss linked an issue Jul 24, 2024 that may be closed by this pull request

[FEATURE] Create a direct way of passing a dataset as part of a pipeline #779

Closed

gabrielmbmb approved these changes Jul 24, 2024

View reviewed changes

plaguss and others added 3 commits July 24, 2024 14:15

Update docs/sections/getting_started/quickstart.md

4c414fc

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Update docs/sections/getting_started/quickstart.md

28435b2

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Update src/distilabel/pipeline/base.py

0c32127

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

plaguss and others added 7 commits July 24, 2024 14:16

Update src/distilabel/pipeline/ray.py

1acc0e1

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Update src/distilabel/steps/generators/utils.py

dd6d427

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Update src/distilabel/steps/generators/utils.py

da72207

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Update src/distilabel/pipeline/local.py

5a23778

Co-authored-by: Gabriel Martín Blázquez <[email protected]>

Respect import order

da3deed

Move functionality to a proper internal method

dab2662

Run linter

d7b4768

plaguss mentioned this pull request Jul 25, 2024

Add default name for a pipeline #809

Merged

Merge branch 'develop' into pipeline-with-dataset

ae85770

MandyLoc60 mentioned this pull request Jul 26, 2024

[BUG] #831

Closed

gabrielmbmb added 2 commits July 29, 2024 09:14

Merge branch 'develop' into pipeline-with-dataset

11c1412

Fix format

64e4ff2

gabrielmbmb force-pushed the pipeline-with-dataset branch from 710337c to 64e4ff2 Compare July 29, 2024 07:38

gabrielmbmb merged commit fc7e82e into develop Jul 29, 2024
8 of 13 checks passed

gabrielmbmb deleted the pipeline-with-dataset branch July 29, 2024 08:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a `GeneratorStep` from a dataset using a helper function #812

Create a `GeneratorStep` from a dataset using a helper function #812

plaguss commented Jul 23, 2024 •

edited

Loading

github-actions bot commented Jul 23, 2024

codspeed-hq bot commented Jul 23, 2024 •

edited

Loading

gabrielmbmb left a comment

gabrielmbmb Jul 24, 2024

Create a GeneratorStep from a dataset using a helper function #812

Create a GeneratorStep from a dataset using a helper function #812

Conversation

plaguss commented Jul 23, 2024 • edited Loading

Description

1. Helper function to create the GeneratorStep from the dataset already processed:

2. Pass the dataset via pipeline.run(dataset=....)

github-actions bot commented Jul 23, 2024

codspeed-hq bot commented Jul 23, 2024 • edited Loading

CodSpeed Performance Report

Merging #812 will not alter performance

Summary

gabrielmbmb left a comment

Choose a reason for hiding this comment

gabrielmbmb Jul 24, 2024

Choose a reason for hiding this comment

Create a `GeneratorStep` from a dataset using a helper function #812

Create a `GeneratorStep` from a dataset using a helper function #812

plaguss commented Jul 23, 2024 •

edited

Loading

1. Helper function to create the `GeneratorStep` from the dataset already processed:

2. Pass the dataset via `pipeline.run(dataset=....)`

codspeed-hq bot commented Jul 23, 2024 •

edited

Loading