Long context evals using hugging face hosted datasets #709

maxisawesome · 2023-11-01T21:40:08Z

Not ready to merge!

Adds long context eval tasks for naively support for LEval QA tasks as well as generated long context tasks (that are padded up to 2k, 4k, and 8k token context length). Both have been uploaded to HF datasets to avoid checking large files into our repo.

LEval:

Supported tasks are in scripts/eval/yamls/leval_tasks.yaml.
Adds very basic HF dataset parsing. I wrote a specific function for LEval tasks. Eventually, this per-dataset parsing that will be required for most arbitrary HF tasks should likely live in Composer. Otherwise, the yaml logic is as follows:

  label: kv_pairs_middle_4k
  dataset_uri: hf://maxisawesome/long_context_eval 
  num_fewshot: [0]
  icl_task_type: question_answering
  hf_vars: 
    name: kv_pairs
    context_length: 4096
    section: middle 
  hf_cols:
    inputs: ["context"]
    outputs: ["answer"]

llm-foundry will remove the hf:// from dataset_uri and load that dataset (here, it will load maxisawesome/long_context_eval).
Everything under hf_vars will be passed into the load_dataset func as keyword args.
llm-foundry will concatenate together the HF dataset columns listed under hf_cols.inputs as the context for the model.
llm-foundry will concatenate together the HF dataset columns listed under hf_cols.outputs as the expected answer.

If pivot_col is specified under hf_cols, will treat each row in the dataset as pivot_col being the main context, inputs being the instruction, and outputs being the desired answer.
(For clarity, LEval has many tasks set up where one row consists of one col of 15 questions, one col of a single document, and one col of 15 answers. The current form of this setup is not the final version, just a temporary working solution.)

Previous notes for generated tasks:

HotPotQA - 10+ concatenated "documents" (3-4 sentences from a wikipedia article) with a single question at the end. Sometimes this requires two "documents" (multi-hop QA) to properly answer the question. While 1 or 2 docs are needed to answer the question, the rest are randomly chosen "distractor" docs that are not relevant to the question.
WikiQA (WikiQA-Altered_Numeric_QA) - Single documents from wikipedia at different context lengths. There are significantly more HTML tags in this dataset. All answers are some sort of number. This needs to be reuploaded to HF.
KV_Pairs - Constructed JSONs with Key Value pairs. The question will be a single key and and the expected answer is the corresponding value that was listed in the JSON.
Caveats:
Fewshot versions of these tasks should be pregenerated (because right now every example is approximately the full context length, so fewshot = 5 for the 4k context length task would be 20k total tokens)

generation scripts for these datasets are not included.

* relax atol and add retries to reduce flakiness in lion8b timing test

…nto long_context_from_hugging_face

eitanturok · 2023-12-04T16:49:34Z

llmfoundry/utils/builders.py

Hi @maxisawesome!

It might be worth passing in the hugging face variables into the get_icl_task_dataloader function. Maybe add

hf_loading_vars=icl_cfg.get('hf_loading_vars', {}), hf_parsing_map=icl_cfg.get('hf_parsing_map', {})

in line 304 originally and in 358 in your new commit. These allows you to pass parameters into hugging face's load_dataset function. In particular, this was helpful in specifying which split of the hugging face dataset, I'd like to evaluate such as hf_loading_vars = {'split': 'train'}.

maxisawesome · 2024-03-01T00:12:16Z

outdated. Main content of this branch was merged here: #925

dblalock and others added 18 commits September 18, 2023 12:46

Skip flaky lion8b test (mosaicml#598)

dbf5535

* relax atol and add retries to reduce flakiness in lion8b timing test

add eval output logging

315afb5

add back tasks

215b802

foo

ebef847

add rlhf prompts

467ac3a

add rlhf prompts

ef91472

add rlhf prompts

c1db48c

add rlhf prompts

ff63cfd

add rlhf prompts

0dc30b0

fix prompt

6d93ba6

fix prompt

5254833

add yamls w/ old links

340b79e

load from max's public hf and parse hf datasets

26dc067

update rest of tasks

31851a5

add better logging

203be47

implemented leval tasks

33b6513

move level

089c392

add level yaml

b644df1

maxisawesome changed the title ~~Long context from hugging face~~ Long context evals using hugging face hosted datasets Nov 1, 2023

add str parsing to hf

5adf77e

maxisawesome marked this pull request as draft November 6, 2023 18:43

maxisawesome mentioned this pull request Nov 14, 2023

Refactor in_context_learning_evaluation.py mosaicml/composer#2713

Merged

7 tasks

bmosaicml and others added 8 commits November 15, 2023 14:09

modify mcli

af32824

Merge branch 'main' into output_eval_logging

29c297a

test

91c6c71

test

b28fd6e

fix

1e6e923

update routes and fewshot for leval

28ca590

update eval yaml

47972cb

Merge remote-tracking branch 'foundry-official/output_eval_logging' i…

488e9a5

…nto long_context_from_hugging_face

eitanturok reviewed Dec 4, 2023

View reviewed changes

maxisawesome closed this Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long context evals using hugging face hosted datasets #709

Long context evals using hugging face hosted datasets #709

maxisawesome commented Nov 1, 2023 •

edited

Loading

eitanturok Dec 4, 2023

maxisawesome commented Mar 1, 2024

Long context evals using hugging face hosted datasets #709

Long context evals using hugging face hosted datasets #709

Conversation

maxisawesome commented Nov 1, 2023 • edited Loading

Not ready to merge!

LEval:

Previous notes for generated tasks:

eitanturok Dec 4, 2023

Choose a reason for hiding this comment

maxisawesome commented Mar 1, 2024

maxisawesome commented Nov 1, 2023 •

edited

Loading