Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Simplified Coding Tasks #645

Merged
merged 28 commits into from
Oct 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
7968ca4
adding simple tasks
mcarbin Sep 22, 2023
0f8b160
Merge branch 'main' into human_eval_simple
bmosaicml Oct 3, 2023
a84dda0
add simple human_eval
bmosaicml Oct 3, 2023
90a2dca
fix yaml
bmosaicml Oct 3, 2023
4c54e3d
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 3, 2023
bcda9ef
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 4, 2023
48060e1
fix yaml
bmosaicml Oct 4, 2023
4eee5c1
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 4, 2023
9534d73
remove breakpoint
bmosaicml Oct 5, 2023
51d1b72
remove breakpoint
bmosaicml Oct 5, 2023
e11bb34
change bsz
bmosaicml Oct 6, 2023
fb750f3
Merge branch 'main' into human_eval_simple
bmosaicml Oct 9, 2023
b50da7d
merge main
bmosaicml Oct 9, 2023
4b29f92
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 9, 2023
211def8
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 9, 2023
043f473
Merge branch 'main' into mike/human-eval-simple
dakinggg Oct 10, 2023
cdc2065
add udpated readme
bmosaicml Oct 10, 2023
eb09668
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 10, 2023
cdbd3d1
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 10, 2023
2fd5c16
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 10, 2023
71cd420
fix precommit
bmosaicml Oct 10, 2023
ff68b53
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 10, 2023
2d17293
restor line
bmosaicml Oct 11, 2023
8b7abf1
restor line
bmosaicml Oct 11, 2023
26b2845
add link to codegeex
bmosaicml Oct 11, 2023
dbcae51
Merge branch 'main' into mike/human-eval-simple
bmosaicml Oct 11, 2023
464d831
restor hf eval
bmosaicml Oct 11, 2023
523fe47
Merge branch 'mike/human-eval-simple' of github.com:mcarbin/llm-found…
bmosaicml Oct 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 37 additions & 2 deletions scripts/eval/local_data/MODEL_GAUNTLET.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,8 +257,43 @@ Language understanding tasks evaluate the model’s ability to understand the st
### Programming
Programming tasks evaluate the model's ability to understand code, write functionally correct code given a specification, simulate code, and document code. Right now we just have HumanEval but later versions will include more.

35. HumanEval code generation
- Description: HumanEval consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.
35. HumanEval Python code generation
- Description: HumanEval Python consists of 164 python programming challenges, in which the model is presented with the method signature and docstring comment for a python program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs.
- Year released: 2022
- Number of few shot examples: 0
- Random baseline accuracy: 0%
36. HumanEval C++ code generation
- Description: HumanEval C++ consists of 161 C++ programming challenges, in which the model is presented with the method signature and docstring comment for a C++ program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs. The C++ translation of HumanEval comes from the [CodeGeex](https://huggingface.co/datasets/THUDM/humaneval-x/viewer/cpp) project.
- Year released: 2022
- Number of few shot examples: 0
- Random baseline accuracy: 0%
37. HumanEval JS code generation
- Description: HumanEval JS consists of 164 Javscript programming challenges, in which the model is presented with the method signature and docstring comment for a Javacript program and is expected to complete the program. We then test the resultant code’s functional correctness on a number of test input/output pairs. The JS translation of HumanEval comes from the [CodeGeex](https://huggingface.co/datasets/THUDM/humaneval-x/viewer/cpp) project.
- Year released: 2022
- Number of few shot examples: 0
- Random baseline accuracy: 0%
38. HumanEval Python 25% code generation
- Description: HumanEval Python 25% is an easier variant of HumanEval Python in which in addition to the original method signature, the model is also provided 25% of the lines in the canonical solution and expected to complete the reaminder of the program. It consists of 164 samples.
- Year released: 2023
- Number of few shot examples: 0
- Random baseline accuracy: 0%
39. HumanEval Python 50% code generation
- Description: HumanEval Python 50% is an easier variant of HumanEval Python in which in addition to the original method signature, the model is also provided 50% of the lines in the canonical solution and expected to complete the reaminder of the program. It consists of 164 samples.
- Year released: 2023
- Number of few shot examples: 0
- Random baseline accuracy: 0%
40. HumanEval Python 75% code generation
- Description: HumanEval Python 75% is an easier variant of HumanEval Python in which in addition to the original method signature, the model is also provided 75% of the lines in the canonical solution and expected to complete the reaminder of the program. It consists of 164 samples.
- Year released: 2023
- Number of few shot examples: 0
- Random baseline accuracy: 0%
41. HumanEval Python simple return statement code generation
- Description: HumanEval Python simple return statament is an easier variant of HumanEval Python in which the model is provided all of the canonical solution with the exception of the return statement and is expected to complete the return statement. Additionally, this set contains only the problems for which the canonical solution has a "simple" return statement consisting only of a line of the form `return VARIABLE\_NAME`. There are 37 samples.
- Year released: 2023
- Number of few shot examples: 0
- Random baseline accuracy: 0%
42. HumanEval Python complex return statement code generation
- Description: HumanEval Pythom complex return statament is an easier variant of HumanEval Python in which the model is provided all of the canonical solution with the exception of the return statement and is expected to complete the return statement. Additionally, this set contains only the problems for which the canonical solution does not have a "simple" return statement as defined above. There are 127 samples.
- Year released: 2023
- Number of few shot examples: 0
- Random baseline accuracy: 0%
164 changes: 164 additions & 0 deletions scripts/eval/local_data/programming/human_eval-0.25.jsonl

Large diffs are not rendered by default.

164 changes: 164 additions & 0 deletions scripts/eval/local_data/programming/human_eval-0.5.jsonl

Large diffs are not rendered by default.

164 changes: 164 additions & 0 deletions scripts/eval/local_data/programming/human_eval-0.75.jsonl

Large diffs are not rendered by default.

127 changes: 127 additions & 0 deletions scripts/eval/local_data/programming/human_eval_return_complex.jsonl

Large diffs are not rendered by default.

Large diffs are not rendered by default.

This file was deleted.

44 changes: 42 additions & 2 deletions scripts/eval/yamls/coding_tasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,22 +5,62 @@ icl_tasks:
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
icl_task_type: code_evaluation
batch_size: 1
icl_task_type: code_evaluation

-
label: human_eval_cpp
dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
icl_task_type: code_evaluation
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_js
dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_return_simple
dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_return_complex
dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_25
dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_50
dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_75
dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
15 changes: 15 additions & 0 deletions scripts/eval/yamls/eval_gauntlet.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,21 @@ eval_gauntlet:
- name: human_eval_js
num_fewshot: 0
random_baseline: 0.0
- name: human_eval_return_simple
num_fewshot: 0
random_baseline: 0.0
- name: human_eval_return_complex
num_fewshot: 0
random_baseline: 0.0
- name: human_eval_25
num_fewshot: 0
random_baseline: 0.0
- name: human_eval_50
num_fewshot: 0
random_baseline: 0.0
- name: human_eval_75
num_fewshot: 0
random_baseline: 0.0
- name: world_knowledge_lm_task_subscore
benchmarks:
- name: jeopardy
Expand Down
44 changes: 42 additions & 2 deletions scripts/eval/yamls/tasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -179,21 +179,61 @@ icl_tasks:
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
icl_task_type: code_evaluation
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_cpp
dataset_uri: eval/local_data/programming/processed_human_eval_cpp.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
icl_task_type: code_evaluation
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_js
dataset_uri: eval/local_data/programming/processed_human_eval_js.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_return_simple
dataset_uri: eval/local_data/programming/human_eval_return_simple.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_return_complex
dataset_uri: eval/local_data/programming/human_eval_return_complex.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_25
dataset_uri: eval/local_data/programming/human_eval-0.25.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_50
dataset_uri: eval/local_data/programming/human_eval-0.5.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation
-
label: human_eval_75
dataset_uri: eval/local_data/programming/human_eval-0.75.jsonl # ADD YOUR OWN DATASET URI
num_fewshot: [0]
pass_at_k: 1
num_beams: 20
batch_size: 1
icl_task_type: code_evaluation