Example files generated for split dev
using naver_splade-cocondenser-selfdistil
as a retriever.
config.yaml
The parameters of the experiment in yaml format.eval_dev_generation_time.json
The generation time in json format.eval_dev_metrics.json
Generation evaluation metrics in json format.eval_dev_out.json
Output of the generation, containsq_id
(str),response
(str)
the generated response,label
(list (str))
the answer reference (multiple possible),instruction
(str)
the instruction given to the generator,ranking_label
(list(list(str)), optional)
ids of reference paragraph (again multiple references possible).run.retrieve.top_5.kilt_nq.dev.naver_splade-cocondenser-selfdistil.trec
The retrieval run intrec
format.eval_dev_ranking_metrics.json
Retrieval evaluation metrics in json format.
Non-neural metrics will be calculated automatically. Neural metrics such as BEM
and LLM
need to be evoked seperately.
By default eval.py
will scan all folders in experiments/
and evaluate them sequentially. To evaluate a single folder pass the folder using --folder
. To avoid running out of memory either run BEM
using --bem
or run LLM
using --llm
. A csv file will automatically be saved to results/
containing the table in csv
format.
When using --llm
you have a choice on how you transform LLM predictions in the final score:
- directly check in the generated answer for the expepected label occurence (default Yes/No), and assign corresponding score (default 1/0), when no expected label is found, or more than one expected label is matched, we assign score -100 to the corresponding sample, such samples are excluded from the mean score computation
- rely on the logits assigned to the first token (not available with vllm generators): get values corresponding to the expected labels, normalize them to 1 (get probability distribution across possible labels
p(label)
); final score would correspond to Inline equation:$\sum_{label} score(label)*p(label)$ The choice of score interpretation is done viause_logits
parameter specified at evaluation config file. Default value is set toTrue
(corresponding to the second option)
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
Similarly to --generator
you can specify which LLM you are willing as first options of --llm
, as well as short name at metrics naming (use the name of the configuration file as the name of the llm).
# use llama2-7b-chat to run evaluation, output metric will be named VLLMeval_l2_7b
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm "vllm_llama-2-7b-chat" "l2_7b"
# use tinyllama to run evaluation, output metric will be named LLMeval_tinyllama
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm "tinyllama-chat" "tinyllama"
# in default settings (with no arguments specified) we use SOLAR-107B for evaluation and output metric is named LLMeval
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm
You can specify prompt and other parameters in the evaluation config file for --llm
at config/evaluator
directory. By default they rely on default_qa.yaml
configuration which assigns binary (Yes/No) value to each triple of Question/Response/Gold Response. You can specify finer granularity options and prompt (aka rubrik section). See example of more fine-grained configuration at config/evaluator/default_multi_qa.yaml
.
python3 eval.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm --llm_prompt default_multi_qa
If you have local ollama server running, you can call models installed on this server as following:
python3 eval.py --experiments_folder experiments/ --llm_ollama "phi3:latest" --ollama_url "http://localhost:11434" --llm_prompt default_multi_qa