Skip to content

Latest commit

 

History

History
45 lines (31 loc) · 2.76 KB

README.md

File metadata and controls

45 lines (31 loc) · 2.76 KB

更适合中国宝宝体质的本地化模板

适配windows版ollama的本地api,并增加run_models.bat模板

开始之前

建议使用anaconda虚拟环境。进入anaconda环境后,cd进项目目录,运行pip install -r requirements.txt命令安装环境依赖。

.bat模板默认已存在anaconda环境,并且配好依赖。python版本建议3.10以上,我的环境是3.10.bat模板默认使用anaconda的默认环境base,如果你打算使用别的环境,自行更改模板中的命令:

:: 激活conda环境
call conda activate base

Ollama-MMLU-Pro

This is a modified version of run_gpt4o.py from TIGER-AI-Lab/MMLU-Pro, and it lets you run MMLU-Pro benchmark via the OpenAI Chat Completion API. It's tested on Ollama 和 Llama.cpp, but it should also work with LMStudio, Koboldcpp, Oobabooga with openai extension, etc.

Open In Colab

I kept the testing and scoring method exactly the same as the original script, adding only a few features to simplify running the test and displaying the results. To see the exact changes, compare between mmlu-pro branch against main with git diff:

git diff mmlu-pro..main -- run_openai.py

Usage

Change the config.toml according to your setup.

pip install -r requirements.txt
python run_openai.py

You can also override settings in configuration file with command line flags like --model, ----category, etc. For example, if you specify --model phi3, all the settings from configuration file will be loaded except model. See python run_openai.py -h for more info.

Additional Notes

  • If an answer cannot be extracted from the model's response, the script will randomly assign an answer. It's the same way as the original script.
  • The total score represents the number of correct answers out of the total number of attempts including random guess attempts. This is the score from the original script.
  • "Random Guess Attempts" indicates the total number of random guesses out of the total attempts.
  • "Correct Random Guesses" shows the number of random guesses that were correct out of all the random guesses.
  • "Adjusted Score Without Random Guesses" subtracts all random guesses from the correct answers and the total answers.
  • The last overall score in the table is calculated as: the total number of correct answers across all categories / the total number of all attempts across all categories * 100.
  • All the scores in percentage are rounded numbers.