-
Notifications
You must be signed in to change notification settings - Fork 1.9k
C Eval performance and script
This project tested the performance of the relevant models on the recently released C-Eval benchmark dataset. The test set consists of 12.3K multiple-choice questions covering 52 subjects. Below are the validation and test set evaluation results (Average) for some of the models. For the complete results, please refer to our technical report.
| Model | Valid (zero-shot) | Valid (5-shot) | Test (zero-shot) | Test (5-shot) |
|---|---|---|---|---|
| Chinese-Alpaca-33B | 43.3 | 42.6 | 41.6 | 40.4 |
| Chinese-LLaMA-33B | 34.9 | 38.4 | 34.6 | 39.5 |
| Chinese-Alpaca-Plus-13B | 43.3 | 42.4 | 41.5 | 39.9 |
| Chinese-LLaMA-Plus-13B | 27.3 | 34.0 | 27.8 | 33.3 |
| Chinese-Alpaca-Plus-7B | 36.7 | 32.9 | 36.4 | 32.3 |
| Chinese-LLaMA-Plus-7B | 27.3 | 28.3 | 26.9 | 28.4 |
In the following, we will introduce the prediction method for the C-Eval dataset. Users can also refer to our Colab Notebook for reference:
Download the dataset from the path specified in official C-Eval, and unzip the file to the data folder:
wget https://huggingface.co/datasets/ceval/ceval-exam/resolve/main/ceval-exam.zip
unzip ceval-exam.zip -d data
Move data to scripts/ceval directory of this project.
Run the following script:
model_path=path/to/chinese_llama_or_alpaca
output_path=path/to/your_output_dir
cd scripts/ceval
python eval.py \
--model_path ${model_path} \
--cot False \
--few_shot False \
--with_prompt True \
--constrained_decoding True \
--temperature 0.2 \
--n_times 1 \
--ntrain 5 \
--do_save_csv False \
--do_test False \
--output_dir ${output_path} \-
model_path: Path to the model to be evaluated (the model merged with LoRA in HF format) -
cot: Whether to use chain-of-thought -
few_shot: Whether to use few-shot -
ntrain: Specifies the number of few-shot demos whenfew_shot=True(5-shot:ntrain=5); Whenfew_shot=False, this argument does not have any effect -
with_prompt: Whether input to the model contains the instruction prompt for Alpaca models -
constrained_decoding: Since the standard answer format for C-Eval is option 'A'/'B'/'C'/'D', we provide two methods for extracting answers from models' outputs:-
constrained_decoding=True: Compute the probability that the first token generated by the model is 'A', 'B', 'C', 'D', and choose the one with the highest probability as the answer -
constrained_decoding=False: Extract the answer token from model's outputs with regular expressions
-
-
temperature: Temperature for decoding -
n_times: The number of repeated evaluations. Folders will be generated underoutput_dircorresponding to the specified number of times -
do_save_csv: Whether to save the model outputs, extracted answers, etc. in csv files -
output_dir: Output path of results -
do_test: Whether to evaluate on the valid or test set: evaluate on the valid set whendo_test=Falseand on the test set whendo_test=True
-
The evaluation script creates directories
outputs\take*when finishing evaluation, where*is a number ranges from 0 ton_times-1, storing the results of then_timesrepeated evaluations respectively. -
In each
outputs\take*, there will be asubmission.jsonand asummary.json. Ifdo_save_csv=True, there will be also 52 csv files that contain model outputs, extracted answers for each subject, etc.
-
submission.jsonstores generated answers in the official submission form, and can be submitted for evaluation:
{
"computer_network": {
"0": "A",
"1": "B",
...
},
"marxism": {
"0": "B",
"1": "A",
...
},
...
}
-
summary.jsonstores model evaluation results under 52 subjects, 4 broader categories and an overall average. For instance, The 'All' key at end the json file shows the overall average score:"All": { "score": 0.36701337295690933, "num": 1346, "correct": 494.0 }
where score is the overall accuracy, num is the total number of evaluation examples, and correct is the number of correct predictions.
do_test=True), score and correct are 0 since there are no labels available. The test set results require submitting the submission.json file to the official C-Eval. For detailed instructions, please refer to the official submission process provided by C-Eval.
- 模型合并与转换
- 模型量化、推理、部署
- 效果与评测
- 训练细节
- 常见问题
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Details
- FAQ