Spaces:

alKoGolik
/

codellama-CodeLlama-7b-hf

Running

App Files Files Community

codellama-CodeLlama-7b-hf / Llama2-Code-Interpreter-main /OpenCodeInterpreter /evaluation /README.md

alKoGolik

Upload 169 files

c87c295 verified 4 months ago

preview code

raw

history blame contribute delete

No virus

2.08 kB

	# Evaluation

	This repository contains code for evaluating the performance of LLM models through both single-turn and multi-turn scenarios.

	To set up the environment, you can install the required dependencies by running:
	```bash
	pip install -r evaluation/requirements.txt
	```

	## Single-turn Evaluation

	For single-turn evaluation, the processes of inference, post-processing, and result aggregation are separated as follows:

	1. Execute `bash evaluation/evaluate/scripts/01_gen_single.sh` to generate model results.
	2. Perform post-processing on the model output by executing `bash evaluation/evaluate/scripts/02_sanitize_single.sh`.
	3. Finally, compute evaluation metrics by executing `bash evaluation/evaluate/scripts/03_eval_single.sh`.

	## Multi-turn Evaluation

	### Multi-turn Evaluation with Execution Feedback

	Evaluate the performance of the models with execution feedback using the provided scripts:

	- For OpenCodeInterpreter:
	```bash
	bash evaluation/evaluate/scripts/04_execution_feedback_multiround_OpenCodeInterpreter.sh
	```

	- For OpenAI's GPT Models:
	Before proceeding with evaluation, ensure to implement the `get_predict` function in `chat_with_gpt.py` to enable interaction with the GPT Models. Then, execute the following script:
	```bash
	bash evaluation/evaluate/scripts/05_execution_feedback_multiround_gpt.sh
	```

	### Multi-turn Evaluation with GPT-4 Simulated Human Feedback

	Execute either of the following scripts to evaluate the models with simulated human feedback:

	- For OpenCodeInterpreter:
	```bash
	bash evaluation/evaluate/scripts/06_human_feedback_multiround_OpenCodeInterpreter.sh
	```

	- For Oracle OpenCodeInterpreter:
	```bash
	bash evaluation/evaluate/scripts/07_human_feedback_multiround_Oracle_OpenCodeInterpreter.sh
	```

	These scripts facilitate the multi-turn evaluation with simulated human feedback.

	This evaluation code is based on [EvalPlus](https://github.com/evalplus/evalplus) and has been modified for specific purposes. We extend our gratitude to the contributors of EvalPlus for their foundational work.