Adding Evaluation Results (#3)

f05df8f verified 3 months ago

7.55 kB

	---
	license: llama3
	library_name: transformers
	datasets:
	- aqua_rat
	- microsoft/orca-math-word-problems-200k
	- m-a-p/CodeFeedback-Filtered-Instruction
	model-index:
	- name: Smaug-Llama-3-70B-Instruct-32K
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: IFEval (0-Shot)
	type: HuggingFaceH4/ifeval
	args:
	num_few_shot: 0
	metrics:
	- type: inst_level_strict_acc and prompt_level_strict_acc
	value: 77.61
	name: strict accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: BBH (3-Shot)
	type: BBH
	args:
	num_few_shot: 3
	metrics:
	- type: acc_norm
	value: 49.07
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MATH Lvl 5 (4-Shot)
	type: hendrycks/competition_math
	args:
	num_few_shot: 4
	metrics:
	- type: exact_match
	value: 21.22
	name: exact match
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GPQA (0-shot)
	type: Idavidrein/gpqa
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 6.15
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MuSR (0-shot)
	type: TAUR-Lab/MuSR
	args:
	num_few_shot: 0
	metrics:
	- type: acc_norm
	value: 12.43
	name: acc_norm
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU-PRO (5-shot)
	type: TIGER-Lab/MMLU-Pro
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 41.83
	name: accuracy
	source:
	url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=abacusai/Smaug-Llama-3-70B-Instruct-32K
	name: Open LLM Leaderboard
	---

	# Smaug-Llama-3-70B-Instruct-32K

	### Built with Meta Llama 3

	This is a 32K version of Smaug-Llama-3-70B-Instruct. It uses PoSE (https://arxiv.org/abs/2309.10400) and LoRA (https://arxiv.org/abs/2106.09685) adapter transfer. More details are coming soon.

	Needle-In-A-Haystack (https://github.com/jzhang38/EasyContext) heatmap:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/8Z5XgqrZXKcb2hmeTKTT6.png)

	### Model Description

	- Developed by: [Abacus.AI](https://abacus.ai)
	- License: https://llama.meta.com/llama3/license/
	- Finetuned from model: [meta-llama/Meta-Llama-3-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct).

	## How to use

	The prompt format is unchanged from Llama 3 70B Instruct.

	### Use with transformers

	See the snippet below for usage with Transformers:

	```python
	import transformers
	import torch

	model_id = "abacusai/Smaug-Llama-3-70B-Instruct"

	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
	{"role": "user", "content": "Who are you?"},
	]

	prompt = pipeline.tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	terminators = [
	pipeline.tokenizer.eos_token_id,
	pipeline.tokenizer.convert_tokens_to_ids("<\|eot_id\|>")
	]

	outputs = pipeline(
	prompt,
	max_new_tokens=256,
	eos_token_id=terminators,
	do_sample=True,
	temperature=0.6,
	top_p=0.9,
	)
	print(outputs[0]["generated_text"][len(prompt):])
	```


	## Evaluation

	### Arena-Hard

	### Arena-Hard

	Score vs selected others (sourced from: (https://lmsys.org/blog/2024-04-19-arena-hard/#full-leaderboard-with-gpt-4-turbo-as-judge)). GPT-4o and Gemini-1.5-pro-latest were missing from the original blob post, and we produced those numbers from a local run using the same methodology.

	\| Model \| Score \| 95% Confidence Interval \| Average Tokens \|
	\| :---- \| ---------: \| ----------: \| ------: \|
	\| GPT-4-Turbo-2024-04-09 \| 82.6 \| (-1.8, 1.6) \| 662 \|
	\| GPT-4o \| 78.3 \| (-2.4, 2.1) \| 685 \|
	\| Gemini-1.5-pro-latest \| 72.1 \| (-2.3, 2.2) \| 630 \|
	\| Claude-3-Opus-20240229 \| 60.4 \| (-3.3, 2.4) \| 541 \|
	\| Smaug-Llama-3-70B-Instruct-32K \| 60.0 \| (-2.6, 2.1) \| 844 \|
	\| Smaug-Llama-3-70B-Instruct \| 56.7 \| (-2.2, 2.6) \| 661 \|
	\| GPT-4-0314 \| 50.0 \| (-0.0, 0.0) \| 423 \|
	\| Claude-3-Sonnet-20240229 \| 46.8 \| (-2.1, 2.2) \| 552 \|
	\| Llama-3-70B-Instruct \| 41.1 \| (-2.5, 2.4) \| 583 \|
	\| GPT-4-0613 \| 37.9 \| (-2.2, 2.0) \| 354 \|
	\| Mistral-Large-2402 \| 37.7 \| (-1.9, 2.6) \| 400 \|
	\| Mixtral-8x22B-Instruct-v0.1 \| 36.4 \| (-2.7, 2.9) \| 430 \|
	\| Qwen1.5-72B-Chat \| 36.1 \| (-2.5, 2.2) \| 474 \|
	\| Command-R-Plus \| 33.1 \| (-2.1, 2.2) \| 541 \|
	\| Mistral-Medium \| 31.9 \| (-2.3, 2.4) \| 485 \|
	\| GPT-3.5-Turbo-0613 \| 24.8 \| (-1.6, 2.0) \| 401 \|

	Note that we believe the number of tokens/verbosity of the model strongly influences the GPT-4 judge in this case, and at least partially explains the improvement in Arena-Hard score for the 32K model.

	### OpenLLM Leaderboard Manual Evaluation

	\| Model \| ARC \| Hellaswag \| MMLU \| TruthfulQA \| Winogrande \| GSM8K* \| Average \|
	\| :---- \| ---: \| ------: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| Smaug-Llama-3-70B-Instruct-32K \| 70.1 \| TBA \| TBA \| 61.9 \| 82.2 \| TBA \| TBA \|
	\| Llama-3-70B-Instruct \| 71.4 \| 85.7 \| 80.0 \| 61.8 \| 82.9 \| 91.1 \| 78.8 \|

	GSM8K The GSM8K numbers quoted here are computed using a recent release
	of the [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/).
	The commit used by the leaderboard has a significant issue that impacts models that
	tend to use `:` in their responses due to a bug in the stop word configuration for
	GSM8K. The issue is covered in more detail in this
	[GSM8K evaluation discussion](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/770).
	The score for both Llama-3 and this model are significantly different when evaluated
	with the updated harness as the issue with stop words has been addressed.

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_abacusai__Smaug-Llama-3-70B-Instruct-32K)

	\| Metric \|Value\|
	\|-------------------\|----:\|
	\|Avg. \|34.72\|
	\|IFEval (0-Shot) \|77.61\|
	\|BBH (3-Shot) \|49.07\|
	\|MATH Lvl 5 (4-Shot)\|21.22\|
	\|GPQA (0-shot) \| 6.15\|
	\|MuSR (0-shot) \|12.43\|
	\|MMLU-PRO (5-shot) \|41.83\|