MaziyarPanahi's picture
Adding Evaluation Results (#3)
a6daf1f verified
---
license: apache-2.0
library_name: transformers
tags:
- mistral
- alpaca
datasets:
- tatsu-lab/alpaca
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
model-index:
- name: Mistral-7B-Alpaca-52k-v0.1
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 60.92
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 82.13
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 63.41
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 41.5
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 77.35
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 37.45
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1
name: Open LLM Leaderboard
---
# Description
`mistralai/Mistral-7B-v0.1` model fine-tuned over 52k alpaca dataset
# How to use it
```python
# pip install transformers==4.35.2
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from transformers import pipeline
model_id="MaziyarPanahi/Mistral-7B-Alpaca-52k-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1024,
temperature=0.1,
do_sample=True,
top_p=0.95,
repetition_penalty=1.15,
return_full_text=False,
streamer=streamer
)
prompt = """Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
describe about pros and cons of docker system. Answer in bullet point
### Response:
"""
res = pipe(prompt)[0]['generated_text']
```
Results:
```
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
describe about pros and cons of docker system. Answer in bullet point
### Response:
Pros of Docker System:
- Improved portability - Docker containers can be easily moved between different environments, making it easier to deploy applications across multiple platforms.
- Increased security - Containers are isolated from each other, which helps prevent malicious code from spreading throughout the system.
- Better resource utilization - Containers allow for better resource management by allowing users to run multiple applications on a single host without having to worry about conflicts or performance issues.
Cons of Docker System:
- Learning curve - It takes time to learn how to use Docker effectively, as there are many commands and concepts involved.
- Limited customization options - While Docker provides some basic configuration options, more advanced features such as network routing require additional tools.
- Performance overhead - Running multiple containers on a single host may result in slower performance due to increased memory usage.</s>
```
## Eval
```python
{
"all": {
"acc": 0.6309850839451187,
"acc_stderr": 0.032333688535613636,
"acc_norm": 0.6368691004374645,
"acc_norm_stderr": 0.03298401757997533,
"mc1": 0.29008567931456547,
"mc1_stderr": 0.01588623687420952,
"mc2": 0.41501661742948026,
"mc2_stderr": 0.014285902986671931
},
"harness|arc:challenge|25": {
"acc": 0.5750853242320819,
"acc_stderr": 0.014445698968520767,
"acc_norm": 0.6092150170648464,
"acc_norm_stderr": 0.01425856388051378
},
"harness|hellaswag|10": {
"acc": 0.6221868153754232,
"acc_stderr": 0.0048384969668239025,
"acc_norm": 0.8212507468631747,
"acc_norm_stderr": 0.0038235918141330347
},
"harness|hendrycksTest-abstract_algebra|5": {
"acc": 0.32,
"acc_stderr": 0.046882617226215034,
"acc_norm": 0.32,
"acc_norm_stderr": 0.046882617226215034
},
"harness|hendrycksTest-anatomy|5": {
"acc": 0.6,
"acc_stderr": 0.04232073695151589,
"acc_norm": 0.6,
"acc_norm_stderr": 0.04232073695151589
},
"harness|hendrycksTest-astronomy|5": {
"acc": 0.6447368421052632,
"acc_stderr": 0.038947344870133176,
"acc_norm": 0.6447368421052632,
"acc_norm_stderr": 0.038947344870133176
},
"harness|hendrycksTest-business_ethics|5": {
"acc": 0.57,
"acc_stderr": 0.04975698519562428,
"acc_norm": 0.57,
"acc_norm_stderr": 0.04975698519562428
},
"harness|hendrycksTest-clinical_knowledge|5": {
"acc": 0.6792452830188679,
"acc_stderr": 0.02872750295788027,
"acc_norm": 0.6792452830188679,
"acc_norm_stderr": 0.02872750295788027
},
"harness|hendrycksTest-college_biology|5": {
"acc": 0.7430555555555556,
"acc_stderr": 0.03653946969442099,
"acc_norm": 0.7430555555555556,
"acc_norm_stderr": 0.03653946969442099
},
"harness|hendrycksTest-college_chemistry|5": {
"acc": 0.49,
"acc_stderr": 0.05024183937956912,
"acc_norm": 0.49,
"acc_norm_stderr": 0.05024183937956912
},
"harness|hendrycksTest-college_computer_science|5": {
"acc": 0.56,
"acc_stderr": 0.04988876515698589,
"acc_norm": 0.56,
"acc_norm_stderr": 0.04988876515698589
},
"harness|hendrycksTest-college_mathematics|5": {
"acc": 0.36,
"acc_stderr": 0.048241815132442176,
"acc_norm": 0.36,
"acc_norm_stderr": 0.048241815132442176
},
"harness|hendrycksTest-college_medicine|5": {
"acc": 0.653179190751445,
"acc_stderr": 0.036291466701596636,
"acc_norm": 0.653179190751445,
"acc_norm_stderr": 0.036291466701596636
},
"harness|hendrycksTest-college_physics|5": {
"acc": 0.4019607843137255,
"acc_stderr": 0.048786087144669955,
"acc_norm": 0.4019607843137255,
"acc_norm_stderr": 0.048786087144669955
},
"harness|hendrycksTest-computer_security|5": {
"acc": 0.79,
"acc_stderr": 0.04093601807403326,
"acc_norm": 0.79,
"acc_norm_stderr": 0.04093601807403326
},
"harness|hendrycksTest-conceptual_physics|5": {
"acc": 0.5702127659574469,
"acc_stderr": 0.03236214467715564,
"acc_norm": 0.5702127659574469,
"acc_norm_stderr": 0.03236214467715564
},
"harness|hendrycksTest-econometrics|5": {
"acc": 0.49122807017543857,
"acc_stderr": 0.047028804320496165,
"acc_norm": 0.49122807017543857,
"acc_norm_stderr": 0.047028804320496165
},
"harness|hendrycksTest-electrical_engineering|5": {
"acc": 0.5862068965517241,
"acc_stderr": 0.04104269211806232,
"acc_norm": 0.5862068965517241,
"acc_norm_stderr": 0.04104269211806232
},
"harness|hendrycksTest-elementary_mathematics|5": {
"acc": 0.3915343915343915,
"acc_stderr": 0.025138091388851116,
"acc_norm": 0.3915343915343915,
"acc_norm_stderr": 0.025138091388851116
},
"harness|hendrycksTest-formal_logic|5": {
"acc": 0.4444444444444444,
"acc_stderr": 0.04444444444444449,
"acc_norm": 0.4444444444444444,
"acc_norm_stderr": 0.04444444444444449
},
"harness|hendrycksTest-global_facts|5": {
"acc": 0.32,
"acc_stderr": 0.04688261722621504,
"acc_norm": 0.32,
"acc_norm_stderr": 0.04688261722621504
},
"harness|hendrycksTest-high_school_biology|5": {
"acc": 0.7419354838709677,
"acc_stderr": 0.02489246917246283,
"acc_norm": 0.7419354838709677,
"acc_norm_stderr": 0.02489246917246283
},
"harness|hendrycksTest-high_school_chemistry|5": {
"acc": 0.5024630541871922,
"acc_stderr": 0.035179450386910616,
"acc_norm": 0.5024630541871922,
"acc_norm_stderr": 0.035179450386910616
},
"harness|hendrycksTest-high_school_computer_science|5": {
"acc": 0.67,
"acc_stderr": 0.047258156262526066,
"acc_norm": 0.67,
"acc_norm_stderr": 0.047258156262526066
},
"harness|hendrycksTest-high_school_european_history|5": {
"acc": 0.7575757575757576,
"acc_stderr": 0.03346409881055953,
"acc_norm": 0.7575757575757576,
"acc_norm_stderr": 0.03346409881055953
},
"harness|hendrycksTest-high_school_geography|5": {
"acc": 0.7929292929292929,
"acc_stderr": 0.028869778460267042,
"acc_norm": 0.7929292929292929,
"acc_norm_stderr": 0.028869778460267042
},
"harness|hendrycksTest-high_school_government_and_politics|5": {
"acc": 0.8601036269430051,
"acc_stderr": 0.025033870583015184,
"acc_norm": 0.8601036269430051,
"acc_norm_stderr": 0.025033870583015184
},
"harness|hendrycksTest-high_school_macroeconomics|5": {
"acc": 0.6358974358974359,
"acc_stderr": 0.024396672985094764,
"acc_norm": 0.6358974358974359,
"acc_norm_stderr": 0.024396672985094764
},
"harness|hendrycksTest-high_school_mathematics|5": {
"acc": 0.362962962962963,
"acc_stderr": 0.029318203645206865,
"acc_norm": 0.362962962962963,
"acc_norm_stderr": 0.029318203645206865
},
"harness|hendrycksTest-high_school_microeconomics|5": {
"acc": 0.6218487394957983,
"acc_stderr": 0.03149930577784906,
"acc_norm": 0.6218487394957983,
"acc_norm_stderr": 0.03149930577784906
},
"harness|hendrycksTest-high_school_physics|5": {
"acc": 0.32450331125827814,
"acc_stderr": 0.038227469376587525,
"acc_norm": 0.32450331125827814,
"acc_norm_stderr": 0.038227469376587525
},
"harness|hendrycksTest-high_school_psychology|5": {
"acc": 0.8146788990825689,
"acc_stderr": 0.016659279700295838,
"acc_norm": 0.8146788990825689,
"acc_norm_stderr": 0.016659279700295838
},
"harness|hendrycksTest-high_school_statistics|5": {
"acc": 0.49537037037037035,
"acc_stderr": 0.03409825519163572,
"acc_norm": 0.49537037037037035,
"acc_norm_stderr": 0.03409825519163572
},
"harness|hendrycksTest-high_school_us_history|5": {
"acc": 0.7892156862745098,
"acc_stderr": 0.028626547912437406,
"acc_norm": 0.7892156862745098,
"acc_norm_stderr": 0.028626547912437406
},
"harness|hendrycksTest-high_school_world_history|5": {
"acc": 0.7552742616033755,
"acc_stderr": 0.027985699387036423,
"acc_norm": 0.7552742616033755,
"acc_norm_stderr": 0.027985699387036423
},
"harness|hendrycksTest-human_aging|5": {
"acc": 0.6636771300448431,
"acc_stderr": 0.031708824268455,
"acc_norm": 0.6636771300448431,
"acc_norm_stderr": 0.031708824268455
},
"harness|hendrycksTest-human_sexuality|5": {
"acc": 0.7862595419847328,
"acc_stderr": 0.0359546161177469,
"acc_norm": 0.7862595419847328,
"acc_norm_stderr": 0.0359546161177469
},
"harness|hendrycksTest-international_law|5": {
"acc": 0.7933884297520661,
"acc_stderr": 0.03695980128098824,
"acc_norm": 0.7933884297520661,
"acc_norm_stderr": 0.03695980128098824
},
"harness|hendrycksTest-jurisprudence|5": {
"acc": 0.7592592592592593,
"acc_stderr": 0.04133119440243838,
"acc_norm": 0.7592592592592593,
"acc_norm_stderr": 0.04133119440243838
},
"harness|hendrycksTest-logical_fallacies|5": {
"acc": 0.803680981595092,
"acc_stderr": 0.031207970394709218,
"acc_norm": 0.803680981595092,
"acc_norm_stderr": 0.031207970394709218
},
"harness|hendrycksTest-machine_learning|5": {
"acc": 0.5178571428571429,
"acc_stderr": 0.047427623612430116,
"acc_norm": 0.5178571428571429,
"acc_norm_stderr": 0.047427623612430116
},
"harness|hendrycksTest-management|5": {
"acc": 0.8252427184466019,
"acc_stderr": 0.03760178006026621,
"acc_norm": 0.8252427184466019,
"acc_norm_stderr": 0.03760178006026621
},
"harness|hendrycksTest-marketing|5": {
"acc": 0.8632478632478633,
"acc_stderr": 0.022509033937077816,
"acc_norm": 0.8632478632478633,
"acc_norm_stderr": 0.022509033937077816
},
"harness|hendrycksTest-medical_genetics|5": {
"acc": 0.74,
"acc_stderr": 0.04408440022768078,
"acc_norm": 0.74,
"acc_norm_stderr": 0.04408440022768078
},
"harness|hendrycksTest-miscellaneous|5": {
"acc": 0.8173690932311622,
"acc_stderr": 0.013816335389973136,
"acc_norm": 0.8173690932311622,
"acc_norm_stderr": 0.013816335389973136
},
"harness|hendrycksTest-moral_disputes|5": {
"acc": 0.7023121387283237,
"acc_stderr": 0.024617055388677,
"acc_norm": 0.7023121387283237,
"acc_norm_stderr": 0.024617055388677
},
"harness|hendrycksTest-moral_scenarios|5": {
"acc": 0.2335195530726257,
"acc_stderr": 0.014149575348976269,
"acc_norm": 0.2335195530726257,
"acc_norm_stderr": 0.014149575348976269
},
"harness|hendrycksTest-nutrition|5": {
"acc": 0.7450980392156863,
"acc_stderr": 0.024954184324879905,
"acc_norm": 0.7450980392156863,
"acc_norm_stderr": 0.024954184324879905
},
"harness|hendrycksTest-philosophy|5": {
"acc": 0.7106109324758842,
"acc_stderr": 0.025755865922632945,
"acc_norm": 0.7106109324758842,
"acc_norm_stderr": 0.025755865922632945
},
"harness|hendrycksTest-prehistory|5": {
"acc": 0.7191358024691358,
"acc_stderr": 0.025006469755799215,
"acc_norm": 0.7191358024691358,
"acc_norm_stderr": 0.025006469755799215
},
"harness|hendrycksTest-professional_accounting|5": {
"acc": 0.4716312056737589,
"acc_stderr": 0.029779450957303062,
"acc_norm": 0.4716312056737589,
"acc_norm_stderr": 0.029779450957303062
},
"harness|hendrycksTest-professional_law|5": {
"acc": 0.4498044328552803,
"acc_stderr": 0.012705721498565107,
"acc_norm": 0.4498044328552803,
"acc_norm_stderr": 0.012705721498565107
},
"harness|hendrycksTest-professional_medicine|5": {
"acc": 0.6580882352941176,
"acc_stderr": 0.02881472242225418,
"acc_norm": 0.6580882352941176,
"acc_norm_stderr": 0.02881472242225418
},
"harness|hendrycksTest-professional_psychology|5": {
"acc": 0.6519607843137255,
"acc_stderr": 0.019270998708223974,
"acc_norm": 0.6519607843137255,
"acc_norm_stderr": 0.019270998708223974
},
"harness|hendrycksTest-public_relations|5": {
"acc": 0.6636363636363637,
"acc_stderr": 0.04525393596302506,
"acc_norm": 0.6636363636363637,
"acc_norm_stderr": 0.04525393596302506
},
"harness|hendrycksTest-security_studies|5": {
"acc": 0.7224489795918367,
"acc_stderr": 0.028666857790274645,
"acc_norm": 0.7224489795918367,
"acc_norm_stderr": 0.028666857790274645
},
"harness|hendrycksTest-sociology|5": {
"acc": 0.8557213930348259,
"acc_stderr": 0.02484575321230604,
"acc_norm": 0.8557213930348259,
"acc_norm_stderr": 0.02484575321230604
},
"harness|hendrycksTest-us_foreign_policy|5": {
"acc": 0.86,
"acc_stderr": 0.03487350880197771,
"acc_norm": 0.86,
"acc_norm_stderr": 0.03487350880197771
},
"harness|hendrycksTest-virology|5": {
"acc": 0.5481927710843374,
"acc_stderr": 0.03874371556587953,
"acc_norm": 0.5481927710843374,
"acc_norm_stderr": 0.03874371556587953
},
"harness|hendrycksTest-world_religions|5": {
"acc": 0.8421052631578947,
"acc_stderr": 0.027966785859160896,
"acc_norm": 0.8421052631578947,
"acc_norm_stderr": 0.027966785859160896
},
"harness|truthfulqa:mc|0": {
"mc1": 0.29008567931456547,
"mc1_stderr": 0.01588623687420952,
"mc2": 0.41501661742948026,
"mc2_stderr": 0.014285902986671931
},
"harness|winogrande|5": {
"acc": 0.7734806629834254,
"acc_stderr": 0.011764149054698332
},
"harness|gsm8k|5": {
"acc": 0.37452615617892343,
"acc_stderr": 0.013331774158491393
}
}
```
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_MaziyarPanahi__Mistral-7B-Alpaca-52k-v0.1)
| Metric |Value|
|---------------------------------|----:|
|Avg. |60.46|
|AI2 Reasoning Challenge (25-Shot)|60.92|
|HellaSwag (10-Shot) |82.13|
|MMLU (5-Shot) |63.41|
|TruthfulQA (0-shot) |41.50|
|Winogrande (5-shot) |77.35|
|GSM8k (5-shot) |37.45|