leaderboard-pr-bot's picture
Adding Evaluation Results
73416ba verified
|
raw
history blame
4.65 kB
metadata
language:
  - en
license: mit
tags:
  - nlp
  - code
  - mlx
datasets:
  - teknium/openhermes
license_link: https://huggingface.co/microsoft/phi-2/resolve/main/LICENSE
pipeline_tag: text-generation
model-index:
  - name: phi-2-openhermes-30k
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 61.01
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=marcel/phi-2-openhermes-30k
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 74.72
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=marcel/phi-2-openhermes-30k
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 57.17
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=marcel/phi-2-openhermes-30k
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 45.38
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=marcel/phi-2-openhermes-30k
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 74.9
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=marcel/phi-2-openhermes-30k
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 49.05
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=marcel/phi-2-openhermes-30k
          name: Open LLM Leaderboard

marcel/phi-2-openhermes-30k

This model was converted to MLX format from microsoft/phi-2. Refer to the original model card for more details on the model.

Use with mlx

pip install mlx
git clone https://github.com/ml-explore/mlx-examples.git
cd mlx-examples/llms/hf_llm
python generate.py --model marcel/phi-2-openhermes-30k --prompt "My name is"
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "marcel/phi-2-openhermes-30k",
    low_cpu_mem_usage=True,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("phi-2-openhermes-30k")

input_text = "### Human: Give me a good recipe for a chinese dish\n\n### Assistant:"

outputs = model.generate(
    tokenizer(input_text, return_tensors="pt").to(model.device)['input_ids'],
    max_length=1024,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 60.37
AI2 Reasoning Challenge (25-Shot) 61.01
HellaSwag (10-Shot) 74.72
MMLU (5-Shot) 57.17
TruthfulQA (0-shot) 45.38
Winogrande (5-shot) 74.90
GSM8k (5-shot) 49.05