Text Generation
Transformers
PyTorch
English
llama
llama-2
code
Eval Results
Inference Endpoints
text-generation-inference
leaderboard-pr-bot's picture
Adding Evaluation Results
1f5f3e0 verified
metadata
language:
  - en
license: llama2
library_name: transformers
tags:
  - llama-2
  - code
datasets:
  - jondurbin/airoboros-2.2
  - Open-Orca/OpenOrca
  - garage-bAInd/Open-Platypus
  - WizardLM/WizardLM_evol_instruct_V2_196k
  - TokenBender/python_eval_instruct_51k
pipeline_tag: text-generation
model-index:
  - name: SpeechlessCoder
    results:
      - task:
          type: text-generation
        dataset:
          name: HumanEval
          type: openai_humaneval
        metrics:
          - type: pass@1
            value: 52.439
            name: pass@1
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 41.21
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-coding-7b-16k-tora
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 64.45
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-coding-7b-16k-tora
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 39.14
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-coding-7b-16k-tora
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 44.91
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-coding-7b-16k-tora
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 63.61
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-coding-7b-16k-tora
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 17.29
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=uukuguy/speechless-coding-7b-16k-tora
          name: Open LLM Leaderboard

speechless-coding-7b-16k-tora

Use the following dataset to fine-tune llm_agents/tora-code-7b-v1.0 in order to improve the model's reasoning and planning abilities.

context window length: 16,384 prompt_type = "alpaca" max_tokens > 128 && < 16384

Total 177,333 samples 316 MB

  • jondurbin/airoboros-2.2: Filter categories related to coding, reasoning and planning. 21,923 samples.
  • Open-Orca/OpenOrca: Filter the 'cot' category in 1M GPT4 dataset. 62,973 samples.
  • garage-bAInd/Open-Platypus: 100%, 22,760 samples.
  • WizardLM/WizardLM_evol_instruct_V2_196k: Coding coversation part. 30,081 samples
  • TokenBender/python_eval_instruct_51k: “python” in output .39,596 samples

50 samples/T=0.2/MaxTokens=512/Top_P=0.95

Code: https://github.com/uukuguy/speechless

How to Prompt the Model

This model accepts the Alpaca instruction format.

For example:

You are an intelligent programming assistant.

### Instruction:
Implement a linked list in C++

### Response:

HumanEval

Metric Value
humaneval-python 52.44

Big Code Models Leaderboard

CodeLlama-34B-Python: 53.29

CodeLlama-34B-Instruct: 50.79

CodeLlama-13B-Instruct: 50.6

CodeLlama-34B: 45.11

CodeLlama-13B-Python: 42.89

CodeLlama-13B: 35.07

MultiPL-E

Metric Value
python 55.96
java 37.84
javascript 46.93
cpp 37.48
rust 29.01
go 28.99
sh 12.11
julia 31.47
typescript 47.80

LMEval

Open LLM Leaderboard

Metric Value
ARC
HellaSwag
MMLU
TruthfulQA
Average

Parameters

lr 2e-4
lr_scheduler_type cosine
weight_decay 0.0
optim paged_adamw_8bit
flash_attention True
rerope False
max_new_tokens 16384
num_train_epochs 2
bits 4
lora_r 64
lora_alpha 256
lora_dropout 0.05
double_quant True
quant_type nf4
dataset_format sharegpt
mini_batch_size 2
grandient_accumulation_steps 32
bf16 True

A100-40G x 4

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 45.10
AI2 Reasoning Challenge (25-Shot) 41.21
HellaSwag (10-Shot) 64.45
MMLU (5-Shot) 39.14
TruthfulQA (0-shot) 44.91
Winogrande (5-shot) 63.61
GSM8k (5-shot) 17.29