Dr_Samantha-7b / README.md
sethuiyer's picture
Adding Evaluation Results (#1)
ebca4b9 verified
metadata
language:
  - en
  - zh
license: llama2
library_name: transformers
tags:
  - llama
  - merge
  - medical
datasets:
  - GBaker/MedQA-USMLE-4-options
  - cognitivecomputations/samantha-data
  - shibing624/medical
base_model:
  - Severus27/BeingWell_llama2_7b
  - ParthasarathyShanmugam/llama-2-7b-samantha
pipeline_tag: text-generation
model-index:
  - name: Dr_Samantha-7b
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 53.84
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/Dr_Samantha-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 77.95
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/Dr_Samantha-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 47.94
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/Dr_Samantha-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 45.58
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/Dr_Samantha-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 73.56
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/Dr_Samantha-7b
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 18.8
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sethuiyer/Dr_Samantha-7b
          name: Open LLM Leaderboard

Dr. Samantha

SynthIQ

Overview

Dr. Samantha is a language model made by merging Severus27/BeingWell_llama2_7b and ParthasarathyShanmugam/llama-2-7b-samantha using mergekit.

Has capabilities of a medical knowledge-focused model (trained on USMLE databases and doctor-patient interactions) with the philosophical, psychological, and relational understanding of the Samantha-7b model.

As both a medical consultant and personal counselor, Dr.Samantha could effectively support both physical and mental wellbeing - important for whole-person care.

Yaml Config


slices:
  - sources:
      - model: Severus27/BeingWell_llama2_7b
        layer_range: [0, 32]
      - model: ParthasarathyShanmugam/llama-2-7b-samantha
        layer_range: [0, 32]

merge_method: slerp
base_model: TinyPixel/Llama-2-7B-bf16-sharded

parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5 # fallback for rest of tensors
tokenizer_source: union

dtype: bfloat16

Prompt Template

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is your name?

### Response:
My name is Samantha.

⚡ Quantized models

Thanks to TheBloke for making this available!

Dr.Samantha is now available on Ollama. You can use it by running the command ollama run stuehieyr/dr_samantha in your terminal. If you have limited computing resources, check out this video to learn how to run it on a Google Colab backend.

OpenLLM Leaderboard Performance

T Model Average ARC Hellaswag MMLU TruthfulQA Winogrande GSM8K
1 sethuiyer/Dr_Samantha-7b 52.95 53.84 77.95 47.94 45.58 73.56 18.8
2 togethercomputer/LLaMA-2-7B-32K-Instruct 50.02 51.11 78.51 46.11 44.86 73.88 5.69
3 togethercomputer/LLaMA-2-7B-32K 47.07 47.53 76.14 43.33 39.23 71.9 4.32

Subject-wise Accuracy

Subject Accuracy (%)
Clinical Knowledge 52.83
Medical Genetics 49.00
Human Aging 58.29
Human Sexuality 55.73
College Medicine 38.73
Anatomy 41.48
College Biology 52.08
College Medicine 38.73
High School Biology 53.23
Professional Medicine 38.73
Nutrition 50.33
Professional Psychology 46.57
Virology 41.57
High School Psychology 66.60
Average 48.85%

Evaluation by GPT-4 across 25 random prompts from ChatDoctor-200k Dataset

Overall Rating: 83.5/100

Pros:

  • Demonstrates extensive medical knowledge through accurate identification of potential causes for various symptoms.
  • Responses consistently emphasize the importance of seeking professional diagnoses and treatments.
  • Advice to consult specialists for certain concerns is well-reasoned.
  • Practical interim measures provided for symptom management in several cases.
  • Consistent display of empathy, support, and reassurance for patients' well-being.
  • Clear and understandable explanations of conditions and treatment options.
  • Prompt responses addressing all aspects of medical inquiries.

Cons:

  • Could occasionally place stronger emphasis on urgency when symptoms indicate potential emergencies.
  • Discussion of differential diagnoses could explore a broader range of less common causes.
  • Details around less common symptoms and their implications need more depth at times.
  • Opportunities exist to gather clarifying details on symptom histories through follow-up questions.
  • Consider exploring full medical histories to improve diagnostic context where relevant.
  • Caution levels and risk factors associated with certain conditions could be underscored more.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 52.95
AI2 Reasoning Challenge (25-Shot) 53.84
HellaSwag (10-Shot) 77.95
MMLU (5-Shot) 47.94
TruthfulQA (0-shot) 45.58
Winogrande (5-shot) 73.56
GSM8k (5-shot) 18.80