promptevals_llama / README.md
user104's picture
Update README.md
2179961 verified
metadata
license: llama3

This model is a fine-tuned Llama3 model, trained on the training set of PromptEvals (https://huggingface.co/datasets/user104/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates.

Model Card:
Model Details
– Person or organization developing model: Meta, and fine-tuned by [Redacted for submission]
– Model date: Base model was released in April 18 2024, and fine-tuned in July 2024
– Model version: 3.1
– Model type: decoder-only Transformer
– Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 8 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl)
– Paper or other resource for more information: https://arxiv.org/abs/2310.06825
– Citation details: Redacted for submission
– License: Meta Llama 3 Community License
– Where to send questions or comments about the model: [Redacted for submission]
Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases)
Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria.
Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3.
We don’t collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset.
Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches)

Base Mistral Mistral (FT) Base Llama Llama (FT) GPT-4o
p25 0.3608 0.7919 0.3211 0.7922 0.6296
p50 0.4100 0.8231 0.3577 0.8233 0.6830
Mean 0.4093 0.8199 0.3607 0.8240 0.6808
p75 0.4561 0.8553 0.3978 0.8554 0.7351

Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.

Mistral (FT) Llama (FT) GPT-4o
p25 1.8717 2.3962 6.5596
p50 2.3106 3.0748 8.2542
Mean 2.5915 3.6057 8.7041
p75 2.9839 4.2716 10.1905

Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.

Average Median 75th percentile 90th percentile
Base Mistral 14.5012 14 18.5 23
Mistral (FT) 6.28640 5 8 10
Base Llama 28.2458 26 33.5 46
Llama (FT) 5.47255 5 6 9
GPT-4o 7.59189 6 10 14.2
Ground Truth 5.98568 5 7 10

Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.

Evaluation Data: Evaluated on PromptEvals test set
Training Data: Fine-tuned on PromptEvals train set

Quantitative Analyses (Unitary results, Intersectional results):

Domain Similarity Precision Recall
General-Purpose Chatbots 0.8140 0.8070 0.8221
Question-Answering 0.8104 0.8018 0.8199
Text Summarization 0.8601 0.8733 0.8479
Database Querying 0.8362 0.8509 0.8228
Education 0.8388 0.8498 0.8282
Content Creation 0.8417 0.8480 0.8358
Workflow Automation 0.8389 0.8477 0.8304
Horse Racing Analytics 0.8249 0.8259 0.8245
Data Analysis 0.7881 0.7940 0.7851
Prompt Engineering 0.8441 0.8387 0.8496

Fine-Tuned Llama Score Averages per Domain (for the 10 most represented domains in our test set

Ethical Considerations:
PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark. Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally. However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria.
Caveats and Recommendations: None