This model is a fine-tuned Mistral model, trained on the training set of PromptEvals (https://huggingface.co/datasets/user104/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates.
Model Card:
Model Details
โ Person or organization developing model: MistralAI, and fine-tuned by [Redacted for submission]
โ Model date: Base model released in September 2023, fine-tuned in July 2023
โ Model version: version 3
โ Model type: decoder-only Transformer
โ Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 7.3 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl)
โ Paper or other resource for more information: https://arxiv.org/abs/2310.06825
โ Citation details: Redacted for submission
โ License: Apache 2.0 license
โ Where to send questions or comments about the model: [Redacted for submission]
Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases)
Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria.
Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3.
We donโt collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset.
Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches)
Base Mistral | Mistral (FT) | Base Llama | Llama (FT) | GPT-4o | |
---|---|---|---|---|---|
p25 | 0.3608 | 0.7919 | 0.3211 | 0.7922 | 0.6296 |
p50 | 0.4100 | 0.8231 | 0.3577 | 0.8233 | 0.6830 |
Mean | 0.4093 | 0.8199 | 0.3607 | 0.8240 | 0.6808 |
p75 | 0.4561 | 0.8553 | 0.3978 | 0.8554 | 0.7351 |
Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.
Mistral (FT) | Llama (FT) | GPT-4o | |
---|---|---|---|
p25 | 1.8717 | 2.3962 | 6.5596 |
p50 | 2.3106 | 3.0748 | 8.2542 |
Mean | 2.5915 | 3.6057 | 8.7041 |
p75 | 2.9839 | 4.2716 | 10.1905 |
Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.
Average | Median | 75th percentile | 90th percentile | |
---|---|---|---|---|
Base Mistral | 14.5012 | 14 | 18.5 | 23 |
Mistral (FT) | 6.28640 | 5 | 8 | 10 |
Base Llama | 28.2458 | 26 | 33.5 | 46 |
Llama (FT) | 5.47255 | 5 | 6 | 9 |
GPT-4o | 7.59189 | 6 | 10 | 14.2 |
Ground Truth | 5.98568 | 5 | 7 | 10 |
Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.
Evaluation Data: Evaluated on PromptEvals test set
Training Data: Fine-tuned on PromptEvals train set
Quantitative Analyses (Unitary results, Intersectional results):
Domain | Similarity | Precision | Recall |
---|---|---|---|
General-Purpose Chatbots | 0.8171 | 0.8023 | 0.8338 |
Question-Answering | 0.8216 | 0.8183 | 0.8255 |
Text Summarization | 0.8785 | 0.8863 | 0.8725 |
Database Querying | 0.8312 | 0.8400 | 0.8234 |
Education | 0.8599 | 0.8636 | 0.8564 |
Content Creation | 0.8184 | 0.8176 | 0.8205 |
Workflow Automation | 0.8304 | 0.8258 | 0.8351 |
Horse Racing Analytics | 0.8216 | 0.8109 | 0.8336 |
Data Analysis | 0.7865 | 0.7793 | 0.7952 |
Prompt Engineering | 0.8534 | 0.8330 | 0.8755 |
Fine-Tuned Mistral Score Averages per Domain (for the 10 most represented domains in our test set)
Ethical Considerations:
PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark.
Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally.
However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria.
Caveats and Recommendations: None
- Downloads last month
- 3