Edit model card

This model is a fine-tuned Mistral model, trained on the training set of PromptEvals (https://huggingface.co/datasets/user104/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates.

Model Card:

Model Details
โ€“ Person or organization developing model: MistralAI, and fine-tuned by [Redacted for submission]
โ€“ Model date: Base model released in September 2023, fine-tuned in July 2023
โ€“ Model version: version 3
โ€“ Model type: decoder-only Transformer
โ€“ Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 7.3 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl)
โ€“ Paper or other resource for more information: https://arxiv.org/abs/2310.06825
โ€“ Citation details: Redacted for submission
โ€“ License: Apache 2.0 license
โ€“ Where to send questions or comments about the model: [Redacted for submission]
Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases)
Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria.
Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3.
We donโ€™t collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset.
Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches)

Base Mistral Mistral (FT) Base Llama Llama (FT) GPT-4o
p25 0.3608 0.7919 0.3211 0.7922 0.6296
p50 0.4100 0.8231 0.3577 0.8233 0.6830
Mean 0.4093 0.8199 0.3607 0.8240 0.6808
p75 0.4561 0.8553 0.3978 0.8554 0.7351

Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.

Mistral (FT) Llama (FT) GPT-4o
p25 1.8717 2.3962 6.5596
p50 2.3106 3.0748 8.2542
Mean 2.5915 3.6057 8.7041
p75 2.9839 4.2716 10.1905

Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.

Average Median 75th percentile 90th percentile
Base Mistral 14.5012 14 18.5 23
Mistral (FT) 6.28640 5 8 10
Base Llama 28.2458 26 33.5 46
Llama (FT) 5.47255 5 6 9
GPT-4o 7.59189 6 10 14.2
Ground Truth 5.98568 5 7 10

Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.

Evaluation Data: Evaluated on PromptEvals test set
Training Data: Fine-tuned on PromptEvals train set

Quantitative Analyses (Unitary results, Intersectional results):

Domain Similarity Precision Recall
General-Purpose Chatbots 0.8171 0.8023 0.8338
Question-Answering 0.8216 0.8183 0.8255
Text Summarization 0.8785 0.8863 0.8725
Database Querying 0.8312 0.8400 0.8234
Education 0.8599 0.8636 0.8564
Content Creation 0.8184 0.8176 0.8205
Workflow Automation 0.8304 0.8258 0.8351
Horse Racing Analytics 0.8216 0.8109 0.8336
Data Analysis 0.7865 0.7793 0.7952
Prompt Engineering 0.8534 0.8330 0.8755

Fine-Tuned Mistral Score Averages per Domain (for the 10 most represented domains in our test set)

Ethical Considerations:
PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark. Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally. However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria.
Caveats and Recommendations: None

Downloads last month
3
Inference API
Unable to determine this model's library. Check the docs .