|
--- |
|
language: |
|
- en |
|
- de |
|
license: apache-2.0 |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- llama |
|
- trl |
|
- orpo |
|
base_model: cstr/phi-3-orpo-v8_16 |
|
--- |
|
|
|
# Model details |
|
|
|
This is a quick experiment on llamafied phi-3 with only 1000 orpo steps from an azureml translated german orca binarized-dataset (johannhartmann/mistralorpo), with original phi-3 prompt template. The immediate result is not really good, but also not bad enough to disencourage further experiments. |
|
|
|
# Benchmark results |
|
|
|
This was an experiment on a german dataset snippet which, as expected, worsened results on english benchmarks: |
|
|
|
| Metric |Value| |
|
|---------------------------------|----:| |
|
|Avg. |64.40| |
|
|AI2 Reasoning Challenge (25-Shot)|60.41| |
|
|HellaSwag (10-Shot) |78.37| |
|
|MMLU (5-Shot) |65.26| |
|
|TruthfulQA (0-shot) |49.76| |
|
|Winogrande (5-shot) |70.24| |
|
|GSM8k (5-shot) |62.32| |
|
|
|
On german EQ-Bench (v2_de) 51.82 (insignificant over 51.41 for original llamafied but significantly better than intermediate cstr/phi-3-orpo-v8_16 which after initial 150 test steps achieved 46.38) but with still only 164/171 correctly parsed. |
|
|
|
Note: We can improve the correctness of parsing, i.a., by only a few SFT steps, as shown with cas/phi3-mini-4k-llamafied-sft-v3 (170/171 correct but with then only 39.46 score in v2_de, which was also an experiment in changing the prompt template). |
|
All that was quickly done with bnb and q4 quants only, which might, in theory, affect especially such small dense models significantly. |
|
But it served the intention for both proof-of-concept-experiments at least. Probably it would easily be possible to further improve results, but that would take some time and compute. |
|
|
|
# Training setup |
|
|
|
This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|