Update README.md
Browse files
README.md
CHANGED
@@ -4,4 +4,69 @@ datasets:
|
|
4 |
- lars1234/story_writing_benchmark
|
5 |
base_model:
|
6 |
- mistralai/Mistral-Small-24B-Instruct-2501
|
7 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
- lars1234/story_writing_benchmark
|
5 |
base_model:
|
6 |
- mistralai/Mistral-Small-24B-Instruct-2501
|
7 |
+
---
|
8 |
+
|
9 |
+
# Mistral-Small-24B-Instruct-2501-writer
|
10 |
+
|
11 |
+
Mistral-Small-24B-Instruct-2501-writer is a fine-tuned version of `mistralai/Mistral-Small-24B-Instruct-2501`, optimized specifically for creative writing tasks.
|
12 |
+
|
13 |
+
## Performance
|
14 |
+
|
15 |
+
The following table was generated by creating 568 stories based on the same prompts as in the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset and then evaluating them using the benchmark's evaluator models.
|
16 |
+
|
17 |
+
| Model | Average | Grammar & Spelling | Clarity | Logical Connection | Scene Construction | Internal Consistency | Character Consistency | Character Motivation | Sentence Variety | Avoiding Clichés | Natural Dialogue | Avoiding Tropes | Character Depth | Character Interactions | Reader Interest | Plot Resolution |
|
18 |
+
|-------|---------|-------------------|---------|-------------------|-------------------|---------------------|----------------------|---------------------|-----------------|----------------|-----------------|----------------|----------------|----------------------|----------------|-----------------|
|
19 |
+
| Mistral-2501 | 49.3% | 82.1% | 63.0% | 57.7% | 56.1% | 67.2% | 50.7% | 44.6% | 57.7% | 24.6% | 42.9% | 28.6% | 35.7% | 45.0% | 54.1% | 35.3% |
|
20 |
+
| Mistral-Writer | **56.5%** | 83.3% | 64.1% | 64.1% | 62.0% | 73.1% | 54.0% | **49.8%** | **64.4%** | **33.3%** | **51.9%** | 37.4% | **46.4%** | **52.0%** | **63.1%** | **45.3%** |
|
21 |
+
| Gemma-Ataraxy | 56.1% | **88.8%** | **65.8%** | **66.0%** | **64.1%** | **75.1%** | **54.3%** | 49.2% | 64.0% | 31.2% | 48.3% | **40.0%** | 45.4% | 51.7% | 63.0% | 44.9% |
|
22 |
+
|
23 |
+
Mistral-Small-24B-Instruct-2501-writer outperforms the base Mistral model across all metrics. Gemma-2-Ataraxy still shows higher creativity in some categories, as seen for example in its better score on "Avoiding Tropes."
|
24 |
+
|
25 |
+
## DPO Dataset Creation
|
26 |
+
|
27 |
+
The model was fine-tuned using Direct Preference Optimization (DPO), which requires pairs of responses where one is preferred over the other. The pairs were created from the [lars1234/story_writing_benchmark](https://huggingface.co/datasets/lars1234/story_writing_benchmark) dataset using two approaches:
|
28 |
+
|
29 |
+
### 1. Language-Based Pairs
|
30 |
+
- **Correct vs. Incorrect Language**: For prompts requesting stories in specific languages (English, Spanish, or German), we identified cases where models incorrectly generated text in the wrong language.
|
31 |
+
- **Verification Process**: Used fast_langdetect to automatically verify language with high confidence (threshold ≥ 0.8).
|
32 |
+
- **Pair Creation**: Stories with correctly detected language were paired as "chosen" against stories with incorrectly detected language as "rejected" for the same prompt.
|
33 |
+
|
34 |
+
### 2. Quality-Based Pairs
|
35 |
+
- **Quality Scoring**: For stories with correctly detected language, we calculated quality differences based on four metrics:
|
36 |
+
- q1: Grammar and spelling
|
37 |
+
- q11: Avoiding tropes
|
38 |
+
- q12: Character depth
|
39 |
+
- q14: Reader interest
|
40 |
+
- **Minimum Threshold**: Only story pairs with a quality difference of at least 0.4 (on a 1-5 scale) were considered.
|
41 |
+
- **Greedy Selection**: the highest-rated story was selected as "chosen" and paired with a lower-rated story as "rejected" for the same prompt.
|
42 |
+
- **Uniqueness**: Each story was used in at most one pair.
|
43 |
+
|
44 |
+
The final JSONL dataset contained these pairs in the format:
|
45 |
+
```json
|
46 |
+
{"prompt": "Write a story about...", "chosen": "High quality story text...", "rejected": "Lower quality story text..."}
|
47 |
+
```
|
48 |
+
|
49 |
+
See [this script](https://github.com/lars76/story-evaluation-llm/blob/main/create_dpo_pairs.py) for the code.
|
50 |
+
|
51 |
+
## Training Methodology
|
52 |
+
|
53 |
+
The model was fine-tuned using Axolotl with the following parameters:
|
54 |
+
|
55 |
+
- **Base Model**: mistralai/Mistral-Small-24B-Instruct-2501
|
56 |
+
- **Adapter**: LoRA with r=16, alpha=32
|
57 |
+
- **DPO Beta**: 0.1
|
58 |
+
- **Learning Rate**: 1e-4
|
59 |
+
- **Optimizer**: AdamW with cosine scheduler
|
60 |
+
- **Training Epochs**: 1
|
61 |
+
- **Gradient Accumulation Steps**: 4
|
62 |
+
- **Micro Batch Size**: 2
|
63 |
+
- **Sequence Length**: 2048
|
64 |
+
- **Quantization**: 4-bit
|
65 |
+
|
66 |
+
## Inference Parameters
|
67 |
+
|
68 |
+
A grid search was performed on inference parameters to find optimal generation settings:
|
69 |
+
- **min_p**: 0.05 (fixed)
|
70 |
+
- **temperature**: 0.5, 0.75, 1.0, 1.25
|
71 |
+
|
72 |
+
The most significant quality improvement was observed when increasing temperature from 0.5 to 0.75. Beyond this point, other quality aspects began to suffer.
|