File size: 1,453 Bytes
a8ff7a4
 
 
 
 
 
 
 
 
 
0c3f464
 
 
 
860bebf
0c3f464
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8ff7a4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: apache-2.0
datasets:
- ToastyPigeon/roselily-v0
- PocketDoc/Dans-Systemmaxx
- allenai/tulu-3-sft-personas-instruction-following
- ZeusLabs/WizardLM_evol_instruct_fuzzy_dedup_sharegpt
base_model:
- mistralai/Mistral-Small-24B-Base-2501
---
This is a double fine-tuned version of Mistral Small 24B Base 2501.

Stage 1 was shoving 30M tokens of human-writen story content into it using completion training ([ToastyPigeon/ms3-base-roselily](https://huggingface.co/ToastyPigeon/ms3-base-roselily)), which is about half of my WIP Roselily dataset (~60M tokens total). 

Stage 2 was teaching it instruct (this model) using a mix of public instruction following data and a private instruct dataset from ZeusLabs.

This model should accept (in theory) any of the following instruct formats: 

**Tekken v7**
```
[SYSTEM_PROMPT]{system prompt}[/SYSTEM_PROMPT][INST]{user message}[/INST]{assistant response}</s>
```
**ChatML**
```
<|im_start|>system
{system prompt}<|im_end|>
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{assistant response}<|im_end|>
```
**Fizzpaca**
```
### System:
{system prompt}

### Instruction:
{user message}

### Response:
{assistant response}</s>
```

The Tekken tokens were already in the tokenizer. unused special tokens #20 and 21 were repurposed for the ChatML tokens. Fizzpaca did not add any. 

You may need to add both `</s>` and `<|im_end|>` as stop tokens for it to work properly with all formats.