HiroseKoichi commited on
Commit
b51d891
1 Parent(s): 94b39c0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ library_name: transformers
4
+ tags:
5
+ - nsfw
6
+ - not-for-all-audiences
7
+ - llama-3
8
+ - text-generation-inference
9
+ ---
10
+
11
+ # Llama-Salad-4x8B
12
+ This is a MoE merge of several Llama-3 models that aims to create a solid role-play model without sacrificing logic and reasoning. Meta-Llama-3-8B-Instruct was already quite good at role-play, but the overuse of `*wink*` and `*giggle*`, along with the tiptoeing around NSFW topics by spamming ellipses, really started to piss me off. L3-8B-Stheno-v3.1 is definitely the best role-play model out of all of the llama-3 fine-tunes so far; however, it's a little *too* horny for my liking, has a hard time keeping track of what's going on, like all other role-play fine-tunes, and has a serious issue with over-responding; there's a high chance you'll get a 4+ paragraph response to something basic.
13
+
14
+ Combining the two models has kept both of their strengths but none of their weaknesses; it completely flattened out its over-responding issue, improved its logic and reasoning, and toned down the horny levels a bit. Don't be mistaken, though; this hasn't affected its ability to generate NSFW content; it's just no longer overwhelmingly horny. It's the first model I've used, aside from base L3-8B-Stheno-v3.1, that actually treats sexual content like it's normal instead of some esoteric and shameful thing.
15
+
16
+ While role-play was the main focus of this merge, its base capabilities weren't affected at all, so swapping models for other tasks isn't needed unless you require a bigger model. Actually, with the addition of Tess-2.0-Llama-3-8B, I did find a small overall improvement. There isn't any particular reason Llama3-OpenBioLLM-8B is in the merge; I needed a 4th model for the merge, it seemed like a high-quality fine-tune, and its knowledge of anatomy might come in handy for role-play.
17
+
18
+ Unfortunately, I can't compare it with 70B models because they're too slow on my machine, but this is the best sub-70B model I have used so far; I haven't felt the need to regenerate any responses, which hasn't happened with any other model so far. This is my first attempt at any kind of merge, and I want to share what I've learned, but this section is already longer than I wanted, so I've decided to place the rest at the bottom of the page.
19
+
20
+ # Details
21
+ - **License**: [llama3](https://llama.meta.com/llama3/license/)
22
+ - **Instruct Format**: [llama-3](https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3/)
23
+ - **Context Size**: 8K
24
+
25
+ ## Models Used
26
+ - [Meta-Llama-3-8B-Instruct](https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct)
27
+ - [Tess-2.0-Llama-3-8B](https://huggingface.co/migtissera/Tess-2.0-Llama-3-8B)
28
+ - [L3-8B-Stheno-v3.1](https://huggingface.co/Sao10K/L3-8B-Stheno-v3.1)
29
+ - [Llama3-OpenBioLLM-8B](https://huggingface.co/aaditya/Llama3-OpenBioLLM-8B)
30
+
31
+ ## Merge Config
32
+ ```yaml
33
+ base_model: NousResearch/Meta-Llama-3-8B-Instruct
34
+ gate_mode: hidden
35
+ dtype: bfloat16
36
+ experts_per_token: 2
37
+ experts:
38
+ - source_model: NousResearch/Meta-Llama-3-8B-Instruct
39
+ positive_prompts:
40
+ - "summarize"
41
+ - "explain"
42
+ - "define"
43
+ - "translate"
44
+ - "multilingual"
45
+ - "chat"
46
+ - "conversation"
47
+ - source_model: migtissera/Tess-2.0-Llama-3-8B
48
+ positive_prompts:
49
+ - "programming language"
50
+ - "math"
51
+ - "code"
52
+ - "step-by-step"
53
+ - "logic"
54
+ - source_model: Sao10K/L3-8B-Stheno-v3.1
55
+ positive_prompts:
56
+ - "role-play"
57
+ - "characters"
58
+ - "narration"
59
+ - "story writing"
60
+ - "scene"
61
+ - source_model: aaditya/Llama3-OpenBioLLM-8B
62
+ positive_prompts:
63
+ - "anatomy"
64
+ - "diagnosis"
65
+ - "symptom"
66
+ - "biomedical"
67
+ - "health"
68
+ - "medicine"
69
+ - "medication"
70
+ - "physiology"
71
+ ```
72
+
73
+ # What I Have Learned
74
+ After testing over a hundred different configurations, I have concluded that the name "Mixture of Experts" is misleading for merges; despite the name, using only domain-specific models leads to an overall worse model. For every single token, two models are used for prediction; if you only have domain-specific models in the merge, then you're always going to have a model of a completely unrelated domain contributing to the token prediction. When you fine-tune a model for a specific task, it degrades its quality when used for other tasks; if you only use models of a specific domain, then the degradation becomes more prominent.
75
+
76
+ The solution I have found is quite simple: make sure that a general-purpose model is always chosen as one of the prediction models. I initially thought that adding a general-purpose model to the prediction pool would reduce the effectiveness of domain-specific models, but in practice, it keeps the strengths of both models but none of their weaknesses. I can't really judge if this extends to non-creative domains, such as coding, but for role-play, Llama-Salad-4x8B is proof that it works.
77
+
78
+ It might be tempting to mix models that use different instruct formats, but this should be avoided as much as possible. Using the wrong instruct format degrades the output quality of the model; merging a domain-specific model that uses a different instruct format from the rest will improve output in that domain, but it will be in a degraded form. Merging models with different instruct formats is still useful if there's no other option, but for the best results, you should stick with models using the same instruct format.
79
+
80
+ These are just my own observations from my experiments, so I very well could be wrong, but it's definitely worth further testing considering how well Llama-Salad-4x8B turned out.