SanjiWatsuki commited on
Commit
b7b8fc0
β€’
1 Parent(s): e26a7bf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: rishiraj/CatPPT-base
4
+ language:
5
+ - en
6
+ tags:
7
+ - merge
8
  ---
9
+
10
+ # 🐈🐈🐈🐈 LongCAT - **Elevating Performance with Interwoven Depth UP Scaling!** 🐈🐈🐈🐈
11
+
12
+ Introducing "LongCAT" - the purrfect alternative to that other 10.7B Frankenmerger in town! Our long feline friend here is created through merging rishiraj/CatPPT-base using a passthrough merge using a new process called Interwoven Depth Up-Scaling resulting in the longest cat!
13
+
14
+ We developed the Interwoven Depth Up-Scaling technique. Built on the Mistral architecture, LongCAT incorporates the innovative Interwoven Upstage Depth Up-Scaling. We then interwove Cat 7B weights into the upscaled layers, and finally, did absolutely no extended pre-training.
15
+
16
+ ## The Sauce
17
+
18
+ All joking aside, this is an attempt to more coherently merge Mistral-7B models together than the typical Undi95/"Depth UP Scaling" technique that is typically used. The typical approach is to lay out the front 75% of one model and then place the back 75% of the second model together: i.e. [0, 24] + [8, 32] for a 7B merger. When laid out flat, this can be broken down as [0, 8]+[8, 24]+[8, 24]+[24, 32] with two discrete 16 layer blocks duplicated twice in a row.
19
+
20
+ This typically is better than laying the entirety of one model out flat, ostensibly because of the locality of the duplicated layers to their original location. Taking this to its logical conclusion, we could theoretically lay out the duplicated layers directly next to each other, maximizing locality.
21
+
22
+ Also, I picked CatPPT-base because I wanted to make a longcat joke.
23
+
24
+ ```
25
+ slices:
26
+ - sources:
27
+ - model: rishiraj/CatPPT-base
28
+ layer_range: [0, 8]
29
+ - sources:
30
+ - model: rishiraj/CatPPT-base
31
+ layer_range: [8, 9]
32
+ - sources:
33
+ - model: rishiraj/CatPPT-base
34
+ layer_range: [8, 9]
35
+ - sources:
36
+ - model: rishiraj/CatPPT-base
37
+ layer_range: [9, 10]
38
+ - sources:
39
+ - model: rishiraj/CatPPT-base
40
+ layer_range: [9, 10]
41
+ - sources:
42
+ - model: rishiraj/CatPPT-base
43
+ layer_range: [10, 11]
44
+ - sources:
45
+ - model: rishiraj/CatPPT-base
46
+ layer_range: [10, 11]
47
+ - sources:
48
+ - model: rishiraj/CatPPT-base
49
+ layer_range: [11, 12]
50
+ - sources:
51
+ - model: rishiraj/CatPPT-base
52
+ layer_range: [11, 12]
53
+ - sources:
54
+ - model: rishiraj/CatPPT-base
55
+ layer_range: [12, 13]
56
+ - sources:
57
+ - model: rishiraj/CatPPT-base
58
+ layer_range: [12, 13]
59
+ - sources:
60
+ - model: rishiraj/CatPPT-base
61
+ layer_range: [13, 14]
62
+ - sources:
63
+ - model: rishiraj/CatPPT-base
64
+ layer_range: [13, 14]
65
+ - sources:
66
+ - model: rishiraj/CatPPT-base
67
+ layer_range: [14, 15]
68
+ - sources:
69
+ - model: rishiraj/CatPPT-base
70
+ layer_range: [14, 15]
71
+ - sources:
72
+ - model: rishiraj/CatPPT-base
73
+ layer_range: [15, 16]
74
+ - sources:
75
+ - model: rishiraj/CatPPT-base
76
+ layer_range: [15, 16]
77
+ - sources:
78
+ - model: rishiraj/CatPPT-base
79
+ layer_range: [16, 17]
80
+ - sources:
81
+ - model: rishiraj/CatPPT-base
82
+ layer_range: [16, 17]
83
+ - sources:
84
+ - model: rishiraj/CatPPT-base
85
+ layer_range: [17, 18]
86
+ - sources:
87
+ - model: rishiraj/CatPPT-base
88
+ layer_range: [17, 18]
89
+ - sources:
90
+ - model: rishiraj/CatPPT-base
91
+ layer_range: [18, 19]
92
+ - sources:
93
+ - model: rishiraj/CatPPT-base
94
+ layer_range: [18, 19]
95
+ - sources:
96
+ - model: rishiraj/CatPPT-base
97
+ layer_range: [19, 20]
98
+ - sources:
99
+ - model: rishiraj/CatPPT-base
100
+ layer_range: [19, 20]
101
+ - sources:
102
+ - model: rishiraj/CatPPT-base
103
+ layer_range: [20, 21]
104
+ - sources:
105
+ - model: rishiraj/CatPPT-base
106
+ layer_range: [20, 21]
107
+ - sources:
108
+ - model: rishiraj/CatPPT-base
109
+ layer_range: [21, 22]
110
+ - sources:
111
+ - model: rishiraj/CatPPT-base
112
+ layer_range: [21, 22]
113
+ - sources:
114
+ - model: rishiraj/CatPPT-base
115
+ layer_range: [22, 23]
116
+ - sources:
117
+ - model: rishiraj/CatPPT-base
118
+ layer_range: [22, 23]
119
+ - sources:
120
+ - model: rishiraj/CatPPT-base
121
+ layer_range: [23, 24]
122
+ - sources:
123
+ - model: rishiraj/CatPPT-base
124
+ layer_range: [23, 24]
125
+ - sources:
126
+ - model: rishiraj/CatPPT-base
127
+ layer_range: [24, 32]
128
+ merge_method: passthrough
129
+ dtype: bfloat16
130
+ ```
131
+
132
+ Don't try to merge this with other 10.7Bs - the layer mismatch will probably create a completely model.