File size: 4,315 Bytes
e26a7bf
 
b7b8fc0
 
 
 
 
e26a7bf
b7b8fc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
base_model: rishiraj/CatPPT-base
language:
- en
tags:
- merge
---

# 🐈🐈🐈🐈 LongCAT - **Elevating Performance with Interwoven Depth UP Scaling!** 🐈🐈🐈🐈 

Introducing "LongCAT" - the purrfect alternative to that other 10.7B Frankenmerger in town! Our long feline friend here is created through merging rishiraj/CatPPT-base using a passthrough merge using a new process called Interwoven Depth Up-Scaling resulting in the longest cat!

We developed the Interwoven Depth Up-Scaling technique. Built on the Mistral architecture, LongCAT incorporates the innovative Interwoven Upstage Depth Up-Scaling. We then interwove Cat 7B weights into the upscaled layers, and finally, did absolutely no extended pre-training.

## The Sauce

All joking aside, this is an attempt to more coherently merge Mistral-7B models together than the typical Undi95/"Depth UP Scaling" technique that is typically used. The typical approach is to lay out the front 75% of one model and then place the back 75% of the second model together: i.e. [0, 24] + [8, 32] for a 7B merger. When laid out flat, this can be broken down as [0, 8]+[8, 24]+[8, 24]+[24, 32] with two discrete 16 layer blocks duplicated twice in a row.

This typically is better than laying the entirety of one model out flat, ostensibly because of the locality of the duplicated layers to their original location. Taking this to its logical conclusion, we could theoretically lay out the duplicated layers directly next to each other, maximizing locality.

Also, I picked CatPPT-base because I wanted to make a longcat joke.

```
slices:
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [0, 8]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [8, 9]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [8, 9]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [9, 10]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [9, 10]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [10, 11]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [10, 11]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [11, 12]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [11, 12]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [12, 13]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [12, 13]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [13, 14]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [13, 14]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [14, 15]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [14, 15]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [15, 16]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [15, 16]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [16, 17]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [16, 17]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [17, 18]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [17, 18]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [18, 19]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [18, 19]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [19, 20]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [19, 20]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [20, 21]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [20, 21]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [21, 22]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [21, 22]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [22, 23]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [22, 23]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [23, 24]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [23, 24]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [24, 32]
merge_method: passthrough
dtype: bfloat16
```

Don't try to merge this with other 10.7Bs - the layer mismatch will probably create a completely model.