longcat-10.7B / README.md
SanjiWatsuki's picture
Update README.md
c79c121
metadata
license: apache-2.0
base_model: rishiraj/CatPPT-base
language:
  - en
tags:
  - merge

🐈🐈🐈🐈 LongCAT - Elevating Performance with Interwoven Depth UP Scaling! 🐈🐈🐈🐈

Introducing "LongCAT" - the purrfect alternative to that other 10.7B Frankenmerger in town! Our long feline friend here is created through merging rishiraj/CatPPT-base using a passthrough merge using a new process called Interwoven Depth Up-Scaling resulting in the longest cat!

We developed the Interwoven Depth Up-Scaling technique. Built on the Mistral architecture, LongCAT incorporates the innovative Interwoven Depth Up-Scaling. We then interwove Cat 7B weights into the upscaled layers, and finally, did absolutely no extended pre-training.

The Sauce

All joking aside, this is an attempt to more coherently merge Mistral-7B models together than the typical Undi95/"Depth UP Scaling" technique that is typically used. The typical approach is to lay out the front 75% of one model and then place the back 75% of the second model together: i.e. [0, 24] + [8, 32] for a 7B merger. When laid out flat, this can be broken down as [0, 8]+[8, 24]+[8, 24]+[24, 32] with two discrete 16 layer blocks duplicated twice in a row.

This typically is better than laying the entirety of one model out flat, ostensibly because of the locality of the duplicated layers to their original location. Taking this to its logical conclusion, we could theoretically lay out the duplicated layers directly next to each other, maximizing locality.

Also, I picked CatPPT-base because I wanted to make a longcat joke.

slices:
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [0, 8]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [8, 9]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [8, 9]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [9, 10]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [9, 10]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [10, 11]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [10, 11]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [11, 12]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [11, 12]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [12, 13]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [12, 13]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [13, 14]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [13, 14]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [14, 15]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [14, 15]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [15, 16]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [15, 16]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [16, 17]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [16, 17]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [17, 18]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [17, 18]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [18, 19]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [18, 19]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [19, 20]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [19, 20]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [20, 21]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [20, 21]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [21, 22]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [21, 22]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [22, 23]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [22, 23]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [23, 24]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [23, 24]
  - sources:
    - model: rishiraj/CatPPT-base
      layer_range: [24, 32]
merge_method: passthrough
dtype: bfloat16

Don't try to merge this with other 10.7Bs - the layer mismatch will probably create a mangled model.