File size: 3,137 Bytes

65084ac
 
 
67bf26a
 
 
 
 
b65c13a
 
 
 
 
86709bd
 
b65c13a
86709bd
 
 
 
 
 
b65c13a
 
67bf26a

---
license: mit
---

### SuperHOT Prototype 2 w/ 4-8K Context

This is a second prototype of SuperHOT, this time with 4K context and no RLHF. In my testing, it can go all the way to 6K without breaking down and I made the change with intention to reach 8K, so I'll assume it will go to 8K although I only trained on 4K sequences.

#### Looking for Merged & Quantized Models?
- 13B 8K GGML: [tmpupload/superhot-13b-8k-no-rlhf-test-GGML](https://huggingface.co/tmpupload/superhot-13b-8k-no-rlhf-test-GGML)
- 13B 8K CUDA (no groupsize): [tmpupload/superhot-13b-8k-no-rlhf-test-GPTQ](https://huggingface.co/tmpupload/superhot-13b-8k-no-rlhf-test-GPTQ)
- 13B 8K CUDA 32g: [tmpupload/superhot-13b-8k-no-rlhf-test-32g-GPTQ](https://huggingface.co/tmpupload/superhot-13b-8k-no-rlhf-test-32g-GPTQ) 

#### Using the monkey-patch?
You will **NEED** to **apply the monkeypatch** or, if you are already using the monkeypatch, **change the scaling factor to 0.25 and the maximum sequence length to 8192**

#### Using Oobabooga with Exllama?
- `python server.py --max_seq_len 8192 --compress_pos_emb 4 --loader exllama_hf`

In order to use the 8K context, you will need to apply the monkeypatch I have added in this repo or follow the instructions for oobabooga's text-generation-webui -- **without it, it will not work**.

I will repeat: **Without the patch with the correct scaling and max sequence length, it will not work!**

The patch is very simple, and you can make the changes yourself:
- Increase the `max_position_embeddings` to 8192 to stretch the sinusoidal
- Stretch the frequency steps by a scale of `0.25`

The intuition is to calibrate the model to within the learned positions of the pre-trained model as the model may be overfit on the token-position relationship (not my idea, [Ofir Press'](https://ofir.io/)). By interpolating the encodings, we remain within the bounds of the pre-trained model (work with the overfitting rather than against it). The monkeypatch will work for the pre-trained model without fine-tuning, but you will need to fine-tune as the results will not be that good without it.

It can probably be even better than this with a few other modifications which I am testing (swap softmax for ReLU, increase head dimension)

In my testing, I tried random positional encoding, but I was not able to replicate the results of [Jianlin Su](https://kexue.fm/archives/9444), so maybe I did it incorrectly. I also tried shifted positions, log n scaling, log-sigmoid, and increase the head dimension, though this dilated RoPE (DoPE :) ) is the only one which worked for me consistently -- Note these are all based on finetuning, since the goal is to extend the context of the pre-trained model. Pre-training will paint a different picture.

I trained the LoRA with the following configuration: 
- 1200 samples (~400 samples over 2048 sequence length)
- learning rate of 3e-4 
- 3 epochs
- The exported modules are:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - all bias
- Rank = 2
- Alpha = 8
- no dropout
- weight decay of 0.1
- AdamW beta1 of 0.9 and beta2 0.99, epsilon of 1e-5
- Trained on 4-bit base model