Can't run the model in tabbyAPI

#2
by minyor25 - opened

Hello.
The tabby's log complained to not being able to find next parameters:
"rms_norm_eps": 1e-06,
"rope_local_base_freq": 10000.0,
"vocab_size": 262208

After adding them into "config.json", now there are these errors:
ERROR: raise ValueError(f" ## Could not find {prefix}.* in model")
ERROR: ValueError: ## Could not find lm_head.* in model

Cannot find any useful info in the internet...
Help, please

The model is currently supported on the dev branch of ExLlamaV2, not the latest release version which Tabby pulls by default. If you can switch ExLlamaV2 over to the dev branch (requires the build prerequisites (CUDA Toolkit, plus VS Build Tools if you're on Windows) it should work, otherwise there will be a new release in a couple of days most likely.

Thank you kindly for your reply, I restored the original "config.json", then cloned dev branch of exllamav2, built it. Now it is working!
However, looks like it is having a difficulty generating code with roo-code, it starts to generate but ends up cycling same word:

Okay, I will write a Tetris game logic in Python using the Pygame library. I'm ready to write the code. I's a Python file, so I'll create a file named tetris.code. I'll use the write_to_code tool to write the code. I'll start by writing the code.

```python
import pygame
import random

def main 
    pygame.init() 
    screen = pygame.display.set_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_main_

Here is my tabbyAPI config for the model:
max_seq_len: 32768
cache_mode: Q4

Can it be some tabby specific changes in exllamav2 usage at fault?
Should I try to experiment with temperature or some other parameters like min or max p or something?
Thanks

Try cache Q6+, some models break almost completely under Q4.

I second this. There's no guarantee that the distribution of the keys and/or values will be amenable to quantization, especially Q4 which relies on groups of 64 consecutive values aligning well to a regular 16-point grid after Hadamard regularization. It might be some interesting interaction between that regularization and Gemma3's use of Q/K norms. I'm not sure. Could also be SWA which only uses 1024 keys/values per token for 5/6 layers, making rounding errors in that smaller chunk of cache more critical. Either way, try Q6, Q8, or FP16 to see if the results improve.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment