Downtown-Case/Star-Command-R-Lite-32B-v1

McUH

Sep 5, 2024

I think it achieves the goal. Plays roleplay more interestingly than basic CommandR 2024 32B and is not so lewd and horny as Star Command. I tried the exl2 4bpw quant and it works Ok, though it would be nice to have some GGUF's with higher precision (like Q6).

Downtown-Case

Owner Sep 5, 2024

•

edited Sep 5, 2024

I can make a Q6 soon, if one of the GGUF quantizers doesn't pick it up before then.

Is iMatrix worth using on Q6 these days? Guess I'll see...

McUH

Sep 8, 2024

Not sure about imatrix and Q6. From what I have heard it produces different results but is it worth it/better? Generally I only use imatrix quants up to 4bit.
I tried the exl2 4bpw a bit more, it is not bad, but it is very inconsistent, contradicting itself often, sometimes even within one message. Don't know if it is because of such low quant or if it is general problem with this Star Command finetune. Either way it is unfortunate because the exl2 4bpw quant could in theory be used for 60-80k context with 24GB VRAM, but if it can't really follow even 8k, then long context is not very useful.

Downtown-Case

Owner Sep 9, 2024

•

edited Sep 9, 2024

Is the regular command R at 4bpw working better in that respect?

I've been trying both back to back, still not sure yet. I feel like both mess up in different areas, and like you said it may be due to the extreme quantization.

One thing I've found, btw, is that command R likes extremely low temperature, especially if you use quadratic smoothing or something. It gets non deterministic even below 0.1, though I'm not sure about an optimal spot or anything.

Downtown-Case

Owner Sep 9, 2024

•

edited Sep 9, 2024

Another note, the HF to GGUF conversion script errors out with this model for some reason:

ValueError: Can not map tensor 'lm_head.weight'

But not with regular command R?

There's also another quirk I discovered earlier where this raw model is a few gigabytes larger than regular Command R. It seemed to quantize to exl2 find and end up at the same size, so I wrote it off... but now I'm not so sure. A linear merge seems to have the same result, as does manually specifying a tokenizer source.

Something might be messed up with mergekit and Command-R, not sure yet.

McUH

Sep 9, 2024

Yeah, you are probably right, checking notes from my original testing of c4ai-command-r-08-2024 32B Q6_K_L I have there "seems not very smart".
I did not try low temperature (actually later I went to experiment with higher temperature as base commandr 2024 often just gets stuck in scene, but yes, it makes it even more chaotic). But maybe low temperature can work for Star Command / lite as they are no longer so dry and repetitive as base CommandR 2024.

I no longer use quadratic smoothing. In general I try not to mess with token distribution much nowadays (except temperature, low minp like 0.02 to remove tail and DRY). As someone pointed out, the models train hard for long time on insane HW to learn token predictions. Simple sampler function that changes distribution is not going to improve it but more likely just mess with what they learned.

Downtown-Case

Owner Sep 9, 2024

•

edited Sep 9, 2024

Simple sampler function that changes distribution is not going to improve it but more likely just mess with what they learned.

I tend to agree with this. That being said there's no "true" distribution as sampling is largely picking the not-most-likely token to keep it from looping... but now that you say it, I will try skipping the distribution warping.

But yes, I find this model does not like a lot of rep penalty, not a lot of temperature (I am using 0.05 for short completions atm). Unfortunately I am not using DRY atm, as text-gen-web-ui is mega laggy at long context :(

tdh111

Nov 13, 2024

Another note, the HF to GGUF conversion script errors out with this model for some reason:
...
There's also another quirk I discovered earlier where this raw model is a few gigabytes larger than regular Command R.
...
Something might be messed up with mergekit and Command-R, not sure yet.

@Downtown-Case
I looked into this issue here and mradermacher was able to make GGUF quants with my suggestion.

Downtown-Case

Owner Nov 13, 2024

•

edited Nov 13, 2024

Thanks!

I think it may have been a bug with mergekit, actually. Its possible the model is a little off, but I am waiting for another Command-R finetune before trying a new merge.

tdh111

Nov 14, 2024

I think it may have been a bug with mergekit, actually.

Yes, I linked a mergekit issue in my comment, they only fixed it for gemma by marking lm_head.weight optional, but they did not do it for Command-R which does the same thing.

Its possible the model is a little off

The only thing that was preventing it from being made into a GGUF was the lm_head.weight created by mergekit which was redundant as embed_tokens.weight contains the same data, and the HF to GGUF script not accepting that.
If your exl2 version works the GGUF's made should also work. I haven't tested it yet but I do plan to.

Downtown-Case

Owner Nov 14, 2024

Ah, good. Thanks! The exl2 version does indeed work fine.

@tdh111 BTW, if you are looking out for more Command-R finetines, one I have my eye on is here: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b/discussions

The tuner has only modified the old command-r, but stated the new version with GQA should be next.

tdh111

Nov 14, 2024

BTW, if you are looking out for more Command-R finetines, one I have my eye on is here: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b/discussions

The tuner has only modified the old command-r, but stated the new version with GQA should be next.

Thanks for the recommendation. I found this on that page "The dataset consisted of approximately 1000 pre-2012 books" which makes me pretty interested.

On that note, at the mid 30b size do you prefer EVA ( and other Qwen based stuff) now to Command-R stuff? Or do you think they have different tradeoffs? (Also v0.2 of Eva came out that seems to be just a strict improvement over 0.1)

My first impressions with the new GQA 35b Command-R felt like it lacked or severely reduced the unique flavor that the original Command-R had, now it just felt like another synthetic data trained LLM. I haven't tried most of the new stuff that's been coming out, so maybe there is stuff better but I still find myself constantly going back to Midnight Miqu. It is pretty bad at picking up style from context, and is not that smart, lacks context size, and is much slower on my machine, but it writes well, isn't boring, and isn't absurdly horny like so many of the fine tunes I've tried.

Downtown-Case

Owner Nov 16, 2024

•

edited Nov 16, 2024

@tdh111

I've been using EVA (0.2 now) almost exclusively! It's great.

It feels like a base model, not a more slopped instruct tune, though it will still use instruct formatting.
It's legit great at 64K context, probably better than Command-R out there, and much better than Qwen Instruct with YaRN. The tokenizer is also incredible, it packs tons of text into 64K.
It picks up and follows style from context very well.
It's not unreasonably horny or anything.

It's still slopped, and I'm not sure how it performs at 80K+ yet. TBH jury is still out on intelligence/fandom knowledge and creativity, but I don't feel the need to go back to Star-Lite or anything.

I agree with much of the sentiment on the new command R, though I think its OK if its prompting format is fleshed out and it has a lot of context to draw on.

tdh111

6 days ago

@Downtown-Case

I've had good experiences with EVA, (also very recently tested DeepseekV3 Base and my very early impressions are disappointment).

https://huggingface.co/nbeerbower/EVA-Gutenberg3-Qwen2.5-32B This just came out, from your suggestion. It also seems to be the first fine tune of EVA-32b-v0.2, plenty of merges ( I've tried one that had QwQ, guten, and EVA). I'm going to test it soon, looks interesting.