Request: WestLake-10.7B-v2 (quants)

#11
by DatToad - opened

[Required] Model name:
WestLake-10.7B-v2

[Required] Model link
https://huggingface.co/froggeric/WestLake-10.7B-v2

[Required] Brief description:
Based on Mistral-7B and trained on a new dataset, testing very well on benchmarks for RP/ERP/creative writing and through experience, plays quite well in chat/instruct. This is a self-merge from the original 7B but saw significant improvements in quality.

[Required] An image/direct image link to represent the model (square shaped):
From the 7B version:
jnqnl8a_zYYMqJoBpX8yS.png

[Optional] Additional quants (if you want any):

Quants are pretty limited for this model from its creator, but I am interested in testing PP performance on the Tesla P40, especially with large context sizes. The doesn't seem to respond well to anything larger than 8k, but I am hoping RoPE scaling can improve this, even though it otherwise plays fantastic.

Beyond improving the model's short-term memory, K quants may be slower than traditional quants which may be made worse by the limited compute power of the P40 compared to current generation GPUs (a P40 is comparable to a 1080, just with 24GB VRAM) . I'd like to test this theory, along with benchmarking the difference in using i-quants and imat real effect on PP or TG on these cards (Aforementioned limited compute may have a more pronounced effect).

Finally, this model could potentially fit in very low context sizes (4k or less), but could use every advantage it can get to produce quality output in 6GB (Q3_K_S would just barely) or even 4GB of RAM. While no guarantee this is attainable, it seems like a valuable target if we could succeed.

Default list of quants for reference:

"Q4_K_M", "Q4_K_S", "IQ4_XS", "Q5_K_M", "Q5_K_S",
"Q6_K", "Q8_0", "IQ3_M", "IQ3_S", "IQ3_XXS"

Existing GGUF only has Q4_K_S, Q5_K_S, Q6_K, and Q8_0. Any additional ones beyond those requested for my benchmarks would likely be useful to the community. Comments also mentioned this imat may have been used in preparing the 7B model: https://huggingface.co/Joseph717171/Imatrices/blob/main/WestLake-7B-v2_wikitext-2-raw.imatrix .

Future ambitious revisions to the model could potentially include LASER or DPO improvements, but those are outside of my scope for the moment. I'd like to encourage its development as showing just how good small models can be at specific tasks and maybe help pave the way for more such specialized models that can be swapped by the user as needed.

Thanks for your help, you're doing the Lewd's work. :)

Heya! Sure. Will do quants when idle, likely this evening. Model looks very promising.

Finally, this model could potentially fit in very low context sizes (4k or less), but could use every advantage it can get to produce quality output in 6GB (Q3_K_S would just barely) or even 4GB of RAM. While no guarantee this is attainable, it seems like a valuable target if we could succeed.

IQ3-imat quants should do nicely for this. I may add some IQ2-imat quants if that's not enough.

Existing GGUF only has Q4_K_S, Q5_K_S, Q6_K, and Q8_0.

I'll do all of them since they are regular GGUF quants. Hopefully we can improve those quants with the Imatrix calibration data, feedback has been positive, especially for smaller quants, like I/Q4/5.

New imatrix data will be generated from the FP16 10.7B-v2 model, using Kalomaze's general groups_merged.txt with some roleplay chats added for slightly more diversity.

If you have a particular calibration data in mind to be used you can let me know.


Do you also want old _0 and _1 quants to test performace in the P40?

Will add those.

    quantization_options = [
        "Q4_0", "Q4_1", "Q5_0", "Q5_1",
        "Q4_K_M", "Q4_K_S", "IQ4_XS", "Q5_K_M", "Q5_K_S",
        "Q6_K", "Q8_0","Q3_K_M", "IQ3_M", "IQ3_S", "IQ3_XXS"
    ]

Yes, please include legacy _0 and _1 quants for testing purposes. Knowing the results could influence future optimizations for older cards of all varieties. Thanks. :)

@DatToad Default list of quants will be uploaded alongside the _0 and _1 quants inside a legacy folder as to avoid people using those without necessity as they are worse quality performers compared to IQ/K quants.

Since I have a Pascal card I'll do some quick speed benchmarks when I get to testing on my end.

Lewdiculous changed discussion status to closed

Sign up or log in to comment