General discussion and testing feedback.
This has to compete and I'm hoping win against InfinityRP, which is a 7B, so let's see how it does at the same VRAM usage.
InfinityRP is a tough one to beat.
Working on the first round of testing, specifically comparing the P40 and 4070Ti in MMQ on/off and Imatrix vs non purely in terms of performance while making sure it doesn't do anything unexpected, 8k context.
Early eyeball test suggests that the imatrix quants are slightly more uncensored than the original (the original tries to sneak 'consent' into the play more than it originally had in mind), but that my current favorite model, Midnight-Miqu 70B 1.5 i1 and at 12k context, is only just a little better in terms of quality and complexity, and takes 4-5 minutes to get a full 384 token response, in comparison to, uh... 16 seconds on the 4070Ti?
This ain't writing code, even if it's only 90% as good can produce the same results in 90% less time, that's going to be a winner for most purposes, and imatrix overall seems to be an improvement over the original. I'll give some comparable InfinityRP quants a go (matching VRAM for VRAM in the same uncensored scenario) later tonight.
If you want something that will almost never refuse/blabber about consent there is Layla-V4 (https://huggingface.co/Lewdiculous/mistral-7b-v0.1-layla-v4-GGUF-IQ-Imatrix) which is a decent RP model and will really do anything you ask, no matter how 'bad', haha, but of course in my opinion (and from others I imagine) InfinityRP is a better pick overall as with a decent character card it will do just fine.
I'd need to spend more time with Westlake-10.7B to get a better feel for it, but it kept trying to bleed ``` into the roleplay chat responses, at the end. I don't know what it was trying to add into a code-block there as I was just chatting to a character that doesn't have them in their normal responses, something InfinityRP or BuRP won't do.
Early eyeball test suggests that the imatrix quants are slightly more uncensored than the original (the original tries to sneak 'consent' into the play more than it originally had in mind)
So, Imatrix quants should have weights "closer" to the original FP16 model, since the calibration data was based on the original model, compared to regular quants.
For example, when looking at smaller quants like IQ3, but especially IQ2, Imatrix is probably the only way make them even decently usable.
4-5 minutes to get a full 384 token response, in comparison to, uh... 16 seconds on the 4070Ti?
I know the struggle, it's impressive the level of quality and speed you can get from models in this size range, even if they can't be fully trusted for math or coding, they will perform greatly on other aspects.
First set of minimally scientific tests completed (I wrote it down, that counts, doesn't it?), major takeaways:
On the P40, best performance was obtained with MMQ On, with the legacy Q4_0-imat, 7.85 T/s.
Worst case scenario was the Q4_K_M-imat, total PP/TG of 7.11 T/s.
Long story short, you're not going to perceive a performance difference between these, and unless you can perceive a quality difference going up to Q6_K, it's not worth taking the speed hit. Both PP and TG were a little slower, combined 6.78 T/s.
I'll leave it to someone else to decide what constitutes acceptable quality for the tiny quants. :)
4070Ti was a little surprising.
MMQ Off resulted in significantly faster TG speeds, going from about 17 T/s to 24 T/s on the KS quants, with and without imat.
All other factors weren't meaningfully significant, but there was also about a 1 T/s drop in performance to go to the Q6_K, so why not.
TL;DR: MMQ a little better for old jank, MMQ bad for new jank. No reason not to fill up your VRAM.
Will give some of the others a quick once-over to see how they behave just on the 4070Ti for comparison in terms of quality at comparable VRAM.
You can see us arguing about Quants here:
I don't think the quality hit is worth going from the K or IQ quants to the legacy _0/_1 quants at all.
Anything starting from the IQ4_XS and up should be pretty comparable -- especially when using Imatrix data, as the gap from the original weights is minimized, considering this blind test:
Table:
If you want something that will almost never refuse/blabber about consent there is Layla-V4 (https://huggingface.co/Lewdiculous/mistral-7b-v0.1-layla-v4-GGUF-IQ-Imatrix) which is a decent RP model and will really do anything you ask, no matter how 'bad', haha, but of course in my opinion (and from others I imagine) InfinityRP is a better pick overall as with a decent character card it will do just fine.
I'd need to spend more time with Westlake-10.7B to get a better feel for it, but it kept trying to bleed ``` into the roleplay chat responses, at the end. I don't know what it was trying to add into a code-block there as I was just chatting to a character that doesn't have them in their normal responses, something InfinityRP or BuRP won't do.
Bad in the opinion of people with concerns about what people do in their personal time and keep between their own ears and a computer in the own home.
WestLake didn't interrupt the RP with consent, but it tried to push it back in that direction, it wasn't a refusal and it kept going from there.
I haven't seen any of those quote issues you mentioned, with a relatively stock ST, Alpaca-Roleplay presets. Sounds like something that's more likely to be in the frontend than the model?
Bad in the opinion of people with concerns about what people do in their personal time and keep between their own ears and a computer in the own home.
Absolutely right. I want my models raw and real, I hate the general BS 'safety' alignment.
For Westlake, I'll go back for testing with other system prompts and see how it goes next time. Other models under the same circumstances didn't act that way but of course, using the right preset for the right model is important, I just didn't spend too much mixing and matching.
I don't think the quality hit is worth going from the K or IQ quants to the legacy _0/_1 quants at all.
I didn't suspect it would, but this was for a particular niche (yet common enough) scenario involving lots of VRAM but limited compute. I'd say my numbers came to the conclusion that even though legacy is ever so slightly faster, it's not enough to justify going out of anyone's way to use them. Potentially more relevant for 8x7B models, but even then those only fit in 24GB if you go to IQ3 anyway.
Potentially more relevant for 8x7B models, but even then those only fit in 24GB if you go to IQ3 anyway.
Speaking of IQ quants, can be more compute intensive (I don't find it much different from K quants to be honest but well), they seem to be particularly slow on Apple Silicon, but otherwise are a good option, depending on VRAM buffers. More options are better than less.
@Nitral-AI @jeiku - Would trying to go for the same merge strategy as Westlake-10.7B as shown here...
dtype: float16
merge_method: passthrough
slices:
- sources:
- model: senseable/WestLake-7B-v2
layer_range: [0,9]
- sources:
- model: senseable/WestLake-7B-v2
layer_range: [5,14]
- sources:
- model: senseable/WestLake-7B-v2
layer_range: [10,19]
- sources:
- model: senseable/WestLake-7B-v2
layer_range: [15,24]
- sources:
- model: senseable/WestLake-7B-v2
layer_range: [20,32]
Be fiesable? Just wondering how compute intensive this arrangement is given current capacities, and if not a huge pain to do, to perform a similar merge for InfinityRP or BuRP... - InfinityRP would be easier to test for me and some others.
I want to hear your thoughts before committing to anything, and as always no need to oblige, this is a more of a morbid curiosity of mine, but sometimes the rocks stick when we smash them after all.
InfinityRP is a tough one to beat.
Limited sample size, but definitely can concur that this is a strong model, and easily goes toe to to with WestLake even with the 10.7B params. Only thing I could wish for is a larger context, and I'm not seeing that anywhere unless it involves the 8x7B or 70B+ models.
@Lewdiculous The problem with a layout like that is that it will not be merge compatible with other 10.7B models. The reason that the basic 11B passthrough is preferred is because it has become a standard ever since Undi's original experiments. Making a model like this would be a one off and only compatible with other models using the exact same layer extension scheme. Furthermore, I have been researching the efficacy of extensions like this or the one used in Sanji's longcat and it seems to rather effectively degrade performance while also making the model larger and thus more compute intensive. These types of layer trickery essentially cost more compute for lower performance. If you'll notice, this user has not submitted any models to the leaderboard and there is likely a reason for that.
Normally id consider this, but honestly feeling pretty burned out on the llm space right now, taking a little extended break.
Only thing I could wish for is a larger context
@DatToad
You and me, brother.
There actually is a good merge for this, Kunochini-128k-test (by Nitral):
https://huggingface.co/Lewdiculous/Kunocchini-7b-128k-test-GGUF-Imatrix
(Only use the V2 quants)
It's my go to long context RPer.
@Nitral-AI Hey, all good mate, I was just checking up, have a good recovery, it has its ups and downs.
These types of layer trickery essentially cost more compute for lower performance. If you'll notice, this user has not submitted any models to the leaderboard and there is likely a reason for that.
👀
I see. Welp...
The reason that the basic 11B passthrough is preferred is because it has become a standard ever since Undi's original experiments.
Would a regular 9B/11B passthrough of only InfinityRP layers be alright or just pointless?
These wouldn't be too intensive but lemme know if the bandwidth situation is an issue.
Honestly it's fine if you're busy or with your mind somewhere else.
The reason that the basic 11B passthrough is preferred is because it has become a standard ever since Undi's original experiments.
Would a regular 9B/11B passthrough of only InfinityRP layers be alright?
Honestly it's fine if you're busy or with your mind somewhere else.
Simply copying layers will not improve a model. I have run various experiments on 3B and 7B layer extensions and the only time I saw an improvement was when disparate models were mixed either during the layer extension (preferable) or after extension in a merge between two likewise extended models (inneficient.) It's not that I don't want to do it, I already have and it introduces grammatical errors like repeated words, improper plurals, and improperly worded phrases.
There actually is a good merge for this, Kunochini-128k-test (by Nitral):
Just gave it a quick whirl and threw 24k of context in it to see what would happen. Decent text output, but it doesn't seem to track multiple characters very well. Definitely a solid go-to type of model for your longer term AI fling.
@jeiku Thanks for the insight, this makes things clearer. I'll think about what could be interesting with that in mind, likely looking at 7B Slerps.
@DatToad Something has to give haha but it 'works', it's hard to handle long context, other than this I can only think of Toppy-M that some people seem to have used with 32k, but I personally am not a fan of it.
LeroyDyer/Mixtral_AI_128K_B <<<<<<<<<<<< Merge here >>>>>>>>>
i was able to merge this model into my pool... and i extended the context to 128k (132)
yanismiraoui/Yarn-Mistral-7b-128k-sharded <<<<<<<<<<<<< This guy! >>>>>>>>>>>> Good sharded model , easy merging. the main thing was to keep the 128 model as he main weighted bias .....
Nous-Yarn-Mistral-7b-128k is a state-of-the-art language model for long context, further pretrained on long context data for 1500 steps using the YaRN extension method. It is an extension of Mistral-7B-v0.1 and supports a 128k token context window. so its mainly clean copy but extended .. but unsharded....
slices:
- sources:
- model: yanismiraoui/Yarn-Mistral-7b-128k-sharded
layer_range: [0, 32]
- model: LeroyDyer/Mixtral_AI
layer_range: [0, 32]
# or, the equivalent models: syntax:
# models:
# - model: mistralai/Mistral-7B-Instruct-v0.2
# LaRGER MODEL MUST BE BASE
# - model: yanismiraoui/Yarn-Mistral-7b-128k-sharded
merge_method: slerp
base_model: yanismiraoui/Yarn-Mistral-7b-128k-sharded
parameters:
t:
- filter: self_attn
value: [0.3, 0.6, 0.4, 0.6, 0.7]
- filter: mlp
value: [0.7, 0.4, 0.6, 0.4, 0.3]
- value: 0.5 # fallback for rest of tensors
dtype: float16
t: - filter: self_attn value: [0.3, 0.6, 0.4, 0.6, 0.7] - filter: mlp value: [0.7, 0.4, 0.6, 0.4, 0.3]
Is there a reason you changed the values from the default example?
Is it to retain more of the model you want to extend? If so how does this affect the maximum usable context size?
[WestLake-10.7B-v2-IQ4_XS-imat]
49/49 LAYERS
llm_load_tensors: ggml ctx size = 0.39 MiB
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 49/49 layers to GPU
llm_load_tensors: CPU buffer size = 66.41 MiB
llm_load_tensors: CUDA0 buffer size = 5438.05 MiB
...................................................................................................
46/49 LAYERS
llm_load_tensors: ggml ctx size = 0.39 MiB
llm_load_tensors: offloading 46 repeating layers to GPU
llm_load_tensors: offloaded 46/49 layers to GPU
llm_load_tensors: CPU buffer size = 5103.59 MiB
llm_load_tensors: CUDA0 buffer size = 5113.19 MiB
...................................................................................................
Hey
@Lewdiculous
I've been trying to test this model, but it's hard to figure out how and why this happens. Just 3 layers less than max = 66MB CPU buffer --> 5,1GB CPU buffer, with GPU only 300MB less.
Why does this happen, where does this 5GB of extra buffer come from? It's like the model doubles itself both into cpu and gpu instead of offloading as much as expected.
Additionally, I can't make it run in kobold 1.62 at all. It gets stuck into Generating forever while my PC gets laggy from the vram being full, but it won't output. I tested it afterwards in LM studio and the model flies with speed: 46.72 tok/s.
I'm having a hard time understanding why, is it the XS quant? Kobold Lost Ruins latest supports those, I thought.
Meanwhile, your Infinitely-Laydiculus-9b-Q4_K_M-imat offloads entirely onto my GPU in similar size (33 layers for 5,1GB on gpu with tiny cpu buffer or something like this) and absolutely BLASTS through my 1,5-2,5k token cards that I made for myself to testing and fetish limit benchmarking (lmao). I am trying to TRULY understand what really causes this difference in performance.
In LM studio, the samplers and instructions are different. In ST, I almost always use ChatML because it surprisingly does amazingly well with any model in my tests. Well, in LM Studio I'm also using ChatML, so I fail to get it and why this WestLake quant gets so stuck in kobold, and why the offloading is so weird with this 5GB popping out of nowhere (seemingly).
Something about the Infinitely-Laydiculus-9b-Q4_K_M-imat quant is really, really good. Using it in LM studio, I get speed: 52.34 tok/s. It's almost as fast in ST with the additional sampling settings and UI extras like blurry background.
But it's not just that, your Laydi-9B CHEWS through BLAS processing, Absolutely destroys ~ 2,6k token card on first prompt and first launch. It takes like 2-3 seconds to start blasting output, where as many models actually hang for a bit on a big first prompt. And then it starts blasting with ~50t/s. This is why I love it, the output is pretty good and never really garbage, AND the speed is fantastic for my 3070.
I think some of it may be due to weirdness in quants and how Kobold runs them, or maybe because Q4_K_M is especially fast and better than IQ4_XS in terms of potential speed? Damn I still don't get it enough to understand what's going on, haha.
EDIT: I did try the Q4_K_M of this model, and it ran on kobold, so there must be something relating to the XS quant, however it is not nearly as fast at blas processing and output as the 9B infi-laydi, even though I do have 1-2gb headroom of vram supposedly. However, I never really get this true headroom anyhow and need a new card ^^(waiting on 5000 series now to see the specs).
The folks at the Chaotic Neutrals might understand what happens better.
KCPP's latest official release is 1.61.2, I think you meant that.
I can't fit WestLake-10.7B-v2 Q4_K_M all in VRAM with only 8GB, so there's that, so leave some headroom I can only offload around 35 layers of it, if your buffer is full it will take ages to generate anything. How does your VRAM usage look like with each:
WestLake-10.7B-v2 Q4_K_M
49/49 LAYERS | Dedicated VRAM usage:
46/49 LAYERS | Dedicated VRAM usage:
IQ quants might require a bit more processing but shouldn't really much slower on CUDA/Nvidia... Hard to say, I don't understand te underlying arch that well.
Honestly this model might just be weird. The 7Bs and 9Bs should give you a much better performance and I'll recommend you stick to them.
Thanks
@Lewdiculous
, yeah I meant 1.61.2*
What metric do you go by to decide how many layers you can offload?
llm_load_tensors: offloading 35 repeating layers to GPU
llm_load_tensors: offloaded 35/49 layers to GPU
llm_load_tensors: CPU buffer size = 5678.68 MiB
llm_load_tensors: CUDA0 buffer size = 4343.59 MiB
llama_kv_cache_init: CUDA_Host KV buffer size = 416.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 1120.00 MiB
llama_new_context_with_model: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB
llama_new_context_with_model: CUDA_Host input buffer size = 25.07 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 560.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 552.00 MiB
llama_new_context_with_model: graph splits (measure): 5
With 35 layers here, the CUDA buffer size is 4,3GB, but we should be able to offload at least until 7GB, leaving 1gb headroom?
I did try offloading 35 layers and got 15t/s. Then I tried offloading 40 layers and got to 22 t/s, so I guess I was probably overflowing at 49 layers. And even now I have 0.9GB of shared memory going into VRAM and the speed is pretty good. It's still a mystery to me how it happens and also why offloading 100% of the layers makes the CPU buffer go almost empty, and then removing 3 layers makes the CPU buffer 5GB. And also why using shared memory in VRAM sometimes slows down things but not other times, plus disabling the sysmem fallback policy seems to make things worse for me with models going outside of the GPU with far more slowdown. I guess it's the inconsistency that makes it hard to understand.
Oh, and I did play around with the model after all. My only feedback is that it seems far more likely to refuse lewdness, in cards where it is effortless to achieve with a model like infi-lay9b, although this one seems a bit better-spoken overall.
Strange, i always run with sysmem fallback disabled and typically am running fine.
llama.cpp gets confused when writing the metadata for frankenmerges, so it outputs it as a 34B model.
Strange, i always run with sysmem fallback disabled and typically am running fine.
I've been doing a lot of tests and overall, the impression I have is that disabled sysmem fallback is meant for models that fit entirely in the GPU, and makes things worse if the model is bigger. But the experiences I have are far too inconsistent. Also, this fallback cannot really be disabled. When it says "prefer to" - it means it, and it will still do it no matter what, just less.
What metric do you go by to decide how many layers you can offload?
Your Dedicated VRAM Utilization. I like to stay at 7.4GB/8GB.
Let's me play some light games like Minecraft and League at the same time or use an Image editing software for example or also high resolution video on MPV, just to make sure it's not going to crap the buffer.
I run everything fully offloaded to my gpu, dont run above 11b's for that reason.
But my mixtrals 😭
I run everything fully offloaded to my gpu, dont run above 11b's for that reason.
Me and my 8GB VRAM will be over here crying in the corner.
You and me, @jeiku . Stay strong. :cope:
Me coping with 12gb's.
Point me to the documentation and requests and I'll set the R730 w/ the 2x P40 to work on it. We all could use better lewd.
Yes, do volunteer to donate your precious compute.
Yes, do volunteer to donate your precious compute.
In what sense?
I have some free compute that has decent RAM, but not the skills.
People in this thread know how to make better models.
I propose an even exchange. :) (As in, I'm sure this stuff is written up somewhere about how to do merges, imatrix quants, etc, and I just need the guides and once I have the knack of it, don't mind filling requests.)
Maybe i should do some tutorials on all of this when im not being lazy. That way more people can do imatrix, gtpq, exl2 quants, and baseline merges (Slerp, dare-ties, linear and passthrough).
To make the Imatrix quants locally I already shared my script here...
https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script
This is setup for a Windows machine with 8GB of VRAM, assuming use with an NVIDIA GPU. If you want to change the -ngl (number of GPU layers) amount, you can do so at line 120. This is only relevant during the --imatrix data generation.
In your case you increase the layers to use more of your big VRAM pool.
Adjust quantization_options in line 133.
That script is fine, but I am assuming @DatToad is on linux because of the P40's.
I can try and get it to work on linux, but it will take a while since I am not a coder.
That server is running ESXi and I can easily switch back and forth between Windows Server and Linux, the GPUs are passed through to whichever OS I want to use.
I'm assuming llama.cpp quantize will run across the full VRAM pool?
For ./quantize
quantization you don't need a gpu.
A gpu is only necessary when computing imatrix.dat ./imatrix
I don't have multi gpu's so I am not sure but these are the relevant flags
-ngl N, --n-gpu-layers N
number of layers to store in VRAM
-sm SPLIT_MODE, --split-mode SPLIT_MODE
how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
-ts SPLIT, --tensor-split SPLIT
fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
-mg i, --main-gpu i the GPU to use for the model (with split-mode = none),
or for intermediate results and KV (with split-mode = row) (default: 0)
For P40's you want to set --split-mode row
Ideally you would have llama.cpp built with LLAMA_CUDA_FORCE_MMQ=1
but Windows builds are annoying so just use the ones from releases.
Not having prebuilt binaries for Linux will make it a bit slower - but nothing unbearable - since you'd need to compile, but as a Windows slave I can't really say much.
The reason for this that didn't cross my mind, is that different distributions have different GCC versions.
Not to mention the other dependencies.
They would have to compile so many packages RPM DEB tar
The only way it would work is releasing some self contained appimage like wheel, similar to the way koboldcpp does it.
I wish they could at least compile for the Debian or Ubuntu base that is more popular, especially on servers.
They have instructions
I don't really care that much since I like to look at the pr's and build from source.
The automated builds are frequent enough I don't think recency is an issue, besides, taking 4 seconds to get the binaries instead of the minutes of compile can be nice for all OSes, I'd like to believe, but alas it's fine.
I have just discovered those imatrix quants of my froggeric/WestLake-10.7B-v2, and did not realise there was so much interest in it! For all it's worth, if you can run a higher quant without importance matrix, I think it is better both in terms of speed and quality. I provided a few of the most important ones at WestLake-10.7B-v2-GGUF. I honestly do not think it is worth using anything smaller than Q4_K_S. Q6 and above definitely do not benefit from using an imatrix. The reason I am advising against using an imatrix, is I am doing some research into important matrices, and what I have observed, along with others, is that imatrices based on an English only dataset degrade the model multilingual capabilities. This is easy to observe when it comes to languages, but the suspicion is the same happens with other capabilities (coding, RP, writing, etc), only more difficult to observe. I also provide a repo with some information about importance matrices, and multiple datasets available for download: froggeric/imatrix. Ideally the importance matrix should be based on the dataset used to fine-tune the model.
@Lewdiculous what dataset did you use for the importance matrix?
In terms of this model, I have plans to develop it further. First I would love for it to have a small fine-tune done with the original WestLake dataset, to realign the layers. But since the dataset is not public, I hope @senseable will be able to help. Second, I am currently doing some research on layers attenuation when merging, and duplicating layers using attenuation. So far, a few promising results, but also lots of duds. We are getting closer to a better understanding. Progress is a bit slow, as it is both time consuming, and we are doing this on large models (120b). Once finished, I will revisit WestLake and try to apply what we have learned to it.
I have just discovered those imatrix quants of my froggeric/WestLake-10.7B-v2, and did not realise there was so much interest in it! For all it's worth, if you can run a higher quant without importance matrix, I think it is better both in terms of speed and quality. I provided a few of the most important ones at WestLake-10.7B-v2-GGUF. I honestly do not think it is worth using anything smaller than Q4_K_S. Q6 and above definitely do not benefit from using an imatrix. The reason I am advising against using an imatrix, is I am doing some research into important matrices, and what I have observed, along with others, is that imatrices based on an English only dataset degrade the model multilingual capabilities. This is easy to observe when it comes to languages, but the suspicion is the same happens with other capabilities (coding, RP, writing, etc), only more difficult to observe. I also provide a repo with some information about importance matrices, and multiple datasets available for download: froggeric/imatrix. Ideally the importance matrix should be based on the dataset used to fine-tune the model.
@Lewdiculous what dataset did you use for the importance matrix?
In terms of this model, I have plans to develop it further. First I would love for it to have a small fine-tune done with the original WestLake dataset, to realign the layers. But since the dataset is not public, I hope @senseable will be able to help. Second, I am currently doing some research on layers attenuation when merging, and duplicating layers using attenuation. So far, a few promising results, but also lots of duds. We are getting closer to a better understanding. Progress is a bit slow, as it is both time consuming, and we are doing this on large models (120b). Once finished, I will revisit WestLake and try to apply what we have learned to it.
Is it possible to insert many more texts such as Titus Livy, Socrates, Polybius, Pliny the Elder etc? would be great ❤️
@ClaudioItaly It is possible but it just takes longer to compute when doing quants so unless there's a noticeable quality improvement it needs to be evaluated wether it's worth it.
@froggeric
Data:
https://huggingface.co/Lewdiculous/WestLake-10.7B-v2-GGUF-IQ-Imatrix/blob/main/imatrix-with-rp-data.txt
Kalomaze's groups_merged.txt with some English roleplay chat examples in the usual roleplay format.
@Virt-io Can probably say more about Imatrix data and multilingual capabilities. I take the stance that these concerns are similar to the overfitting fears which don't manifest outside of extreme scenarios if your data is diverse enough, and that English is still the main target for roleplay as that's a safer language to assume usage with. If anyone verifies issues will multilingual capabilities I will add a [!NOTE] regarding it but I've had feedback from other quants with similar data where users said it handled other languages very well.
About quants, generally I feel like starting from the IQ4_XS and up, they all perform generally well. You are correct, for importance matrix mainly benefits up to Q5 quants. Which makes all those Q4/IQ4 quants especially interesting in my opinion.