zero
AI & ML interests
Recent Activity
Organizations
wonderboy's activity
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
I saw the imatrix dataset which is a whole text file, I'm trying to recreate your wizardry in ONNX lol and I wonder how you make sense of the whole text, how do you chunk it? etc, etc? Help appreciated, and I'm glad you started posting, just found out about this new feature last week, take care. You doing god's work, and your quants are the best. GGUF quants have come such a long way, I see smaller files, and faster outputs, but even ONNX is beating GGUF in my tests, it just take more refined approach.
After examining it, the most I could take away was questions + answers + random text.
I coded Python script:
with open("calibration_datav3.txt", "rt") as file:
data = file.read()
data_blocks = data.split("Q:\n\n")[1:]
for i, block in enumerate(data_blocks, 1):
block = block.split("A:\n\n")
question = block[0].strip()
answer = block[1].strip().split("\n\n")[0].strip()
print(f"### QUESTION:\n{question}\n")
print(f"### ANSWER:\n{answer}")
if i != len(data_blocks):
print("\n---\n")
and it give me some structured data, although some parts of the answers are truncated 😅, example:
### QUESTION:
как передать json на сервер
Здравствуйте, у меня есть 2 json объекта, находящиеся в javascript. Каким образом мне хранить их на сервере, файлами или в запросе передавать? Пожалуйста, с примерами кода.
Бэкэнд на ASP.NET 4.5
### ANSWER:
На клиенте конвертировать его в string:
myStringObj = JSON.stringify(myObj);
---
...
---
### QUESTION:
Show that $S_5$ does not have a quotient group isomorphic to $S_4$
Show that $S_5$ does not have a quotient group isomorphic to $S_4$.
If we to assume that $H$ is such a group, than $H$ must be normal in $S_5$ and $|H|=|S_5|/|S_4|=5$. So $H$ must be isomorphic to $\mathbb{Z}/5\Bbb Z$.
That's as far as my logic goes. I couldn't arrive at a contradiction.
Any ideas?
### ANSWER:
The possible candidates for such an $H$ are the subgroups of $S_5$ that are cyclic of order 5. All elements of $S_5$ of order 5 are given by $5$-cycles. However, the subgroup generated by a 5-cycle is not normal, so no $H$ can exist, as desired.
It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important
However what the quantization then does with that information is where I was wrong.
I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW
Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations
The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group
But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix
Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
Thank you for the fast reply and improved code.
Dumb questions:
- This doesn't improve already finetuned models, rather I have to run this code, and then run the training, correct?
- Also, when I save the model, and load it later in another instance, this improvement did not get saved too, so I need to load this code every time, correct?
Could you help us with an example? In this case, maybe I'm on the money, maybe I'm not lol, I made this:
import torch
from torch import nn
from transformers import AutoTokenizer, AutoConfig, AutoModelForTokenClassification
# Define the fractal functions
def f1(x):
return x**2 + 0.1
def f2(x):
return 1 - (2 * x - 1)**4
# Custom P-FAF Embedding Layer
class PFAFEmbedding(nn.Module):
def __init__(self, embed_size, num_fractals):
super().__init__()
self.p = nn.Parameter(torch.rand(num_fractals)) # Probabilistic weights
self.d = nn.Parameter(torch.rand(num_fractals) * 1.5 + 0.5) # Fractional dimensions
self.embed_size = embed_size
self.num_fractals = num_fractals
self.fractals = [f1, f2] # List of fractal functions
def forward(self, x):
# x: [batch_size, seq_length, embed_size]
batch_size, seq_length, _ = x.shape
x_expanded = x.unsqueeze(1).expand(-1, self.num_fractals, -1, -1) # Shape: [batch_size, num_fractals, seq_length, embed_size]
x_dim = torch.pow(x_expanded, 1 / self.d.unsqueeze(0).unsqueeze(-1).unsqueeze(-1)) # Apply fractional power
t = torch.stack([p * f(xd) for p, f, xd in zip(self.p, self.fractals, torch.unbind(x_dim, dim=1))], dim=1)
return torch.sum(t, dim=1) # Sum over fractals
# Custom BERT Model with P-FAF Embedding
class AutoModelWithPFAF(AutoModelForTokenClassification):
def __init__(self, config):
super().__init__(config)
self.pfaf_embedding = PFAFEmbedding(config.hidden_size, 2) # Using 2 fractal functions for demonstration
def forward(self, input_ids, attention_mask=None):
# Normal BERT inputs handling
inputs_embeds = self.embeddings.word_embeddings(input_ids)
inputs_embeds = self.pfaf_embedding(inputs_embeds) # Apply P-FAF transformation
# Rest of the BERT model
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_ids.shape, input_ids.device)
head_mask = self.get_head_mask(None, self.config.num_hidden_layers)
encoder_outputs = self.encoder(
inputs_embeds,
attention_mask=extended_attention_mask,
head_mask=head_mask
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
outputs = (sequence_output, pooled_output) + encoder_outputs[1:]
return outputs # Return the base BERT outputs for compatibility
# Load pre-trained BERT and modify it
config = AutoConfig.from_pretrained("google-bert/bert-base-cased")
model = AutoModelWithPFAF.from_config(config)
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")
i feel you lol, i hate the super long times, but also the small ones, cause like I play movie for like 5-10 mins before im interrupted again so it feels like a game of cat and mouse haha, but patience, hopefully we reap some rewards XD stay strong.