Text Generation
Transformers
Safetensors
English
mistral
conversational
Inference Endpoints
text-generation-inference

Open Code

#3
by Deniaud - opened

Can we hope that you will post your Laser code implementation using Random Matrix Theory?

Cognitive Computations org

Yes. you can :)
Keep following our next steps <3

Do y'all have a quick/brief overview of how the pruning rates are selected? Is it some variation of selecting a singular value cut-off from the Marchenko–Pastur distribution (say retain just 10% for layer N, 20% of N-10, etc), and then normalize that based on the norm of the matrix?

I believe the paper did a first-order grid-search from (highest layer, lowest-pruning-rate) across each weight. This encodes the intuition in their paper that you shouldn't prune the early layers, but pruning the later layers is extra beneficial. Is that strategy (favor later layers) also applicable here?

Additional question - would we see a significant (inference) performance boost via preprocessing the model? I thought a big part of the speedup due to using SVD is that SVD matmul is much faster, I believe the resulting LASER-ed weights are still data-dense with the same dimensions as before? That would suggest no speedup in inference until the engine can take advantage of the knowledge that some of the weights are low-rank right?

Cognitive Computations org

We use marchenko pastur to define which is the threshold cut for singular values.
And yes, we start the search from top layers downwards, which is optimal.

Stay tuned because we will release the code in the next few days.

@fernandofernandes Just wanted to follow up on this. Have you had time to release the code yet, and if so, could you share a link to it? Thanks for all of your hard work!

Sign up or log in to comment