Here is how we can calculate the size of any LLM model:
Each parameter in LLM models is typically stored as a floating-point number. The size of each parameter in bytes depends on the precision.
32-bit precision: Each parameter takes 4 bytes. 16-bit precision: Each parameter takes 2 bytes
To calculate the total memory usage of the model: Memory usage (in bytes) = No. of Parameters × Size of Each Parameter
For example: 32-bit Precision (FP32) In 32-bit floating-point precision, each parameter takes 4 bytes. Memory usage in bytes = 1 billion parameters × 4 bytes 1,000,000,000 × 4 = 4,000,000,000 bytes In gigabytes: ≈ 3.73 GB
16-bit Precision (FP16) In 16-bit floating-point precision, each parameter takes 2 bytes. Memory usage in bytes = 1 billion parameters × 2 bytes 1,000,000,000 × 2 = 2,000,000,000 bytes In gigabytes: ≈ 1.86 GB
It depends on whether you use 32-bit or 16-bit precision, a model with 1 billion parameters would use approximately 3.73 GB or 1.86 GB of memory, respectively.
Reacted to bartowski's
post with ❤️about 1 month ago
In regards to the latest mistral model and GGUFs for it:
Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window
Yes, I got excited when a conversion worked and released them ASAP
That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model
I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !