language:
- en
author: >-
froggeric
(https://huggingface.co/datasets/froggeric/imatrix/edit/main/README.md)
All credit for this wonderful Repo Card detailing and explaining the similarities and differences of computed imatrices and detailing and explaining the differences, similarities, and, highlighted significances of training datasets and their purported purposes for particular large language models, goes to froggeric.
Input files for generating the Importance Matrix
Note: All uploaded imatrices to this repo are pre-computed, and are ready to be used in llama.cpp's quantization process.
llama.cpp % ./quantize --imatrix path_to_imatrix path_to_model_files model_name-QuantType.gguf QuantType
Which file to use for generating the importance matrix
Not all importance matrices are equal. The best results are obtained when using a source file similar to the training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs), the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed; for example:
cat technical.txt multilingual.txt wiki.txt >custom.matrix
Note on context size when generating the matrix: in general, a small context size such as 512 is recommended, and community tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates to number of words, not characters.
You will find below descriptions for the various input files provided, to help you choose the correct one.
Community provided files
groups_merged
"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
I get lower KL div than wikitext for the same length and the outputs seem qualitatively better." (kalomaze)
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384
group_10_merged
(superseeded by groups_merged)
"This is about ~50k pseudo-random tokens.
I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b" (kalomaze)
https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233
20k_random_data
(superseeded by groups_10_merged)
https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190
8k_random_data
(superseeded by 20k_random_data)
https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829
badwords
402 english words that can be considered dirty, naughty, obscene, or otherwise bad words.
This could be useful to remove guard rails.
Compiled from Shutterstock github repo
badwords_multilingual
2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages.
This could be useful to remove guard rails.
Compiled from Shutterstock github repo
ptb.train
Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing,
punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded
by WikiText which does not have these removals, features a larger vocabulary and full articles (better
suited for models that can take advantage of long term dependencies). However, for importantce matrix training,
PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText,
without being WikiText; this can help against bias.
WikiText
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of
verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and
WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models
that can take advantage of long term dependencies.
https://huggingface.co/datasets/wikitext
WikiText_FR
70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles"
or "good articles".
https://huggingface.co/datasets/asi/wikitext_fr
c4
The C4 dataset is a collection text sourced from the public Common Crawl web scrape.
It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish)
in addition to extensive deduplication. C4 dataset was explicitly designed to be English only:
any page that was not given a probability of at least 99% of being English by langdetect was discarded.
code (exllamav2)
Programming
multilingual (exllamav2)
English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew,
Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish.
technical (exllamav2)
Technical writing.
tiny
Very short stories. Be mindful of the prevalence of "Once upon a time" and "<|endoftext|>".
Extract from TinyStories dataset
wiki (exllamav2)
Small Wikipedia dump. Unclean, contains many unwanted tags.
exllamav2 calibration data taken from:
https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data
How to quantize using an imatrix, with llama.cpp
- Get one of the input files collected here, or elsewhere.
- Convert or download the model you want to quantise, in fp16 GGUF format.
- Generate an imatrix file specific to the model you want to quantise
cd <llama.cpp directory>
./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512
# -ngl : layers offloaded to gpu (recommended to use number of layers the model contains)
# -t 12 : number of threads (should probably match no of cpu)
# -c 512 : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model)
# -b 200 : batch size (default=512)
# --chunks 100 (recommended)
# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16)
- Use the generated matrix file to quantise the model
./quantize --matrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS>
Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is better for higher quantisation.