Imatrices / README.md

Update README.md

ea88c2c verified 8 months ago

8.1 kB

	---
	language:
	- en
	author: Joseph717171 & froggeric (https://huggingface.co/datasets/froggeric/imatrix/edit/main/README.md)
	---
	# All credit for this wonderful Repo Card detailing and explaining the similarities and differences of computed imatrices and detailing and explaining the differences, similarities, and, highlighted significances of training datasets and their purported purposes for particular large language models, goes to [froggeric](https://huggingface.co/datasets/froggeric/imatrix).

	# Note: All uploaded imatrices to this repo are pre-computed, and are, therefore, ready to be used in llama.cpp's quantization process.

	# Note: Imatrices uploaded to this repo follow the following naming convention: model-name_training-dataset.imatrix (hyphens are purely used in this example to enhance readability...)

	# Instructions: Download the imatrix for your chosen LLM (Large Language Model), and quantize to your preferred QuantType. (Note the following example already assumes you converted your model to GGUF)

	```
	llama.cpp % ./quantize --imatrix path_to_imatrix path_to_model/ggml-model-f16.gguf model_name-QuantType.gguf QuantType
	```
	# Note: If you need detailed steps to convert your Large Language Model to GGUF, please scroll to the bottom of this page and check out the section: How to convert Supported LLMs (Large Language Models) to GGUF format

	# Supplementary Learning: Training Datasets, Their Similarities and Differences, and How to Determine Which one will Be Right for Computing your Imatrix

	# Input files for generating the Importance Matrix

	## Which file to use for generating the importance matrix

	Not all importance matrices are equal. The best results are obtained when using a source file similar to the
	training data. Size also matters: the bigger the model (eg: 70b vs 13b) and the higher the quant (eg: q6k_ vs iq3_xs),
	the bigger the source file needs to be to make an impact. Multiple input files can be combined if needed;
	for example:
	```
	cat technical.txt multilingual.txt wiki.txt >custom.matrix
	```
	Note on context size when generating the matrix: in general, a small context size such as 512 is recommended, and community
	tests have shown it usually performs than a larger one such as 4096. However, I would argue this is is highly dependent on the
	source data you are using: with random tokens or short text a small context makes sense; but when using larger texts, a larger
	context matching the size of the texts might be a better choice. Remember that the size is in tokens, which roughly translates
	to number of words, not characters.

	You will find below descriptions for the various input files provided, to help you choose the correct one.

	## Community provided files

	groups_merged\
	_"Here is a decent general purpose imatrix calibration dataset. It should be more diverse than wikitext at ~30k tokens, as it is excerpts of a larger dataset which includes coding examples (which seems quite important!)
	This means it's generally higher entropy data compared to wikitext, and it's real data rather than pseudo-randomly generated data.
	I get lower KL div than wikitext for the same length and the outputs seem qualitatively better."_ (kalomaze)\
	https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8395384

	group_10_merged\
	(superseeded by groups_merged)\
	_"This is about ~50k pseudo-random tokens.
	I am getting the best balance between the maximum divergence and the other divergence statistics using this file when quantizing 7b"_ (kalomaze)\
	https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-8349233

	20k_random_data\
	(superseeded by groups_10_merged)\
	https://github.com/ggerganov/llama.cpp/discussions/5006#discussioncomment-8163190

	8k_random_data\
	(superseeded by 20k_random_data)\
	https://github.com/ggerganov/llama.cpp/discussions/5006#discussion-6087829

	badwords\
	402 english words that can be considered dirty, naughty, obscene, or otherwise bad words.
	This could be useful to remove guard rails.
	Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)

	badwords_multilingual\
	2580 words that can be considered dirty, naughty, obscene, or otherwise bad words. Includes 26 languages.
	This could be useful to remove guard rails.
	Compiled from [Shutterstock github repo](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/tree/master)

	ptb.train\
	Penn Treebank (PTB) is a widely used preprocessed large dataset designed for language training. Casing,
	punctuation and numbers have been removed from the training data. Recently it has kind of been superseeded
	by WikiText which does not have these removals, features a larger vocabulary and full articles (better
	suited for models that can take advantage of long term dependencies). However, for importantce matrix training,
	PTB is still a valid dataset, which has the advantage of being manually curated, and similar to WikiText,
	without being WikiText; this can help against bias.

	WikiText\
	The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of
	verified Good and Featured articles on Wikipedia. Compared to PTB, WikiText-2 is over 2 times larger and
	WikiText-103 is over 110 times larger. As it is composed of full articles, the dataset is well suited for models
	that can take advantage of long term dependencies.\
	https://huggingface.co/datasets/wikitext

	WikiText_FR\
	70 million tokens extracted from the set of french Wikipedia articles that are classified as "quality articles"
	or "good articles".\
	https://huggingface.co/datasets/asi/wikitext_fr

	c4\
	The C4 dataset is a collection text sourced from the public Common Crawl web scrape.
	It includes heuristics to extract only natural language (as opposed to boilerplate and other gibberish)
	in addition to extensive deduplication. C4 dataset was explicitly designed to be English only:
	any page that was not given a probability of at least 99% of being English by langdetect was discarded.

	code (exllamav2)\
	Programming

	multilingual (exllamav2)\
	English, Arabic, Chinese, French, German, Japanese, Polish, Russian, Spanish, Swedish, Turkish, Hebrew,
	Macedonian, Norwegian, Lithuanian, Greek, Italian, Afrikaans, Dutch, Danish.

	technical (exllamav2)\
	Technical writing.

	tiny\
	Very short stories. Be mindful of the prevalence of _"Once upon a time"_ and _"<\|endoftext\|>"_.
	Extract from [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories)

	wiki (exllamav2)\
	Small Wikipedia dump. Unclean, contains many unwanted tags.

	exllamav2 calibration data taken from:\
	https://github.com/turboderp/exllamav2/tree/master/conversion/standard_cal_data

	# How to Convert Supported LLMs (Large Language Models) to GGUF Format:
	```
	llama.cpp % python convert.py path_to_model --outtype f16
	```

	## How to quantize using an imatrix, with llama.cpp

	1. Get one of the input files collected here, or elsewhere.
	2. Convert or download the model you want to quantise, in fp16 GGUF format.
	3. Generate an imatrix file specific to the model you want to quantise
	```
	cd <llama.cpp directory>
	./imatrix -m <model_path>/ggml-model-f16.gguf -f <plain_text_matrix_file> -o <output.matrix> -t 12 -ngl 144 --chunks 100 -b 512 -c 512

	# -ngl : layers offloaded to gpu (recommended to use number of layers the model contains)
	# -t 12 : number of threads (should probably match no of cpu)
	# -c 512 : context size, testing seems to show 512 is recommended (default=512, 0=loaded from model)
	# -b 200 : batch size (default=512)
	# --chunks 100 (recommended)
	# --mlock : keep model in ram (only use if you had sufficient RAM for the whole fp16)
	```
	4. Use the generated matrix file to quantise the model
	```
	./quantize --imatrix <output.matrix> <model_path>/ggml-model-f16.gguf <quantisation_level, eg:IQ4_XS>
	```
	Note: normal quantisation also benefits from using a matrix file. It also seem that a bigger input matrix is
	better for higher quantisation.