mirekphd/gte-Qwen2-7B-instruct-Q8_0

This version

This model was converted to a 8-bit GGUF format (q8_0) from Alibaba-NLP/gte-Qwen2-7B-instruct using llama-quantize built from llama.cpp.

Custom conversion script settings:

"gte-Qwen2-7B-instruct": {
    "model_name": "gte-Qwen2-7B-instruct", 
    "hq_quant_type": "f32",
    "final_quant_type": "q8_0",
    "produce_final_quant": true,
    "parts_num": 2,
    "max_shard_size_gb": 4,
    "numexpr_max_thread": 8
    }

Please refer to the original model card for more details on the unquantized model, including its metrics, which may be different (typically slightly worse) for this quantized version.

gte-Qwen2-7B-instruct

gte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024).

Recently, the Qwen team released the Qwen2 series models, and we have trained the gte-Qwen2-7B-instruct model based on the Qwen2-7B LLM model. Compared to the gte-Qwen1.5-7B-instruct model, the gte-Qwen2-7B-instruct model uses the same training data and training strategies during the finetuning stage, with the only difference being the upgraded base model to Qwen2-7B. Considering the improvements in the Qwen2 series models compared to the Qwen1.5 series, we can also expect consistent performance enhancements in the embedding models.

The model incorporates several key advancements:

Integration of bidirectional attention mechanisms, enriching its contextual understanding.
Instruction tuning, applied solely on the query side for streamlined efficiency
Comprehensive training across a vast, multilingual text corpus spanning diverse domains and scenarios. This training leverages both weakly supervised and supervised data, ensuring the model's applicability across numerous languages and a wide array of downstream tasks.

Model Information

Overview

Model Type: GTE (General Text Embeddings)
Model Size: 7B
Embedding Dimension: 3584
Context Window: 131072

Supported languages

North America: English
Western Europe: German, French, Spanish, Portuguese, Italian, Dutch
Eastern & Central Europe: Russian, Czech, Polish
Middle East: Arabic, Persian, Hebrew, Turkish
Eastern Asia: Chinese, Japanese, Korean
South-Eastern Asia: Vietnamese, Thai, Indonesian, Malay, Lao, Burmese, Cebuano, Khmer, Tagalog
Southern Asia: Hindi, Bengali, Urdu
[source]

Details

llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = gte-Qwen2-7B-instruct
llama_model_loader: - kv   3:                           general.finetune str              = instruct
llama_model_loader: - kv   4:                           general.basename str              = gte-Qwen2
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,5]       = ["mteb", "sentence-transformers", "tr...
llama_model_loader: - kv   8:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   9:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  10:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv  11:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  12:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  13:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,151646]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  20:                  tokenizer.ggml.token_type arr[i32,151646]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  21:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  22:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - kv  28:                                   split.no u16              = 0
llama_model_loader: - kv  29:                                split.count u16              = 8
llama_model_loader: - kv  30:                        split.tensors.count i32              = 339
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:  198 tensors
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.9308 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151646
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.61 B
llm_load_print_meta: model size       = 7.53 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = gte-Qwen2-7B-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size =  1008.21 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   959.63 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   974.51 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   983.77 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   944.73 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   944.76 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   944.74 MiB
llm_load_tensors:   CPU_Mapped model buffer size =   954.29 MiB
........................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 131072
llama_new_context_with_model: n_ctx_per_seq = 131072
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_kv_cache_init:        CPU KV buffer size =  7168.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.01 MiB
llama_new_context_with_model:        CPU compute buffer size =  7452.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1

Usage

Sentence Transformers

Transformers

Inference

Using `llama.cpp` to get embeddings in CPU and/or GPU

First build or install llama-server binary from llama.cpp, preferably with GPU support.

CLI

Server

# using remote HF repo address (with model file(s) to be downloaded and cached locally)
$ llama-server --hf-repo mirekphd/gte-Qwen2-7B-instruct-Q8_0 --hf-file gte-Qwen2-7B-instruct-Q8_0-00001-of-00008.gguf --n-gpu-layers 0 --ctx-size 131072 --embeddings

# using a previously downloaded local model file(s)
$ llama-server --model <path-to-hf-models>/mirekphd/gte-Qwen2-7B-instruct-Q8_0/gte-Qwen2-7B-instruct-Q8_0-00001-of-00008.gguf --n-gpu-layers 0 --ctx-size 131072 --embeddings

Evaluation

MTEB & C-MTEB

Cloud API Services

Citation

If you find our paper or models helpful, please consider cite:

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}