main: build = 3004 (bb9c3618) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1716729441 llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from DeepSeek-V2-Lite-IMat-GGUF/DeepSeek-V2-Lite.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = deepseek2 llama_model_loader: - kv 1: general.name str = DeepSeek-V2-Lite llama_model_loader: - kv 2: deepseek2.block_count u32 = 27 llama_model_loader: - kv 3: deepseek2.context_length u32 = 163840 llama_model_loader: - kv 4: deepseek2.embedding_length u32 = 2048 llama_model_loader: - kv 5: deepseek2.feed_forward_length u32 = 10944 llama_model_loader: - kv 6: deepseek2.attention.head_count u32 = 16 llama_model_loader: - kv 7: deepseek2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: deepseek2.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 9: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: deepseek2.expert_used_count u32 = 6 llama_model_loader: - kv 11: general.file_type u32 = 0 llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 1 llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 102400 llama_model_loader: - kv 14: deepseek2.attention.kv_lora_rank u32 = 512 llama_model_loader: - kv 15: deepseek2.attention.key_length u32 = 192 llama_model_loader: - kv 16: deepseek2.attention.value_length u32 = 128 llama_model_loader: - kv 17: deepseek2.expert_feed_forward_length u32 = 1408 llama_model_loader: - kv 18: deepseek2.expert_count u32 = 64 llama_model_loader: - kv 19: deepseek2.expert_shared_count u32 = 2 llama_model_loader: - kv 20: deepseek2.expert_weights_scale f32 = 1.000000 llama_model_loader: - kv 21: deepseek2.rope.dimension_count u32 = 64 llama_model_loader: - kv 22: deepseek2.rope.scaling.type str = yarn llama_model_loader: - kv 23: deepseek2.rope.scaling.factor f32 = 40.000000 llama_model_loader: - kv 24: deepseek2.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 25: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.070700 llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 27: tokenizer.ggml.pre str = deepseek-llm llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,102400] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,102400] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,99757] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e... llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 100000 llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 100001 llama_model_loader: - kv 33: tokenizer.ggml.padding_token_id u32 = 100001 llama_model_loader: - kv 34: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 35: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 36: tokenizer.chat_template str = {% if not add_generation_prompt is de... llama_model_loader: - kv 37: general.quantization_version u32 = 2 llama_model_loader: - type f32: 377 tensors llm_load_vocab: special tokens definition check successful ( 2400/102400 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = deepseek2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 102400 llm_load_print_meta: n_merges = 99757 llm_load_print_meta: n_ctx_train = 163840 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 27 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 192 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 3072 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10944 llm_load_print_meta: n_expert = 64 llm_load_print_meta: n_expert_used = 6 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = yarn llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 0.025 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 16B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 15.71 B llm_load_print_meta: model size = 58.51 GiB (32.00 BPW) llm_load_print_meta: general.name = DeepSeek-V2-Lite llm_load_print_meta: BOS token = 100000 '<|begin▁of▁sentence|>' llm_load_print_meta: EOS token = 100001 '<|end▁of▁sentence|>' llm_load_print_meta: PAD token = 100001 '<|end▁of▁sentence|>' llm_load_print_meta: LF token = 126 'Ä' ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected llm_load_tensors: ggml ctx size = 0.18 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/28 layers to GPU llm_load_tensors: CPU buffer size = 59915.48 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 0.025 ggml_cuda_host_malloc: failed to allocate 135.00 MiB of pinned memory: no CUDA-capable device is detected llama_kv_cache_init: CPU KV buffer size = 135.00 MiB llama_new_context_with_model: KV self size = 135.00 MiB, K (f16): 81.00 MiB, V (f16): 54.00 MiB ggml_cuda_host_malloc: failed to allocate 0.39 MiB of pinned memory: no CUDA-capable device is detected llama_new_context_with_model: CPU output buffer size = 0.39 MiB ggml_cuda_host_malloc: failed to allocate 367.76 MiB of pinned memory: no CUDA-capable device is detected llama_new_context_with_model: CUDA_Host compute buffer size = 367.76 MiB llama_new_context_with_model: graph nodes = 1924 llama_new_context_with_model: graph splits = 1 system_info: n_threads = 25 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | compute_imatrix: tokenizing the input .. compute_imatrix: tokenization took 316.687 ms compute_imatrix: computing over 214 chunks with batch_size 512 ggml_cuda_host_malloc: failed to allocate 200.00 MiB of pinned memory: no CUDA-capable device is detected compute_imatrix: 3.94 seconds per pass - ETA 14.05 minutes [1]6.2555,[2]4.2175,[3]4.2354,[4]4.7779,[5]4.5992,[6]4.3746,[7]4.6177,[8]4.7474,[9]5.2937, save_imatrix: stored collected data after 10 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [10]5.4986,[11]5.6408,[12]5.9810,[13]5.6804,[14]5.9820,[15]6.0953,[16]6.3420,[17]6.5456,[18]6.7452,[19]6.8095, save_imatrix: stored collected data after 20 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [20]7.0144,[21]6.7376,[22]6.7956,[23]6.3795,[24]6.0759,[25]5.8707,[26]5.6799,[27]5.8932,[28]5.8383,[29]6.0069, save_imatrix: stored collected data after 30 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [30]5.9428,[31]6.0374,[32]5.7749,[33]5.6207,[34]5.5905,[35]5.6057,[36]5.5774,[37]5.5393,[38]5.5691,[39]5.7045, save_imatrix: stored collected data after 40 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [40]5.7997,[41]5.9024,[42]5.9141,[43]6.0895,[44]6.2345,[45]6.3836,[46]6.4676,[47]6.5514,[48]6.5311,[49]6.5449, save_imatrix: stored collected data after 50 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [50]6.3847,[51]6.4601,[52]6.5167,[53]6.5824,[54]6.6533,[55]6.6995,[56]6.7367,[57]6.7719,[58]6.7740,[59]6.7665, save_imatrix: stored collected data after 60 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [60]6.7819,[61]6.7772,[62]6.8427,[63]6.8977,[64]6.9321,[65]6.9631,[66]6.8989,[67]6.8333,[68]6.8088,[69]6.7940, save_imatrix: stored collected data after 70 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [70]6.7560,[71]6.6992,[72]6.6262,[73]6.5936,[74]6.5718,[75]6.5353,[76]6.5785,[77]6.6122,[78]6.6220,[79]6.5713, save_imatrix: stored collected data after 80 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [80]6.5914,[81]6.5084,[82]6.4933,[83]6.4302,[84]6.4087,[85]6.3842,[86]6.3523,[87]6.3194,[88]6.3067,[89]6.2823, save_imatrix: stored collected data after 90 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [90]6.2839,[91]6.3025,[92]6.3198,[93]6.3372,[94]6.2966,[95]6.2789,[96]6.2877,[97]6.3021,[98]6.2908,[99]6.3007, save_imatrix: stored collected data after 100 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [100]6.3198,[101]6.3022,[102]6.3086,[103]6.3196,[104]6.3205,[105]6.3059,[106]6.2882,[107]6.2989,[108]6.2865,[109]6.2806, save_imatrix: stored collected data after 110 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [110]6.2540,[111]6.2815,[112]6.3146,[113]6.3094,[114]6.3091,[115]6.2984,[116]6.3304,[117]6.2803,[118]6.2706,[119]6.2413, save_imatrix: stored collected data after 120 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [120]6.1952,[121]6.1954,[122]6.1407,[123]6.0890,[124]6.0344,[125]5.9815,[126]5.9314,[127]5.8839,[128]5.8350,[129]5.7905, save_imatrix: stored collected data after 130 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [130]5.7808,[131]5.7651,[132]5.7437,[133]5.7147,[134]5.6963,[135]5.6677,[136]5.6463,[137]5.6236,[138]5.6059,[139]5.5843, save_imatrix: stored collected data after 140 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [140]5.5758,[141]5.5508,[142]5.5417,[143]5.5143,[144]5.5232,[145]5.5644,[146]5.6307,[147]5.6856,[148]5.7236,[149]5.7360, save_imatrix: stored collected data after 150 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [150]5.7317,[151]5.7352,[152]5.7437,[153]5.7248,[154]5.7356,[155]5.7334,[156]5.7457,[157]5.7393,[158]5.7350,[159]5.7339, save_imatrix: stored collected data after 160 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [160]5.7356,[161]5.7286,[162]5.7220,[163]5.7214,[164]5.7003,[165]5.7289,[166]5.7428,[167]5.7618,[168]5.7580,[169]5.7791, save_imatrix: stored collected data after 170 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [170]5.7874,[171]5.7781,[172]5.7794,[173]5.7839,[174]5.7875,[175]5.7931,[176]5.8008,[177]5.7806,[178]5.8077,[179]5.8437, save_imatrix: stored collected data after 180 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [180]5.8786,[181]5.9323,[182]5.9803,[183]5.9945,[184]6.0066,[185]5.9892,[186]6.0005,[187]6.0207,[188]6.0299,[189]6.0291, save_imatrix: stored collected data after 190 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [190]6.0346,[191]6.0436,[192]6.0529,[193]6.0541,[194]6.0535,[195]6.0681,[196]6.0828,[197]6.1170,[198]6.1112,[199]6.1171, save_imatrix: stored collected data after 200 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [200]6.1118,[201]6.1221,[202]6.1193,[203]6.1350,[204]6.1196,[205]6.1238,[206]6.1132,[207]6.1488,[208]6.1835,[209]6.2240, save_imatrix: stored collected data after 210 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat [210]6.2520,[211]6.2770,[212]6.2461,[213]6.2213,[214]6.1906, save_imatrix: stored collected data after 214 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat llama_print_timings: load time = 6013.60 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 829906.84 ms / 109568 tokens ( 7.57 ms per token, 132.02 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 833391.48 ms / 109569 tokens Final estimate: PPL = 6.1906 +/- 0.06217