main: build = 3004 (bb9c3618)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1716729441
llama_model_loader: loaded meta data with 38 key-value pairs and 377 tensors from DeepSeek-V2-Lite-IMat-GGUF/DeepSeek-V2-Lite.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.name str              = DeepSeek-V2-Lite
llama_model_loader: - kv   2:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   3:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   4:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv   5:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv   6:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv   7:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv   9: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  11:                          general.file_type u32              = 0
llama_model_loader: - kv  12:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  13:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  14:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  15:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  16:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  17:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  18:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  19:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  20:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  21:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  23:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  24: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  25: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  32:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  37:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  377 tensors
llm_load_vocab: special tokens definition check successful ( 2400/102400 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 102400
llm_load_print_meta: n_merges         = 99757
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 27
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10944
llm_load_print_meta: n_expert         = 64
llm_load_print_meta: n_expert_used    = 6
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 15.71 B
llm_load_print_meta: model size       = 58.51 GiB (32.00 BPW) 
llm_load_print_meta: general.name     = DeepSeek-V2-Lite
llm_load_print_meta: BOS token        = 100000 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 100001 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 126 'Ä'
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
llm_load_tensors: ggml ctx size =    0.18 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/28 layers to GPU
llm_load_tensors:        CPU buffer size = 59915.48 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
ggml_cuda_host_malloc: failed to allocate 135.00 MiB of pinned memory: no CUDA-capable device is detected
llama_kv_cache_init:        CPU KV buffer size =   135.00 MiB
llama_new_context_with_model: KV self size  =  135.00 MiB, K (f16):   81.00 MiB, V (f16):   54.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.39 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:        CPU  output buffer size =     0.39 MiB
ggml_cuda_host_malloc: failed to allocate 367.76 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:  CUDA_Host compute buffer size =   367.76 MiB
llama_new_context_with_model: graph nodes  = 1924
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 25 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 316.687 ms
compute_imatrix: computing over 214 chunks with batch_size 512
ggml_cuda_host_malloc: failed to allocate 200.00 MiB of pinned memory: no CUDA-capable device is detected
compute_imatrix: 3.94 seconds per pass - ETA 14.05 minutes
[1]6.2555,[2]4.2175,[3]4.2354,[4]4.7779,[5]4.5992,[6]4.3746,[7]4.6177,[8]4.7474,[9]5.2937,
save_imatrix: stored collected data after 10 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[10]5.4986,[11]5.6408,[12]5.9810,[13]5.6804,[14]5.9820,[15]6.0953,[16]6.3420,[17]6.5456,[18]6.7452,[19]6.8095,
save_imatrix: stored collected data after 20 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[20]7.0144,[21]6.7376,[22]6.7956,[23]6.3795,[24]6.0759,[25]5.8707,[26]5.6799,[27]5.8932,[28]5.8383,[29]6.0069,
save_imatrix: stored collected data after 30 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[30]5.9428,[31]6.0374,[32]5.7749,[33]5.6207,[34]5.5905,[35]5.6057,[36]5.5774,[37]5.5393,[38]5.5691,[39]5.7045,
save_imatrix: stored collected data after 40 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[40]5.7997,[41]5.9024,[42]5.9141,[43]6.0895,[44]6.2345,[45]6.3836,[46]6.4676,[47]6.5514,[48]6.5311,[49]6.5449,
save_imatrix: stored collected data after 50 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[50]6.3847,[51]6.4601,[52]6.5167,[53]6.5824,[54]6.6533,[55]6.6995,[56]6.7367,[57]6.7719,[58]6.7740,[59]6.7665,
save_imatrix: stored collected data after 60 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[60]6.7819,[61]6.7772,[62]6.8427,[63]6.8977,[64]6.9321,[65]6.9631,[66]6.8989,[67]6.8333,[68]6.8088,[69]6.7940,
save_imatrix: stored collected data after 70 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[70]6.7560,[71]6.6992,[72]6.6262,[73]6.5936,[74]6.5718,[75]6.5353,[76]6.5785,[77]6.6122,[78]6.6220,[79]6.5713,
save_imatrix: stored collected data after 80 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[80]6.5914,[81]6.5084,[82]6.4933,[83]6.4302,[84]6.4087,[85]6.3842,[86]6.3523,[87]6.3194,[88]6.3067,[89]6.2823,
save_imatrix: stored collected data after 90 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[90]6.2839,[91]6.3025,[92]6.3198,[93]6.3372,[94]6.2966,[95]6.2789,[96]6.2877,[97]6.3021,[98]6.2908,[99]6.3007,
save_imatrix: stored collected data after 100 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[100]6.3198,[101]6.3022,[102]6.3086,[103]6.3196,[104]6.3205,[105]6.3059,[106]6.2882,[107]6.2989,[108]6.2865,[109]6.2806,
save_imatrix: stored collected data after 110 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[110]6.2540,[111]6.2815,[112]6.3146,[113]6.3094,[114]6.3091,[115]6.2984,[116]6.3304,[117]6.2803,[118]6.2706,[119]6.2413,
save_imatrix: stored collected data after 120 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[120]6.1952,[121]6.1954,[122]6.1407,[123]6.0890,[124]6.0344,[125]5.9815,[126]5.9314,[127]5.8839,[128]5.8350,[129]5.7905,
save_imatrix: stored collected data after 130 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[130]5.7808,[131]5.7651,[132]5.7437,[133]5.7147,[134]5.6963,[135]5.6677,[136]5.6463,[137]5.6236,[138]5.6059,[139]5.5843,
save_imatrix: stored collected data after 140 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[140]5.5758,[141]5.5508,[142]5.5417,[143]5.5143,[144]5.5232,[145]5.5644,[146]5.6307,[147]5.6856,[148]5.7236,[149]5.7360,
save_imatrix: stored collected data after 150 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[150]5.7317,[151]5.7352,[152]5.7437,[153]5.7248,[154]5.7356,[155]5.7334,[156]5.7457,[157]5.7393,[158]5.7350,[159]5.7339,
save_imatrix: stored collected data after 160 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[160]5.7356,[161]5.7286,[162]5.7220,[163]5.7214,[164]5.7003,[165]5.7289,[166]5.7428,[167]5.7618,[168]5.7580,[169]5.7791,
save_imatrix: stored collected data after 170 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[170]5.7874,[171]5.7781,[172]5.7794,[173]5.7839,[174]5.7875,[175]5.7931,[176]5.8008,[177]5.7806,[178]5.8077,[179]5.8437,
save_imatrix: stored collected data after 180 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[180]5.8786,[181]5.9323,[182]5.9803,[183]5.9945,[184]6.0066,[185]5.9892,[186]6.0005,[187]6.0207,[188]6.0299,[189]6.0291,
save_imatrix: stored collected data after 190 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[190]6.0346,[191]6.0436,[192]6.0529,[193]6.0541,[194]6.0535,[195]6.0681,[196]6.0828,[197]6.1170,[198]6.1112,[199]6.1171,
save_imatrix: stored collected data after 200 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[200]6.1118,[201]6.1221,[202]6.1193,[203]6.1350,[204]6.1196,[205]6.1238,[206]6.1132,[207]6.1488,[208]6.1835,[209]6.2240,
save_imatrix: stored collected data after 210 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat
[210]6.2520,[211]6.2770,[212]6.2461,[213]6.2213,[214]6.1906,
save_imatrix: stored collected data after 214 chunks in DeepSeek-V2-Lite-IMat-GGUF/imatrix.dat

llama_print_timings:        load time =    6013.60 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =  829906.84 ms / 109568 tokens (    7.57 ms per token,   132.02 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  833391.48 ms / 109569 tokens

Final estimate: PPL = 6.1906 +/- 0.06217