main: build = 3003 (d298382a) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1716752123 llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from Phi-3-mini-128k-instruct-IMat-GGUF/Phi-3-mini-128k-instruct.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: general.name str = Phi3 llama_model_loader: - kv 2: phi3.context_length u32 = 131072 llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072 llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192 llama_model_loader: - kv 6: phi3.block_count u32 = 32 llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32 llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96 llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 12: general.file_type u32 = 0 llama_model_loader: - kv 13: phi3.rope.scaling.attn_factor f32 = 1.190238 llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32064] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 25: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 26: general.quantization_version u32 = 2 llama_model_loader: - type f32: 197 tensors llm_load_vocab: special tokens definition check successful ( 323/32064 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi3 llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32064 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 3072 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 96 llm_load_print_meta: n_embd_head_k = 96 llm_load_print_meta: n_embd_head_v = 96 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 3072 llm_load_print_meta: n_embd_v_gqa = 3072 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 8192 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = all F32 llm_load_print_meta: model params = 3.82 B llm_load_print_meta: model size = 14.23 GiB (32.00 BPW) llm_load_print_meta: general.name = Phi3 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 32000 '<|endoftext|>' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 32000 '<|endoftext|>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: EOT token = 32007 '<|end|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 375.75 MiB llm_load_tensors: CUDA0 buffer size = 14200.53 MiB .................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 192.00 MiB llama_new_context_with_model: KV self size = 192.00 MiB, K (f16): 96.00 MiB, V (f16): 96.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB llama_new_context_with_model: CUDA0 compute buffer size = 83.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 7.01 MiB llama_new_context_with_model: graph nodes = 1286 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 25 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | compute_imatrix: tokenizing the input .. compute_imatrix: tokenization took 133.64 ms compute_imatrix: computing over 234 chunks with batch_size 512 compute_imatrix: 0.32 seconds per pass - ETA 1.23 minutes [1]6.0727,[2]4.4610,[3]4.4629,[4]4.9370,[5]5.3244,[6]5.4170,[7]4.8496,[8]5.2827,[9]5.5966, save_imatrix: stored collected data after 10 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [10]5.9047,[11]5.8828,[12]5.4226,[13]5.5632,[14]5.4485,[15]5.8942,[16]5.9986,[17]6.2966,[18]6.4616,[19]6.6562, save_imatrix: stored collected data after 20 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [20]6.8072,[21]6.8799,[22]7.1044,[23]6.8300,[24]6.6506,[25]6.6546,[26]6.2990,[27]6.0381,[28]5.7430,[29]5.7002, save_imatrix: stored collected data after 30 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [30]5.8055,[31]5.8773,[32]5.9258,[33]5.9168,[34]5.9704,[35]5.9718,[36]5.7531,[37]5.6171,[38]5.5492,[39]5.5208, save_imatrix: stored collected data after 40 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [40]5.4923,[41]5.4249,[42]5.4651,[43]5.5062,[44]5.5516,[45]5.6231,[46]5.7071,[47]5.7971,[48]5.9323,[49]6.0424, save_imatrix: stored collected data after 50 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [50]6.1599,[51]6.2644,[52]6.3634,[53]6.3316,[54]6.2462,[55]6.1791,[56]6.2703,[57]6.3211,[58]6.3341,[59]6.3940, save_imatrix: stored collected data after 60 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [60]6.4760,[61]6.5040,[62]6.5788,[63]6.6285,[64]6.7074,[65]6.7470,[66]6.7897,[67]6.8378,[68]6.8799,[69]6.9477, save_imatrix: stored collected data after 70 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [70]6.9901,[71]7.0393,[72]7.0741,[73]7.0324,[74]6.9788,[75]6.9180,[76]6.8588,[77]6.8484,[78]6.7958,[79]6.7419, save_imatrix: stored collected data after 80 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [80]6.6785,[81]6.6562,[82]6.6094,[83]6.5719,[84]6.5868,[85]6.6086,[86]6.6214,[87]6.6579,[88]6.6725,[89]6.6531, save_imatrix: stored collected data after 90 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [90]6.6234,[91]6.6468,[92]6.6571,[93]6.6760,[94]6.6899,[95]6.7034,[96]6.7345,[97]6.7548,[98]6.7283,[99]6.6884, save_imatrix: stored collected data after 100 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [100]6.7032,[101]6.7244,[102]6.7139,[103]6.6792,[104]6.6236,[105]6.6084,[106]6.6134,[107]6.6201,[108]6.5983,[109]6.5869, save_imatrix: stored collected data after 110 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [110]6.5668,[111]6.5736,[112]6.5833,[113]6.5814,[114]6.5906,[115]6.5858,[116]6.5842,[117]6.5771,[118]6.5829,[119]6.5621, save_imatrix: stored collected data after 120 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [120]6.5647,[121]6.5506,[122]6.5262,[123]6.5435,[124]6.5348,[125]6.5383,[126]6.5255,[127]6.5257,[128]6.5349,[129]6.5169, save_imatrix: stored collected data after 130 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [130]6.4927,[131]6.4831,[132]6.4802,[133]6.4305,[134]6.4382,[135]6.4156,[136]6.3964,[137]6.3724,[138]6.3466,[139]6.3174, save_imatrix: stored collected data after 140 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [140]6.2951,[141]6.2763,[142]6.2547,[143]6.2554,[144]6.2527,[145]6.2348,[146]6.2123,[147]6.2092,[148]6.1993,[149]6.1886, save_imatrix: stored collected data after 150 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [150]6.1821,[151]6.1677,[152]6.1637,[153]6.1538,[154]6.1412,[155]6.1651,[156]6.1401,[157]6.1342,[158]6.1511,[159]6.1461, save_imatrix: stored collected data after 160 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [160]6.1502,[161]6.1632,[162]6.1662,[163]6.1865,[164]6.1987,[165]6.2188,[166]6.2278,[167]6.2257,[168]6.2270,[169]6.2340, save_imatrix: stored collected data after 170 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [170]6.2467,[171]6.2367,[172]6.2370,[173]6.2540,[174]6.2566,[175]6.2740,[176]6.2834,[177]6.2939,[178]6.3006,[179]6.3317, save_imatrix: stored collected data after 180 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [180]6.3424,[181]6.3923,[182]6.4110,[183]6.4381,[184]6.4432,[185]6.4486,[186]6.4541,[187]6.4579,[188]6.4483,[189]6.4521, save_imatrix: stored collected data after 190 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [190]6.4586,[191]6.4720,[192]6.4769,[193]6.5062,[194]6.4950,[195]6.4664,[196]6.5076,[197]6.5460,[198]6.5764,[199]6.6271, save_imatrix: stored collected data after 200 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [200]6.6740,[201]6.6813,[202]6.6861,[203]6.6440,[204]6.6408,[205]6.6470,[206]6.6681,[207]6.6649,[208]6.6679,[209]6.6688, save_imatrix: stored collected data after 210 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [210]6.6772,[211]6.6909,[212]6.6907,[213]6.6877,[214]6.6944,[215]6.7130,[216]6.7305,[217]6.7338,[218]6.7346,[219]6.7292, save_imatrix: stored collected data after 220 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [220]6.7205,[221]6.7193,[222]6.7178,[223]6.7324,[224]6.7153,[225]6.7209,[226]6.7047,[227]6.7403,[228]6.7806,[229]6.8252, save_imatrix: stored collected data after 230 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat [230]6.8657,[231]6.8871,[232]6.8663,[233]6.8447,[234]6.8193, save_imatrix: stored collected data after 234 chunks in Phi-3-mini-128k-instruct-IMat-GGUF/imatrix.dat llama_print_timings: load time = 2127.07 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 52676.50 ms / 119808 tokens ( 0.44 ms per token, 2274.41 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 55109.70 ms / 119809 tokens Final estimate: PPL = 6.8193 +/- 0.07007