llama_model_loader: loaded meta data with 29 key-value pairs and 254 tensors from RoGemma-7b-Instruct-IMat-GGUF/RoGemma-7b-Instruct.Q8_0.gguf.hardlink.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.name str              = RoGemma-7b-Instruct
llama_model_loader: - kv   2:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   3:                     gemma.embedding_length u32              = 3072
llama_model_loader: - kv   4:                          gemma.block_count u32              = 28
llama_model_loader: - kv   5:                  gemma.feed_forward_length u32              = 24576
llama_model_loader: - kv   6:                 gemma.attention.head_count u32              = 16
llama_model_loader: - kv   7:              gemma.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  10:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,256000]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {{ '<bos>' }}{% if messages[0]['role'...
llama_model_loader: - kv  24:             tokenizer.ggml.prefix_token_id u32              = 67
llama_model_loader: - kv  25:             tokenizer.ggml.suffix_token_id u32              = 69
llama_model_loader: - kv  26:             tokenizer.ggml.middle_token_id u32              = 68
llama_model_loader: - kv  27:                tokenizer.ggml.eot_token_id u32              = 107
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   57 tensors
llama_model_loader: - type q8_0:  197 tensors
llm_load_vocab: special tokens cache size = 260
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 192
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 24576
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.54 B
llm_load_print_meta: model size       = 8.45 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = RoGemma-7b-Instruct
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: PRE token        = 67 '<unused60>'
llm_load_print_meta: SUF token        = 69 '<unused62>'
llm_load_print_meta: MID token        = 68 '<unused61>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 93
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.24 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   796.88 MiB
llm_load_tensors:      CUDA0 buffer size =  8651.54 MiB
......................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   506.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     7.01 MiB
llama_new_context_with_model: graph nodes  = 931
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 25 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 122.982 ms
compute_imatrix: computing over 128 chunks with batch_size 512
compute_imatrix: 0.72 seconds per pass - ETA 1.53 minutes
[1]6.7268,[2]4.7956,[3]4.3365,[4]5.4634,[5]5.5761,[6]4.7844,[7]5.2281,[8]5.4576,[9]5.6880,
save_imatrix: stored collected data after 10 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[10]5.1084,[11]5.2387,[12]5.6425,[13]6.0984,[14]6.4094,[15]6.7641,[16]7.0334,[17]7.1327,[18]7.4186,[19]7.1324,
save_imatrix: stored collected data after 20 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[20]7.2420,[21]7.3956,[22]7.3754,[23]7.4916,[24]7.5334,[25]7.6872,[26]7.4682,[27]7.7364,[28]7.9965,[29]7.9577,
save_imatrix: stored collected data after 30 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[30]7.9155,[31]7.4608,[32]7.2029,[33]7.1080,[34]6.9745,[35]6.9006,[36]7.1893,[37]7.2251,[38]7.2493,[39]7.3769,
save_imatrix: stored collected data after 40 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[40]7.4966,[41]7.6572,[42]7.9743,[43]8.2865,[44]8.5892,[45]8.7742,[46]8.6579,[47]8.6807,[48]8.8624,[49]8.9978,
save_imatrix: stored collected data after 50 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[50]8.8560,[51]8.8432,[52]8.8797,[53]8.9995,[54]9.1008,[55]9.2532,[56]9.3016,[57]9.3064,[58]9.3182,[59]9.1306,
save_imatrix: stored collected data after 60 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[60]9.0207,[61]8.8842,[62]8.8355,[63]8.8759,[64]8.8721,[65]8.8531,[66]8.8684,[67]8.8022,[68]8.7356,[69]8.7641,
save_imatrix: stored collected data after 70 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[70]8.7371,[71]8.7328,[72]8.7399,[73]8.7070,[74]8.6636,[75]8.6279,[76]8.6341,[77]8.6556,[78]8.6528,[79]8.6093,
save_imatrix: stored collected data after 80 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[80]8.6513,[81]8.6853,[82]8.6525,[83]8.6525,[84]8.6889,[85]8.5525,[86]8.5107,[87]8.4442,[88]8.4460,[89]8.4832,
save_imatrix: stored collected data after 90 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[90]8.4874,[91]8.4152,[92]8.3457,[93]8.2611,[94]8.1827,[95]8.1159,[96]8.0418,[97]7.9787,[98]7.9182,[99]7.9335,
save_imatrix: stored collected data after 100 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[100]7.9383,[101]8.0119,[102]8.0867,[103]8.1573,[104]8.2901,[105]8.3851,[106]8.4142,[107]8.4310,[108]8.4466,[109]8.4205,
save_imatrix: stored collected data after 110 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[110]8.4063,[111]8.3474,[112]8.2734,[113]8.3102,[114]8.3227,[115]8.3146,[116]8.3066,[117]8.3351,[118]8.3537,[119]8.3591,
save_imatrix: stored collected data after 120 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat
[120]8.3618,[121]8.3700,[122]8.3300,[123]8.3984,[124]8.4737,[125]8.5299,[126]8.6142,[127]8.6792,[128]8.7451,
save_imatrix: stored collected data after 128 chunks in RoGemma-7b-Instruct-IMat-GGUF/imatrix.dat

llama_print_timings:        load time =    6235.43 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   76873.60 ms / 65536 tokens (    1.17 ms per token,   852.52 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   83878.25 ms / 65537 tokens

Final estimate: PPL = 8.7451 +/- 0.12878