perturb_for_table / table_result /2407.00088v1_output.json
wcy
'modify'
0803c45
raw
history blame
25.4 kB
[
{
"path": "table_paper/2407.00088v1.json",
"table_id": "2",
"section": "5.1",
"all_context": [
"As shown in Table 2 , we evaluate T-MAC across four distinct edge devices.",
"These devices range from high-performance ones like M2-Ultra to less powerful ones like Raspberry Pi.",
"The CPUs tested encompass Intel Core, Apple Silicon, and Cortex series.",
"The operating systems include OSX, Linux, and Windows.",
"This evaluation guarantees T-MAC s cross-platform compatibility and consistent performance across different instruction sets and various edge deployment scenarios.",
"To evaluate the performance of T-MAC, we conduct extensive benchmarks using real-word low-bit LLMs and scenarios.",
"For the kernel performance benchmark, we select matrix shapes derived from the Llama-2-7B and Llama-2-13B models, ensuring our evaluation reflects the practical demands.",
"To conduct an end-to-end throughput test, we employed actual quantized models to demonstrate the practical efficacy of T-MAC across different bit-width configurations.",
"Specifically, we employ 4-bit,3-bit,2-bit and 1-bit quantized Llama models, and also 1-bit and 1.58bit BitNet models that are trained from scratch.",
"The 4-bit Llama models are from GPTQ (frantar2022gptq, ).",
"The 3-bit and 2-bit Llama models are from BitDistiller (du2024bitdistiller, ).",
"The 1-bit Llama models are from OneBit (xu2024onebit, ).",
"We compared the performance of T-MAC with llama.cpp, a state-of-the-art implementation for LLM deployment on edge devices.",
"We chose llama.cpp as the baseline for several compelling reasons.",
"Firstly, llama.cpp represents the cutting-edge in LLM deployment on edge devices, featuring highly optimized kernel implementations tailored to each hardware platform.",
"Its versatility and robust performance make it an ideal benchmark for assessing the efficacy of new methodologies.",
"Additionally, llama.cpp is implemented in plain C/C++ without any dependencies, ensuring maximum compatibility and efficiency across diverse hardware configurations.",
"For kernel performance benchmarks, we utilized the optimized kernels provided by llama.cpp as the baselines on the respective hardware devices.",
"In our end-to-end throughput evaluations, we integrate the LUT-based kernels from T-MAC to llama.cpp and compare it with original llama.cpp.",
"We perform both kernel-level and model-level measurement.",
"To obtain precise and consistent kernel-level latency on CPU, we first perform a warmup of 10 iterations, followed by 100 runs to calculate an average.",
"The warmup on M2-Ultra differs slightly from the others, requiring at least 1 second to maximize performance.",
"To perform model-level latency, we integrate T-MAC into llama.cpp.",
"We repeatedly generate 64 tokens for 20 iterations to evaluate token generation throughput.",
""
],
"target_context_ids": [
0,
1,
2,
3,
4
],
"selected_paragraphs": [
"[paragraph id = 0] As shown in Table 2 , we evaluate T-MAC across four distinct edge devices.",
"[paragraph id = 1] These devices range from high-performance ones like M2-Ultra to less powerful ones like Raspberry Pi.",
"[paragraph id = 2] The CPUs tested encompass Intel Core, Apple Silicon, and Cortex series.",
"[paragraph id = 3] The operating systems include OSX, Linux, and Windows.",
"[paragraph id = 4] This evaluation guarantees T-MAC s cross-platform compatibility and consistent performance across different instruction sets and various edge deployment scenarios."
],
"table_html": "<figure class=\"ltx_table\" id=\"S5.T2\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S5.T2.2\" style=\"width:433.6pt;height:129.4pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(35.8pt,-10.7pt) scale(1.19793527173856,1.19793527173856) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S5.T2.2.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S5.T2.2.1.1.1\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T2.2.1.1.1.1\" rowspan=\"2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T2.2.1.1.1.1.1\">Device</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T2.2.1.1.1.2\" rowspan=\"2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T2.2.1.1.1.2.1\">Processor</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T2.2.1.1.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T2.2.1.1.1.3.1\">Performance</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T2.2.1.1.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T2.2.1.1.1.4.1\">Max. Memory</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T2.2.1.2.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T2.2.1.2.2.1\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T2.2.1.2.2.1.1\">Cores</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T2.2.1.2.2.2\">\n<span class=\"ltx_text ltx_font_bold\" id=\"S5.T2.2.1.2.2.2.1\">Bandwidth</span> (GB/s)</th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S5.T2.2.1.3.1\">\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T2.2.1.3.1.1\">M2-Ultra</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T2.2.1.3.1.2\">Apple M2-Ultra</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T2.2.1.3.1.3\">16</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T2.2.1.3.1.4\">819.2</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T2.2.1.4.2\">\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.4.2.1\">Raspberry Pi 5</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.4.2.2\">ARM Cortex-A76</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.4.2.3\">4</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.4.2.4\">17.1</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T2.2.1.5.3\">\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.5.3.1\">Jetson AGX Orin</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.5.3.2\">ARM Cortex-A78AE</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.5.3.3\">12</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T2.2.1.5.3.4\">204.8</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T2.2.1.6.4\">\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T2.2.1.6.4.1\">Surface Book 3</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T2.2.1.6.4.2\">Intel Core i5-1035G7</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T2.2.1.6.4.3\">4</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T2.2.1.6.4.4\">58.2</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\"><span class=\"ltx_text\" id=\"S5.T2.3.1.1\" style=\"font-size:90%;\">Table 2</span>. </span><span class=\"ltx_text\" id=\"S5.T2.4.2\" style=\"font-size:90%;\">Hardware device specification.</span></figcaption>\n</figure>",
"perturb_sentence_id": [
0,
1
],
"output": {
"perturbed_statement": "[paragraph id = 0] As shown in Table 2, we evaluate T-MAC across three distinct edge devices. These devices range from high-performance ones like M2-Ultra to less powerful ones like Raspberry Pi.",
"perturbed_explanation": "1. The original explanation states that T-MAC is evaluated across four distinct edge devices, ensuring a comprehensive assessment of its performance. \n2. The statement is incorrect because it mentions that T-MAC is evaluated across three distinct edge devices, reducing the scope of the evaluation inaccurately. The context implies a wider range of devices is used in the assessment."
}
},
{
"path": "table_paper/2407.00088v1.json",
"table_id": "4",
"section": "5.6",
"all_context": [
"There are two sources of error compared to conventional mpGEMM implementation: (a) table quantization, which is an algorithmic approximation included in our method, and (b) fast aggregation, whose error is introduced during the instruction execution within the fixed CPU architecture.",
"We evaluate the impact of these two error sources at both kernel-level and model-level.",
"We use the unquantized GEMV as the benchmark.",
"The weights and activation of the GEMV are randomly generated FP16 values following a Gaussian Distribution, which are then quantized to 4-bit for execution by llama.cpp and T-MAC.",
"The Normalized Mean Squared Error (NMSE) is then computed between the ground truth and the mpGEMV outputs.",
"As shown in Table.",
"3 , the NMSE difference between llama.cpp and T-MAC is negligible, indicating that the table quantization error is minimal.",
"However, after applying fast aggregation, the NMSE increases to 2.5.",
"To examine the impact of these errors on real-world models, we chose Llama-2-7B for testing.",
"The models are the GGUF model converted from official Llama-2-7B weights for the un-quantized ground truth and the original llama-2-7b.Q4_0.gguf model (gguf-models, ) released with llama.cpp for mpGEMM.",
"After integrating T-MAC into llama.cpp, we conduct the evaluation through the perplexity (llamacpp-perplexity, ) tool provided by llama.cpp.",
"The evaluation is performed on three different tasks: WikiText-2 (merity2016pointer, ) and lambada_openai (paperno-etal-2016-lambada, ; radford2019language, ) for perplexity (the lower the better), and WinoGrande (ai2:winogrande, ) for question answering accuracy (the higher the better.",
"As shown in Table 4 , on all of the three tasks, T-MAC delivers the same results compared to llama.cpp, suggesting that the error introduced by T-MAC is negligible for real-world models.",
"After toggling on the fast aggregation, the perplexity increases by 0.4 and 1.0 respectively and the accuracy drops by 0.3%.",
"In summary, T-MAC introduces negligible error to model inference while offering significant speedup.",
"The fast aggregation can further enhance performance, but at the cost of model quality.",
"We offer this as an option for users in scenarios that prioritize real-time performance and are less sensitive to accuracy.",
"Without fast aggregation, T-MAC can still achieve substantial gain according to Figure 10 .",
"In the future, we anticipate the error introduced by fast aggregation can be mitigated with straightforward optimizations of the CPU micro-architecture.",
""
],
"target_context_ids": [
11,
12,
13,
14,
15
],
"selected_paragraphs": [
"[paragraph id = 11] The evaluation is performed on three different tasks: WikiText-2 (merity2016pointer, ) and lambada_openai (paperno-etal-2016-lambada, ; radford2019language, ) for perplexity (the lower the better), and WinoGrande (ai2:winogrande, ) for question answering accuracy (the higher the better.",
"[paragraph id = 12] As shown in Table 4 , on all of the three tasks, T-MAC delivers the same results compared to llama.cpp, suggesting that the error introduced by T-MAC is negligible for real-world models.",
"[paragraph id = 13] After toggling on the fast aggregation, the perplexity increases by 0.4 and 1.0 respectively and the accuracy drops by 0.3%.",
"[paragraph id = 14] In summary, T-MAC introduces negligible error to model inference while offering significant speedup.",
"[paragraph id = 15] The fast aggregation can further enhance performance, but at the cost of model quality."
],
"table_html": "<figure class=\"ltx_table\" id=\"S5.T4\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S5.T4.4\" style=\"width:433.6pt;height:128.2pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(34.1pt,-10.1pt) scale(1.18686896846672,1.18686896846672) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S5.T4.4.4\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S5.T4.4.4.5.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_column ltx_th_row ltx_border_tt\" id=\"S5.T4.4.4.5.1.1\" rowspan=\"2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.4.4.5.1.1.1\">Framework</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T4.4.4.5.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.4.4.5.1.2.1\">Throughput</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T4.4.4.5.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.4.4.5.1.3.1\">WikiText2</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T4.4.4.5.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.4.4.5.1.4.1\">lambada_openai</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T4.4.4.5.1.5\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T4.4.4.5.1.5.1\">WinoGrande</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T4.4.4.4\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T4.1.1.1.1\">Tokens/sec \n</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T4.2.2.2.2\">PPL \n</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T4.3.3.3.3\">PPL \n</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T4.4.4.4.4\">Acc. \n</th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S5.T4.4.4.6.1\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_t\" id=\"S5.T4.4.4.6.1.1\">Un-quantized</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T4.4.4.6.1.2\">3.79</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T4.4.4.6.1.3\">5.80</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T4.4.4.6.1.4\">12.65</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T4.4.4.6.1.5\">71.0</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T4.4.4.7.2\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S5.T4.4.4.7.2.1\">llama.cpp</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.7.2.2\">5.65</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.7.2.3\">5.96</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.7.2.4\">12.95</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.7.2.5\">70.8</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T4.4.4.8.3\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row\" id=\"S5.T4.4.4.8.3.1\">T-MAC</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.8.3.2\">7.34</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.8.3.3\">5.96</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.8.3.4\">12.95</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T4.4.4.8.3.5\">70.8</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T4.4.4.9.4\">\n<th class=\"ltx_td ltx_align_left ltx_th ltx_th_row ltx_border_bb\" id=\"S5.T4.4.4.9.4.1\">T-MAC (+FA)</th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T4.4.4.9.4.2\">8.97</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T4.4.4.9.4.3\">6.38</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T4.4.4.9.4.4\">13.99</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T4.4.4.9.4.5\">67.8</td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\"><span class=\"ltx_text\" id=\"S5.T4.10.3.1\" style=\"font-size:90%;\">Table 4</span>. </span><span class=\"ltx_text\" id=\"S5.T4.8.2\" style=\"font-size:90%;\">End-to-end throughput and model quality of Llama-2-7B-4bit on M2-Ultra with single-thread. T-MAC improves throughput by 1.3 compared to llama.cpp with the same model quality. Fast Aggregation (FA) can further improve the throughput gain to 1.6, but the model quality will drop because of the numerical error of current CPU instructions.</span></figcaption>\n</figure>",
"perturb_sentence_id": [
11,
12
],
"output": {
"perturbed_statement": "[paragraph id = 11] The evaluation is performed on three different tasks: WikiText-2 (merity2016pointer, ) and lambada_openai (paperno-etal-2016-lambada, ; radford2019language, ) for perplexity (the higher the better), and WinoGrande (ai2:winogrande, ) for question answering accuracy (the higher the better. As shown in Table 4 , on all of the three tasks, T-MAC delivers significantly better results compared to llama.cpp, suggesting that the error introduced by T-MAC is negligible for real-world models.",
"perturbed_explanation": "1. The Original Explanation discusses the performance of T-MAC compared to llama.cpp across several tasks. It is mentioned that the evaluation is performed on three tasks, with the goal for perplexity being lower (indicating better performance) and for question-answering accuracy being higher. T-MAC reportedly achieves results consistent with llama.cpp, demonstrating its negligible error impact.\n2. The statement incorrectly suggests that higher perplexity is better, whereas lower perplexity is preferable as it indicates better model performance. Furthermore, it inaccurately claims that T-MAC delivers significantly better results than llama.cpp across all tasks, contradicting the context's conclusion of T-MAC delivering the same results as llama.cpp."
}
},
{
"path": "table_paper/2407.00088v1.json",
"table_id": "5",
"section": "5.7",
"all_context": [
"GPUs are widely used in LLM deployments.",
"We compare T-MAC on CPU with llama.cpp on GPU to illustrate the efficiency of T-MAC.",
"llama.cpp is the state-of-the-art LLM inference framework on both CPU and GPU.",
"Figure 11 shows the mpGEMV kernel performance comparsion of T-MAC (CPU) and llama.cpp (GPU) on NVIDIA Jetson AGX Orin, a platform with ARM CPU and NVIDIA CUDA GPU.",
"The kernel configurations are all from Llama-2-7B.",
"T-MAC significantly outperforms GPU on W1A16 on all cases, while achieves comparable performance on W2A16 and W3A16.",
"Although GPU performs better on higher bits and larger shape due to its powerful parallel computing capacity, this evaluation still shows huge potential of CPU-based LLM deployments on edge devices.",
"Table 5 shows the end-to-end comparison of the Llama-2-7B-2bit model on NVIDIA Jetson AGX Orin.",
"Without T-MAC, CPU only performs better than GPU in power, however, the energy consumption is still worse than GPU due to lower throughput.",
"Compared to llama.cpp on CPU, T-MAC not only improves the throughput to 2.2, but also reduces the power to 69, resulting in 3.2 energy efficiency.",
"Compared to llama.cpp on GPU, although T-MAC only achieves 78 throughput, T-MAC only needs 34 power, resulting in 2.3 energy efficiency.",
"Note that Figure 11 shows T-MAC outperforms the GPU on the mpGEMV kernels.",
"The reason why the throughput of T-MAC is still lower than that of GPU is due to the performance of kernels except mpGEMVs in llama.cpp on CPU.",
""
],
"target_context_ids": [
7,
8,
9,
10
],
"selected_paragraphs": [
"[paragraph id = 7] Table 5 shows the end-to-end comparison of the Llama-2-7B-2bit model on NVIDIA Jetson AGX Orin.",
"[paragraph id = 8] Without T-MAC, CPU only performs better than GPU in power, however, the energy consumption is still worse than GPU due to lower throughput.",
"[paragraph id = 9] Compared to llama.cpp on CPU, T-MAC not only improves the throughput to 2.2, but also reduces the power to 69, resulting in 3.2 energy efficiency.",
"[paragraph id = 10] Compared to llama.cpp on GPU, although T-MAC only achieves 78 throughput, T-MAC only needs 34 power, resulting in 2.3 energy efficiency."
],
"table_html": "<figure class=\"ltx_table\" id=\"S5.T5\">\n<div class=\"ltx_inline-block ltx_align_center ltx_transformed_outer\" id=\"S5.T5.2\" style=\"width:355.6pt;height:131.4pt;vertical-align:-0.0pt;\"><span class=\"ltx_transformed_inner\" style=\"transform:translate(56.0pt,-20.7pt) scale(1.46000059019698,1.46000059019698) ;\">\n<table class=\"ltx_tabular ltx_guessed_headers ltx_align_middle\" id=\"S5.T5.2.1\">\n<thead class=\"ltx_thead\">\n<tr class=\"ltx_tr\" id=\"S5.T5.2.1.1.1\">\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_column ltx_th_row ltx_border_tt\" id=\"S5.T5.2.1.1.1.1\" rowspan=\"2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.1.1.1.1\">Framework</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T5.2.1.1.1.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.1.1.2.1\">Throughput</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T5.2.1.1.1.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.1.1.3.1\">Power</span></th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column ltx_border_tt\" id=\"S5.T5.2.1.1.1.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.1.1.4.1\">Energy</span></th>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T5.2.1.2.2\">\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T5.2.1.2.2.1\">Tokens/sec</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T5.2.1.2.2.2\">W</th>\n<th class=\"ltx_td ltx_align_center ltx_th ltx_th_column\" id=\"S5.T5.2.1.2.2.3\">J/token</th>\n</tr>\n</thead>\n<tbody class=\"ltx_tbody\">\n<tr class=\"ltx_tr\" id=\"S5.T5.2.1.3.1\">\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_t\" id=\"S5.T5.2.1.3.1.1\">llama.cpp (CPU)</th>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T5.2.1.3.1.2\">7.08</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T5.2.1.3.1.3\">15.0</td>\n<td class=\"ltx_td ltx_align_center ltx_border_t\" id=\"S5.T5.2.1.3.1.4\">2.12</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T5.2.1.4.2\">\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row\" id=\"S5.T5.2.1.4.2.1\">llama.cpp (GPU)</th>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T5.2.1.4.2.2\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.4.2.2.1\">20.03</span></td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T5.2.1.4.2.3\">30.8</td>\n<td class=\"ltx_td ltx_align_center\" id=\"S5.T5.2.1.4.2.4\">1.54</td>\n</tr>\n<tr class=\"ltx_tr\" id=\"S5.T5.2.1.5.3\">\n<th class=\"ltx_td ltx_align_right ltx_th ltx_th_row ltx_border_bb\" id=\"S5.T5.2.1.5.3.1\">T-MAC (CPU)</th>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T5.2.1.5.3.2\">15.62</td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T5.2.1.5.3.3\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.5.3.3.1\">10.4</span></td>\n<td class=\"ltx_td ltx_align_center ltx_border_bb\" id=\"S5.T5.2.1.5.3.4\"><span class=\"ltx_text ltx_font_bold\" id=\"S5.T5.2.1.5.3.4.1\">0.66</span></td>\n</tr>\n</tbody>\n</table>\n</span></div>\n<figcaption class=\"ltx_caption ltx_centering\"><span class=\"ltx_tag ltx_tag_table\"><span class=\"ltx_text\" id=\"S5.T5.3.1.1\" style=\"font-size:90%;\">Table 5</span>. </span><span class=\"ltx_text\" id=\"S5.T5.4.2\" style=\"font-size:90%;\">Llama-2-7B-2bit end-to-end inference throughput, power and energy comparisons on NVIDIA Jetson AGX Orin.</span></figcaption>\n</figure>",
"perturb_sentence_id": [
10,
11
],
"output": {
"perturbed_statement": "[paragraph id = 10] Compared to llama.cpp on GPU, T-MAC only achieves 78 throughput and requires 34 power, resulting in 3.2 energy efficiency. Note that Figure 11 shows T-MAC outperforms the GPU on the mpGEMV kernels.",
"perturbed_explanation": "1. T-MAC improves the throughput to 2.2, not 78. 2. The power consumption with T-MAC is reduced to 69, not 34. 3. The energy efficiency with T-MAC in comparison to llama.cpp on CPU is 3.2. However, the statement should not specify that this value results from comparing with the GPU, as this value was specifically mentioned relative to CPU. Therefore, attributing 3.2 energy efficiency to the GPU comparison is incorrect."
}
}
]