Lekr0 commited on 11 days ago

Commit

0146652

verified ·

1 Parent(s): 32a1ae1

Add files using upload-large-folder tool

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/__grp__triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.cubin +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.llir +667 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ptx +1534 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.source +299 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttgir +232 -0
SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttir +231 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/__grp__triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.llir +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ptx +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.source +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ttgir +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ttir +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/__grp__triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.cubin +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.llir +266 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ptx +640 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.source +379 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ttgir +270 -0
SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ttir +246 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/__grp__triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.llir +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ptx +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.source +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttgir +841 -0
SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttir +799 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/__grp__triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.cubin +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.llir +934 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ptx +921 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.source +449 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttgir +226 -0
SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttir +233 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/__grp__triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.cubin +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.llir +318 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ptx +736 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.source +418 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ttgir +280 -0
SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ttir +283 -0
SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/__grp__triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json +1 -0
SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin +0 -0
SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json +1 -0

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/__grp__triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.source", "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttir", "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttgir", "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.llir", "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ptx", "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.cubin", "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.cubin ADDED Viewed

Binary file (76 kB). View file

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"hash": "d4e9ec88be03aebb42041f53f2cbb9bb30afcd0e87a7a62c3ff7f5fe862eba44", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 4, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 0, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0"}

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.llir ADDED Viewed

	@@ -0,0 +1,667 @@

+; ModuleID = 'LLVMDialectModule'
+source_filename = "LLVMDialectModule"
+target datalayout = "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
+@assertFunc_1 = internal constant [8 x i8] c"unknown\00"
+@assertFile_1 = internal constant [114 x i8] c"/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py\00"
+@assertMessage_1 = internal constant [38 x i8] c"index out of bounds: 0 <= tmp25 < ks4\00"
+@assertFunc_0 = internal constant [8 x i8] c"unknown\00"
+@assertFile_0 = internal constant [114 x i8] c"/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py\00"
+@assertMessage_0 = internal constant [37 x i8] c"index out of bounds: 0 <= tmp5 < ks2\00"
+; Function Attrs: noreturn
+declare !dbg !5 void @__assertfail(ptr, ptr, i32, ptr, i64) local_unnamed_addr #0
+define ptx_kernel void @triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, ptr addrspace(1) %3, ptr addrspace(1) %4, i64 %5, i64 %6, i64 %7, i64 %8, i64 %9, i32 %10, ptr addrspace(1) readnone captures(none) %11, ptr addrspace(1) readnone captures(none) %12) local_unnamed_addr #1 !dbg !9 {
+  %14 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !10
+  %15 = shl i32 %14, 10, !dbg !11
+  %16 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !12
+  %17 = shl nuw nsw i32 %16, 3, !dbg !12
+  %18 = and i32 %17, 1016, !dbg !12
+  %19 = or disjoint i32 %18, %15, !dbg !13
+  %20 = or disjoint i32 %19, 1, !dbg !13
+  %21 = or disjoint i32 %19, 2, !dbg !13
+  %22 = or disjoint i32 %19, 3, !dbg !13
+  %23 = or disjoint i32 %19, 4, !dbg !13
+  %24 = or disjoint i32 %19, 5, !dbg !13
+  %25 = or disjoint i32 %19, 6, !dbg !13
+  %26 = or disjoint i32 %19, 7, !dbg !13
+  %27 = insertelement <8 x i32> poison, i32 %26, i64 0, !dbg !14
+  %28 = insertelement <8 x i32> %27, i32 %25, i64 1, !dbg !14
+  %29 = insertelement <8 x i32> %28, i32 %24, i64 2, !dbg !14
+  %30 = insertelement <8 x i32> %29, i32 %23, i64 3, !dbg !14
+  %31 = insertelement <8 x i32> %30, i32 %22, i64 4, !dbg !14
+  %32 = insertelement <8 x i32> %31, i32 %21, i64 5, !dbg !14
+  %33 = insertelement <8 x i32> %32, i32 %20, i64 6, !dbg !14
+  %34 = insertelement <8 x i32> %33, i32 %19, i64 7, !dbg !14
+  %35 = sext <8 x i32> %34 to <8 x i64>, !dbg !14
+  %36 = extractelement <8 x i64> %35, i64 7, !dbg !15
+  %37 = sdiv i64 %36, %5, !dbg !14
+  %38 = extractelement <8 x i64> %35, i64 6, !dbg !15
+  %39 = sdiv i64 %38, %5, !dbg !14
+  %40 = extractelement <8 x i64> %35, i64 5, !dbg !15
+  %41 = sdiv i64 %40, %5, !dbg !14
+  %42 = extractelement <8 x i64> %35, i64 4, !dbg !15
+  %43 = sdiv i64 %42, %5, !dbg !14
+  %44 = extractelement <8 x i64> %35, i64 3, !dbg !15
+  %45 = sdiv i64 %44, %5, !dbg !14
+  %46 = extractelement <8 x i64> %35, i64 2, !dbg !15
+  %47 = sdiv i64 %46, %5, !dbg !14
+  %48 = extractelement <8 x i64> %35, i64 1, !dbg !15
+  %49 = sdiv i64 %48, %5, !dbg !14
+  %50 = extractelement <8 x i64> %35, i64 0, !dbg !15
+  %51 = sdiv i64 %50, %5, !dbg !14
+  %52 = srem i64 %37, %6, !dbg !16
+  %53 = srem i64 %39, %6, !dbg !16
+  %54 = srem i64 %41, %6, !dbg !16
+  %55 = srem i64 %43, %6, !dbg !16
+  %56 = srem i64 %45, %6, !dbg !16
+  %57 = srem i64 %47, %6, !dbg !16
+  %58 = srem i64 %49, %6, !dbg !16
+  %59 = srem i64 %51, %6, !dbg !16
+  %60 = getelementptr bfloat, ptr addrspace(1) %0, i64 %36, !dbg !15
+  %61 = getelementptr bfloat, ptr addrspace(1) %0, i64 %38, !dbg !15
+  %62 = getelementptr bfloat, ptr addrspace(1) %0, i64 %40, !dbg !15
+  %63 = getelementptr bfloat, ptr addrspace(1) %0, i64 %42, !dbg !15
+  %64 = getelementptr bfloat, ptr addrspace(1) %0, i64 %44, !dbg !15
+  %65 = getelementptr bfloat, ptr addrspace(1) %0, i64 %46, !dbg !15
+  %66 = getelementptr bfloat, ptr addrspace(1) %0, i64 %48, !dbg !15
+  %67 = getelementptr bfloat, ptr addrspace(1) %0, i64 %50, !dbg !15
+  %68 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %69 = getelementptr i64, ptr addrspace(1) %1, i64 %52, !dbg !18
+  %70 = getelementptr i64, ptr addrspace(1) %1, i64 %53, !dbg !18
+  %71 = getelementptr i64, ptr addrspace(1) %1, i64 %54, !dbg !18
+  %72 = getelementptr i64, ptr addrspace(1) %1, i64 %55, !dbg !18
+  %73 = getelementptr i64, ptr addrspace(1) %1, i64 %56, !dbg !18
+  %74 = getelementptr i64, ptr addrspace(1) %1, i64 %57, !dbg !18
+  %75 = getelementptr i64, ptr addrspace(1) %1, i64 %58, !dbg !18
+  %76 = getelementptr i64, ptr addrspace(1) %1, i64 %59, !dbg !18
+  %77 = insertelement <8 x i32> poison, i32 %19, i64 0, !dbg !19
+  %78 = insertelement <8 x i32> %77, i32 %20, i64 1, !dbg !19
+  %79 = insertelement <8 x i32> %78, i32 %21, i64 2, !dbg !19
+  %80 = insertelement <8 x i32> %79, i32 %22, i64 3, !dbg !19
+  %81 = insertelement <8 x i32> %80, i32 %23, i64 4, !dbg !19
+  %82 = insertelement <8 x i32> %81, i32 %24, i64 5, !dbg !19
+  %83 = insertelement <8 x i32> %82, i32 %25, i64 6, !dbg !19
+  %84 = insertelement <8 x i32> %83, i32 %26, i64 7, !dbg !19
+  %85 = insertelement <8 x i32> poison, i32 %10, i64 0, !dbg !19
+  %86 = shufflevector <8 x i32> %85, <8 x i32> poison, <8 x i32> zeroinitializer, !dbg !19
+  %87 = icmp slt <8 x i32> %84, %86, !dbg !19
+  %88 = extractelement <8 x i1> %87, i64 0, !dbg !17
+  %89 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %60, i64 %68, i1 %88) #4, !dbg !17
+  %90 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %91 = extractelement <8 x i1> %87, i64 1, !dbg !17
+  %92 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %61, i64 %90, i1 %91) #4, !dbg !17
+  %93 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %94 = extractelement <8 x i1> %87, i64 2, !dbg !17
+  %95 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %62, i64 %93, i1 %94) #4, !dbg !17
+  %96 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %97 = extractelement <8 x i1> %87, i64 3, !dbg !17
+  %98 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %63, i64 %96, i1 %97) #4, !dbg !17
+  %99 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %100 = extractelement <8 x i1> %87, i64 4, !dbg !17
+  %101 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %64, i64 %99, i1 %100) #4, !dbg !17
+  %102 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %103 = extractelement <8 x i1> %87, i64 5, !dbg !17
+  %104 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %65, i64 %102, i1 %103) #4, !dbg !17
+  %105 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %106 = extractelement <8 x i1> %87, i64 6, !dbg !17
+  %107 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %66, i64 %105, i1 %106) #4, !dbg !17
+  %108 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !17
+  %109 = extractelement <8 x i1> %87, i64 7, !dbg !17
+  %110 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %67, i64 %108, i1 %109) #4, !dbg !17
+  %111 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %112 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %69, i64 %111, i1 %88) #4, !dbg !20
+  %113 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %114 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %70, i64 %113, i1 %91) #4, !dbg !20
+  %115 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %116 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %71, i64 %115, i1 %94) #4, !dbg !20
+  %117 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %118 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %72, i64 %117, i1 %97) #4, !dbg !20
+  %119 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %120 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %73, i64 %119, i1 %100) #4, !dbg !20
+  %121 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %122 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %74, i64 %121, i1 %103) #4, !dbg !20
+  %123 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %124 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %75, i64 %123, i1 %106) #4, !dbg !20
+  %125 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !20
+  %126 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %76, i64 %125, i1 %109) #4, !dbg !20
+  %127 = insertelement <8 x i64> poison, i64 %112, i64 0, !dbg !21
+  %128 = insertelement <8 x i64> %127, i64 %114, i64 1, !dbg !21
+  %129 = insertelement <8 x i64> %128, i64 %116, i64 2, !dbg !21
+  %130 = insertelement <8 x i64> %129, i64 %118, i64 3, !dbg !21
+  %131 = insertelement <8 x i64> %130, i64 %120, i64 4, !dbg !21
+  %132 = insertelement <8 x i64> %131, i64 %122, i64 5, !dbg !21
+  %133 = insertelement <8 x i64> %132, i64 %124, i64 6, !dbg !21
+  %134 = insertelement <8 x i64> %133, i64 %126, i64 7, !dbg !21
+  %135 = icmp slt <8 x i64> %134, zeroinitializer, !dbg !21
+  %136 = insertelement <8 x i64> poison, i64 %7, i64 0, !dbg !22
+  %137 = shufflevector <8 x i64> %136, <8 x i64> poison, <8 x i32> zeroinitializer, !dbg !22
+  %138 = select <8 x i1> %135, <8 x i64> %137, <8 x i64> zeroinitializer, !dbg !22
+  %139 = add <8 x i64> %138, %134, !dbg !22
+  %140 = icmp slt <8 x i64> %139, zeroinitializer, !dbg !23
+  %141 = icmp sge <8 x i64> %139, %137, !dbg !24
+  %142 = or <8 x i1> %140, %141, !dbg !25
+  %143 = and <8 x i1> %87, %142, !dbg !26
+  %144 = bitcast <8 x i1> %143 to i8, !dbg !27
+  %.not = icmp eq i8 %144, 0, !dbg !27
+  br i1 %.not, label %146, label %145, !dbg !27
+145:                                              ; preds = %13
+  tail call void @__assertfail(ptr nonnull @assertMessage_0, ptr nonnull @assertFile_0, i32 32, ptr nonnull @assertFunc_0, i64 1), !dbg !27
+  unreachable, !dbg !27
+146:                                              ; preds = %13
+  %147 = insertelement <8 x i64> poison, i64 %8, i64 0, !dbg !28
+  %148 = shufflevector <8 x i64> %147, <8 x i64> poison, <8 x i32> zeroinitializer, !dbg !28
+  %149 = srem <8 x i64> %35, %148, !dbg !28
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !27
+  %150 = extractelement <8 x i64> %139, i64 0, !dbg !29
+  %151 = mul i64 %150, %8, !dbg !29
+  %152 = extractelement <8 x i64> %139, i64 1, !dbg !29
+  %153 = mul i64 %152, %8, !dbg !29
+  %154 = extractelement <8 x i64> %139, i64 2, !dbg !29
+  %155 = mul i64 %154, %8, !dbg !29
+  %156 = extractelement <8 x i64> %139, i64 3, !dbg !29
+  %157 = mul i64 %156, %8, !dbg !29
+  %158 = extractelement <8 x i64> %139, i64 4, !dbg !29
+  %159 = mul i64 %158, %8, !dbg !29
+  %160 = extractelement <8 x i64> %139, i64 5, !dbg !29
+  %161 = mul i64 %160, %8, !dbg !29
+  %162 = extractelement <8 x i64> %139, i64 6, !dbg !29
+  %163 = mul i64 %162, %8, !dbg !29
+  %164 = extractelement <8 x i64> %139, i64 7, !dbg !29
+  %165 = mul i64 %164, %8, !dbg !29
+  %166 = extractelement <8 x i64> %149, i64 7, !dbg !30
+  %167 = getelementptr bfloat, ptr addrspace(1) %2, i64 %166, !dbg !31
+  %168 = getelementptr bfloat, ptr addrspace(1) %167, i64 %151, !dbg !31
+  %169 = extractelement <8 x i64> %149, i64 6, !dbg !30
+  %170 = getelementptr bfloat, ptr addrspace(1) %2, i64 %169, !dbg !31
+  %171 = getelementptr bfloat, ptr addrspace(1) %170, i64 %153, !dbg !31
+  %172 = extractelement <8 x i64> %149, i64 5, !dbg !30
+  %173 = getelementptr bfloat, ptr addrspace(1) %2, i64 %172, !dbg !31
+  %174 = getelementptr bfloat, ptr addrspace(1) %173, i64 %155, !dbg !31
+  %175 = extractelement <8 x i64> %149, i64 4, !dbg !30
+  %176 = getelementptr bfloat, ptr addrspace(1) %2, i64 %175, !dbg !31
+  %177 = getelementptr bfloat, ptr addrspace(1) %176, i64 %157, !dbg !31
+  %178 = extractelement <8 x i64> %149, i64 3, !dbg !30
+  %179 = getelementptr bfloat, ptr addrspace(1) %2, i64 %178, !dbg !31
+  %180 = getelementptr bfloat, ptr addrspace(1) %179, i64 %159, !dbg !31
+  %181 = extractelement <8 x i64> %149, i64 2, !dbg !30
+  %182 = getelementptr bfloat, ptr addrspace(1) %2, i64 %181, !dbg !31
+  %183 = getelementptr bfloat, ptr addrspace(1) %182, i64 %161, !dbg !31
+  %184 = extractelement <8 x i64> %149, i64 1, !dbg !30
+  %185 = getelementptr bfloat, ptr addrspace(1) %2, i64 %184, !dbg !31
+  %186 = getelementptr bfloat, ptr addrspace(1) %185, i64 %163, !dbg !31
+  %187 = extractelement <8 x i64> %149, i64 0, !dbg !30
+  %188 = getelementptr bfloat, ptr addrspace(1) %2, i64 %187, !dbg !31
+  %189 = getelementptr bfloat, ptr addrspace(1) %188, i64 %165, !dbg !31
+  %190 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %191 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %168, i64 %190, i1 %88) #4, !dbg !32
+  %192 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %193 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %171, i64 %192, i1 %91) #4, !dbg !32
+  %194 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %195 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %174, i64 %194, i1 %94) #4, !dbg !32
+  %196 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %197 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %177, i64 %196, i1 %97) #4, !dbg !32
+  %198 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %199 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %180, i64 %198, i1 %100) #4, !dbg !32
+  %200 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %201 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %183, i64 %200, i1 %103) #4, !dbg !32
+  %202 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %203 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %186, i64 %202, i1 %106) #4, !dbg !32
+  %204 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !32
+  %205 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %189, i64 %204, i1 %109) #4, !dbg !32
+  %206 = sdiv i64 %8, 2, !dbg !33
+  %207 = sub i64 %8, %206, !dbg !34
+  %208 = icmp slt i64 %166, %207, !dbg !35
+  %209 = icmp slt i64 %169, %207, !dbg !35
+  %210 = icmp slt i64 %172, %207, !dbg !35
+  %211 = icmp slt i64 %175, %207, !dbg !35
+  %212 = icmp slt i64 %178, %207, !dbg !35
+  %213 = icmp slt i64 %181, %207, !dbg !35
+  %214 = icmp slt i64 %184, %207, !dbg !35
+  %215 = icmp slt i64 %187, %207, !dbg !35
+  %216 = sub nsw i64 %36, %166, !dbg !30
+  %217 = sub nsw i64 %38, %169, !dbg !30
+  %218 = sub nsw i64 %40, %172, !dbg !30
+  %219 = sub nsw i64 %42, %175, !dbg !30
+  %220 = sub nsw i64 %44, %178, !dbg !30
+  %221 = sub nsw i64 %46, %181, !dbg !30
+  %222 = sub nsw i64 %48, %184, !dbg !30
+  %223 = sub nsw i64 %50, %187, !dbg !30
+  %224 = getelementptr bfloat, ptr addrspace(1) %0, i64 %216, !dbg !36
+  %225 = getelementptr bfloat, ptr addrspace(1) %224, i64 %206, !dbg !36
+  %226 = getelementptr bfloat, ptr addrspace(1) %225, i64 %166, !dbg !36
+  %227 = getelementptr bfloat, ptr addrspace(1) %0, i64 %217, !dbg !36
+  %228 = getelementptr bfloat, ptr addrspace(1) %227, i64 %206, !dbg !36
+  %229 = getelementptr bfloat, ptr addrspace(1) %228, i64 %169, !dbg !36
+  %230 = getelementptr bfloat, ptr addrspace(1) %0, i64 %218, !dbg !36
+  %231 = getelementptr bfloat, ptr addrspace(1) %230, i64 %206, !dbg !36
+  %232 = getelementptr bfloat, ptr addrspace(1) %231, i64 %172, !dbg !36
+  %233 = getelementptr bfloat, ptr addrspace(1) %0, i64 %219, !dbg !36
+  %234 = getelementptr bfloat, ptr addrspace(1) %233, i64 %206, !dbg !36
+  %235 = getelementptr bfloat, ptr addrspace(1) %234, i64 %175, !dbg !36
+  %236 = getelementptr bfloat, ptr addrspace(1) %0, i64 %220, !dbg !36
+  %237 = getelementptr bfloat, ptr addrspace(1) %236, i64 %206, !dbg !36
+  %238 = getelementptr bfloat, ptr addrspace(1) %237, i64 %178, !dbg !36
+  %239 = getelementptr bfloat, ptr addrspace(1) %0, i64 %221, !dbg !36
+  %240 = getelementptr bfloat, ptr addrspace(1) %239, i64 %206, !dbg !36
+  %241 = getelementptr bfloat, ptr addrspace(1) %240, i64 %181, !dbg !36
+  %242 = getelementptr bfloat, ptr addrspace(1) %0, i64 %222, !dbg !36
+  %243 = getelementptr bfloat, ptr addrspace(1) %242, i64 %206, !dbg !36
+  %244 = getelementptr bfloat, ptr addrspace(1) %243, i64 %184, !dbg !36
+  %245 = getelementptr bfloat, ptr addrspace(1) %0, i64 %223, !dbg !36
+  %246 = getelementptr bfloat, ptr addrspace(1) %245, i64 %206, !dbg !36
+  %247 = getelementptr bfloat, ptr addrspace(1) %246, i64 %187, !dbg !36
+  %248 = and i1 %88, %208, !dbg !37
+  %249 = and i1 %91, %209, !dbg !37
+  %250 = and i1 %94, %210, !dbg !37
+  %251 = and i1 %97, %211, !dbg !37
+  %252 = and i1 %100, %212, !dbg !37
+  %253 = and i1 %103, %213, !dbg !37
+  %254 = and i1 %106, %214, !dbg !37
+  %255 = and i1 %109, %215, !dbg !37
+  %256 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %257 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %226, i64 %256, i1 %248) #4, !dbg !38
+  %258 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %259 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %229, i64 %258, i1 %249) #4, !dbg !38
+  %260 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %261 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %232, i64 %260, i1 %250) #4, !dbg !38
+  %262 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %263 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %235, i64 %262, i1 %251) #4, !dbg !38
+  %264 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %265 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %238, i64 %264, i1 %252) #4, !dbg !38
+  %266 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %267 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %241, i64 %266, i1 %253) #4, !dbg !38
+  %268 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %269 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %244, i64 %268, i1 %254) #4, !dbg !38
+  %270 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !38
+  %271 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %247, i64 %270, i1 %255) #4, !dbg !38
+  %272 = insertelement <8 x i64> poison, i64 %207, i64 0, !dbg !39
+  %273 = shufflevector <8 x i64> %272, <8 x i64> poison, <8 x i32> zeroinitializer, !dbg !39
+  %274 = icmp sge <8 x i64> %149, %273, !dbg !39
+  %275 = sub i64 %166, %8, !dbg !40
+  %276 = sub i64 %169, %8, !dbg !40
+  %277 = sub i64 %172, %8, !dbg !40
+  %278 = sub i64 %175, %8, !dbg !40
+  %279 = sub i64 %178, %8, !dbg !40
+  %280 = sub i64 %181, %8, !dbg !40
+  %281 = sub i64 %184, %8, !dbg !40
+  %282 = sub i64 %187, %8, !dbg !40
+  %283 = getelementptr bfloat, ptr addrspace(1) %224, i64 %275, !dbg !41
+  %284 = getelementptr bfloat, ptr addrspace(1) %283, i64 %206, !dbg !41
+  %285 = getelementptr bfloat, ptr addrspace(1) %227, i64 %276, !dbg !41
+  %286 = getelementptr bfloat, ptr addrspace(1) %285, i64 %206, !dbg !41
+  %287 = getelementptr bfloat, ptr addrspace(1) %230, i64 %277, !dbg !41
+  %288 = getelementptr bfloat, ptr addrspace(1) %287, i64 %206, !dbg !41
+  %289 = getelementptr bfloat, ptr addrspace(1) %233, i64 %278, !dbg !41
+  %290 = getelementptr bfloat, ptr addrspace(1) %289, i64 %206, !dbg !41
+  %291 = getelementptr bfloat, ptr addrspace(1) %236, i64 %279, !dbg !41
+  %292 = getelementptr bfloat, ptr addrspace(1) %291, i64 %206, !dbg !41
+  %293 = getelementptr bfloat, ptr addrspace(1) %239, i64 %280, !dbg !41
+  %294 = getelementptr bfloat, ptr addrspace(1) %293, i64 %206, !dbg !41
+  %295 = getelementptr bfloat, ptr addrspace(1) %242, i64 %281, !dbg !41
+  %296 = getelementptr bfloat, ptr addrspace(1) %295, i64 %206, !dbg !41
+  %297 = getelementptr bfloat, ptr addrspace(1) %245, i64 %282, !dbg !41
+  %298 = getelementptr bfloat, ptr addrspace(1) %297, i64 %206, !dbg !41
+  %299 = extractelement <8 x i1> %274, i64 7, !dbg !42
+  %300 = and i1 %88, %299, !dbg !42
+  %301 = extractelement <8 x i1> %274, i64 6, !dbg !42
+  %302 = and i1 %91, %301, !dbg !42
+  %303 = extractelement <8 x i1> %274, i64 5, !dbg !42
+  %304 = and i1 %94, %303, !dbg !42
+  %305 = extractelement <8 x i1> %274, i64 4, !dbg !42
+  %306 = and i1 %97, %305, !dbg !42
+  %307 = extractelement <8 x i1> %274, i64 3, !dbg !42
+  %308 = and i1 %100, %307, !dbg !42
+  %309 = extractelement <8 x i1> %274, i64 2, !dbg !42
+  %310 = and i1 %103, %309, !dbg !42
+  %311 = extractelement <8 x i1> %274, i64 1, !dbg !42
+  %312 = and i1 %106, %311, !dbg !42
+  %313 = extractelement <8 x i1> %274, i64 0, !dbg !42
+  %314 = and i1 %109, %313, !dbg !42
+  %315 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %316 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %284, i64 %315, i1 %300) #4, !dbg !43
+  %317 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %318 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %286, i64 %317, i1 %302) #4, !dbg !43
+  %319 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %320 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %288, i64 %319, i1 %304) #4, !dbg !43
+  %321 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %322 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %290, i64 %321, i1 %306) #4, !dbg !43
+  %323 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %324 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %292, i64 %323, i1 %308) #4, !dbg !43
+  %325 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %326 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %294, i64 %325, i1 %310) #4, !dbg !43
+  %327 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %328 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %296, i64 %327, i1 %312) #4, !dbg !43
+  %329 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !43
+  %330 = tail call i16 asm sideeffect "mov.u16 $0, $1;\0A\09@$4 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $2 + 0 ], $3;", "=c,c,l,l,b"(i16 0, ptr addrspace(1) %298, i64 %329, i1 %314) #4, !dbg !43
+  %331 = insertelement <8 x i64> poison, i64 %9, i64 0, !dbg !44
+  %332 = shufflevector <8 x i64> %331, <8 x i64> poison, <8 x i32> zeroinitializer, !dbg !44
+  %333 = select <8 x i1> %135, <8 x i64> %332, <8 x i64> zeroinitializer, !dbg !44
+  %334 = add <8 x i64> %333, %134, !dbg !44
+  %335 = icmp slt <8 x i64> %334, zeroinitializer, !dbg !45
+  %336 = icmp sge <8 x i64> %334, %332, !dbg !46
+  %337 = or <8 x i1> %335, %336, !dbg !47
+  %338 = and <8 x i1> %87, %337, !dbg !48
+  %339 = bitcast <8 x i1> %338 to i8, !dbg !49
+  %.not87 = icmp eq i8 %339, 0, !dbg !49
+  br i1 %.not87, label %341, label %340, !dbg !49
+340:                                              ; preds = %146
+  tail call void @__assertfail(ptr nonnull @assertMessage_1, ptr nonnull @assertFile_1, i32 52, ptr nonnull @assertFunc_1, i64 1), !dbg !49
+  unreachable, !dbg !49
+341:                                              ; preds = %146
+  %342 = bitcast i16 %271 to bfloat, !dbg !38
+  %343 = fpext bfloat %342 to float, !dbg !50
+  %344 = fsub float 0.000000e+00, %343, !dbg !51
+  %345 = bitcast i16 %330 to bfloat, !dbg !43
+  %346 = fpext bfloat %345 to float, !dbg !52
+  %347 = select i1 %215, float %344, float %346, !dbg !53
+  %348 = bitcast i16 %269 to bfloat, !dbg !38
+  %349 = fpext bfloat %348 to float, !dbg !50
+  %350 = fsub float 0.000000e+00, %349, !dbg !51
+  %351 = bitcast i16 %328 to bfloat, !dbg !43
+  %352 = fpext bfloat %351 to float, !dbg !52
+  %353 = select i1 %214, float %350, float %352, !dbg !53
+  %354 = bitcast i16 %267 to bfloat, !dbg !38
+  %355 = fpext bfloat %354 to float, !dbg !50
+  %356 = fsub float 0.000000e+00, %355, !dbg !51
+  %357 = bitcast i16 %326 to bfloat, !dbg !43
+  %358 = fpext bfloat %357 to float, !dbg !52
+  %359 = select i1 %213, float %356, float %358, !dbg !53
+  %360 = bitcast i16 %265 to bfloat, !dbg !38
+  %361 = fpext bfloat %360 to float, !dbg !50
+  %362 = fsub float 0.000000e+00, %361, !dbg !51
+  %363 = bitcast i16 %324 to bfloat, !dbg !43
+  %364 = fpext bfloat %363 to float, !dbg !52
+  %365 = select i1 %212, float %362, float %364, !dbg !53
+  %366 = bitcast i16 %263 to bfloat, !dbg !38
+  %367 = fpext bfloat %366 to float, !dbg !50
+  %368 = fsub float 0.000000e+00, %367, !dbg !51
+  %369 = bitcast i16 %322 to bfloat, !dbg !43
+  %370 = fpext bfloat %369 to float, !dbg !52
+  %371 = select i1 %211, float %368, float %370, !dbg !53
+  %372 = bitcast i16 %261 to bfloat, !dbg !38
+  %373 = fpext bfloat %372 to float, !dbg !50
+  %374 = fsub float 0.000000e+00, %373, !dbg !51
+  %375 = bitcast i16 %320 to bfloat, !dbg !43
+  %376 = fpext bfloat %375 to float, !dbg !52
+  %377 = select i1 %210, float %374, float %376, !dbg !53
+  %378 = bitcast i16 %259 to bfloat, !dbg !38
+  %379 = fpext bfloat %378 to float, !dbg !50
+  %380 = fsub float 0.000000e+00, %379, !dbg !51
+  %381 = bitcast i16 %318 to bfloat, !dbg !43
+  %382 = fpext bfloat %381 to float, !dbg !52
+  %383 = select i1 %209, float %380, float %382, !dbg !53
+  %384 = bitcast i16 %257 to bfloat, !dbg !38
+  %385 = fpext bfloat %384 to float, !dbg !50
+  %386 = fsub float 0.000000e+00, %385, !dbg !51
+  %387 = bitcast i16 %316 to bfloat, !dbg !43
+  %388 = fpext bfloat %387 to float, !dbg !52
+  %389 = select i1 %208, float %386, float %388, !dbg !53
+  %390 = bitcast i16 %110 to bfloat, !dbg !17
+  %391 = fpext bfloat %390 to float, !dbg !54
+  %392 = bitcast i16 %107 to bfloat, !dbg !17
+  %393 = fpext bfloat %392 to float, !dbg !54
+  %394 = bitcast i16 %104 to bfloat, !dbg !17
+  %395 = fpext bfloat %394 to float, !dbg !54
+  %396 = bitcast i16 %101 to bfloat, !dbg !17
+  %397 = fpext bfloat %396 to float, !dbg !54
+  %398 = bitcast i16 %98 to bfloat, !dbg !17
+  %399 = fpext bfloat %398 to float, !dbg !54
+  %400 = bitcast i16 %95 to bfloat, !dbg !17
+  %401 = fpext bfloat %400 to float, !dbg !54
+  %402 = bitcast i16 %92 to bfloat, !dbg !17
+  %403 = fpext bfloat %402 to float, !dbg !54
+  %404 = bitcast i16 %89 to bfloat, !dbg !17
+  %405 = fpext bfloat %404 to float, !dbg !54
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !49
+  %406 = extractelement <8 x i64> %334, i64 0, !dbg !55
+  %407 = mul i64 %406, %8, !dbg !55
+  %408 = extractelement <8 x i64> %334, i64 1, !dbg !55
+  %409 = mul i64 %408, %8, !dbg !55
+  %410 = extractelement <8 x i64> %334, i64 2, !dbg !55
+  %411 = mul i64 %410, %8, !dbg !55
+  %412 = extractelement <8 x i64> %334, i64 3, !dbg !55
+  %413 = mul i64 %412, %8, !dbg !55
+  %414 = extractelement <8 x i64> %334, i64 4, !dbg !55
+  %415 = mul i64 %414, %8, !dbg !55
+  %416 = extractelement <8 x i64> %334, i64 5, !dbg !55
+  %417 = mul i64 %416, %8, !dbg !55
+  %418 = extractelement <8 x i64> %334, i64 6, !dbg !55
+  %419 = mul i64 %418, %8, !dbg !55
+  %420 = extractelement <8 x i64> %334, i64 7, !dbg !55
+  %421 = mul i64 %420, %8, !dbg !55
+  %422 = getelementptr bfloat, ptr addrspace(1) %3, i64 %166, !dbg !56
+  %423 = getelementptr bfloat, ptr addrspace(1) %422, i64 %407, !dbg !56
+  %424 = getelementptr bfloat, ptr addrspace(1) %3, i64 %169, !dbg !56
+  %425 = getelementptr bfloat, ptr addrspace(1) %424, i64 %409, !dbg !56
+  %426 = getelementptr bfloat, ptr addrspace(1) %3, i64 %172, !dbg !56
+  %427 = getelementptr bfloat, ptr addrspace(1) %426, i64 %411, !dbg !56
+  %428 = getelementptr bfloat, ptr addrspace(1) %3, i64 %175, !dbg !56
+  %429 = getelementptr bfloat, ptr addrspace(1) %428, i64 %413, !dbg !56
+  %430 = getelementptr bfloat, ptr addrspace(1) %3, i64 %178, !dbg !56
+  %431 = getelementptr bfloat, ptr addrspace(1) %430, i64 %415, !dbg !56
+  %432 = getelementptr bfloat, ptr addrspace(1) %3, i64 %181, !dbg !56
+  %433 = getelementptr bfloat, ptr addrspace(1) %432, i64 %417, !dbg !56
+  %434 = getelementptr bfloat, ptr addrspace(1) %3, i64 %184, !dbg !56
+  %435 = getelementptr bfloat, ptr addrspace(1) %434, i64 %419, !dbg !56
+  %436 = getelementptr bfloat, ptr addrspace(1) %3, i64 %187, !dbg !56
+  %437 = getelementptr bfloat, ptr addrspace(1) %436, i64 %421, !dbg !56
+  %438 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %439 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %423, i64 %438, i1 %88) #4, !dbg !57
+  %440 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %441 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %425, i64 %440, i1 %91) #4, !dbg !57
+  %442 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %443 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %427, i64 %442, i1 %94) #4, !dbg !57
+  %444 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %445 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %429, i64 %444, i1 %97) #4, !dbg !57
+  %446 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %447 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %431, i64 %446, i1 %100) #4, !dbg !57
+  %448 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %449 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %433, i64 %448, i1 %103) #4, !dbg !57
+  %450 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %451 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %435, i64 %450, i1 %106) #4, !dbg !57
+  %452 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #4, !dbg !57
+  %453 = tail call i16 asm sideeffect "mov.u16 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b16 { $0 }, [ $1 + 0 ], $2;", "=c,l,l,b"(ptr addrspace(1) %437, i64 %452, i1 %109) #4, !dbg !57
+  %454 = insertelement <2 x i16> poison, i16 %191, i64 0, !dbg !32
+  %455 = insertelement <2 x i16> %454, i16 %439, i64 1, !dbg !32
+  %456 = bitcast <2 x i16> %455 to <2 x bfloat>, !dbg !32
+  %457 = fpext <2 x bfloat> %456 to <2 x float>, !dbg !58
+  %458 = insertelement <2 x float> poison, float %405, i64 0, !dbg !59
+  %459 = insertelement <2 x float> %458, float %389, i64 1, !dbg !59
+  %460 = fmul <2 x float> %459, %457, !dbg !59
+  %461 = insertelement <2 x i16> poison, i16 %193, i64 0, !dbg !32
+  %462 = insertelement <2 x i16> %461, i16 %441, i64 1, !dbg !32
+  %463 = bitcast <2 x i16> %462 to <2 x bfloat>, !dbg !32
+  %464 = fpext <2 x bfloat> %463 to <2 x float>, !dbg !58
+  %465 = insertelement <2 x float> poison, float %403, i64 0, !dbg !59
+  %466 = insertelement <2 x float> %465, float %383, i64 1, !dbg !59
+  %467 = fmul <2 x float> %466, %464, !dbg !59
+  %468 = insertelement <2 x i16> poison, i16 %195, i64 0, !dbg !32
+  %469 = insertelement <2 x i16> %468, i16 %443, i64 1, !dbg !32
+  %470 = bitcast <2 x i16> %469 to <2 x bfloat>, !dbg !32
+  %471 = fpext <2 x bfloat> %470 to <2 x float>, !dbg !58
+  %472 = insertelement <2 x float> poison, float %401, i64 0, !dbg !59
+  %473 = insertelement <2 x float> %472, float %377, i64 1, !dbg !59
+  %474 = fmul <2 x float> %473, %471, !dbg !59
+  %475 = insertelement <2 x i16> poison, i16 %197, i64 0, !dbg !32
+  %476 = insertelement <2 x i16> %475, i16 %445, i64 1, !dbg !32
+  %477 = bitcast <2 x i16> %476 to <2 x bfloat>, !dbg !32
+  %478 = fpext <2 x bfloat> %477 to <2 x float>, !dbg !58
+  %479 = insertelement <2 x float> poison, float %399, i64 0, !dbg !59
+  %480 = insertelement <2 x float> %479, float %371, i64 1, !dbg !59
+  %481 = fmul <2 x float> %480, %478, !dbg !59
+  %482 = insertelement <2 x i16> poison, i16 %199, i64 0, !dbg !32
+  %483 = insertelement <2 x i16> %482, i16 %447, i64 1, !dbg !32
+  %484 = bitcast <2 x i16> %483 to <2 x bfloat>, !dbg !32
+  %485 = fpext <2 x bfloat> %484 to <2 x float>, !dbg !58
+  %486 = insertelement <2 x float> poison, float %397, i64 0, !dbg !59
+  %487 = insertelement <2 x float> %486, float %365, i64 1, !dbg !59
+  %488 = fmul <2 x float> %487, %485, !dbg !59
+  %489 = insertelement <2 x i16> poison, i16 %201, i64 0, !dbg !32
+  %490 = insertelement <2 x i16> %489, i16 %449, i64 1, !dbg !32
+  %491 = bitcast <2 x i16> %490 to <2 x bfloat>, !dbg !32
+  %492 = fpext <2 x bfloat> %491 to <2 x float>, !dbg !58
+  %493 = insertelement <2 x float> poison, float %395, i64 0, !dbg !59
+  %494 = insertelement <2 x float> %493, float %359, i64 1, !dbg !59
+  %495 = fmul <2 x float> %494, %492, !dbg !59
+  %496 = insertelement <2 x i16> poison, i16 %203, i64 0, !dbg !32
+  %497 = insertelement <2 x i16> %496, i16 %451, i64 1, !dbg !32
+  %498 = bitcast <2 x i16> %497 to <2 x bfloat>, !dbg !32
+  %499 = fpext <2 x bfloat> %498 to <2 x float>, !dbg !58
+  %500 = insertelement <2 x float> poison, float %393, i64 0, !dbg !59
+  %501 = insertelement <2 x float> %500, float %353, i64 1, !dbg !59
+  %502 = fmul <2 x float> %501, %499, !dbg !59
+  %503 = insertelement <2 x i16> poison, i16 %205, i64 0, !dbg !32
+  %504 = insertelement <2 x i16> %503, i16 %453, i64 1, !dbg !32
+  %505 = bitcast <2 x i16> %504 to <2 x bfloat>, !dbg !32
+  %506 = fpext <2 x bfloat> %505 to <2 x float>, !dbg !58
+  %507 = insertelement <2 x float> poison, float %391, i64 0, !dbg !59
+  %508 = insertelement <2 x float> %507, float %347, i64 1, !dbg !59
+  %509 = fmul <2 x float> %508, %506, !dbg !59
+  %shift = shufflevector <2 x float> %460, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop = fadd <2 x float> %460, %shift, !dbg !60
+  %510 = extractelement <2 x float> %foldExtExtBinop, i64 0, !dbg !60
+  %shift66 = shufflevector <2 x float> %467, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop67 = fadd <2 x float> %467, %shift66, !dbg !60
+  %511 = extractelement <2 x float> %foldExtExtBinop67, i64 0, !dbg !60
+  %shift69 = shufflevector <2 x float> %474, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop70 = fadd <2 x float> %474, %shift69, !dbg !60
+  %512 = extractelement <2 x float> %foldExtExtBinop70, i64 0, !dbg !60
+  %shift72 = shufflevector <2 x float> %481, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop73 = fadd <2 x float> %481, %shift72, !dbg !60
+  %513 = extractelement <2 x float> %foldExtExtBinop73, i64 0, !dbg !60
+  %shift75 = shufflevector <2 x float> %488, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop76 = fadd <2 x float> %488, %shift75, !dbg !60
+  %514 = extractelement <2 x float> %foldExtExtBinop76, i64 0, !dbg !60
+  %shift78 = shufflevector <2 x float> %495, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop79 = fadd <2 x float> %495, %shift78, !dbg !60
+  %515 = extractelement <2 x float> %foldExtExtBinop79, i64 0, !dbg !60
+  %shift81 = shufflevector <2 x float> %502, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop82 = fadd <2 x float> %502, %shift81, !dbg !60
+  %516 = extractelement <2 x float> %foldExtExtBinop82, i64 0, !dbg !60
+  %shift84 = shufflevector <2 x float> %509, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !60
+  %foldExtExtBinop85 = fadd <2 x float> %509, %shift84, !dbg !60
+  %517 = extractelement <2 x float> %foldExtExtBinop85, i64 0, !dbg !60
+  %518 = getelementptr bfloat, ptr addrspace(1) %4, i64 %36, !dbg !61
+  %519 = getelementptr bfloat, ptr addrspace(1) %4, i64 %38, !dbg !61
+  %520 = getelementptr bfloat, ptr addrspace(1) %4, i64 %40, !dbg !61
+  %521 = getelementptr bfloat, ptr addrspace(1) %4, i64 %42, !dbg !61
+  %522 = getelementptr bfloat, ptr addrspace(1) %4, i64 %44, !dbg !61
+  %523 = getelementptr bfloat, ptr addrspace(1) %4, i64 %46, !dbg !61
+  %524 = getelementptr bfloat, ptr addrspace(1) %4, i64 %48, !dbg !61
+  %525 = getelementptr bfloat, ptr addrspace(1) %4, i64 %50, !dbg !61
+  %526 = fptrunc float %510 to bfloat, !dbg !62
+  %527 = fptrunc float %511 to bfloat, !dbg !62
+  %528 = fptrunc float %512 to bfloat, !dbg !62
+  %529 = fptrunc float %513 to bfloat, !dbg !62
+  %530 = fptrunc float %514 to bfloat, !dbg !62
+  %531 = fptrunc float %515 to bfloat, !dbg !62
+  %532 = fptrunc float %516 to bfloat, !dbg !62
+  %533 = fptrunc float %517 to bfloat, !dbg !62
+  %534 = bitcast bfloat %526 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %534, ptr addrspace(1) %518, i1 %88) #4, !dbg !62
+  %535 = bitcast bfloat %527 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %535, ptr addrspace(1) %519, i1 %91) #4, !dbg !62
+  %536 = bitcast bfloat %528 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %536, ptr addrspace(1) %520, i1 %94) #4, !dbg !62
+  %537 = bitcast bfloat %529 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %537, ptr addrspace(1) %521, i1 %97) #4, !dbg !62
+  %538 = bitcast bfloat %530 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %538, ptr addrspace(1) %522, i1 %100) #4, !dbg !62
+  %539 = bitcast bfloat %531 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %539, ptr addrspace(1) %523, i1 %103) #4, !dbg !62
+  %540 = bitcast bfloat %532 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %540, ptr addrspace(1) %524, i1 %106) #4, !dbg !62
+  %541 = bitcast bfloat %533 to i16, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b16 [ $1 + 0 ], { $0 };", "c,l,b"(i16 %541, ptr addrspace(1) %525, i1 %109) #4, !dbg !62
+  ret void, !dbg !63
+}
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 2147483647) i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #2
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 1024) i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2
+; Function Attrs: convergent nocallback nounwind
+declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32) #3
+attributes #0 = { noreturn }
+attributes #1 = { "nvvm.reqntid"="128" }
+attributes #2 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+attributes #3 = { convergent nocallback nounwind }
+attributes #4 = { nounwind }
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!2, !3}
+!llvm.ident = !{!4}
+!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "triton", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly)
+!1 = !DIFile(filename: "cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py", directory: "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy")
+!2 = !{i32 2, !"Debug Info Version", i32 3}
+!3 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
+!4 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
+!5 = !DISubprogram(name: "__assertfail", linkageName: "__assertfail", scope: !6, file: !6, type: !7, spFlags: DISPFlagOptimized)
+!6 = !DIFile(filename: "<unknown>", directory: "")
+!7 = !DISubroutineType(cc: DW_CC_normal, types: !8)
+!8 = !{}
+!9 = distinct !DISubprogram(name: "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0", linkageName: "triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0", scope: !1, file: !1, line: 18, type: !7, scopeLine: 18, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0)
+!10 = !DILocation(line: 19, column: 28, scope: !9)
+!11 = !DILocation(line: 19, column: 33, scope: !9)
+!12 = !DILocation(line: 20, column: 36, scope: !9)
+!13 = !DILocation(line: 20, column: 23, scope: !9)
+!14 = !DILocation(line: 23, column: 21, scope: !9)
+!15 = !DILocation(line: 26, column: 30, scope: !9)
+!16 = !DILocation(line: 23, column: 28, scope: !9)
+!17 = !DILocation(line: 26, column: 35, scope: !9)
+!18 = !DILocation(line: 27, column: 30, scope: !9)
+!19 = !DILocation(line: 21, column: 21, scope: !9)
+!20 = !DILocation(line: 27, column: 35, scope: !9)
+!21 = !DILocation(line: 30, column: 18, scope: !9)
+!22 = !DILocation(line: 31, column: 32, scope: !9)
+!23 = !DILocation(line: 32, column: 28, scope: !9)
+!24 = !DILocation(line: 32, column: 44, scope: !9)
+!25 = !DILocation(line: 32, column: 37, scope: !9)
+!26 = !DILocation(line: 32, column: 52, scope: !9)
+!27 = !DILocation(line: 32, column: 62, scope: !9)
+!28 = !DILocation(line: 24, column: 19, scope: !9)
+!29 = !DILocation(line: 33, column: 39, scope: !9)
+!30 = !DILocation(line: 40, column: 35, scope: !9)
+!31 = !DILocation(line: 33, column: 30, scope: !9)
+!32 = !DILocation(line: 33, column: 46, scope: !9)
+!33 = !DILocation(line: 38, column: 31, scope: !9)
+!34 = !DILocation(line: 38, column: 18, scope: !9)
+!35 = !DILocation(line: 39, column: 19, scope: !9)
+!36 = !DILocation(line: 40, column: 31, scope: !9)
+!37 = !DILocation(line: 40, column: 68, scope: !9)
+!38 = !DILocation(line: 40, column: 60, scope: !9)
+!39 = !DILocation(line: 44, column: 20, scope: !9)
+!40 = !DILocation(line: 47, column: 47, scope: !9)
+!41 = !DILocation(line: 47, column: 31, scope: !9)
+!42 = !DILocation(line: 47, column: 81, scope: !9)
+!43 = !DILocation(line: 47, column: 73, scope: !9)
+!44 = !DILocation(line: 51, column: 34, scope: !9)
+!45 = !DILocation(line: 52, column: 28, scope: !9)
+!46 = !DILocation(line: 52, column: 46, scope: !9)
+!47 = !DILocation(line: 52, column: 38, scope: !9)
+!48 = !DILocation(line: 52, column: 54, scope: !9)
+!49 = !DILocation(line: 52, column: 64, scope: !9)
+!50 = !DILocation(line: 40, column: 119, scope: !9)
+!51 = !DILocation(line: 41, column: 13, scope: !9)
+!52 = !DILocation(line: 47, column: 132, scope: !9)
+!53 = !DILocation(line: 0, scope: !9)
+!54 = !DILocation(line: 26, column: 75, scope: !9)
+!55 = !DILocation(line: 53, column: 40, scope: !9)
+!56 = !DILocation(line: 53, column: 31, scope: !9)
+!57 = !DILocation(line: 53, column: 48, scope: !9)
+!58 = !DILocation(line: 33, column: 86, scope: !9)
+!59 = !DILocation(line: 34, column: 18, scope: !9)
+!60 = !DILocation(line: 55, column: 19, scope: !9)
+!61 = !DILocation(line: 56, column: 25, scope: !9)
+!62 = !DILocation(line: 56, column: 37, scope: !9)
+!63 = !DILocation(line: 56, column: 4, scope: !9)

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ptx ADDED Viewed

	@@ -0,0 +1,1534 @@

+//
+// Generated by LLVM NVPTX Back-End
+//
+.version 8.7
+.target sm_90a
+.address_size 64
+	// .globl	triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0 // -- Begin function triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0
+.extern .func __assertfail
+(
+	.param .b64 __assertfail_param_0,
+	.param .b64 __assertfail_param_1,
+	.param .b32 __assertfail_param_2,
+	.param .b64 __assertfail_param_3,
+	.param .b64 __assertfail_param_4
+)
+.noreturn;
+.global .align 1 .b8 assertFunc_1[8] = {117, 110, 107, 110, 111, 119, 110};
+.global .align 1 .b8 assertFile_1[114] = {47, 119, 111, 114, 107, 115, 112, 97, 99, 101, 47, 104, 97, 110, 114, 117, 105, 47, 83, 112, 101, 99, 70, 111, 114, 103, 101, 45, 101, 120, 116, 47, 99, 97, 99, 104, 101, 47, 99, 111, 109, 112, 105, 108, 101, 100, 95, 107, 101, 114, 110, 101, 108, 115, 47, 118, 121, 47, 99, 118, 121, 111, 113, 103, 55, 106, 122, 101, 97, 100, 97, 114, 114, 103, 103, 103, 120, 104, 115, 50, 100, 106, 109, 120, 117, 103, 120, 121, 118, 106, 104, 105, 109, 55, 50, 118, 102, 115, 116, 113, 99, 52, 105, 111, 52, 113, 115, 101, 99, 106, 46, 112, 121};
+.global .align 1 .b8 assertMessage_1[38] = {105, 110, 100, 101, 120, 32, 111, 117, 116, 32, 111, 102, 32, 98, 111, 117, 110, 100, 115, 58, 32, 48, 32, 60, 61, 32, 116, 109, 112, 50, 53, 32, 60, 32, 107, 115, 52};
+.global .align 1 .b8 assertFunc_0[8] = {117, 110, 107, 110, 111, 119, 110};
+.global .align 1 .b8 assertFile_0[114] = {47, 119, 111, 114, 107, 115, 112, 97, 99, 101, 47, 104, 97, 110, 114, 117, 105, 47, 83, 112, 101, 99, 70, 111, 114, 103, 101, 45, 101, 120, 116, 47, 99, 97, 99, 104, 101, 47, 99, 111, 109, 112, 105, 108, 101, 100, 95, 107, 101, 114, 110, 101, 108, 115, 47, 118, 121, 47, 99, 118, 121, 111, 113, 103, 55, 106, 122, 101, 97, 100, 97, 114, 114, 103, 103, 103, 120, 104, 115, 50, 100, 106, 109, 120, 117, 103, 120, 121, 118, 106, 104, 105, 109, 55, 50, 118, 102, 115, 116, 113, 99, 52, 105, 111, 52, 113, 115, 101, 99, 106, 46, 112, 121};
+.global .align 1 .b8 assertMessage_0[37] = {105, 110, 100, 101, 120, 32, 111, 117, 116, 32, 111, 102, 32, 98, 111, 117, 110, 100, 115, 58, 32, 48, 32, 60, 61, 32, 116, 109, 112, 53, 32, 60, 32, 107, 115, 50};
+                                        // @triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0
+.visible .entry triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0(
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_0,
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_1,
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_2,
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_3,
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_4,
+	.param .u64 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_5,
+	.param .u64 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_6,
+	.param .u64 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_7,
+	.param .u64 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_8,
+	.param .u64 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_9,
+	.param .u32 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_10,
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_11,
+	.param .u64 .ptr .global .align 1 triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_12
+)
+.reqntid 128
+{
+	.reg .pred 	%p<179>;
+	.reg .b16 	%rs<165>;
+	.reg .b32 	%r<160>;
+	.reg .b64 	%rd<500>;
+	.loc	1 18 0                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:18:0
+$L__func_begin0:
+	.loc	1 18 0                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:18:0
+// %bb.0:
+	ld.param.b64 	%rd103, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_5];
+$L__tmp0:
+	.loc	1 19 28                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:19:28
+	mov.u32 	%r26, %ctaid.x;
+	.loc	1 19 33                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:19:33
+	shl.b32 	%r27, %r26, 10;
+	.loc	1 20 36                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:20:36
+	mov.u32 	%r28, %tid.x;
+	shl.b32 	%r29, %r28, 3;
+	and.b32 	%r30, %r29, 1016;
+	.loc	1 20 23                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:20:23
+	or.b32 	%r1, %r30, %r27;
+	or.b32 	%r2, %r1, 1;
+	or.b32 	%r3, %r1, 2;
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	cvt.s64.s32 	%rd7, %r2;
+	cvt.s64.s32 	%rd8, %r1;
+	or.b64 	%rd108, %rd8, %rd103;
+	and.b64 	%rd109, %rd108, -4294967296;
+	setp.ne.b64 	%p9, %rd109, 0;
+	@%p9 bra 	$L__BB0_2;
+	bra.uni 	$L__BB0_1;
+$L__BB0_2:
+	div.s64 	%rd484, %rd8, %rd103;
+	bra.uni 	$L__BB0_3;
+$L__BB0_1:
+	cvt.u32.u64 	%r31, %rd103;
+	cvt.u32.u64 	%r32, %rd8;
+	div.u32 	%r33, %r32, %r31;
+	cvt.u64.u32 	%rd484, %r33;
+$L__BB0_3:
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	or.b32 	%r4, %r1, 3;
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	cvt.s64.s32 	%rd6, %r3;
+	or.b64 	%rd110, %rd7, %rd103;
+	and.b64 	%rd111, %rd110, -4294967296;
+	setp.ne.b64 	%p10, %rd111, 0;
+	@%p10 bra 	$L__BB0_5;
+	bra.uni 	$L__BB0_4;
+$L__BB0_5:
+	div.s64 	%rd485, %rd7, %rd103;
+	bra.uni 	$L__BB0_6;
+$L__BB0_4:
+	cvt.u32.u64 	%r34, %rd103;
+	cvt.u32.u64 	%r35, %rd7;
+	div.u32 	%r36, %r35, %r34;
+	cvt.u64.u32 	%rd485, %r36;
+$L__BB0_6:
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	or.b32 	%r5, %r1, 4;
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	cvt.s64.s32 	%rd5, %r4;
+	or.b64 	%rd112, %rd6, %rd103;
+	and.b64 	%rd113, %rd112, -4294967296;
+	setp.ne.b64 	%p11, %rd113, 0;
+	@%p11 bra 	$L__BB0_8;
+	bra.uni 	$L__BB0_7;
+$L__BB0_8:
+	div.s64 	%rd486, %rd6, %rd103;
+	bra.uni 	$L__BB0_9;
+$L__BB0_7:
+	cvt.u32.u64 	%r37, %rd103;
+	cvt.u32.u64 	%r38, %rd6;
+	div.u32 	%r39, %r38, %r37;
+	cvt.u64.u32 	%rd486, %r39;
+$L__BB0_9:
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	or.b32 	%r6, %r1, 5;
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	cvt.s64.s32 	%rd4, %r5;
+	or.b64 	%rd114, %rd5, %rd103;
+	and.b64 	%rd115, %rd114, -4294967296;
+	setp.ne.b64 	%p12, %rd115, 0;
+	@%p12 bra 	$L__BB0_11;
+	bra.uni 	$L__BB0_10;
+$L__BB0_11:
+	div.s64 	%rd487, %rd5, %rd103;
+	bra.uni 	$L__BB0_12;
+$L__BB0_10:
+	cvt.u32.u64 	%r40, %rd103;
+	cvt.u32.u64 	%r41, %rd5;
+	div.u32 	%r42, %r41, %r40;
+	cvt.u64.u32 	%rd487, %r42;
+$L__BB0_12:
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	or.b32 	%r7, %r1, 6;
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	cvt.s64.s32 	%rd3, %r6;
+	or.b64 	%rd116, %rd4, %rd103;
+	and.b64 	%rd117, %rd116, -4294967296;
+	setp.ne.b64 	%p13, %rd117, 0;
+	@%p13 bra 	$L__BB0_14;
+	bra.uni 	$L__BB0_13;
+$L__BB0_14:
+	div.s64 	%rd488, %rd4, %rd103;
+	bra.uni 	$L__BB0_15;
+$L__BB0_13:
+	cvt.u32.u64 	%r43, %rd103;
+	cvt.u32.u64 	%r44, %rd4;
+	div.u32 	%r45, %r44, %r43;
+	cvt.u64.u32 	%rd488, %r45;
+$L__BB0_15:
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	or.b32 	%r8, %r1, 7;
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	cvt.s64.s32 	%rd2, %r7;
+	or.b64 	%rd118, %rd3, %rd103;
+	and.b64 	%rd119, %rd118, -4294967296;
+	setp.ne.b64 	%p14, %rd119, 0;
+	@%p14 bra 	$L__BB0_17;
+	bra.uni 	$L__BB0_16;
+$L__BB0_17:
+	div.s64 	%rd489, %rd3, %rd103;
+	bra.uni 	$L__BB0_18;
+$L__BB0_16:
+	cvt.u32.u64 	%r46, %rd103;
+	cvt.u32.u64 	%r47, %rd3;
+	div.u32 	%r48, %r47, %r46;
+	cvt.u64.u32 	%rd489, %r48;
+$L__BB0_18:
+	cvt.s64.s32 	%rd1, %r8;
+	or.b64 	%rd120, %rd2, %rd103;
+	and.b64 	%rd121, %rd120, -4294967296;
+	setp.ne.b64 	%p15, %rd121, 0;
+	@%p15 bra 	$L__BB0_20;
+	bra.uni 	$L__BB0_19;
+$L__BB0_20:
+	div.s64 	%rd490, %rd2, %rd103;
+	bra.uni 	$L__BB0_21;
+$L__BB0_19:
+	cvt.u32.u64 	%r49, %rd103;
+	cvt.u32.u64 	%r50, %rd2;
+	div.u32 	%r51, %r50, %r49;
+	cvt.u64.u32 	%rd490, %r51;
+$L__BB0_21:
+	.loc	1 0 21                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0:21
+	ld.param.b64 	%rd104, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_6];
+	.loc	1 23 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:21
+	or.b64 	%rd122, %rd1, %rd103;
+	and.b64 	%rd123, %rd122, -4294967296;
+	setp.ne.b64 	%p16, %rd123, 0;
+	@%p16 bra 	$L__BB0_23;
+	bra.uni 	$L__BB0_22;
+$L__BB0_23:
+	div.s64 	%rd491, %rd1, %rd103;
+	bra.uni 	$L__BB0_24;
+$L__BB0_22:
+	cvt.u32.u64 	%r52, %rd103;
+	cvt.u32.u64 	%r53, %rd1;
+	div.u32 	%r54, %r53, %r52;
+	cvt.u64.u32 	%rd491, %r54;
+$L__BB0_24:
+	.loc	1 23 28                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:28
+	or.b64 	%rd124, %rd484, %rd104;
+	and.b64 	%rd125, %rd124, -4294967296;
+	setp.ne.b64 	%p17, %rd125, 0;
+	@%p17 bra 	$L__BB0_26;
+	bra.uni 	$L__BB0_25;
+$L__BB0_26:
+	rem.s64 	%rd492, %rd484, %rd104;
+	bra.uni 	$L__BB0_27;
+$L__BB0_25:
+	cvt.u32.u64 	%r55, %rd104;
+	cvt.u32.u64 	%r56, %rd484;
+	rem.u32 	%r57, %r56, %r55;
+	cvt.u64.u32 	%rd492, %r57;
+$L__BB0_27:
+	or.b64 	%rd126, %rd485, %rd104;
+	and.b64 	%rd127, %rd126, -4294967296;
+	setp.ne.b64 	%p18, %rd127, 0;
+	@%p18 bra 	$L__BB0_29;
+	bra.uni 	$L__BB0_28;
+$L__BB0_29:
+	rem.s64 	%rd493, %rd485, %rd104;
+	bra.uni 	$L__BB0_30;
+$L__BB0_28:
+	cvt.u32.u64 	%r58, %rd104;
+	cvt.u32.u64 	%r59, %rd485;
+	rem.u32 	%r60, %r59, %r58;
+	cvt.u64.u32 	%rd493, %r60;
+$L__BB0_30:
+	or.b64 	%rd128, %rd486, %rd104;
+	and.b64 	%rd129, %rd128, -4294967296;
+	setp.ne.b64 	%p19, %rd129, 0;
+	@%p19 bra 	$L__BB0_32;
+	bra.uni 	$L__BB0_31;
+$L__BB0_32:
+	rem.s64 	%rd494, %rd486, %rd104;
+	bra.uni 	$L__BB0_33;
+$L__BB0_31:
+	cvt.u32.u64 	%r61, %rd104;
+	cvt.u32.u64 	%r62, %rd486;
+	rem.u32 	%r63, %r62, %r61;
+	cvt.u64.u32 	%rd494, %r63;
+$L__BB0_33:
+	or.b64 	%rd130, %rd487, %rd104;
+	and.b64 	%rd131, %rd130, -4294967296;
+	setp.ne.b64 	%p20, %rd131, 0;
+	@%p20 bra 	$L__BB0_35;
+	bra.uni 	$L__BB0_34;
+$L__BB0_35:
+	rem.s64 	%rd495, %rd487, %rd104;
+	bra.uni 	$L__BB0_36;
+$L__BB0_34:
+	cvt.u32.u64 	%r64, %rd104;
+	cvt.u32.u64 	%r65, %rd487;
+	rem.u32 	%r66, %r65, %r64;
+	cvt.u64.u32 	%rd495, %r66;
+$L__BB0_36:
+	or.b64 	%rd132, %rd488, %rd104;
+	and.b64 	%rd133, %rd132, -4294967296;
+	setp.ne.b64 	%p21, %rd133, 0;
+	@%p21 bra 	$L__BB0_38;
+	bra.uni 	$L__BB0_37;
+$L__BB0_38:
+	rem.s64 	%rd496, %rd488, %rd104;
+	bra.uni 	$L__BB0_39;
+$L__BB0_37:
+	cvt.u32.u64 	%r67, %rd104;
+	cvt.u32.u64 	%r68, %rd488;
+	rem.u32 	%r69, %r68, %r67;
+	cvt.u64.u32 	%rd496, %r69;
+$L__BB0_39:
+	or.b64 	%rd134, %rd489, %rd104;
+	and.b64 	%rd135, %rd134, -4294967296;
+	setp.ne.b64 	%p22, %rd135, 0;
+	@%p22 bra 	$L__BB0_41;
+	bra.uni 	$L__BB0_40;
+$L__BB0_41:
+	rem.s64 	%rd497, %rd489, %rd104;
+	bra.uni 	$L__BB0_42;
+$L__BB0_40:
+	cvt.u32.u64 	%r70, %rd104;
+	cvt.u32.u64 	%r71, %rd489;
+	rem.u32 	%r72, %r71, %r70;
+	cvt.u64.u32 	%rd497, %r72;
+$L__BB0_42:
+	or.b64 	%rd136, %rd490, %rd104;
+	and.b64 	%rd137, %rd136, -4294967296;
+	setp.ne.b64 	%p23, %rd137, 0;
+	@%p23 bra 	$L__BB0_44;
+	bra.uni 	$L__BB0_43;
+$L__BB0_44:
+	rem.s64 	%rd498, %rd490, %rd104;
+	bra.uni 	$L__BB0_45;
+$L__BB0_43:
+	cvt.u32.u64 	%r73, %rd104;
+	cvt.u32.u64 	%r74, %rd490;
+	rem.u32 	%r75, %r74, %r73;
+	cvt.u64.u32 	%rd498, %r75;
+$L__BB0_45:
+	.loc	1 0 28                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0:28
+	ld.param.b32 	%r25, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_10];
+	ld.param.b64 	%rd105, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_7];
+	ld.param.b64 	%rd99, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_1];
+	ld.param.b64 	%rd98, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_0];
+	.loc	1 23 28                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:23:28
+	or.b64 	%rd138, %rd491, %rd104;
+	and.b64 	%rd139, %rd138, -4294967296;
+	setp.ne.b64 	%p24, %rd139, 0;
+	@%p24 bra 	$L__BB0_47;
+	bra.uni 	$L__BB0_46;
+$L__BB0_47:
+	rem.s64 	%rd499, %rd491, %rd104;
+	bra.uni 	$L__BB0_48;
+$L__BB0_46:
+	cvt.u32.u64 	%r76, %rd104;
+	cvt.u32.u64 	%r77, %rd491;
+	rem.u32 	%r78, %r77, %r76;
+	cvt.u64.u32 	%rd499, %r78;
+$L__BB0_48:
+	.loc	1 26 30                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:26:30
+	shl.b64 	%rd196, %rd8, 1;
+	add.s64 	%rd141, %rd98, %rd196;
+	shl.b64 	%rd197, %rd7, 1;
+	add.s64 	%rd144, %rd98, %rd197;
+	shl.b64 	%rd198, %rd6, 1;
+	add.s64 	%rd147, %rd98, %rd198;
+	shl.b64 	%rd199, %rd5, 1;
+	add.s64 	%rd150, %rd98, %rd199;
+	shl.b64 	%rd200, %rd4, 1;
+	add.s64 	%rd153, %rd98, %rd200;
+	shl.b64 	%rd201, %rd3, 1;
+	add.s64 	%rd156, %rd98, %rd201;
+	shl.b64 	%rd202, %rd2, 1;
+	add.s64 	%rd159, %rd98, %rd202;
+	shl.b64 	%rd203, %rd1, 1;
+	add.s64 	%rd162, %rd98, %rd203;
+	.loc	1 26 35                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:26:35
+	// begin inline asm
+	mov.u64 %rd140, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd140, 1.0;
+	// end inline asm
+	.loc	1 27 30                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:27:30
+	shl.b64 	%rd204, %rd492, 3;
+	add.s64 	%rd166, %rd99, %rd204;
+	shl.b64 	%rd205, %rd493, 3;
+	add.s64 	%rd170, %rd99, %rd205;
+	shl.b64 	%rd206, %rd494, 3;
+	add.s64 	%rd174, %rd99, %rd206;
+	shl.b64 	%rd207, %rd495, 3;
+	add.s64 	%rd178, %rd99, %rd207;
+	shl.b64 	%rd208, %rd496, 3;
+	add.s64 	%rd182, %rd99, %rd208;
+	shl.b64 	%rd209, %rd497, 3;
+	add.s64 	%rd186, %rd99, %rd209;
+	shl.b64 	%rd210, %rd498, 3;
+	add.s64 	%rd190, %rd99, %rd210;
+	shl.b64 	%rd211, %rd499, 3;
+	add.s64 	%rd194, %rd99, %rd211;
+	.loc	1 21 21                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:21:21
+	setp.lt.s32 	%p8, %r8, %r25;
+	setp.lt.s32 	%p7, %r7, %r25;
+	setp.lt.s32 	%p6, %r6, %r25;
+	setp.lt.s32 	%p5, %r5, %r25;
+	setp.lt.s32 	%p4, %r4, %r25;
+	setp.lt.s32 	%p3, %r3, %r25;
+	setp.lt.s32 	%p2, %r2, %r25;
+	setp.lt.s32 	%p1, %r1, %r25;
+	.loc	1 26 35                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:26:35
+	// begin inline asm
+	mov.u16 %rs33, 0x0;
+	@%p1 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs33 }, [ %rd141 + 0 ], %rd140;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd143, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd143, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs34, 0x0;
+	@%p2 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs34 }, [ %rd144 + 0 ], %rd143;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd146, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd146, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs35, 0x0;
+	@%p3 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs35 }, [ %rd147 + 0 ], %rd146;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd149, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd149, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs36, 0x0;
+	@%p4 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs36 }, [ %rd150 + 0 ], %rd149;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd152, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd152, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs37, 0x0;
+	@%p5 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs37 }, [ %rd153 + 0 ], %rd152;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd155, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd155, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs38, 0x0;
+	@%p6 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs38 }, [ %rd156 + 0 ], %rd155;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd158, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd158, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs39, 0x0;
+	@%p7 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs39 }, [ %rd159 + 0 ], %rd158;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd161, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd161, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs40, 0x0;
+	@%p8 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs40 }, [ %rd162 + 0 ], %rd161;
+	// end inline asm
+	.loc	1 27 35                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:27:35
+	// begin inline asm
+	mov.u64 %rd164, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd164, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd165, 0x0;
+	@%p1 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd165 }, [ %rd166 + 0 ], %rd164;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd168, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd168, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd169, 0x0;
+	@%p2 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd169 }, [ %rd170 + 0 ], %rd168;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd172, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd172, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd173, 0x0;
+	@%p3 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd173 }, [ %rd174 + 0 ], %rd172;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd176, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd176, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd177, 0x0;
+	@%p4 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd177 }, [ %rd178 + 0 ], %rd176;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd180, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd180, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd181, 0x0;
+	@%p5 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd181 }, [ %rd182 + 0 ], %rd180;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd184, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd184, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd185, 0x0;
+	@%p6 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd185 }, [ %rd186 + 0 ], %rd184;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd188, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd188, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd189, 0x0;
+	@%p7 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd189 }, [ %rd190 + 0 ], %rd188;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd192, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd192, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd193, 0x0;
+	@%p8 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd193 }, [ %rd194 + 0 ], %rd192;
+	// end inline asm
+	.loc	1 31 32                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:31:32
+	shr.s64 	%rd212, %rd173, 63;
+	and.b64 	%rd213, %rd212, %rd105;
+	shr.s64 	%rd214, %rd177, 63;
+	and.b64 	%rd215, %rd214, %rd105;
+	shr.s64 	%rd216, %rd165, 63;
+	and.b64 	%rd217, %rd216, %rd105;
+	shr.s64 	%rd218, %rd169, 63;
+	and.b64 	%rd219, %rd218, %rd105;
+	shr.s64 	%rd220, %rd189, 63;
+	and.b64 	%rd221, %rd220, %rd105;
+	shr.s64 	%rd222, %rd193, 63;
+	and.b64 	%rd223, %rd222, %rd105;
+	shr.s64 	%rd224, %rd181, 63;
+	and.b64 	%rd225, %rd224, %rd105;
+	shr.s64 	%rd226, %rd185, 63;
+	and.b64 	%rd227, %rd226, %rd105;
+	add.s64 	%rd78, %rd227, %rd185;
+	add.s64 	%rd77, %rd225, %rd181;
+	add.s64 	%rd80, %rd223, %rd193;
+	add.s64 	%rd79, %rd221, %rd189;
+	add.s64 	%rd74, %rd219, %rd169;
+	add.s64 	%rd73, %rd217, %rd165;
+	add.s64 	%rd76, %rd215, %rd177;
+	add.s64 	%rd75, %rd213, %rd173;
+	.loc	1 32 28                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:28
+	setp.lt.s64 	%p41, %rd75, 0;
+	setp.lt.s64 	%p42, %rd76, 0;
+	setp.lt.s64 	%p43, %rd73, 0;
+	setp.lt.s64 	%p44, %rd74, 0;
+	setp.lt.s64 	%p45, %rd79, 0;
+	setp.lt.s64 	%p46, %rd80, 0;
+	setp.lt.s64 	%p47, %rd77, 0;
+	setp.lt.s64 	%p48, %rd78, 0;
+	.loc	1 32 44                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:44
+	setp.ge.s64 	%p49, %rd75, %rd105;
+	setp.ge.s64 	%p50, %rd76, %rd105;
+	setp.ge.s64 	%p51, %rd73, %rd105;
+	setp.ge.s64 	%p52, %rd74, %rd105;
+	setp.ge.s64 	%p53, %rd79, %rd105;
+	setp.ge.s64 	%p54, %rd80, %rd105;
+	setp.ge.s64 	%p55, %rd77, %rd105;
+	setp.ge.s64 	%p56, %rd78, %rd105;
+	.loc	1 32 37                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:37
+	or.pred 	%p57, %p48, %p56;
+	or.pred 	%p58, %p47, %p55;
+	or.pred 	%p59, %p46, %p54;
+	or.pred 	%p60, %p45, %p53;
+	or.pred 	%p61, %p44, %p52;
+	or.pred 	%p62, %p43, %p51;
+	or.pred 	%p63, %p42, %p50;
+	or.pred 	%p64, %p41, %p49;
+	.loc	1 32 52                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:52
+	and.pred 	%p65, %p3, %p64;
+	selp.b16 	%rs41, 1, 0, %p65;
+	shl.b16 	%rs42, %rs41, 2;
+	and.pred 	%p66, %p4, %p63;
+	selp.b16 	%rs43, -1, 0, %p66;
+	shl.b16 	%rs44, %rs43, 3;
+	or.b16 	%rs45, %rs44, %rs42;
+	and.pred 	%p67, %p1, %p62;
+	selp.b16 	%rs46, 1, 0, %p67;
+	and.pred 	%p68, %p2, %p61;
+	selp.b16 	%rs47, -1, 0, %p68;
+	shl.b16 	%rs48, %rs47, 1;
+	or.b16 	%rs49, %rs46, %rs48;
+	and.b16 	%rs50, %rs49, 3;
+	or.b16 	%rs51, %rs50, %rs45;
+	and.b16 	%rs52, %rs51, 15;
+	and.pred 	%p69, %p7, %p60;
+	selp.b16 	%rs53, 1, 0, %p69;
+	shl.b16 	%rs54, %rs53, 2;
+	and.pred 	%p70, %p8, %p59;
+	selp.b16 	%rs55, -1, 0, %p70;
+	shl.b16 	%rs56, %rs55, 3;
+	or.b16 	%rs57, %rs56, %rs54;
+	and.pred 	%p71, %p5, %p58;
+	selp.b16 	%rs58, 1, 0, %p71;
+	and.pred 	%p72, %p6, %p57;
+	selp.b16 	%rs59, -1, 0, %p72;
+	shl.b16 	%rs60, %rs59, 1;
+	or.b16 	%rs61, %rs58, %rs60;
+	and.b16 	%rs62, %rs61, 3;
+	or.b16 	%rs63, %rs62, %rs57;
+	shl.b16 	%rs64, %rs63, 4;
+	or.b16 	%rs65, %rs52, %rs64;
+	.loc	1 32 62                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:62
+	and.b16 	%rs66, %rs65, 255;
+	setp.eq.b16 	%p73, %rs66, 0;
+	@%p73 bra 	$L__BB0_50;
+	bra.uni 	$L__BB0_49;
+$L__BB0_50:
+	.loc	1 0 62                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0:62
+	ld.param.b64 	%rd107, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_9];
+	ld.param.b64 	%rd106, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_8];
+	ld.param.b64 	%rd100, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_2];
+	.loc	1 24 19                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:24:19
+	rem.s64 	%rd88, %rd1, %rd106;
+	rem.s64 	%rd87, %rd2, %rd106;
+	rem.s64 	%rd86, %rd3, %rd106;
+	rem.s64 	%rd85, %rd4, %rd106;
+	rem.s64 	%rd84, %rd5, %rd106;
+	rem.s64 	%rd83, %rd6, %rd106;
+	rem.s64 	%rd82, %rd7, %rd106;
+	rem.s64 	%rd81, %rd8, %rd106;
+	.loc	1 32 62                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:62
+	bar.sync 	0;
+	.loc	1 33 39                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:39
+	mul.lo.s64 	%rd306, %rd73, %rd106;
+	mul.lo.s64 	%rd307, %rd74, %rd106;
+	mul.lo.s64 	%rd308, %rd75, %rd106;
+	mul.lo.s64 	%rd309, %rd76, %rd106;
+	mul.lo.s64 	%rd310, %rd77, %rd106;
+	mul.lo.s64 	%rd311, %rd78, %rd106;
+	mul.lo.s64 	%rd312, %rd79, %rd106;
+	mul.lo.s64 	%rd313, %rd80, %rd106;
+	.loc	1 33 30                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:30
+	shl.b64 	%rd314, %rd81, 1;
+	add.s64 	%rd315, %rd100, %rd314;
+	shl.b64 	%rd316, %rd306, 1;
+	add.s64 	%rd235, %rd315, %rd316;
+	shl.b64 	%rd317, %rd82, 1;
+	add.s64 	%rd318, %rd100, %rd317;
+	shl.b64 	%rd319, %rd307, 1;
+	add.s64 	%rd238, %rd318, %rd319;
+	shl.b64 	%rd320, %rd83, 1;
+	add.s64 	%rd321, %rd100, %rd320;
+	shl.b64 	%rd322, %rd308, 1;
+	add.s64 	%rd241, %rd321, %rd322;
+	shl.b64 	%rd323, %rd84, 1;
+	add.s64 	%rd324, %rd100, %rd323;
+	shl.b64 	%rd325, %rd309, 1;
+	add.s64 	%rd244, %rd324, %rd325;
+	shl.b64 	%rd326, %rd85, 1;
+	add.s64 	%rd327, %rd100, %rd326;
+	shl.b64 	%rd328, %rd310, 1;
+	add.s64 	%rd247, %rd327, %rd328;
+	shl.b64 	%rd329, %rd86, 1;
+	add.s64 	%rd330, %rd100, %rd329;
+	shl.b64 	%rd331, %rd311, 1;
+	add.s64 	%rd250, %rd330, %rd331;
+	shl.b64 	%rd332, %rd87, 1;
+	add.s64 	%rd333, %rd100, %rd332;
+	shl.b64 	%rd334, %rd312, 1;
+	add.s64 	%rd253, %rd333, %rd334;
+	shl.b64 	%rd335, %rd88, 1;
+	add.s64 	%rd336, %rd100, %rd335;
+	shl.b64 	%rd337, %rd313, 1;
+	add.s64 	%rd256, %rd336, %rd337;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	// begin inline asm
+	mov.u64 %rd234, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd234, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs67, 0x0;
+	@%p1 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs67 }, [ %rd235 + 0 ], %rd234;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd237, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd237, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs68, 0x0;
+	@%p2 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs68 }, [ %rd238 + 0 ], %rd237;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd240, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd240, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs69, 0x0;
+	@%p3 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs69 }, [ %rd241 + 0 ], %rd240;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd243, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd243, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs70, 0x0;
+	@%p4 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs70 }, [ %rd244 + 0 ], %rd243;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd246, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd246, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs71, 0x0;
+	@%p5 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs71 }, [ %rd247 + 0 ], %rd246;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd249, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd249, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs72, 0x0;
+	@%p6 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs72 }, [ %rd250 + 0 ], %rd249;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd252, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd252, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs73, 0x0;
+	@%p7 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs73 }, [ %rd253 + 0 ], %rd252;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd255, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd255, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs74, 0x0;
+	@%p8 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs74 }, [ %rd256 + 0 ], %rd255;
+	// end inline asm
+	.loc	1 38 31                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:38:31
+	shr.u64 	%rd338, %rd106, 63;
+	add.s64 	%rd339, %rd106, %rd338;
+	shr.s64 	%rd340, %rd339, 1;
+	.loc	1 38 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:38:18
+	sub.s64 	%rd89, %rd106, %rd340;
+	.loc	1 39 19                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:39:19
+	setp.lt.s64 	%p106, %rd81, %rd89;
+	setp.lt.s64 	%p107, %rd82, %rd89;
+	setp.lt.s64 	%p108, %rd83, %rd89;
+	setp.lt.s64 	%p109, %rd84, %rd89;
+	setp.lt.s64 	%p110, %rd85, %rd89;
+	setp.lt.s64 	%p111, %rd86, %rd89;
+	setp.lt.s64 	%p112, %rd87, %rd89;
+	setp.lt.s64 	%p113, %rd88, %rd89;
+	.loc	1 40 35                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:35
+	sub.s64 	%rd341, %rd8, %rd81;
+	sub.s64 	%rd342, %rd7, %rd82;
+	sub.s64 	%rd343, %rd6, %rd83;
+	sub.s64 	%rd344, %rd5, %rd84;
+	sub.s64 	%rd345, %rd4, %rd85;
+	sub.s64 	%rd346, %rd3, %rd86;
+	sub.s64 	%rd347, %rd2, %rd87;
+	sub.s64 	%rd348, %rd1, %rd88;
+	.loc	1 40 31                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:31
+	shl.b64 	%rd349, %rd341, 1;
+	add.s64 	%rd350, %rd98, %rd349;
+	and.b64 	%rd351, %rd339, -2;
+	add.s64 	%rd352, %rd350, %rd351;
+	add.s64 	%rd259, %rd352, %rd314;
+	shl.b64 	%rd353, %rd342, 1;
+	add.s64 	%rd354, %rd98, %rd353;
+	add.s64 	%rd355, %rd354, %rd351;
+	add.s64 	%rd262, %rd355, %rd317;
+	shl.b64 	%rd356, %rd343, 1;
+	add.s64 	%rd357, %rd98, %rd356;
+	add.s64 	%rd358, %rd357, %rd351;
+	add.s64 	%rd265, %rd358, %rd320;
+	shl.b64 	%rd359, %rd344, 1;
+	add.s64 	%rd360, %rd98, %rd359;
+	add.s64 	%rd361, %rd360, %rd351;
+	add.s64 	%rd268, %rd361, %rd323;
+	shl.b64 	%rd362, %rd345, 1;
+	add.s64 	%rd363, %rd98, %rd362;
+	add.s64 	%rd364, %rd363, %rd351;
+	add.s64 	%rd271, %rd364, %rd326;
+	shl.b64 	%rd365, %rd346, 1;
+	add.s64 	%rd366, %rd98, %rd365;
+	add.s64 	%rd367, %rd366, %rd351;
+	add.s64 	%rd274, %rd367, %rd329;
+	shl.b64 	%rd368, %rd347, 1;
+	add.s64 	%rd369, %rd98, %rd368;
+	add.s64 	%rd370, %rd369, %rd351;
+	add.s64 	%rd277, %rd370, %rd332;
+	shl.b64 	%rd371, %rd348, 1;
+	add.s64 	%rd372, %rd98, %rd371;
+	add.s64 	%rd373, %rd372, %rd351;
+	add.s64 	%rd280, %rd373, %rd335;
+	.loc	1 40 68                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:68
+	and.pred 	%p82, %p1, %p106;
+	and.pred 	%p83, %p2, %p107;
+	and.pred 	%p84, %p3, %p108;
+	and.pred 	%p85, %p4, %p109;
+	and.pred 	%p86, %p5, %p110;
+	and.pred 	%p87, %p6, %p111;
+	and.pred 	%p88, %p7, %p112;
+	and.pred 	%p89, %p8, %p113;
+	.loc	1 40 60                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:60
+	// begin inline asm
+	mov.u64 %rd258, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd258, 1.0;
+	// end inline asm
+	mov.b16 	%rs76, 0;
+	// begin inline asm
+	mov.u16 %rs75, %rs76;
+	@%p82 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs75 }, [ %rd259 + 0 ], %rd258;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd261, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd261, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs77, %rs76;
+	@%p83 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs77 }, [ %rd262 + 0 ], %rd261;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd264, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd264, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs79, %rs76;
+	@%p84 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs79 }, [ %rd265 + 0 ], %rd264;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd267, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd267, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs81, %rs76;
+	@%p85 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs81 }, [ %rd268 + 0 ], %rd267;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd270, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd270, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs83, %rs76;
+	@%p86 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs83 }, [ %rd271 + 0 ], %rd270;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd273, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd273, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs85, %rs76;
+	@%p87 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs85 }, [ %rd274 + 0 ], %rd273;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd276, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd276, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs87, %rs76;
+	@%p88 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs87 }, [ %rd277 + 0 ], %rd276;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd279, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd279, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs89, %rs76;
+	@%p89 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs89 }, [ %rd280 + 0 ], %rd279;
+	// end inline asm
+	.loc	1 44 20                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:44:20
+	setp.ge.s64 	%p114, %rd88, %rd89;
+	setp.ge.s64 	%p115, %rd87, %rd89;
+	setp.ge.s64 	%p116, %rd86, %rd89;
+	setp.ge.s64 	%p117, %rd85, %rd89;
+	setp.ge.s64 	%p118, %rd84, %rd89;
+	setp.ge.s64 	%p119, %rd83, %rd89;
+	setp.ge.s64 	%p120, %rd82, %rd89;
+	setp.ge.s64 	%p121, %rd81, %rd89;
+	.loc	1 47 47                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:47
+	sub.s64 	%rd374, %rd81, %rd106;
+	sub.s64 	%rd375, %rd82, %rd106;
+	sub.s64 	%rd376, %rd83, %rd106;
+	sub.s64 	%rd377, %rd84, %rd106;
+	sub.s64 	%rd378, %rd85, %rd106;
+	sub.s64 	%rd379, %rd86, %rd106;
+	sub.s64 	%rd380, %rd87, %rd106;
+	sub.s64 	%rd381, %rd88, %rd106;
+	.loc	1 47 31                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:31
+	shl.b64 	%rd382, %rd374, 1;
+	add.s64 	%rd283, %rd352, %rd382;
+	shl.b64 	%rd383, %rd375, 1;
+	add.s64 	%rd286, %rd355, %rd383;
+	shl.b64 	%rd384, %rd376, 1;
+	add.s64 	%rd289, %rd358, %rd384;
+	shl.b64 	%rd385, %rd377, 1;
+	add.s64 	%rd292, %rd361, %rd385;
+	shl.b64 	%rd386, %rd378, 1;
+	add.s64 	%rd295, %rd364, %rd386;
+	shl.b64 	%rd387, %rd379, 1;
+	add.s64 	%rd298, %rd367, %rd387;
+	shl.b64 	%rd388, %rd380, 1;
+	add.s64 	%rd301, %rd370, %rd388;
+	shl.b64 	%rd389, %rd381, 1;
+	add.s64 	%rd304, %rd373, %rd389;
+	.loc	1 47 81                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:81
+	and.pred 	%p90, %p1, %p121;
+	and.pred 	%p91, %p2, %p120;
+	and.pred 	%p92, %p3, %p119;
+	and.pred 	%p93, %p4, %p118;
+	and.pred 	%p94, %p5, %p117;
+	and.pred 	%p95, %p6, %p116;
+	and.pred 	%p96, %p7, %p115;
+	and.pred 	%p97, %p8, %p114;
+	.loc	1 47 73                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:73
+	// begin inline asm
+	mov.u64 %rd282, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd282, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs91, %rs76;
+	@%p90 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs91 }, [ %rd283 + 0 ], %rd282;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd285, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd285, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs93, %rs76;
+	@%p91 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs93 }, [ %rd286 + 0 ], %rd285;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd288, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd288, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs95, %rs76;
+	@%p92 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs95 }, [ %rd289 + 0 ], %rd288;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd291, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd291, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs97, %rs76;
+	@%p93 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs97 }, [ %rd292 + 0 ], %rd291;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd294, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd294, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs99, %rs76;
+	@%p94 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs99 }, [ %rd295 + 0 ], %rd294;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd297, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd297, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs101, %rs76;
+	@%p95 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs101 }, [ %rd298 + 0 ], %rd297;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd300, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd300, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs103, %rs76;
+	@%p96 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs103 }, [ %rd301 + 0 ], %rd300;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd303, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd303, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs105, %rs76;
+	@%p97 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs105 }, [ %rd304 + 0 ], %rd303;
+	// end inline asm
+	.loc	1 51 34                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:51:34
+	and.b64 	%rd391, %rd212, %rd107;
+	and.b64 	%rd393, %rd214, %rd107;
+	and.b64 	%rd395, %rd216, %rd107;
+	and.b64 	%rd397, %rd218, %rd107;
+	and.b64 	%rd399, %rd220, %rd107;
+	and.b64 	%rd401, %rd222, %rd107;
+	and.b64 	%rd403, %rd224, %rd107;
+	and.b64 	%rd405, %rd226, %rd107;
+	add.s64 	%rd95, %rd405, %rd185;
+	add.s64 	%rd94, %rd403, %rd181;
+	add.s64 	%rd97, %rd401, %rd193;
+	add.s64 	%rd96, %rd399, %rd189;
+	add.s64 	%rd91, %rd397, %rd169;
+	add.s64 	%rd90, %rd395, %rd165;
+	add.s64 	%rd93, %rd393, %rd177;
+	add.s64 	%rd92, %rd391, %rd173;
+	.loc	1 52 28                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:28
+	setp.lt.s64 	%p122, %rd92, 0;
+	setp.lt.s64 	%p123, %rd93, 0;
+	setp.lt.s64 	%p124, %rd90, 0;
+	setp.lt.s64 	%p125, %rd91, 0;
+	setp.lt.s64 	%p126, %rd96, 0;
+	setp.lt.s64 	%p127, %rd97, 0;
+	setp.lt.s64 	%p128, %rd94, 0;
+	setp.lt.s64 	%p129, %rd95, 0;
+	.loc	1 52 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:46
+	setp.ge.s64 	%p130, %rd92, %rd107;
+	setp.ge.s64 	%p131, %rd93, %rd107;
+	setp.ge.s64 	%p132, %rd90, %rd107;
+	setp.ge.s64 	%p133, %rd91, %rd107;
+	setp.ge.s64 	%p134, %rd96, %rd107;
+	setp.ge.s64 	%p135, %rd97, %rd107;
+	setp.ge.s64 	%p136, %rd94, %rd107;
+	setp.ge.s64 	%p137, %rd95, %rd107;
+	.loc	1 52 38                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:38
+	or.pred 	%p138, %p129, %p137;
+	or.pred 	%p139, %p128, %p136;
+	or.pred 	%p140, %p127, %p135;
+	or.pred 	%p141, %p126, %p134;
+	or.pred 	%p142, %p125, %p133;
+	or.pred 	%p143, %p124, %p132;
+	or.pred 	%p144, %p123, %p131;
+	or.pred 	%p145, %p122, %p130;
+	.loc	1 52 54                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:54
+	and.pred 	%p146, %p3, %p145;
+	selp.b16 	%rs107, 1, 0, %p146;
+	shl.b16 	%rs108, %rs107, 2;
+	and.pred 	%p147, %p4, %p144;
+	selp.b16 	%rs109, -1, 0, %p147;
+	shl.b16 	%rs110, %rs109, 3;
+	or.b16 	%rs111, %rs110, %rs108;
+	and.pred 	%p148, %p1, %p143;
+	selp.b16 	%rs112, 1, 0, %p148;
+	and.pred 	%p149, %p2, %p142;
+	selp.b16 	%rs113, -1, 0, %p149;
+	shl.b16 	%rs114, %rs113, 1;
+	or.b16 	%rs115, %rs112, %rs114;
+	and.b16 	%rs116, %rs115, 3;
+	or.b16 	%rs117, %rs116, %rs111;
+	and.b16 	%rs118, %rs117, 15;
+	and.pred 	%p150, %p7, %p141;
+	selp.b16 	%rs119, 1, 0, %p150;
+	shl.b16 	%rs120, %rs119, 2;
+	and.pred 	%p151, %p8, %p140;
+	selp.b16 	%rs121, -1, 0, %p151;
+	shl.b16 	%rs122, %rs121, 3;
+	or.b16 	%rs123, %rs122, %rs120;
+	and.pred 	%p152, %p5, %p139;
+	selp.b16 	%rs124, 1, 0, %p152;
+	and.pred 	%p153, %p6, %p138;
+	selp.b16 	%rs125, -1, 0, %p153;
+	shl.b16 	%rs126, %rs125, 1;
+	or.b16 	%rs127, %rs124, %rs126;
+	and.b16 	%rs128, %rs127, 3;
+	or.b16 	%rs129, %rs128, %rs123;
+	shl.b16 	%rs130, %rs129, 4;
+	or.b16 	%rs131, %rs118, %rs130;
+	.loc	1 52 64                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:64
+	and.b16 	%rs132, %rs131, 255;
+	setp.eq.b16 	%p154, %rs132, 0;
+	@%p154 bra 	$L__BB0_52;
+	bra.uni 	$L__BB0_51;
+$L__BB0_52:
+	.loc	1 0 64                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0:64
+	ld.param.b64 	%rd102, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_4];
+	ld.param.b64 	%rd101, [triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0_param_3];
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r79, %rs89;
+	mov.b32 	%r80, 0f00000000;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r81, %r80, %r79;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r82, %rs105;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r83, %r81, %r82, %p113;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r84, %rs87;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r85, %r80, %r84;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r86, %rs103;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r87, %r85, %r86, %p112;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r88, %rs85;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r89, %r80, %r88;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r90, %rs101;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r91, %r89, %r90, %p111;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r92, %rs83;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r93, %r80, %r92;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r94, %rs99;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r95, %r93, %r94, %p110;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r96, %rs81;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r97, %r80, %r96;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r98, %rs97;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r99, %r97, %r98, %p109;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r100, %rs79;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r101, %r80, %r100;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r102, %rs95;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r103, %r101, %r102, %p108;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r104, %rs77;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r105, %r80, %r104;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r106, %rs93;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r107, %r105, %r106, %p107;
+	.loc	1 40 119                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:40:119
+	cvt.f32.bf16 	%r108, %rs75;
+	.loc	1 41 13                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:41:13
+	sub.f32 	%r109, %r80, %r108;
+	.loc	1 47 132                        // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:47:132
+	cvt.f32.bf16 	%r110, %rs91;
+	.loc	1 0 0                           // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:0
+	selp.f32 	%r111, %r109, %r110, %p106;
+	.loc	1 26 75                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:26:75
+	cvt.f32.bf16 	%r112, %rs40;
+	cvt.f32.bf16 	%r113, %rs39;
+	cvt.f32.bf16 	%r114, %rs38;
+	cvt.f32.bf16 	%r115, %rs37;
+	cvt.f32.bf16 	%r116, %rs36;
+	cvt.f32.bf16 	%r117, %rs35;
+	cvt.f32.bf16 	%r118, %rs34;
+	cvt.f32.bf16 	%r119, %rs33;
+	.loc	1 52 64                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:64
+	bar.sync 	0;
+	.loc	1 53 40                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:53:40
+	mul.lo.s64 	%rd444, %rd90, %rd106;
+	mul.lo.s64 	%rd445, %rd91, %rd106;
+	mul.lo.s64 	%rd446, %rd92, %rd106;
+	mul.lo.s64 	%rd447, %rd93, %rd106;
+	mul.lo.s64 	%rd448, %rd94, %rd106;
+	mul.lo.s64 	%rd449, %rd95, %rd106;
+	mul.lo.s64 	%rd450, %rd96, %rd106;
+	mul.lo.s64 	%rd451, %rd97, %rd106;
+	.loc	1 53 31                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:53:31
+	add.s64 	%rd453, %rd101, %rd314;
+	shl.b64 	%rd454, %rd444, 1;
+	add.s64 	%rd413, %rd453, %rd454;
+	add.s64 	%rd456, %rd101, %rd317;
+	shl.b64 	%rd457, %rd445, 1;
+	add.s64 	%rd416, %rd456, %rd457;
+	add.s64 	%rd459, %rd101, %rd320;
+	shl.b64 	%rd460, %rd446, 1;
+	add.s64 	%rd419, %rd459, %rd460;
+	add.s64 	%rd462, %rd101, %rd323;
+	shl.b64 	%rd463, %rd447, 1;
+	add.s64 	%rd422, %rd462, %rd463;
+	add.s64 	%rd465, %rd101, %rd326;
+	shl.b64 	%rd466, %rd448, 1;
+	add.s64 	%rd425, %rd465, %rd466;
+	add.s64 	%rd468, %rd101, %rd329;
+	shl.b64 	%rd469, %rd449, 1;
+	add.s64 	%rd428, %rd468, %rd469;
+	add.s64 	%rd471, %rd101, %rd332;
+	shl.b64 	%rd472, %rd450, 1;
+	add.s64 	%rd431, %rd471, %rd472;
+	add.s64 	%rd474, %rd101, %rd335;
+	shl.b64 	%rd475, %rd451, 1;
+	add.s64 	%rd434, %rd474, %rd475;
+	.loc	1 53 48                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:53:48
+	// begin inline asm
+	mov.u64 %rd414, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd414, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs133, 0x0;
+	@%p1 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs133 }, [ %rd413 + 0 ], %rd414;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd417, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd417, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs134, 0x0;
+	@%p2 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs134 }, [ %rd416 + 0 ], %rd417;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd420, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd420, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs135, 0x0;
+	@%p3 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs135 }, [ %rd419 + 0 ], %rd420;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd423, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd423, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs136, 0x0;
+	@%p4 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs136 }, [ %rd422 + 0 ], %rd423;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd426, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd426, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs137, 0x0;
+	@%p5 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs137 }, [ %rd425 + 0 ], %rd426;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd429, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd429, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs138, 0x0;
+	@%p6 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs138 }, [ %rd428 + 0 ], %rd429;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd432, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd432, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs139, 0x0;
+	@%p7 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs139 }, [ %rd431 + 0 ], %rd432;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd435, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd435, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u16 %rs140, 0x0;
+	@%p8 ld.global.L1::evict_last.L2::cache_hint.b16 { %rs140 }, [ %rd434 + 0 ], %rd435;
+	// end inline asm
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r120, {%rs67, %rs133};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs149, %rs150}, %r120;
+	cvt.f32.bf16 	%r121, %rs149;
+	cvt.f32.bf16 	%r122, %rs150;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r123, %r111, %r122;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r124, {%rs68, %rs134};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs151, %rs152}, %r124;
+	cvt.f32.bf16 	%r125, %rs151;
+	cvt.f32.bf16 	%r126, %rs152;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r127, %r107, %r126;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r128, {%rs69, %rs135};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs153, %rs154}, %r128;
+	cvt.f32.bf16 	%r129, %rs153;
+	cvt.f32.bf16 	%r130, %rs154;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r131, %r103, %r130;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r132, {%rs70, %rs136};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs155, %rs156}, %r132;
+	cvt.f32.bf16 	%r133, %rs155;
+	cvt.f32.bf16 	%r134, %rs156;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r135, %r99, %r134;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r136, {%rs71, %rs137};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs157, %rs158}, %r136;
+	cvt.f32.bf16 	%r137, %rs157;
+	cvt.f32.bf16 	%r138, %rs158;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r139, %r95, %r138;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r140, {%rs72, %rs138};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs159, %rs160}, %r140;
+	cvt.f32.bf16 	%r141, %rs159;
+	cvt.f32.bf16 	%r142, %rs160;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r143, %r91, %r142;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r144, {%rs73, %rs139};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs161, %rs162}, %r144;
+	cvt.f32.bf16 	%r145, %rs161;
+	cvt.f32.bf16 	%r146, %rs162;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r147, %r87, %r146;
+	.loc	1 33 46                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:46
+	mov.b32 	%r148, {%rs74, %rs140};
+	.loc	1 33 86                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:33:86
+	mov.b32 	{%rs163, %rs164}, %r148;
+	cvt.f32.bf16 	%r149, %rs163;
+	cvt.f32.bf16 	%r150, %rs164;
+	.loc	1 34 18                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:34:18
+	mul.f32 	%r151, %r83, %r150;
+	.loc	1 55 19                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:55:19
+	fma.rn.f32 	%r152, %r119, %r121, %r123;
+	fma.rn.f32 	%r153, %r118, %r125, %r127;
+	fma.rn.f32 	%r154, %r117, %r129, %r131;
+	fma.rn.f32 	%r155, %r116, %r133, %r135;
+	fma.rn.f32 	%r156, %r115, %r137, %r139;
+	fma.rn.f32 	%r157, %r114, %r141, %r143;
+	fma.rn.f32 	%r158, %r113, %r145, %r147;
+	fma.rn.f32 	%r159, %r112, %r149, %r151;
+	.loc	1 56 25                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:56:25
+	add.s64 	%rd436, %rd102, %rd196;
+	add.s64 	%rd437, %rd102, %rd197;
+	add.s64 	%rd438, %rd102, %rd198;
+	add.s64 	%rd439, %rd102, %rd199;
+	add.s64 	%rd440, %rd102, %rd200;
+	add.s64 	%rd441, %rd102, %rd201;
+	add.s64 	%rd442, %rd102, %rd202;
+	add.s64 	%rd443, %rd102, %rd203;
+	.loc	1 56 37                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:56:37
+	cvt.rn.bf16.f32 	%rs141, %r152;
+	cvt.rn.bf16.f32 	%rs142, %r153;
+	cvt.rn.bf16.f32 	%rs143, %r154;
+	cvt.rn.bf16.f32 	%rs144, %r155;
+	cvt.rn.bf16.f32 	%rs145, %r156;
+	cvt.rn.bf16.f32 	%rs146, %r157;
+	cvt.rn.bf16.f32 	%rs147, %r158;
+	cvt.rn.bf16.f32 	%rs148, %r159;
+	// begin inline asm
+	@%p1 st.global.b16 [ %rd436 + 0 ], { %rs141 };
+	// end inline asm
+	// begin inline asm
+	@%p2 st.global.b16 [ %rd437 + 0 ], { %rs142 };
+	// end inline asm
+	// begin inline asm
+	@%p3 st.global.b16 [ %rd438 + 0 ], { %rs143 };
+	// end inline asm
+	// begin inline asm
+	@%p4 st.global.b16 [ %rd439 + 0 ], { %rs144 };
+	// end inline asm
+	// begin inline asm
+	@%p5 st.global.b16 [ %rd440 + 0 ], { %rs145 };
+	// end inline asm
+	// begin inline asm
+	@%p6 st.global.b16 [ %rd441 + 0 ], { %rs146 };
+	// end inline asm
+	// begin inline asm
+	@%p7 st.global.b16 [ %rd442 + 0 ], { %rs147 };
+	// end inline asm
+	// begin inline asm
+	@%p8 st.global.b16 [ %rd443 + 0 ], { %rs148 };
+	// end inline asm
+	.loc	1 56 4                          // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:56:4
+	ret;
+$L__BB0_49:
+	.loc	1 32 62                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:32:62
+	{ // callseq 0, 0
+	.param .b64 	param0;
+	.param .b64 	param1;
+	.param .b32 	param2;
+	.param .b64 	param3;
+	.param .b64 	param4;
+	mov.b64 	%rd228, assertFunc_0;
+	cvta.global.u64 	%rd229, %rd228;
+	st.param.b64 	[param3], %rd229;
+	mov.b64 	%rd230, assertFile_0;
+	cvta.global.u64 	%rd231, %rd230;
+	st.param.b64 	[param1], %rd231;
+	mov.b64 	%rd232, assertMessage_0;
+	cvta.global.u64 	%rd233, %rd232;
+	st.param.b64 	[param0], %rd233;
+	st.param.b64 	[param4], 1;
+	st.param.b32 	[param2], 32;
+	call.uni __assertfail, (param0, param1, param2, param3, param4);
+	} // callseq 0
+	trap;
+$L__BB0_51:
+	.loc	1 52 64                         // cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py:52:64
+	{ // callseq 1, 0
+	.param .b64 	param0;
+	.param .b64 	param1;
+	.param .b32 	param2;
+	.param .b64 	param3;
+	.param .b64 	param4;
+	mov.b64 	%rd406, assertFunc_1;
+	cvta.global.u64 	%rd407, %rd406;
+	st.param.b64 	[param3], %rd407;
+	mov.b64 	%rd408, assertFile_1;
+	cvta.global.u64 	%rd409, %rd408;
+	st.param.b64 	[param1], %rd409;
+	mov.b64 	%rd410, assertMessage_1;
+	cvta.global.u64 	%rd411, %rd410;
+	st.param.b64 	[param0], %rd411;
+	st.param.b64 	[param4], 1;
+	st.param.b32 	[param2], 52;
+	call.uni __assertfail, (param0, param1, param2, param3, param4);
+	} // callseq 1
+	trap;
+$L__tmp1:
+$L__func_end0:
+                                        // -- End function
+}
+	.file	1 "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py"
+	.section	.debug_abbrev
+	{
+.b8 1                                   // Abbreviation Code
+.b8 17                                  // DW_TAG_compile_unit
+.b8 0                                   // DW_CHILDREN_no
+.b8 37                                  // DW_AT_producer
+.b8 8                                   // DW_FORM_string
+.b8 19                                  // DW_AT_language
+.b8 5                                   // DW_FORM_data2
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 16                                  // DW_AT_stmt_list
+.b8 6                                   // DW_FORM_data4
+.b8 27                                  // DW_AT_comp_dir
+.b8 8                                   // DW_FORM_string
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 0                                   // EOM(3)
+	}
+	.section	.debug_info
+	{
+.b32 135                                // Length of Unit
+.b8 2                                   // DWARF version number
+.b8 0
+.b32 .debug_abbrev                      // Offset Into Abbrev. Section
+.b8 8                                   // Address Size (in bytes)
+.b8 1                                   // Abbrev [1] 0xb:0x80 DW_TAG_compile_unit
+.b8 116                                 // DW_AT_producer
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 0
+.b8 2                                   // DW_AT_language
+.b8 0
+.b8 99                                  // DW_AT_name
+.b8 118
+.b8 121
+.b8 111
+.b8 113
+.b8 103
+.b8 55
+.b8 106
+.b8 122
+.b8 101
+.b8 97
+.b8 100
+.b8 97
+.b8 114
+.b8 114
+.b8 103
+.b8 103
+.b8 103
+.b8 120
+.b8 104
+.b8 115
+.b8 50
+.b8 100
+.b8 106
+.b8 109
+.b8 120
+.b8 117
+.b8 103
+.b8 120
+.b8 121
+.b8 118
+.b8 106
+.b8 104
+.b8 105
+.b8 109
+.b8 55
+.b8 50
+.b8 118
+.b8 102
+.b8 115
+.b8 116
+.b8 113
+.b8 99
+.b8 52
+.b8 105
+.b8 111
+.b8 52
+.b8 113
+.b8 115
+.b8 101
+.b8 99
+.b8 106
+.b8 46
+.b8 112
+.b8 121
+.b8 0
+.b32 .debug_line                        // DW_AT_stmt_list
+.b8 47                                  // DW_AT_comp_dir
+.b8 119
+.b8 111
+.b8 114
+.b8 107
+.b8 115
+.b8 112
+.b8 97
+.b8 99
+.b8 101
+.b8 47
+.b8 104
+.b8 97
+.b8 110
+.b8 114
+.b8 117
+.b8 105
+.b8 47
+.b8 83
+.b8 112
+.b8 101
+.b8 99
+.b8 70
+.b8 111
+.b8 114
+.b8 103
+.b8 101
+.b8 45
+.b8 101
+.b8 120
+.b8 116
+.b8 47
+.b8 99
+.b8 97
+.b8 99
+.b8 104
+.b8 101
+.b8 47
+.b8 99
+.b8 111
+.b8 109
+.b8 112
+.b8 105
+.b8 108
+.b8 101
+.b8 100
+.b8 95
+.b8 107
+.b8 101
+.b8 114
+.b8 110
+.b8 101
+.b8 108
+.b8 115
+.b8 47
+.b8 118
+.b8 121
+.b8 0
+	}
+	.section	.debug_macinfo	{	}

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.source ADDED Viewed

	@@ -0,0 +1,299 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":18:0)
+#loc78 = loc("in_ptr0"(#loc))
+#loc79 = loc("in_ptr1"(#loc))
+#loc80 = loc("in_ptr2"(#loc))
+#loc81 = loc("in_ptr3"(#loc))
+#loc82 = loc("out_ptr0"(#loc))
+#loc83 = loc("ks0"(#loc))
+#loc84 = loc("ks1"(#loc))
+#loc85 = loc("ks2"(#loc))
+#loc86 = loc("ks3"(#loc))
+#loc87 = loc("ks4"(#loc))
+#loc88 = loc("xnumel"(#loc))
+module {
+  tt.func public @triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0(%in_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %in_ptr1: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr1"(#loc)), %in_ptr2: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr2"(#loc)), %in_ptr3: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr3"(#loc)), %out_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("out_ptr0"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %ks2: i64 loc("ks2"(#loc)), %ks3: i64 loc("ks3"(#loc)), %ks4: i64 loc("ks4"(#loc)), %xnumel: i32 loc("xnumel"(#loc))) attributes {noinline = false} {
+    %xoffset = tt.get_program_id x : i32 loc(#loc89)
+    %xoffset_0 = arith.constant 1024 : i32 loc(#loc90)
+    %xoffset_1 = arith.constant 1024 : i32 loc(#loc90)
+    %xoffset_2 = arith.muli %xoffset, %xoffset_1 : i32 loc(#loc90)
+    %xindex = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> loc(#loc91)
+    %xindex_3 = tt.splat %xoffset_2 : i32 -> tensor<1024xi32> loc(#loc92)
+    %xindex_4 = arith.addi %xindex_3, %xindex : tensor<1024xi32> loc(#loc92)
+    %xmask = tt.splat %xnumel : i32 -> tensor<1024xi32> loc(#loc93)
+    %xmask_5 = arith.cmpi slt, %xindex_4, %xmask : tensor<1024xi32> loc(#loc93)
+    %x2 = arith.extsi %xindex_4 : tensor<1024xi32> to tensor<1024xi64> loc(#loc94)
+    %x2_6 = tt.splat %ks0 : i64 -> tensor<1024xi64> loc(#loc94)
+    %x2_7 = arith.divsi %x2, %x2_6 : tensor<1024xi64> loc(#loc94)
+    %x2_8 = tt.splat %ks1 : i64 -> tensor<1024xi64> loc(#loc95)
+    %x2_9 = arith.remsi %x2_7, %x2_8 : tensor<1024xi64> loc(#loc95)
+    %x0 = arith.extsi %xindex_4 : tensor<1024xi32> to tensor<1024xi64> loc(#loc96)
+    %x0_10 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc96)
+    %x0_11 = arith.remsi %x0, %x0_10 : tensor<1024xi64> loc(#loc96)
+    %x5 = arith.extsi %xindex_4 : tensor<1024xi32> to tensor<1024xi64> loc(#loc97)
+    %x5_12 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc97)
+    %x5_13 = arith.divsi %x5, %x5_12 : tensor<1024xi64> loc(#loc97)
+    %tmp0 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc98)
+    %tmp0_14 = tt.addptr %tmp0, %xindex_4 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi32> loc(#loc98)
+    %tmp0_15 = tt.load %tmp0_14, %xmask_5 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc99)
+    %tmp0_16 = arith.extf %tmp0_15 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc100)
+    %tmp1 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1024x!tt.ptr<i64>> loc(#loc101)
+    %tmp1_17 = tt.addptr %tmp1, %x2_9 : tensor<1024x!tt.ptr<i64>>, tensor<1024xi64> loc(#loc101)
+    %tmp1_18 = tt.load %tmp1_17, %xmask_5 evictionPolicy = evict_last : tensor<1024x!tt.ptr<i64>> loc(#loc102)
+    %tmp3 = tt.splat %ks2 : i64 -> tensor<1024xi64> loc(#loc103)
+    %tmp3_19 = arith.addi %tmp1_18, %tmp3 : tensor<1024xi64> loc(#loc103)
+    %tmp4 = arith.constant 0 : i32 loc(#loc104)
+    %tmp4_20 = arith.extsi %tmp4 : i32 to i64 loc(#loc104)
+    %tmp4_21 = tt.splat %tmp4_20 : i64 -> tensor<1024xi64> loc(#loc104)
+    %tmp4_22 = arith.cmpi slt, %tmp1_18, %tmp4_21 : tensor<1024xi64> loc(#loc104)
+    %tmp5 = arith.select %tmp4_22, %tmp3_19, %tmp1_18 : tensor<1024xi1>, tensor<1024xi64> loc(#loc105)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc18)
+    %0 = arith.extsi %c0_i32 : i32 to i64 loc(#loc18)
+    %1 = tt.splat %0 : i64 -> tensor<1024xi64> loc(#loc18)
+    %2 = arith.cmpi sle, %1, %tmp5 : tensor<1024xi64> loc(#loc18)
+    %3 = tt.splat %ks2 : i64 -> tensor<1024xi64> loc(#loc19)
+    %4 = arith.cmpi slt, %tmp5, %3 : tensor<1024xi64> loc(#loc19)
+    %5 = arith.andi %2, %4 : tensor<1024xi1> loc(#loc20)
+    %true = arith.constant true loc(#loc21)
+    %cst = arith.constant dense<true> : tensor<1024xi1> loc(#loc21)
+    %6 = arith.xori %xmask_5, %cst : tensor<1024xi1> loc(#loc21)
+    %7 = arith.ori %5, %6 : tensor<1024xi1> loc(#loc22)
+    tt.assert %7, "index out of bounds: 0 <= tmp5 < ks2" : tensor<1024xi1> loc(#loc23)
+    %tmp7 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc106)
+    %tmp7_23 = arith.muli %tmp7, %tmp5 : tensor<1024xi64> loc(#loc106)
+    %tmp7_24 = arith.addi %x0_11, %tmp7_23 : tensor<1024xi64> loc(#loc107)
+    %tmp7_25 = tt.splat %in_ptr2 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc108)
+    %tmp7_26 = tt.addptr %tmp7_25, %tmp7_24 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc108)
+    %tmp7_27 = tt.load %tmp7_26, %xmask_5 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc109)
+    %tmp7_28 = arith.extf %tmp7_27 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc110)
+    %tmp8 = arith.mulf %tmp0_16, %tmp7_28 : tensor<1024xf32> loc(#loc111)
+    %tmp10 = arith.constant 0 : i64 loc(#loc112)
+    %tmp10_29 = arith.constant dense<0> : tensor<1xi64> loc(#loc112)
+    %tmp11 = arith.constant dense<0> : tensor<1024xi64> loc(#loc113)
+    %tmp11_30 = arith.cmpi sge, %x0_11, %tmp11 : tensor<1024xi64> loc(#loc113)
+    %tmp12 = arith.constant 2 : i32 loc(#loc114)
+    %tmp12_31 = arith.constant 2 : i64 loc(#loc114)
+    %tmp12_32 = arith.divsi %ks3, %tmp12_31 : i64 loc(#loc114)
+    %tmp12_33 = arith.constant -1 : i32 loc(#loc115)
+    %tmp12_34 = arith.constant -1 : i64 loc(#loc115)
+    %tmp12_35 = arith.muli %tmp12_34, %tmp12_32 : i64 loc(#loc115)
+    %tmp12_36 = arith.addi %ks3, %tmp12_35 : i64 loc(#loc116)
+    %tmp13 = tt.splat %tmp12_36 : i64 -> tensor<1024xi64> loc(#loc117)
+    %tmp13_37 = arith.cmpi slt, %x0_11, %tmp13 : tensor<1024xi64> loc(#loc117)
+    %tmp14 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc118)
+    %tmp14_38 = arith.muli %tmp14, %x5_13 : tensor<1024xi64> loc(#loc118)
+    %tmp14_39 = arith.constant 2 : i32 loc(#loc119)
+    %tmp14_40 = arith.constant 2 : i64 loc(#loc119)
+    %tmp14_41 = arith.divsi %ks3, %tmp14_40 : i64 loc(#loc119)
+    %tmp14_42 = tt.splat %tmp14_41 : i64 -> tensor<1024xi64> loc(#loc120)
+    %tmp14_43 = arith.addi %tmp14_38, %tmp14_42 : tensor<1024xi64> loc(#loc120)
+    %tmp14_44 = arith.addi %tmp14_43, %x0_11 : tensor<1024xi64> loc(#loc121)
+    %tmp14_45 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc122)
+    %tmp14_46 = tt.addptr %tmp14_45, %tmp14_44 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc122)
+    %tmp14_47 = arith.andi %tmp13_37, %xmask_5 : tensor<1024xi1> loc(#loc123)
+    %tmp14_48 = arith.constant 0.000000e+00 : f32 loc(#loc124)
+    %tmp14_49 = arith.constant dense<0.000000e+00> : tensor<1024xf32> loc(#loc124)
+    %tmp14_50 = arith.truncf %tmp14_49 : tensor<1024xf32> to tensor<1024xbf16> loc(#loc124)
+    %tmp14_51 = tt.load %tmp14_46, %tmp14_47, %tmp14_50 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc124)
+    %tmp14_52 = arith.extf %tmp14_51 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc125)
+    %tmp15 = arith.constant 0.000000e+00 : f32 loc(#loc126)
+    %tmp15_53 = arith.constant dense<0.000000e+00> : tensor<1024xf32> loc(#loc126)
+    %tmp15_54 = arith.subf %tmp15_53, %tmp14_52 : tensor<1024xf32> loc(#loc126)
+    %tmp16 = arith.constant 0.000000e+00 : f32 loc(#loc127)
+    %tmp16_55 = arith.constant dense<0.000000e+00> : tensor<1024xf32> loc(#loc127)
+    %tmp17 = arith.select %tmp13_37, %tmp15_54, %tmp16_55 : tensor<1024xi1>, tensor<1024xf32> loc(#loc128)
+    %tmp18 = tt.splat %tmp12_36 : i64 -> tensor<1024xi64> loc(#loc129)
+    %tmp18_56 = arith.cmpi sge, %x0_11, %tmp18 : tensor<1024xi64> loc(#loc129)
+    %tmp20 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc130)
+    %tmp20_57 = arith.cmpi slt, %x0_11, %tmp20 : tensor<1024xi64> loc(#loc130)
+    %tmp21 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc131)
+    %tmp21_58 = arith.muli %tmp21, %x5_13 : tensor<1024xi64> loc(#loc131)
+    %tmp21_59 = arith.constant -1 : i32 loc(#loc132)
+    %tmp21_60 = arith.constant -1 : i64 loc(#loc132)
+    %tmp21_61 = arith.muli %tmp21_60, %ks3 : i64 loc(#loc132)
+    %tmp21_62 = tt.splat %tmp21_61 : i64 -> tensor<1024xi64> loc(#loc133)
+    %tmp21_63 = arith.addi %x0_11, %tmp21_62 : tensor<1024xi64> loc(#loc133)
+    %tmp21_64 = arith.constant 2 : i32 loc(#loc134)
+    %tmp21_65 = arith.constant 2 : i64 loc(#loc134)
+    %tmp21_66 = arith.divsi %ks3, %tmp21_65 : i64 loc(#loc134)
+    %tmp21_67 = tt.splat %tmp21_66 : i64 -> tensor<1024xi64> loc(#loc135)
+    %tmp21_68 = arith.addi %tmp21_63, %tmp21_67 : tensor<1024xi64> loc(#loc135)
+    %tmp21_69 = arith.addi %tmp21_58, %tmp21_68 : tensor<1024xi64> loc(#loc136)
+    %tmp21_70 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc137)
+    %tmp21_71 = tt.addptr %tmp21_70, %tmp21_69 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc137)
+    %tmp21_72 = arith.andi %tmp18_56, %xmask_5 : tensor<1024xi1> loc(#loc138)
+    %tmp21_73 = arith.constant 0.000000e+00 : f32 loc(#loc139)
+    %tmp21_74 = arith.constant dense<0.000000e+00> : tensor<1024xf32> loc(#loc139)
+    %tmp21_75 = arith.truncf %tmp21_74 : tensor<1024xf32> to tensor<1024xbf16> loc(#loc139)
+    %tmp21_76 = tt.load %tmp21_71, %tmp21_72, %tmp21_75 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc139)
+    %tmp21_77 = arith.extf %tmp21_76 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc140)
+    %tmp22 = arith.select %tmp13_37, %tmp17, %tmp21_77 : tensor<1024xi1>, tensor<1024xf32> loc(#loc141)
+    %tmp24 = tt.splat %ks4 : i64 -> tensor<1024xi64> loc(#loc142)
+    %tmp24_78 = arith.addi %tmp1_18, %tmp24 : tensor<1024xi64> loc(#loc142)
+    %tmp25 = arith.select %tmp4_22, %tmp24_78, %tmp1_18 : tensor<1024xi1>, tensor<1024xi64> loc(#loc143)
+    %c0_i32_79 = arith.constant 0 : i32 loc(#loc62)
+    %8 = arith.extsi %c0_i32_79 : i32 to i64 loc(#loc62)
+    %9 = tt.splat %8 : i64 -> tensor<1024xi64> loc(#loc62)
+    %10 = arith.cmpi sle, %9, %tmp25 : tensor<1024xi64> loc(#loc62)
+    %11 = tt.splat %ks4 : i64 -> tensor<1024xi64> loc(#loc63)
+    %12 = arith.cmpi slt, %tmp25, %11 : tensor<1024xi64> loc(#loc63)
+    %13 = arith.andi %10, %12 : tensor<1024xi1> loc(#loc64)
+    %true_80 = arith.constant true loc(#loc65)
+    %cst_81 = arith.constant dense<true> : tensor<1024xi1> loc(#loc65)
+    %14 = arith.xori %xmask_5, %cst_81 : tensor<1024xi1> loc(#loc65)
+    %15 = arith.ori %13, %14 : tensor<1024xi1> loc(#loc66)
+    tt.assert %15, "index out of bounds: 0 <= tmp25 < ks4" : tensor<1024xi1> loc(#loc67)
+    %tmp27 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc144)
+    %tmp27_82 = arith.muli %tmp27, %tmp25 : tensor<1024xi64> loc(#loc144)
+    %tmp27_83 = arith.addi %x0_11, %tmp27_82 : tensor<1024xi64> loc(#loc145)
+    %tmp27_84 = tt.splat %in_ptr3 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc146)
+    %tmp27_85 = tt.addptr %tmp27_84, %tmp27_83 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc146)
+    %tmp27_86 = tt.load %tmp27_85, %xmask_5 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc147)
+    %tmp27_87 = arith.extf %tmp27_86 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc148)
+    %tmp28 = arith.mulf %tmp22, %tmp27_87 : tensor<1024xf32> loc(#loc149)
+    %tmp29 = arith.addf %tmp8, %tmp28 : tensor<1024xf32> loc(#loc150)
+    %16 = tt.splat %out_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc75)
+    %17 = tt.addptr %16, %xindex_4 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi32> loc(#loc75)
+    %18 = arith.truncf %tmp29 : tensor<1024xf32> to tensor<1024xbf16> loc(#loc76)
+    tt.store %17, %18, %xmask_5 : tensor<1024x!tt.ptr<bf16>> loc(#loc76)
+    tt.return loc(#loc77)
+  } loc(#loc)
+} loc(#loc)
+#loc1 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":19:28)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":19:33)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":20:36)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":20:23)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":21:21)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":23:21)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":23:28)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":24:19)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":25:19)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:30)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:35)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:75)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":27:30)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":27:35)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":29:18)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":30:18)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":31:32)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:28)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:44)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:37)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:54)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:52)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:62)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:39)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:35)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:30)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:46)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:86)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":34:18)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":36:28)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":37:20)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:31)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:24)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:18)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":39:19)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:35)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:48)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:41)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:54)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:31)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:68)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:60)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:119)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":41:13)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":42:38)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":43:35)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":44:20)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":46:19)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:35)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:52)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:47)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:67)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:60)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:41)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:31)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:81)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:73)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:132)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":48:35)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":50:19)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":51:34)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:28)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:46)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:38)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:56)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:54)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:64)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:40)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:36)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:31)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:48)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:88)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":54:20)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":55:19)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:25)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:37)
+#loc77 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:4)
+#loc89 = loc("xoffset"(#loc1))
+#loc90 = loc("xoffset"(#loc2))
+#loc91 = loc("xindex"(#loc3))
+#loc92 = loc("xindex"(#loc4))
+#loc93 = loc("xmask"(#loc5))
+#loc94 = loc("x2"(#loc6))
+#loc95 = loc("x2"(#loc7))
+#loc96 = loc("x0"(#loc8))
+#loc97 = loc("x5"(#loc9))
+#loc98 = loc("tmp0"(#loc10))
+#loc99 = loc("tmp0"(#loc11))
+#loc100 = loc("tmp0"(#loc12))
+#loc101 = loc("tmp1"(#loc13))
+#loc102 = loc("tmp1"(#loc14))
+#loc103 = loc("tmp3"(#loc15))
+#loc104 = loc("tmp4"(#loc16))
+#loc105 = loc("tmp5"(#loc17))
+#loc106 = loc("tmp7"(#loc24))
+#loc107 = loc("tmp7"(#loc25))
+#loc108 = loc("tmp7"(#loc26))
+#loc109 = loc("tmp7"(#loc27))
+#loc110 = loc("tmp7"(#loc28))
+#loc111 = loc("tmp8"(#loc29))
+#loc112 = loc("tmp10"(#loc30))
+#loc113 = loc("tmp11"(#loc31))
+#loc114 = loc("tmp12"(#loc32))
+#loc115 = loc("tmp12"(#loc33))
+#loc116 = loc("tmp12"(#loc34))
+#loc117 = loc("tmp13"(#loc35))
+#loc118 = loc("tmp14"(#loc36))
+#loc119 = loc("tmp14"(#loc37))
+#loc120 = loc("tmp14"(#loc38))
+#loc121 = loc("tmp14"(#loc39))
+#loc122 = loc("tmp14"(#loc40))
+#loc123 = loc("tmp14"(#loc41))
+#loc124 = loc("tmp14"(#loc42))
+#loc125 = loc("tmp14"(#loc43))
+#loc126 = loc("tmp15"(#loc44))
+#loc127 = loc("tmp16"(#loc45))
+#loc128 = loc("tmp17"(#loc46))
+#loc129 = loc("tmp18"(#loc47))
+#loc130 = loc("tmp20"(#loc48))
+#loc131 = loc("tmp21"(#loc49))
+#loc132 = loc("tmp21"(#loc50))
+#loc133 = loc("tmp21"(#loc51))
+#loc134 = loc("tmp21"(#loc52))
+#loc135 = loc("tmp21"(#loc53))
+#loc136 = loc("tmp21"(#loc54))
+#loc137 = loc("tmp21"(#loc55))
+#loc138 = loc("tmp21"(#loc56))
+#loc139 = loc("tmp21"(#loc57))
+#loc140 = loc("tmp21"(#loc58))
+#loc141 = loc("tmp22"(#loc59))
+#loc142 = loc("tmp24"(#loc60))
+#loc143 = loc("tmp25"(#loc61))
+#loc144 = loc("tmp27"(#loc68))
+#loc145 = loc("tmp27"(#loc69))
+#loc146 = loc("tmp27"(#loc70))
+#loc147 = loc("tmp27"(#loc71))
+#loc148 = loc("tmp27"(#loc72))
+#loc149 = loc("tmp28"(#loc73))
+#loc150 = loc("tmp29"(#loc74))

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttgir ADDED Viewed

	@@ -0,0 +1,232 @@

+#blocked = #ttg.blocked<{sizePerThread = [8], threadsPerWarp = [32], warpsPerCTA = [4], order = [0]}>
+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":18:0)
+#loc70 = loc("in_ptr0"(#loc))
+#loc71 = loc("in_ptr1"(#loc))
+#loc72 = loc("in_ptr2"(#loc))
+#loc73 = loc("in_ptr3"(#loc))
+#loc74 = loc("out_ptr0"(#loc))
+#loc75 = loc("ks0"(#loc))
+#loc76 = loc("ks1"(#loc))
+#loc77 = loc("ks2"(#loc))
+#loc78 = loc("ks3"(#loc))
+#loc79 = loc("ks4"(#loc))
+#loc80 = loc("xnumel"(#loc))
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "cuda:90", "ttg.threads-per-warp" = 32 : i32} {
+  tt.func public @triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0(%in_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %in_ptr1: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr1"(#loc)), %in_ptr2: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr2"(#loc)), %in_ptr3: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr3"(#loc)), %out_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("out_ptr0"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %ks2: i64 loc("ks2"(#loc)), %ks3: i64 loc("ks3"(#loc)), %ks4: i64 loc("ks4"(#loc)), %xnumel: i32 loc("xnumel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<true> : tensor<1024xi1, #blocked> loc(#loc1)
+    %c1024_i32 = arith.constant 1024 : i32 loc(#loc1)
+    %cst_0 = arith.constant dense<0.000000e+00> : tensor<1024xbf16, #blocked> loc(#loc1)
+    %c2_i64 = arith.constant 2 : i64 loc(#loc1)
+    %c-1_i64 = arith.constant -1 : i64 loc(#loc1)
+    %cst_1 = arith.constant dense<0> : tensor<1024xi64, #blocked> loc(#loc1)
+    %cst_2 = arith.constant dense<0.000000e+00> : tensor<1024xf32, #blocked> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc81)
+    %xoffset_3 = arith.muli %xoffset, %c1024_i32 : i32 loc(#loc82)
+    %xindex = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #blocked> loc(#loc83)
+    %xindex_4 = tt.splat %xoffset_3 : i32 -> tensor<1024xi32, #blocked> loc(#loc84)
+    %xindex_5 = arith.addi %xindex_4, %xindex : tensor<1024xi32, #blocked> loc(#loc84)
+    %xmask = tt.splat %xnumel : i32 -> tensor<1024xi32, #blocked> loc(#loc85)
+    %xmask_6 = arith.cmpi slt, %xindex_5, %xmask : tensor<1024xi32, #blocked> loc(#loc85)
+    %x2 = arith.extsi %xindex_5 : tensor<1024xi32, #blocked> to tensor<1024xi64, #blocked> loc(#loc86)
+    %x2_7 = tt.splat %ks0 : i64 -> tensor<1024xi64, #blocked> loc(#loc86)
+    %x2_8 = arith.divsi %x2, %x2_7 : tensor<1024xi64, #blocked> loc(#loc86)
+    %x2_9 = tt.splat %ks1 : i64 -> tensor<1024xi64, #blocked> loc(#loc87)
+    %x2_10 = arith.remsi %x2_8, %x2_9 : tensor<1024xi64, #blocked> loc(#loc87)
+    %x0 = tt.splat %ks3 : i64 -> tensor<1024xi64, #blocked> loc(#loc88)
+    %x0_11 = arith.remsi %x2, %x0 : tensor<1024xi64, #blocked> loc(#loc88)
+    %x5 = arith.divsi %x2, %x0 : tensor<1024xi64, #blocked> loc(#loc89)
+    %tmp0 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc90)
+    %tmp0_12 = tt.addptr %tmp0, %xindex_5 : tensor<1024x!tt.ptr<bf16>, #blocked>, tensor<1024xi32, #blocked> loc(#loc90)
+    %tmp0_13 = tt.load %tmp0_12, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc91)
+    %tmp0_14 = arith.extf %tmp0_13 : tensor<1024xbf16, #blocked> to tensor<1024xf32, #blocked> loc(#loc92)
+    %tmp1 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1024x!tt.ptr<i64>, #blocked> loc(#loc93)
+    %tmp1_15 = tt.addptr %tmp1, %x2_10 : tensor<1024x!tt.ptr<i64>, #blocked>, tensor<1024xi64, #blocked> loc(#loc93)
+    %tmp1_16 = tt.load %tmp1_15, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<i64>, #blocked> loc(#loc94)
+    %tmp3 = tt.splat %ks2 : i64 -> tensor<1024xi64, #blocked> loc(#loc95)
+    %tmp3_17 = arith.addi %tmp1_16, %tmp3 : tensor<1024xi64, #blocked> loc(#loc95)
+    %tmp4 = arith.cmpi slt, %tmp1_16, %cst_1 : tensor<1024xi64, #blocked> loc(#loc96)
+    %tmp5 = arith.select %tmp4, %tmp3_17, %tmp1_16 : tensor<1024xi1, #blocked>, tensor<1024xi64, #blocked> loc(#loc97)
+    %0 = arith.cmpi sge, %tmp5, %cst_1 : tensor<1024xi64, #blocked> loc(#loc19)
+    %1 = arith.cmpi slt, %tmp5, %tmp3 : tensor<1024xi64, #blocked> loc(#loc20)
+    %2 = arith.andi %0, %1 : tensor<1024xi1, #blocked> loc(#loc21)
+    %3 = arith.xori %xmask_6, %cst : tensor<1024xi1, #blocked> loc(#loc22)
+    %4 = arith.ori %2, %3 : tensor<1024xi1, #blocked> loc(#loc23)
+    tt.assert %4, "index out of bounds: 0 <= tmp5 < ks2" : tensor<1024xi1, #blocked> loc(#loc24)
+    %tmp7 = arith.muli %x0, %tmp5 : tensor<1024xi64, #blocked> loc(#loc98)
+    %tmp7_18 = arith.addi %x0_11, %tmp7 : tensor<1024xi64, #blocked> loc(#loc99)
+    %tmp7_19 = tt.splat %in_ptr2 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc100)
+    %tmp7_20 = tt.addptr %tmp7_19, %tmp7_18 : tensor<1024x!tt.ptr<bf16>, #blocked>, tensor<1024xi64, #blocked> loc(#loc100)
+    %tmp7_21 = tt.load %tmp7_20, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc101)
+    %tmp7_22 = arith.extf %tmp7_21 : tensor<1024xbf16, #blocked> to tensor<1024xf32, #blocked> loc(#loc102)
+    %tmp8 = arith.mulf %tmp0_14, %tmp7_22 : tensor<1024xf32, #blocked> loc(#loc103)
+    %tmp12 = arith.divsi %ks3, %c2_i64 : i64 loc(#loc104)
+    %tmp12_23 = arith.subi %ks3, %tmp12 : i64 loc(#loc105)
+    %tmp13 = tt.splat %tmp12_23 : i64 -> tensor<1024xi64, #blocked> loc(#loc106)
+    %tmp13_24 = arith.cmpi slt, %x0_11, %tmp13 : tensor<1024xi64, #blocked> loc(#loc106)
+    %tmp14 = arith.muli %x0, %x5 : tensor<1024xi64, #blocked> loc(#loc107)
+    %tmp14_25 = tt.splat %tmp12 : i64 -> tensor<1024xi64, #blocked> loc(#loc108)
+    %tmp14_26 = arith.addi %tmp14, %tmp14_25 : tensor<1024xi64, #blocked> loc(#loc108)
+    %tmp14_27 = arith.addi %tmp14_26, %x0_11 : tensor<1024xi64, #blocked> loc(#loc109)
+    %tmp14_28 = tt.addptr %tmp0, %tmp14_27 : tensor<1024x!tt.ptr<bf16>, #blocked>, tensor<1024xi64, #blocked> loc(#loc110)
+    %tmp14_29 = arith.andi %tmp13_24, %xmask_6 : tensor<1024xi1, #blocked> loc(#loc111)
+    %tmp14_30 = tt.load %tmp14_28, %tmp14_29, %cst_0 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc112)
+    %tmp14_31 = arith.extf %tmp14_30 : tensor<1024xbf16, #blocked> to tensor<1024xf32, #blocked> loc(#loc113)
+    %tmp15 = arith.subf %cst_2, %tmp14_31 : tensor<1024xf32, #blocked> loc(#loc114)
+    %tmp18 = arith.cmpi sge, %x0_11, %tmp13 : tensor<1024xi64, #blocked> loc(#loc115)
+    %tmp21 = arith.muli %ks3, %c-1_i64 : i64 loc(#loc116)
+    %tmp21_32 = tt.splat %tmp21 : i64 -> tensor<1024xi64, #blocked> loc(#loc117)
+    %tmp21_33 = arith.addi %x0_11, %tmp21_32 : tensor<1024xi64, #blocked> loc(#loc117)
+    %tmp21_34 = arith.addi %tmp21_33, %tmp14_25 : tensor<1024xi64, #blocked> loc(#loc118)
+    %tmp21_35 = arith.addi %tmp14, %tmp21_34 : tensor<1024xi64, #blocked> loc(#loc119)
+    %tmp21_36 = tt.addptr %tmp0, %tmp21_35 : tensor<1024x!tt.ptr<bf16>, #blocked>, tensor<1024xi64, #blocked> loc(#loc120)
+    %tmp21_37 = arith.andi %tmp18, %xmask_6 : tensor<1024xi1, #blocked> loc(#loc121)
+    %tmp21_38 = tt.load %tmp21_36, %tmp21_37, %cst_0 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc122)
+    %tmp21_39 = arith.extf %tmp21_38 : tensor<1024xbf16, #blocked> to tensor<1024xf32, #blocked> loc(#loc123)
+    %tmp22 = arith.select %tmp13_24, %tmp15, %tmp21_39 : tensor<1024xi1, #blocked>, tensor<1024xf32, #blocked> loc(#loc135)
+    %tmp24 = tt.splat %ks4 : i64 -> tensor<1024xi64, #blocked> loc(#loc126)
+    %tmp24_40 = arith.addi %tmp1_16, %tmp24 : tensor<1024xi64, #blocked> loc(#loc126)
+    %tmp25 = arith.select %tmp4, %tmp24_40, %tmp1_16 : tensor<1024xi1, #blocked>, tensor<1024xi64, #blocked> loc(#loc127)
+    %5 = arith.cmpi sge, %tmp25, %cst_1 : tensor<1024xi64, #blocked> loc(#loc55)
+    %6 = arith.cmpi slt, %tmp25, %tmp24 : tensor<1024xi64, #blocked> loc(#loc56)
+    %7 = arith.andi %5, %6 : tensor<1024xi1, #blocked> loc(#loc57)
+    %8 = arith.ori %7, %3 : tensor<1024xi1, #blocked> loc(#loc58)
+    tt.assert %8, "index out of bounds: 0 <= tmp25 < ks4" : tensor<1024xi1, #blocked> loc(#loc59)
+    %tmp27 = arith.muli %x0, %tmp25 : tensor<1024xi64, #blocked> loc(#loc128)
+    %tmp27_41 = arith.addi %x0_11, %tmp27 : tensor<1024xi64, #blocked> loc(#loc129)
+    %tmp27_42 = tt.splat %in_ptr3 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc130)
+    %tmp27_43 = tt.addptr %tmp27_42, %tmp27_41 : tensor<1024x!tt.ptr<bf16>, #blocked>, tensor<1024xi64, #blocked> loc(#loc130)
+    %tmp27_44 = tt.load %tmp27_43, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc131)
+    %tmp27_45 = arith.extf %tmp27_44 : tensor<1024xbf16, #blocked> to tensor<1024xf32, #blocked> loc(#loc132)
+    %tmp28 = arith.mulf %tmp22, %tmp27_45 : tensor<1024xf32, #blocked> loc(#loc133)
+    %tmp29 = arith.addf %tmp8, %tmp28 : tensor<1024xf32, #blocked> loc(#loc134)
+    %9 = tt.splat %out_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc67)
+    %10 = tt.addptr %9, %xindex_5 : tensor<1024x!tt.ptr<bf16>, #blocked>, tensor<1024xi32, #blocked> loc(#loc67)
+    %11 = arith.truncf %tmp29 : tensor<1024xf32, #blocked> to tensor<1024xbf16, #blocked> loc(#loc68)
+    tt.store %10, %11, %xmask_6 : tensor<1024x!tt.ptr<bf16>, #blocked> loc(#loc68)
+    tt.return loc(#loc69)
+  } loc(#loc)
+} loc(#loc)
+#loc1 = loc(unknown)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":19:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":19:33)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":20:36)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":20:23)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":21:21)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":23:21)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":23:28)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":24:19)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":25:19)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:30)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:35)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:75)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":27:30)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":27:35)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":29:18)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":30:18)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":31:32)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:28)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:44)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:37)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:54)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:52)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:62)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:39)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:35)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:30)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:46)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:86)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":34:18)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:31)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:18)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":39:19)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:35)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:41)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:54)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:31)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:68)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:60)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:119)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":41:13)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":44:20)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:52)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:47)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:60)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:41)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:31)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:81)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:73)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:132)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":48:35)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":43:35)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":50:19)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":51:34)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:28)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:46)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:38)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:54)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:64)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:40)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:36)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:31)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:48)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:88)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":54:20)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":55:19)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:25)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:37)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:4)
+#loc81 = loc("xoffset"(#loc2))
+#loc82 = loc("xoffset"(#loc3))
+#loc83 = loc("xindex"(#loc4))
+#loc84 = loc("xindex"(#loc5))
+#loc85 = loc("xmask"(#loc6))
+#loc86 = loc("x2"(#loc7))
+#loc87 = loc("x2"(#loc8))
+#loc88 = loc("x0"(#loc9))
+#loc89 = loc("x5"(#loc10))
+#loc90 = loc("tmp0"(#loc11))
+#loc91 = loc("tmp0"(#loc12))
+#loc92 = loc("tmp0"(#loc13))
+#loc93 = loc("tmp1"(#loc14))
+#loc94 = loc("tmp1"(#loc15))
+#loc95 = loc("tmp3"(#loc16))
+#loc96 = loc("tmp4"(#loc17))
+#loc97 = loc("tmp5"(#loc18))
+#loc98 = loc("tmp7"(#loc25))
+#loc99 = loc("tmp7"(#loc26))
+#loc100 = loc("tmp7"(#loc27))
+#loc101 = loc("tmp7"(#loc28))
+#loc102 = loc("tmp7"(#loc29))
+#loc103 = loc("tmp8"(#loc30))
+#loc104 = loc("tmp12"(#loc31))
+#loc105 = loc("tmp12"(#loc32))
+#loc106 = loc("tmp13"(#loc33))
+#loc107 = loc("tmp14"(#loc34))
+#loc108 = loc("tmp14"(#loc35))
+#loc109 = loc("tmp14"(#loc36))
+#loc110 = loc("tmp14"(#loc37))
+#loc111 = loc("tmp14"(#loc38))
+#loc112 = loc("tmp14"(#loc39))
+#loc113 = loc("tmp14"(#loc40))
+#loc114 = loc("tmp15"(#loc41))
+#loc115 = loc("tmp18"(#loc42))
+#loc116 = loc("tmp21"(#loc43))
+#loc117 = loc("tmp21"(#loc44))
+#loc118 = loc("tmp21"(#loc45))
+#loc119 = loc("tmp21"(#loc46))
+#loc120 = loc("tmp21"(#loc47))
+#loc121 = loc("tmp21"(#loc48))
+#loc122 = loc("tmp21"(#loc49))
+#loc123 = loc("tmp21"(#loc50))
+#loc124 = loc("tmp22"(#loc51))
+#loc125 = loc("tmp17"(#loc52))
+#loc126 = loc("tmp24"(#loc53))
+#loc127 = loc("tmp25"(#loc54))
+#loc128 = loc("tmp27"(#loc60))
+#loc129 = loc("tmp27"(#loc61))
+#loc130 = loc("tmp27"(#loc62))
+#loc131 = loc("tmp27"(#loc63))
+#loc132 = loc("tmp27"(#loc64))
+#loc133 = loc("tmp28"(#loc65))
+#loc134 = loc("tmp29"(#loc66))
+#loc135 = loc(fused[#loc124, #loc125])

SpecForge-ext/cache/compiled_kernels/triton/3/2TU6ZCF6AOXLWQQED5J7FS5ZXMYK7TIOQ6T2MLB767275BROXJCA/triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0.ttir ADDED Viewed

	@@ -0,0 +1,231 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":18:0)
+#loc70 = loc("in_ptr0"(#loc))
+#loc71 = loc("in_ptr1"(#loc))
+#loc72 = loc("in_ptr2"(#loc))
+#loc73 = loc("in_ptr3"(#loc))
+#loc74 = loc("out_ptr0"(#loc))
+#loc75 = loc("ks0"(#loc))
+#loc76 = loc("ks1"(#loc))
+#loc77 = loc("ks2"(#loc))
+#loc78 = loc("ks3"(#loc))
+#loc79 = loc("ks4"(#loc))
+#loc80 = loc("xnumel"(#loc))
+module {
+  tt.func public @triton_poi_fused_add_cat_index_mul_neg_slice_squeeze_unsqueeze_0(%in_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %in_ptr1: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr1"(#loc)), %in_ptr2: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr2"(#loc)), %in_ptr3: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr3"(#loc)), %out_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("out_ptr0"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %ks2: i64 loc("ks2"(#loc)), %ks3: i64 loc("ks3"(#loc)), %ks4: i64 loc("ks4"(#loc)), %xnumel: i32 loc("xnumel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<0.000000e+00> : tensor<1024xbf16> loc(#loc1)
+    %cst_0 = arith.constant dense<0> : tensor<1024xi64> loc(#loc1)
+    %cst_1 = arith.constant dense<0.000000e+00> : tensor<1024xf32> loc(#loc1)
+    %c-1_i64 = arith.constant -1 : i64 loc(#loc1)
+    %c2_i64 = arith.constant 2 : i64 loc(#loc1)
+    %cst_2 = arith.constant dense<true> : tensor<1024xi1> loc(#loc1)
+    %c1024_i32 = arith.constant 1024 : i32 loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc81)
+    %xoffset_3 = arith.muli %xoffset, %c1024_i32 : i32 loc(#loc82)
+    %xindex = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> loc(#loc83)
+    %xindex_4 = tt.splat %xoffset_3 : i32 -> tensor<1024xi32> loc(#loc84)
+    %xindex_5 = arith.addi %xindex_4, %xindex : tensor<1024xi32> loc(#loc84)
+    %xmask = tt.splat %xnumel : i32 -> tensor<1024xi32> loc(#loc85)
+    %xmask_6 = arith.cmpi slt, %xindex_5, %xmask : tensor<1024xi32> loc(#loc85)
+    %x2 = arith.extsi %xindex_5 : tensor<1024xi32> to tensor<1024xi64> loc(#loc86)
+    %x2_7 = tt.splat %ks0 : i64 -> tensor<1024xi64> loc(#loc86)
+    %x2_8 = arith.divsi %x2, %x2_7 : tensor<1024xi64> loc(#loc86)
+    %x2_9 = tt.splat %ks1 : i64 -> tensor<1024xi64> loc(#loc87)
+    %x2_10 = arith.remsi %x2_8, %x2_9 : tensor<1024xi64> loc(#loc87)
+    %x0 = tt.splat %ks3 : i64 -> tensor<1024xi64> loc(#loc88)
+    %x0_11 = arith.remsi %x2, %x0 : tensor<1024xi64> loc(#loc88)
+    %x5 = arith.divsi %x2, %x0 : tensor<1024xi64> loc(#loc89)
+    %tmp0 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc90)
+    %tmp0_12 = tt.addptr %tmp0, %xindex_5 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi32> loc(#loc90)
+    %tmp0_13 = tt.load %tmp0_12, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc91)
+    %tmp0_14 = arith.extf %tmp0_13 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc92)
+    %tmp1 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1024x!tt.ptr<i64>> loc(#loc93)
+    %tmp1_15 = tt.addptr %tmp1, %x2_10 : tensor<1024x!tt.ptr<i64>>, tensor<1024xi64> loc(#loc93)
+    %tmp1_16 = tt.load %tmp1_15, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<i64>> loc(#loc94)
+    %tmp3 = tt.splat %ks2 : i64 -> tensor<1024xi64> loc(#loc95)
+    %tmp3_17 = arith.addi %tmp1_16, %tmp3 : tensor<1024xi64> loc(#loc95)
+    %tmp4 = arith.cmpi slt, %tmp1_16, %cst_0 : tensor<1024xi64> loc(#loc96)
+    %tmp5 = arith.select %tmp4, %tmp3_17, %tmp1_16 : tensor<1024xi1>, tensor<1024xi64> loc(#loc97)
+    %0 = arith.cmpi sge, %tmp5, %cst_0 : tensor<1024xi64> loc(#loc19)
+    %1 = arith.cmpi slt, %tmp5, %tmp3 : tensor<1024xi64> loc(#loc20)
+    %2 = arith.andi %0, %1 : tensor<1024xi1> loc(#loc21)
+    %3 = arith.xori %xmask_6, %cst_2 : tensor<1024xi1> loc(#loc22)
+    %4 = arith.ori %2, %3 : tensor<1024xi1> loc(#loc23)
+    tt.assert %4, "index out of bounds: 0 <= tmp5 < ks2" : tensor<1024xi1> loc(#loc24)
+    %tmp7 = arith.muli %x0, %tmp5 : tensor<1024xi64> loc(#loc98)
+    %tmp7_18 = arith.addi %x0_11, %tmp7 : tensor<1024xi64> loc(#loc99)
+    %tmp7_19 = tt.splat %in_ptr2 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc100)
+    %tmp7_20 = tt.addptr %tmp7_19, %tmp7_18 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc100)
+    %tmp7_21 = tt.load %tmp7_20, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc101)
+    %tmp7_22 = arith.extf %tmp7_21 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc102)
+    %tmp8 = arith.mulf %tmp0_14, %tmp7_22 : tensor<1024xf32> loc(#loc103)
+    %tmp12 = arith.divsi %ks3, %c2_i64 : i64 loc(#loc104)
+    %tmp12_23 = arith.subi %ks3, %tmp12 : i64 loc(#loc105)
+    %tmp13 = tt.splat %tmp12_23 : i64 -> tensor<1024xi64> loc(#loc106)
+    %tmp13_24 = arith.cmpi slt, %x0_11, %tmp13 : tensor<1024xi64> loc(#loc106)
+    %tmp14 = arith.muli %x0, %x5 : tensor<1024xi64> loc(#loc107)
+    %tmp14_25 = tt.splat %tmp12 : i64 -> tensor<1024xi64> loc(#loc108)
+    %tmp14_26 = arith.addi %tmp14, %tmp14_25 : tensor<1024xi64> loc(#loc108)
+    %tmp14_27 = arith.addi %tmp14_26, %x0_11 : tensor<1024xi64> loc(#loc109)
+    %tmp14_28 = tt.addptr %tmp0, %tmp14_27 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc110)
+    %tmp14_29 = arith.andi %tmp13_24, %xmask_6 : tensor<1024xi1> loc(#loc111)
+    %tmp14_30 = tt.load %tmp14_28, %tmp14_29, %cst evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc112)
+    %tmp14_31 = arith.extf %tmp14_30 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc113)
+    %tmp15 = arith.subf %cst_1, %tmp14_31 : tensor<1024xf32> loc(#loc114)
+    %tmp18 = arith.cmpi sge, %x0_11, %tmp13 : tensor<1024xi64> loc(#loc115)
+    %tmp21 = arith.muli %ks3, %c-1_i64 : i64 loc(#loc116)
+    %tmp21_32 = tt.splat %tmp21 : i64 -> tensor<1024xi64> loc(#loc117)
+    %tmp21_33 = arith.addi %x0_11, %tmp21_32 : tensor<1024xi64> loc(#loc117)
+    %tmp21_34 = arith.addi %tmp21_33, %tmp14_25 : tensor<1024xi64> loc(#loc118)
+    %tmp21_35 = arith.addi %tmp14, %tmp21_34 : tensor<1024xi64> loc(#loc119)
+    %tmp21_36 = tt.addptr %tmp0, %tmp21_35 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc120)
+    %tmp21_37 = arith.andi %tmp18, %xmask_6 : tensor<1024xi1> loc(#loc121)
+    %tmp21_38 = tt.load %tmp21_36, %tmp21_37, %cst evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc122)
+    %tmp21_39 = arith.extf %tmp21_38 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc123)
+    %tmp22 = arith.select %tmp13_24, %tmp15, %tmp21_39 : tensor<1024xi1>, tensor<1024xf32> loc(#loc135)
+    %tmp24 = tt.splat %ks4 : i64 -> tensor<1024xi64> loc(#loc126)
+    %tmp24_40 = arith.addi %tmp1_16, %tmp24 : tensor<1024xi64> loc(#loc126)
+    %tmp25 = arith.select %tmp4, %tmp24_40, %tmp1_16 : tensor<1024xi1>, tensor<1024xi64> loc(#loc127)
+    %5 = arith.cmpi sge, %tmp25, %cst_0 : tensor<1024xi64> loc(#loc55)
+    %6 = arith.cmpi slt, %tmp25, %tmp24 : tensor<1024xi64> loc(#loc56)
+    %7 = arith.andi %5, %6 : tensor<1024xi1> loc(#loc57)
+    %8 = arith.ori %7, %3 : tensor<1024xi1> loc(#loc58)
+    tt.assert %8, "index out of bounds: 0 <= tmp25 < ks4" : tensor<1024xi1> loc(#loc59)
+    %tmp27 = arith.muli %x0, %tmp25 : tensor<1024xi64> loc(#loc128)
+    %tmp27_41 = arith.addi %x0_11, %tmp27 : tensor<1024xi64> loc(#loc129)
+    %tmp27_42 = tt.splat %in_ptr3 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc130)
+    %tmp27_43 = tt.addptr %tmp27_42, %tmp27_41 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi64> loc(#loc130)
+    %tmp27_44 = tt.load %tmp27_43, %xmask_6 evictionPolicy = evict_last : tensor<1024x!tt.ptr<bf16>> loc(#loc131)
+    %tmp27_45 = arith.extf %tmp27_44 : tensor<1024xbf16> to tensor<1024xf32> loc(#loc132)
+    %tmp28 = arith.mulf %tmp22, %tmp27_45 : tensor<1024xf32> loc(#loc133)
+    %tmp29 = arith.addf %tmp8, %tmp28 : tensor<1024xf32> loc(#loc134)
+    %9 = tt.splat %out_ptr0 : !tt.ptr<bf16> -> tensor<1024x!tt.ptr<bf16>> loc(#loc67)
+    %10 = tt.addptr %9, %xindex_5 : tensor<1024x!tt.ptr<bf16>>, tensor<1024xi32> loc(#loc67)
+    %11 = arith.truncf %tmp29 : tensor<1024xf32> to tensor<1024xbf16> loc(#loc68)
+    tt.store %10, %11, %xmask_6 : tensor<1024x!tt.ptr<bf16>> loc(#loc68)
+    tt.return loc(#loc69)
+  } loc(#loc)
+} loc(#loc)
+#loc1 = loc(unknown)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":19:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":19:33)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":20:36)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":20:23)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":21:21)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":23:21)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":23:28)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":24:19)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":25:19)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:30)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:35)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":26:75)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":27:30)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":27:35)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":29:18)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":30:18)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":31:32)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:28)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:44)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:37)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:54)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:52)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":32:62)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:39)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:35)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:30)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:46)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":33:86)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":34:18)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:31)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":38:18)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":39:19)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:35)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:41)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:54)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:31)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:68)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:60)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":40:119)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":41:13)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":44:20)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:52)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:47)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:60)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:41)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:31)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:81)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:73)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":47:132)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":48:35)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":43:35)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":50:19)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":51:34)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:28)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:46)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:38)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:54)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":52:64)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:40)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:36)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:31)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:48)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":53:88)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":54:20)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":55:19)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:25)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:37)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vy/cvyoqg7jzeadarrgggxhs2djmxugxyvjhim72vfstqc4io4qsecj.py":56:4)
+#loc81 = loc("xoffset"(#loc2))
+#loc82 = loc("xoffset"(#loc3))
+#loc83 = loc("xindex"(#loc4))
+#loc84 = loc("xindex"(#loc5))
+#loc85 = loc("xmask"(#loc6))
+#loc86 = loc("x2"(#loc7))
+#loc87 = loc("x2"(#loc8))
+#loc88 = loc("x0"(#loc9))
+#loc89 = loc("x5"(#loc10))
+#loc90 = loc("tmp0"(#loc11))
+#loc91 = loc("tmp0"(#loc12))
+#loc92 = loc("tmp0"(#loc13))
+#loc93 = loc("tmp1"(#loc14))
+#loc94 = loc("tmp1"(#loc15))
+#loc95 = loc("tmp3"(#loc16))
+#loc96 = loc("tmp4"(#loc17))
+#loc97 = loc("tmp5"(#loc18))
+#loc98 = loc("tmp7"(#loc25))
+#loc99 = loc("tmp7"(#loc26))
+#loc100 = loc("tmp7"(#loc27))
+#loc101 = loc("tmp7"(#loc28))
+#loc102 = loc("tmp7"(#loc29))
+#loc103 = loc("tmp8"(#loc30))
+#loc104 = loc("tmp12"(#loc31))
+#loc105 = loc("tmp12"(#loc32))
+#loc106 = loc("tmp13"(#loc33))
+#loc107 = loc("tmp14"(#loc34))
+#loc108 = loc("tmp14"(#loc35))
+#loc109 = loc("tmp14"(#loc36))
+#loc110 = loc("tmp14"(#loc37))
+#loc111 = loc("tmp14"(#loc38))
+#loc112 = loc("tmp14"(#loc39))
+#loc113 = loc("tmp14"(#loc40))
+#loc114 = loc("tmp15"(#loc41))
+#loc115 = loc("tmp18"(#loc42))
+#loc116 = loc("tmp21"(#loc43))
+#loc117 = loc("tmp21"(#loc44))
+#loc118 = loc("tmp21"(#loc45))
+#loc119 = loc("tmp21"(#loc46))
+#loc120 = loc("tmp21"(#loc47))
+#loc121 = loc("tmp21"(#loc48))
+#loc122 = loc("tmp21"(#loc49))
+#loc123 = loc("tmp21"(#loc50))
+#loc124 = loc("tmp22"(#loc51))
+#loc125 = loc("tmp17"(#loc52))
+#loc126 = loc("tmp24"(#loc53))
+#loc127 = loc("tmp25"(#loc54))
+#loc128 = loc("tmp27"(#loc60))
+#loc129 = loc("tmp27"(#loc61))
+#loc130 = loc("tmp27"(#loc62))
+#loc131 = loc("tmp27"(#loc63))
+#loc132 = loc("tmp27"(#loc64))
+#loc133 = loc("tmp28"(#loc65))
+#loc134 = loc("tmp29"(#loc66))
+#loc135 = loc(fused[#loc124, #loc125])

SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/__grp__triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.source", "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ttir", "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ttgir", "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.llir", "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.ptx", "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.cubin", "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/4BXPLEVNIV4ISF7IZIVK7CAM4LM5YGYX34PZFJY2Q7MUVWT7ZGUA/triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"hash": "e06ef592ad45788917e8ca2aaf880ce2d9dc1b17df1f92a71a87d94ada7fc9a8", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 8, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 16384, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_per_fused__to_copy_arange_bitwise_and_eq_gt_index_put_lt_new_zeros_scalar_tensor_sort_sum_unsqueeze_view_where_2"}

The diff for this file is too large to render. See raw diff

SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/__grp__triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.source", "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ttir", "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ttgir", "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.llir", "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.ptx", "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.cubin", "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/7Y3WXJA5F4C76K5XYE6DPME3QXZYZM2B2JXSRQ4JEXGQ6AZL2CMA/triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2.cubin ADDED Viewed

Binary file (22.3 kB). View file

	@@ -0,0 +1 @@

+ {"hash": "fe376ba41d2f05ff2bb7c13c37b09b85f38cb341d26f28c38925cd0f032bd098", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 2, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 0, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2"}

	@@ -0,0 +1,266 @@

+; ModuleID = 'LLVMDialectModule'
+source_filename = "LLVMDialectModule"
+target datalayout = "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
+@assertFunc_0 = internal constant [8 x i8] c"unknown\00"
+@assertFile_0 = internal constant [114 x i8] c"/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py\00"
+@assertMessage_0 = internal constant [90 x i8] c"index out of bounds: 0 <= tmp15 < 1 + (triton_helpers.div_floor_integer(127 + ks1,  128))\00"
+; Function Attrs: noreturn
+declare !dbg !5 void @__assertfail(ptr, ptr, i32, ptr, i64) local_unnamed_addr #0
+define ptx_kernel void @triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, ptr addrspace(1) %3, ptr addrspace(1) %4, i64 %5, i64 %6, i32 %7, i32 %8, ptr addrspace(1) readnone captures(none) %9, ptr addrspace(1) readnone captures(none) %10) local_unnamed_addr #1 !dbg !9 {
+  %12 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !10
+  %13 = icmp samesign ult i32 %12, 32, !dbg !11
+  %14 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !12
+  %15 = and i32 %14, 31, !dbg !12
+  %16 = zext nneg i32 %12 to i64, !dbg !13
+  %17 = mul i64 %5, %16, !dbg !13
+  %18 = icmp sgt i32 %8, 0, !dbg !14
+  br i1 %18, label %.lr.ph, label %._crit_edge, !dbg !14
+.lr.ph:                                           ; preds = %11
+  %19 = getelementptr i32, ptr addrspace(1) %0, i64 %17
+  br i1 %13, label %.lr.ph.split, label %.lr.ph.split.us
+.lr.ph.split.us:                                  ; preds = %.lr.ph, %.lr.ph.split.us
+  %20 = phi i32 [ %26, %.lr.ph.split.us ], [ 0, %.lr.ph ]
+  %21 = or disjoint i32 %20, %15, !dbg !15
+  %22 = sext i32 %21 to i64, !dbg !16
+  %23 = getelementptr i32, ptr addrspace(1) %19, i64 %22, !dbg !17
+  %24 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_first.b64 $0, 1.0;", "=l"() #5, !dbg !18
+  %25 = tail call i32 asm sideeffect "mov.u32 $0, 0x0;\0A\09@$3 ld.global.L1::evict_first.L2::cache_hint.b32 { $0 }, [ $1 + 0 ], $2;", "=r,l,l,b"(ptr addrspace(1) %23, i64 %24, i1 false) #5, !dbg !18
+  %26 = add i32 %20, 32, !dbg !14
+  %27 = icmp slt i32 %26, %8, !dbg !14
+  br i1 %27, label %.lr.ph.split.us, label %._crit_edge, !dbg !14
+.lr.ph.split:                                     ; preds = %.lr.ph, %.lr.ph.split
+  %28 = phi i64 [ %36, %.lr.ph.split ], [ 0, %.lr.ph ]
+  %29 = phi i32 [ %37, %.lr.ph.split ], [ 0, %.lr.ph ]
+  %30 = or disjoint i32 %29, %15, !dbg !15
+  %31 = icmp slt i32 %30, %8, !dbg !19
+  %32 = sext i32 %30 to i64, !dbg !16
+  %33 = getelementptr i32, ptr addrspace(1) %19, i64 %32, !dbg !17
+  %34 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_first.b64 $0, 1.0;", "=l"() #5, !dbg !18
+  %35 = tail call i32 asm sideeffect "mov.u32 $0, 0x0;\0A\09@$3 ld.global.L1::evict_first.L2::cache_hint.b32 { $0 }, [ $1 + 0 ], $2;", "=r,l,l,b"(ptr addrspace(1) %33, i64 %34, i1 %31) #5, !dbg !18
+  %narrow16 = select i1 %31, i32 %35, i32 0, !dbg !20
+  %spec.select = sext i32 %narrow16 to i64, !dbg !20
+  %36 = add i64 %28, %spec.select, !dbg !20
+  %37 = add i32 %29, 32, !dbg !14
+  %38 = icmp slt i32 %37, %8, !dbg !14
+  br i1 %38, label %.lr.ph.split, label %._crit_edge, !dbg !14
+._crit_edge:                                      ; preds = %.lr.ph.split.us, %.lr.ph.split, %11
+  %.lcssa = phi i64 [ 0, %11 ], [ %36, %.lr.ph.split ], [ 0, %.lr.ph.split.us ], !dbg !21
+  %extelt.offset = lshr i64 %.lcssa, 32, !dbg !22
+  %39 = trunc nuw i64 %extelt.offset to i32, !dbg !22
+  %40 = trunc i64 %.lcssa to i32, !dbg !22
+  %41 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %40, i32 16, i32 31), !dbg !22
+  %42 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %39, i32 16, i32 31), !dbg !22
+  %43 = insertelement <2 x i32> poison, i32 %41, i64 0, !dbg !22
+  %44 = insertelement <2 x i32> %43, i32 %42, i64 1, !dbg !22
+  %45 = bitcast <2 x i32> %44 to i64, !dbg !22
+  %46 = add i64 %.lcssa, %45, !dbg !26
+  %extelt.offset3 = lshr i64 %46, 32, !dbg !22
+  %47 = trunc nuw i64 %extelt.offset3 to i32, !dbg !22
+  %48 = trunc i64 %46 to i32, !dbg !22
+  %49 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %48, i32 8, i32 31), !dbg !22
+  %50 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %47, i32 8, i32 31), !dbg !22
+  %51 = insertelement <2 x i32> poison, i32 %49, i64 0, !dbg !22
+  %52 = insertelement <2 x i32> %51, i32 %50, i64 1, !dbg !22
+  %53 = bitcast <2 x i32> %52 to i64, !dbg !22
+  %54 = add i64 %46, %53, !dbg !26
+  %extelt.offset4 = lshr i64 %54, 32, !dbg !22
+  %55 = trunc nuw i64 %extelt.offset4 to i32, !dbg !22
+  %56 = trunc i64 %54 to i32, !dbg !22
+  %57 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %56, i32 4, i32 31), !dbg !22
+  %58 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %55, i32 4, i32 31), !dbg !22
+  %59 = insertelement <2 x i32> poison, i32 %57, i64 0, !dbg !22
+  %60 = insertelement <2 x i32> %59, i32 %58, i64 1, !dbg !22
+  %61 = bitcast <2 x i32> %60 to i64, !dbg !22
+  %62 = add i64 %54, %61, !dbg !26
+  %extelt.offset5 = lshr i64 %62, 32, !dbg !22
+  %63 = trunc nuw i64 %extelt.offset5 to i32, !dbg !22
+  %64 = trunc i64 %62 to i32, !dbg !22
+  %65 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %64, i32 2, i32 31), !dbg !22
+  %66 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %63, i32 2, i32 31), !dbg !22
+  %67 = insertelement <2 x i32> poison, i32 %65, i64 0, !dbg !22
+  %68 = insertelement <2 x i32> %67, i32 %66, i64 1, !dbg !22
+  %69 = bitcast <2 x i32> %68 to i64, !dbg !22
+  %70 = add i64 %62, %69, !dbg !26
+  %extelt.offset6 = lshr i64 %70, 32, !dbg !22
+  %71 = trunc nuw i64 %extelt.offset6 to i32, !dbg !22
+  %72 = trunc i64 %70 to i32, !dbg !22
+  %73 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %72, i32 1, i32 31), !dbg !22
+  %74 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %71, i32 1, i32 31), !dbg !22
+  %75 = insertelement <2 x i32> poison, i32 %73, i64 0, !dbg !22
+  %76 = insertelement <2 x i32> %75, i32 %74, i64 1, !dbg !22
+  %77 = bitcast <2 x i32> %76 to i64, !dbg !22
+  %78 = add i64 %70, %77, !dbg !26
+  %79 = trunc i64 %78 to i32, !dbg !27
+  %80 = getelementptr i32, ptr addrspace(1) %2, i64 %16, !dbg !28
+  %81 = and i32 %14, 32, !dbg !29
+  %82 = icmp eq i32 %81, 0, !dbg !29
+  %83 = and i32 %14, 63, !dbg !29
+  %84 = icmp eq i32 %83, 0, !dbg !29
+  %85 = and i1 %13, %84, !dbg !29
+  tail call void asm sideeffect "@$2 st.global.b32 [ $1 + 0 ], { $0 };", "r,l,b"(i32 %79, ptr addrspace(1) %80, i1 %85) #5, !dbg !29
+  %86 = icmp slt i64 %5, 2, !dbg !30
+  %87 = icmp sgt i64 %5, 1, !dbg !31
+  %88 = select i1 %87, i64 %5, i64 0, !dbg !32
+  %89 = zext i1 %86 to i64, !dbg !33
+  %90 = add i64 %88, %89, !dbg !34
+  %91 = mul i64 %90, %16, !dbg !35
+  %92 = add i64 %5, 1, !dbg !36
+  %93 = add i64 %6, 127, !dbg !37
+  %94 = sdiv i64 %93, 128, !dbg !38
+  %95 = and i64 %93, 127, !dbg !42
+  %.not = icmp ne i64 %95, 0, !dbg !42
+  %96 = icmp slt i64 %93, 0, !dbg !43
+  %narrow = and i1 %96, %.not, !dbg !44
+  %97 = sext i1 %narrow to i64, !dbg !44
+  %98 = add nsw i64 %94, %97, !dbg !44
+  br i1 %18, label %.lr.ph14, label %._crit_edge15, !dbg !45
+.lr.ph14:                                         ; preds = %._crit_edge, %119
+  %99 = phi i32 [ %131, %119 ], [ 0, %._crit_edge ]
+  %100 = or disjoint i32 %99, %15, !dbg !46
+  %101 = icmp slt i32 %100, %8, !dbg !47
+  %102 = sext i32 %100 to i64, !dbg !48
+  %103 = add i64 %91, %102, !dbg !48
+  %104 = getelementptr i64, ptr addrspace(1) %1, i64 %103, !dbg !49
+  %105 = and i1 %13, %101, !dbg !50
+  %106 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_first.b64 $0, 1.0;", "=l"() #5, !dbg !51
+  %107 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_first.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %104, i64 %106, i1 %105) #5, !dbg !51
+  %108 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_first.b64 $0, 1.0;", "=l"() #5, !dbg !51
+  %109 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_first.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %104, i64 %108, i1 %105) #5, !dbg !51
+  %110 = icmp slt i32 %100, %79, !dbg !52
+  %sext7 = shl i64 %109, 32, !dbg !53
+  %111 = ashr exact i64 %sext7, 32, !dbg !53
+  %112 = select i1 %110, i64 %111, i64 %5, !dbg !53
+  %113 = icmp slt i64 %112, 0, !dbg !54
+  %114 = select i1 %113, i64 %92, i64 0, !dbg !55
+  %115 = add i64 %114, %112, !dbg !55
+  %116 = icmp slt i64 %115, 0, !dbg !56
+  %117 = icmp sgt i64 %115, %98, !dbg !57
+  %.not12 = or i1 %116, %117, !dbg !58
+  %.not9 = and i1 %105, %.not12, !dbg !59
+  br i1 %.not9, label %118, label %119, !dbg !59
+118:                                              ; preds = %.lr.ph14
+  tail call void @__assertfail(ptr nonnull @assertMessage_0, ptr nonnull @assertFile_0, i32 59, ptr nonnull @assertFunc_0, i64 1), !dbg !59
+  unreachable, !dbg !59
+119:                                              ; preds = %.lr.ph14
+  %sext = shl i64 %107, 32, !dbg !53
+  %120 = ashr exact i64 %sext, 32, !dbg !53
+  %121 = select i1 %110, i64 %120, i64 %5, !dbg !53
+  %122 = icmp slt i64 %121, 0, !dbg !54
+  %123 = select i1 %122, i64 %92, i64 0, !dbg !55
+  %124 = trunc i64 %109 to i32, !dbg !60
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !59
+  %125 = getelementptr i32, ptr addrspace(1) %3, i64 %103, !dbg !61
+  %126 = and i1 %82, %105, !dbg !62
+  tail call void asm sideeffect "@$2 st.global.b32 [ $1 + 0 ], { $0 };", "r,l,b"(i32 %124, ptr addrspace(1) %125, i1 %126) #5, !dbg !62
+  %127 = getelementptr i32, ptr addrspace(1) %4, i64 %121, !dbg !63
+  %128 = getelementptr i32, ptr addrspace(1) %127, i64 %123, !dbg !63
+  %129 = getelementptr i32, ptr addrspace(1) %128, i64 %16, !dbg !63
+  %130 = getelementptr i32, ptr addrspace(1) %129, i64 %17, !dbg !63
+  tail call void asm sideeffect "@$2 st.global.b32 [ $1 + 0 ], { $0 };", "r,l,b"(i32 1, ptr addrspace(1) %130, i1 %126) #5, !dbg !64
+  %131 = add i32 %99, 32, !dbg !45
+  %132 = icmp slt i32 %131, %8, !dbg !45
+  br i1 %132, label %.lr.ph14, label %._crit_edge15, !dbg !45
+._crit_edge15:                                    ; preds = %119, %._crit_edge
+  ret void, !dbg !65
+}
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 2147483647) i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #2
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 1024) i32 @llvm.nvvm.read.ptx.sreg.tid.x() #2
+; Function Attrs: convergent nocallback nounwind memory(inaccessiblemem: readwrite)
+declare i32 @llvm.nvvm.shfl.sync.bfly.i32(i32, i32, i32, i32) #3
+; Function Attrs: convergent nocallback nounwind
+declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32) #4
+attributes #0 = { noreturn }
+attributes #1 = { "nvvm.reqntid"="64" }
+attributes #2 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+attributes #3 = { convergent nocallback nounwind memory(inaccessiblemem: readwrite) }
+attributes #4 = { convergent nocallback nounwind }
+attributes #5 = { nounwind }
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!2, !3}
+!llvm.ident = !{!4}
+!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "triton", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly)
+!1 = !DIFile(filename: "cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py", directory: "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr")
+!2 = !{i32 2, !"Debug Info Version", i32 3}
+!3 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
+!4 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
+!5 = !DISubprogram(name: "__assertfail", linkageName: "__assertfail", scope: !6, file: !6, type: !7, spFlags: DISPFlagOptimized)
+!6 = !DIFile(filename: "<unknown>", directory: "")
+!7 = !DISubroutineType(cc: DW_CC_normal, types: !8)
+!8 = !{}
+!9 = distinct !DISubprogram(name: "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2", linkageName: "triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2", scope: !1, file: !1, line: 18, type: !7, scopeLine: 18, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0)
+!10 = !DILocation(line: 22, column: 28, scope: !9)
+!11 = !DILocation(line: 24, column: 21, scope: !9)
+!12 = !DILocation(line: 25, column: 37, scope: !9)
+!13 = !DILocation(line: 35, column: 45, scope: !9)
+!14 = !DILocation(line: 29, column: 40, scope: !9)
+!15 = !DILocation(line: 30, column: 31, scope: !9)
+!16 = !DILocation(line: 35, column: 41, scope: !9)
+!17 = !DILocation(line: 35, column: 34, scope: !9)
+!18 = !DILocation(line: 35, column: 50, scope: !9)
+!19 = !DILocation(line: 31, column: 29, scope: !9)
+!20 = !DILocation(line: 39, column: 48, scope: !9)
+!21 = !DILocation(line: 28, column: 43, scope: !9)
+!22 = !DILocation(line: 291, column: 36, scope: !23, inlinedAt: !25)
+!23 = distinct !DILexicalBlockFile(scope: !9, file: !24, discriminator: 0)
+!24 = !DIFile(filename: "standard.py", directory: "/workspace/specforge/lib/python3.11/site-packages/triton/language")
+!25 = !DILocation(line: 40, column: 25, scope: !9)
+!26 = !DILocation(line: 261, column: 15, scope: !23, inlinedAt: !25)
+!27 = !DILocation(line: 41, column: 19, scope: !9)
+!28 = !DILocation(line: 42, column: 25, scope: !9)
+!29 = !DILocation(line: 42, column: 36, scope: !9)
+!30 = !DILocation(line: 49, column: 60, scope: !9)
+!31 = !DILocation(line: 49, column: 86, scope: !9)
+!32 = !DILocation(line: 49, column: 77, scope: !9)
+!33 = !DILocation(line: 49, scope: !9)
+!34 = !DILocation(line: 49, column: 68, scope: !9)
+!35 = !DILocation(line: 49, column: 45, scope: !9)
+!36 = !DILocation(line: 55, column: 20, scope: !9)
+!37 = !DILocation(line: 59, column: 94, scope: !9)
+!38 = !DILocation(line: 72, column: 16, scope: !39, inlinedAt: !41)
+!39 = distinct !DILexicalBlockFile(scope: !9, file: !40, discriminator: 0)
+!40 = !DIFile(filename: "triton_helpers.py", directory: "/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime")
+!41 = !DILocation(line: 59, column: 100, scope: !9)
+!42 = !DILocation(line: 74, column: 34, scope: !39, inlinedAt: !41)
+!43 = !DILocation(line: 75, column: 25, scope: !39, inlinedAt: !41)
+!44 = !DILocation(line: 75, column: 47, scope: !39, inlinedAt: !41)
+!45 = !DILocation(line: 43, column: 40, scope: !9)
+!46 = !DILocation(line: 44, column: 31, scope: !9)
+!47 = !DILocation(line: 45, column: 29, scope: !9)
+!48 = !DILocation(line: 49, column: 41, scope: !9)
+!49 = !DILocation(line: 49, column: 34, scope: !9)
+!50 = !DILocation(line: 49, column: 103, scope: !9)
+!51 = !DILocation(line: 49, column: 93, scope: !9)
+!52 = !DILocation(line: 52, column: 22, scope: !9)
+!53 = !DILocation(line: 54, column: 37, scope: !9)
+!54 = !DILocation(line: 57, column: 24, scope: !9)
+!55 = !DILocation(line: 58, column: 39, scope: !9)
+!56 = !DILocation(line: 59, column: 32, scope: !9)
+!57 = !DILocation(line: 59, column: 50, scope: !9)
+!58 = !DILocation(line: 59, column: 112, scope: !9)
+!59 = !DILocation(line: 59, column: 130, scope: !9)
+!60 = !DILocation(line: 50, column: 23, scope: !9)
+!61 = !DILocation(line: 61, column: 29, scope: !9)
+!62 = !DILocation(line: 61, column: 94, scope: !9)
+!63 = !DILocation(line: 62, column: 29, scope: !9)
+!64 = !DILocation(line: 62, column: 95, scope: !9)
+!65 = !DILocation(line: 43, column: 4, scope: !9)

	@@ -0,0 +1,640 @@

+//
+// Generated by LLVM NVPTX Back-End
+//
+.version 8.7
+.target sm_90a
+.address_size 64
+	// .globl	triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2 // -- Begin function triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2
+.extern .func __assertfail
+(
+	.param .b64 __assertfail_param_0,
+	.param .b64 __assertfail_param_1,
+	.param .b32 __assertfail_param_2,
+	.param .b64 __assertfail_param_3,
+	.param .b64 __assertfail_param_4
+)
+.noreturn;
+.global .align 1 .b8 assertFunc_0[8] = {117, 110, 107, 110, 111, 119, 110};
+.global .align 1 .b8 assertFile_0[114] = {47, 119, 111, 114, 107, 115, 112, 97, 99, 101, 47, 104, 97, 110, 114, 117, 105, 47, 83, 112, 101, 99, 70, 111, 114, 103, 101, 45, 101, 120, 116, 47, 99, 97, 99, 104, 101, 47, 99, 111, 109, 112, 105, 108, 101, 100, 95, 107, 101, 114, 110, 101, 108, 115, 47, 118, 114, 47, 99, 118, 114, 104, 110, 114, 109, 112, 103, 121, 120, 119, 117, 51, 52, 120, 108, 101, 99, 108, 101, 101, 51, 116, 116, 52, 107, 101, 109, 111, 108, 100, 107, 106, 55, 105, 97, 109, 52, 117, 99, 105, 97, 116, 104, 111, 109, 105, 114, 118, 108, 99, 46, 112, 121};
+.global .align 1 .b8 assertMessage_0[90] = {105, 110, 100, 101, 120, 32, 111, 117, 116, 32, 111, 102, 32, 98, 111, 117, 110, 100, 115, 58, 32, 48, 32, 60, 61, 32, 116, 109, 112, 49, 53, 32, 60, 32, 49, 32, 43, 32, 40, 116, 114, 105, 116, 111, 110, 95, 104, 101, 108, 112, 101, 114, 115, 46, 100, 105, 118, 95, 102, 108, 111, 111, 114, 95, 105, 110, 116, 101, 103, 101, 114, 40, 49, 50, 55, 32, 43, 32, 107, 115, 49, 44, 32, 32, 49, 50, 56, 41, 41};
+                                        // @triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2
+.visible .entry triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2(
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_0,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_1,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_2,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_3,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_4,
+	.param .u64 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_5,
+	.param .u64 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_6,
+	.param .u32 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_7,
+	.param .u32 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_8,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_9,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_10
+)
+.reqntid 64
+{
+	.reg .pred 	%p<32>;
+	.reg .b32 	%r<53>;
+	.reg .b64 	%rd<103>;
+	.loc	1 18 0                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:18:0
+$L__func_begin0:
+	.loc	1 18 0                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:18:0
+// %bb.0:
+	ld.param.b32 	%r12, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_8];
+	ld.param.b64 	%rd18, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_5];
+	ld.param.b64 	%rd15, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_2];
+$L__tmp0:
+	.loc	1 22 28                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:22:28
+	mov.u32 	%r13, %ctaid.x;
+	.loc	1 25 37                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:25:37
+	mov.u32 	%r1, %tid.x;
+	and.b32 	%r2, %r1, 31;
+	.loc	1 35 45                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:35:45
+	cvt.u64.u32 	%rd1, %r13;
+	mul.lo.s64 	%rd2, %rd18, %rd1;
+	.loc	1 29 40                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:29:40
+	setp.lt.s32 	%p2, %r12, 1;
+	mov.b64 	%rd102, 0;
+	cvt.u32.u64 	%r49, %rd1;
+	shl.b64 	%rd100, %rd2, 2;
+	@%p2 bra 	$L__BB0_6;
+// %bb.1:                               // %.lr.ph
+	.loc	1 0 40                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:0:40
+	ld.param.b64 	%rd13, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_0];
+	.loc	1 24 21                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:24:21
+	setp.lt.u32 	%p3, %r49, 32;
+	add.s64 	%rd3, %rd13, %rd100;
+	@%p3 bra 	$L__BB0_4;
+	bra.uni 	$L__BB0_2;
+$L__BB0_4:                              // %.lr.ph.split.preheader
+	.loc	1 0 21                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:0:21
+	mov.b32 	%r51, 0;
+	mov.b64 	%rd102, 0;
+$L__BB0_5:                              // %.lr.ph.split
+                                        // =>This Inner Loop Header: Depth=1
+	.loc	1 31 29                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:31:29
+	add.s32 	%r20, %r2, %r51;
+	setp.lt.s32 	%p6, %r20, %r12;
+	.loc	1 35 34                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:35:34
+	mad.wide.s32 	%rd28, %r20, 4, %rd3;
+	.loc	1 35 50                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:35:50
+	// begin inline asm
+	mov.u64 %rd27, 0x0;
+	createpolicy.fractional.L2::evict_first.b64 %rd27, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u32 %r19, 0x0;
+	@%p6 ld.global.L1::evict_first.L2::cache_hint.b32 { %r19 }, [ %rd28 + 0 ], %rd27;
+	// end inline asm
+	.loc	1 39 48                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:39:48
+	selp.b32 	%r21, %r19, 0, %p6;
+	cvt.s64.s32 	%rd30, %r21;
+	add.s64 	%rd102, %rd102, %rd30;
+	.loc	1 29 40                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:29:40
+	add.s32 	%r51, %r51, 32;
+	setp.lt.s32 	%p7, %r51, %r12;
+	@%p7 bra 	$L__BB0_5;
+	bra.uni 	$L__BB0_6;
+$L__BB0_2:                              // %.lr.ph.split.us.preheader
+	.loc	1 0 40                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:0:40
+	mov.b32 	%r50, 0;
+$L__BB0_3:                              // %.lr.ph.split.us
+                                        // =>This Inner Loop Header: Depth=1
+	.loc	1 35 41                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:35:41
+	add.s32 	%r17, %r2, %r50;
+	.loc	1 35 34                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:35:34
+	mad.wide.s32 	%rd23, %r17, 4, %rd3;
+	.loc	1 35 50                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:35:50
+	// begin inline asm
+	mov.u64 %rd22, 0x0;
+	createpolicy.fractional.L2::evict_first.b64 %rd22, 1.0;
+	// end inline asm
+	mov.pred 	%p4, 0;
+	// begin inline asm
+	mov.u32 %r16, 0x0;
+	@%p4 ld.global.L1::evict_first.L2::cache_hint.b32 { %r16 }, [ %rd23 + 0 ], %rd22;
+	// end inline asm
+	.loc	1 29 40                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:29:40
+	add.s32 	%r50, %r50, 32;
+	setp.lt.s32 	%p5, %r50, %r12;
+	@%p5 bra 	$L__BB0_3;
+$L__BB0_6:                              // %._crit_edge
+	.loc	1 24 21                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:24:21
+	setp.lt.u32 	%p10, %r49, 32;
+$L__tmp1:
+	.loc	2 291 36                        // standard.py:291:36 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	mov.b64 	{_, %r24}, %rd102;
+	cvt.u32.u64 	%r25, %rd102;
+	shfl.sync.bfly.b32 	%r26, %r25, 16, 31, -1;
+	shfl.sync.bfly.b32 	%r27, %r24, 16, 31, -1;
+	cvt.u64.u32 	%rd32, %r26;
+	cvt.u64.u32 	%rd33, %r27;
+	shl.b64 	%rd34, %rd33, 32;
+	or.b64 	%rd35, %rd32, %rd34;
+	.loc	2 261 15                        // standard.py:261:15 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	add.s64 	%rd36, %rd102, %rd35;
+	.loc	2 291 36                        // standard.py:291:36 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	mov.b64 	{_, %r28}, %rd36;
+	cvt.u32.u64 	%r29, %rd36;
+	shfl.sync.bfly.b32 	%r30, %r29, 8, 31, -1;
+	shfl.sync.bfly.b32 	%r31, %r28, 8, 31, -1;
+	cvt.u64.u32 	%rd37, %r30;
+	cvt.u64.u32 	%rd38, %r31;
+	shl.b64 	%rd39, %rd38, 32;
+	or.b64 	%rd40, %rd37, %rd39;
+	.loc	2 261 15                        // standard.py:261:15 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	add.s64 	%rd41, %rd36, %rd40;
+	.loc	2 291 36                        // standard.py:291:36 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	mov.b64 	{_, %r32}, %rd41;
+	cvt.u32.u64 	%r33, %rd41;
+	shfl.sync.bfly.b32 	%r34, %r33, 4, 31, -1;
+	shfl.sync.bfly.b32 	%r35, %r32, 4, 31, -1;
+	cvt.u64.u32 	%rd42, %r34;
+	cvt.u64.u32 	%rd43, %r35;
+	shl.b64 	%rd44, %rd43, 32;
+	or.b64 	%rd45, %rd42, %rd44;
+	.loc	2 261 15                        // standard.py:261:15 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	add.s64 	%rd46, %rd41, %rd45;
+	.loc	2 291 36                        // standard.py:291:36 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	mov.b64 	{_, %r36}, %rd46;
+	cvt.u32.u64 	%r37, %rd46;
+	shfl.sync.bfly.b32 	%r38, %r37, 2, 31, -1;
+	shfl.sync.bfly.b32 	%r39, %r36, 2, 31, -1;
+	cvt.u64.u32 	%rd47, %r38;
+	cvt.u64.u32 	%rd48, %r39;
+	shl.b64 	%rd49, %rd48, 32;
+	or.b64 	%rd50, %rd47, %rd49;
+	.loc	2 261 15                        // standard.py:261:15 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	add.s64 	%rd51, %rd46, %rd50;
+	.loc	2 291 36                        // standard.py:291:36 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	mov.b64 	{_, %r40}, %rd51;
+	cvt.u32.u64 	%r41, %rd51;
+	shfl.sync.bfly.b32 	%r42, %r41, 1, 31, -1;
+	shfl.sync.bfly.b32 	%r43, %r40, 1, 31, -1;
+	cvt.u64.u32 	%rd52, %r42;
+	.loc	2 261 15                        // standard.py:261:15 @[ cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:40:25 ]
+	add.s64 	%rd53, %rd51, %rd52;
+$L__tmp2:
+	.loc	1 41 19                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:41:19
+	cvt.u32.u64 	%r22, %rd53;
+	.loc	1 42 25                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:42:25
+	shl.b64 	%rd54, %rd1, 2;
+	add.s64 	%rd31, %rd15, %rd54;
+	.loc	1 42 36                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:42:36
+	and.b32 	%r44, %r1, 63;
+	setp.eq.b32 	%p11, %r44, 0;
+	and.pred 	%p8, %p10, %p11;
+	// begin inline asm
+	@%p8 st.global.b32 [ %rd31 + 0 ], { %r22 };
+	// end inline asm
+	.loc	1 43 40                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:43:40
+	@%p2 bra 	$L__BB0_11;
+// %bb.7:                               // %.lr.ph14.preheader
+	.loc	1 0 40                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:0:40
+	ld.param.b64 	%rd19, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_6];
+	ld.param.b64 	%rd17, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_4];
+	ld.param.b64 	%rd16, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_3];
+	ld.param.b64 	%rd14, [triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2_param_1];
+	and.b32 	%r8, %r1, 32;
+	setp.lt.s64 	%p12, %rd18, 2;
+	setp.gt.s64 	%p13, %rd18, 1;
+	selp.b64 	%rd55, %rd18, 0, %p13;
+	selp.b64 	%rd56, 1, 0, %p12;
+	add.s64 	%rd57, %rd55, %rd56;
+	mul.lo.s64 	%rd7, %rd57, %rd1;
+	add.s64 	%rd8, %rd18, 1;
+	add.s64 	%rd58, %rd19, 127;
+	shr.s64 	%rd59, %rd58, 63;
+	shr.u64 	%rd60, %rd59, 57;
+	add.s64 	%rd61, %rd58, %rd60;
+	shr.s64 	%rd62, %rd61, 7;
+	and.b64 	%rd63, %rd58, 127;
+	setp.ne.b64 	%p14, %rd63, 0;
+	setp.lt.s64 	%p15, %rd58, 0;
+	and.pred 	%p16, %p15, %p14;
+	selp.b64 	%rd64, -1, 0, %p16;
+	add.s64 	%rd9, %rd62, %rd64;
+	mov.b32 	%r52, 0;
+$L__BB0_8:                              // %.lr.ph14
+                                        // =>This Inner Loop Header: Depth=1
+	.loc	1 45 29                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:45:29
+	add.s32 	%r10, %r2, %r52;
+	setp.lt.s32 	%p20, %r10, %r12;
+	.loc	1 49 41                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:49:41
+	cvt.s64.s32 	%rd73, %r10;
+	add.s64 	%rd10, %rd7, %rd73;
+	.loc	1 49 34                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:49:34
+	shl.b64 	%rd74, %rd10, 3;
+	add.s64 	%rd67, %rd14, %rd74;
+	.loc	1 49 103                        // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:49:103
+	and.pred 	%p18, %p10, %p20;
+	.loc	1 49 93                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:49:93
+	// begin inline asm
+	mov.u64 %rd65, 0x0;
+	createpolicy.fractional.L2::evict_first.b64 %rd65, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd66, 0x0;
+	@%p18 ld.global.L1::evict_first.L2::cache_hint.b64 { %rd66 }, [ %rd67 + 0 ], %rd65;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd69, 0x0;
+	createpolicy.fractional.L2::evict_first.b64 %rd69, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd70, 0x0;
+	@%p18 ld.global.L1::evict_first.L2::cache_hint.b64 { %rd70 }, [ %rd67 + 0 ], %rd69;
+	// end inline asm
+	.loc	1 52 22                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:52:22
+	setp.lt.s32 	%p21, %r10, %r22;
+	.loc	1 54 37                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:54:37
+	cvt.s64.s32 	%rd75, %rd70;
+	selp.b64 	%rd76, %rd75, %rd18, %p21;
+	.loc	1 58 39                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:58:39
+	shr.s64 	%rd77, %rd76, 63;
+	and.b64 	%rd78, %rd77, %rd8;
+	add.s64 	%rd79, %rd78, %rd76;
+	.loc	1 59 32                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:59:32
+	setp.lt.s64 	%p22, %rd79, 0;
+	.loc	1 59 50                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:59:50
+	setp.gt.s64 	%p23, %rd79, %rd9;
+	.loc	1 59 112                        // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:59:112
+	or.pred 	%p24, %p22, %p23;
+	.loc	1 59 130                        // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:59:130
+	and.pred 	%p25, %p18, %p24;
+	not.pred 	%p26, %p25;
+	@%p26 bra 	$L__BB0_10;
+	bra.uni 	$L__BB0_9;
+$L__BB0_10:                             //   in Loop: Header=BB0_8 Depth=1
+	.loc	1 42 36                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:42:36
+	setp.eq.b32 	%p30, %r8, 0;
+	.loc	1 54 37                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:54:37
+	cvt.s64.s32 	%rd82, %rd66;
+	selp.b64 	%rd83, %rd82, %rd18, %p21;
+	.loc	1 58 39                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:58:39
+	shr.s64 	%rd84, %rd83, 63;
+	and.b64 	%rd85, %rd84, %rd8;
+	.loc	1 50 23                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:50:23
+	cvt.u32.u64 	%r47, %rd70;
+	.loc	1 59 130                        // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:59:130
+	bar.sync 	0;
+	.loc	1 61 29                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:61:29
+	shl.b64 	%rd86, %rd10, 2;
+	add.s64 	%rd80, %rd16, %rd86;
+	.loc	1 61 94                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:61:94
+	and.pred 	%p27, %p30, %p18;
+	// begin inline asm
+	@%p27 st.global.b32 [ %rd80 + 0 ], { %r47 };
+	// end inline asm
+	.loc	1 62 29                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:62:29
+	shl.b64 	%rd87, %rd83, 2;
+	add.s64 	%rd88, %rd17, %rd87;
+	shl.b64 	%rd89, %rd85, 2;
+	add.s64 	%rd90, %rd88, %rd89;
+	add.s64 	%rd92, %rd90, %rd54;
+	add.s64 	%rd81, %rd92, %rd100;
+	mov.b32 	%r48, 1;
+	.loc	1 62 95                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:62:95
+	// begin inline asm
+	@%p27 st.global.b32 [ %rd81 + 0 ], { %r48 };
+	// end inline asm
+	.loc	1 43 40                         // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:43:40
+	add.s32 	%r52, %r52, 32;
+	setp.lt.s32 	%p31, %r52, %r12;
+	@%p31 bra 	$L__BB0_8;
+$L__BB0_11:                             // %._crit_edge15
+	.loc	1 43 4                          // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:43:4
+	ret;
+$L__BB0_9:
+	.loc	1 59 130                        // cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py:59:130
+	{ // callseq 0, 0
+	.param .b64 	param0;
+	.param .b64 	param1;
+	.param .b32 	param2;
+	.param .b64 	param3;
+	.param .b64 	param4;
+	mov.b64 	%rd94, assertFunc_0;
+	cvta.global.u64 	%rd95, %rd94;
+	st.param.b64 	[param3], %rd95;
+	mov.b64 	%rd96, assertFile_0;
+	cvta.global.u64 	%rd97, %rd96;
+	st.param.b64 	[param1], %rd97;
+	mov.b64 	%rd98, assertMessage_0;
+	cvta.global.u64 	%rd99, %rd98;
+	st.param.b64 	[param0], %rd99;
+	st.param.b64 	[param4], 1;
+	st.param.b32 	[param2], 59;
+	call.uni __assertfail, (param0, param1, param2, param3, param4);
+	} // callseq 0
+	trap;
+$L__tmp3:
+$L__func_end0:
+                                        // -- End function
+}
+	.file	1 "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py"
+	.file	2 "/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py"
+	.section	.debug_abbrev
+	{
+.b8 1                                   // Abbreviation Code
+.b8 17                                  // DW_TAG_compile_unit
+.b8 1                                   // DW_CHILDREN_yes
+.b8 37                                  // DW_AT_producer
+.b8 8                                   // DW_FORM_string
+.b8 19                                  // DW_AT_language
+.b8 5                                   // DW_FORM_data2
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 16                                  // DW_AT_stmt_list
+.b8 6                                   // DW_FORM_data4
+.b8 27                                  // DW_AT_comp_dir
+.b8 8                                   // DW_FORM_string
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 2                                   // Abbreviation Code
+.b8 46                                  // DW_TAG_subprogram
+.b8 0                                   // DW_CHILDREN_no
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 32                                  // DW_AT_inline
+.b8 11                                  // DW_FORM_data1
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 3                                   // Abbreviation Code
+.b8 46                                  // DW_TAG_subprogram
+.b8 1                                   // DW_CHILDREN_yes
+.b8 17                                  // DW_AT_low_pc
+.b8 1                                   // DW_FORM_addr
+.b8 18                                  // DW_AT_high_pc
+.b8 1                                   // DW_FORM_addr
+.b8 49                                  // DW_AT_abstract_origin
+.b8 19                                  // DW_FORM_ref4
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 4                                   // Abbreviation Code
+.b8 29                                  // DW_TAG_inlined_subroutine
+.b8 0                                   // DW_CHILDREN_no
+.b8 49                                  // DW_AT_abstract_origin
+.b8 19                                  // DW_FORM_ref4
+.b8 17                                  // DW_AT_low_pc
+.b8 1                                   // DW_FORM_addr
+.b8 18                                  // DW_AT_high_pc
+.b8 1                                   // DW_FORM_addr
+.b8 88                                  // DW_AT_call_file
+.b8 11                                  // DW_FORM_data1
+.b8 89                                  // DW_AT_call_line
+.b8 11                                  // DW_FORM_data1
+.b8 87                                  // DW_AT_call_column
+.b8 11                                  // DW_FORM_data1
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 0                                   // EOM(3)
+	}
+	.section	.debug_info
+	{
+.b32 281                                // Length of Unit
+.b8 2                                   // DWARF version number
+.b8 0
+.b32 .debug_abbrev                      // Offset Into Abbrev. Section
+.b8 8                                   // Address Size (in bytes)
+.b8 1                                   // Abbrev [1] 0xb:0x112 DW_TAG_compile_unit
+.b8 116                                 // DW_AT_producer
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 0
+.b8 2                                   // DW_AT_language
+.b8 0
+.b8 99                                  // DW_AT_name
+.b8 118
+.b8 114
+.b8 104
+.b8 110
+.b8 114
+.b8 109
+.b8 112
+.b8 103
+.b8 121
+.b8 120
+.b8 119
+.b8 117
+.b8 51
+.b8 52
+.b8 120
+.b8 108
+.b8 101
+.b8 99
+.b8 108
+.b8 101
+.b8 101
+.b8 51
+.b8 116
+.b8 116
+.b8 52
+.b8 107
+.b8 101
+.b8 109
+.b8 111
+.b8 108
+.b8 100
+.b8 107
+.b8 106
+.b8 55
+.b8 105
+.b8 97
+.b8 109
+.b8 52
+.b8 117
+.b8 99
+.b8 105
+.b8 97
+.b8 116
+.b8 104
+.b8 111
+.b8 109
+.b8 105
+.b8 114
+.b8 118
+.b8 108
+.b8 99
+.b8 46
+.b8 112
+.b8 121
+.b8 0
+.b32 .debug_line                        // DW_AT_stmt_list
+.b8 47                                  // DW_AT_comp_dir
+.b8 119
+.b8 111
+.b8 114
+.b8 107
+.b8 115
+.b8 112
+.b8 97
+.b8 99
+.b8 101
+.b8 47
+.b8 104
+.b8 97
+.b8 110
+.b8 114
+.b8 117
+.b8 105
+.b8 47
+.b8 83
+.b8 112
+.b8 101
+.b8 99
+.b8 70
+.b8 111
+.b8 114
+.b8 103
+.b8 101
+.b8 45
+.b8 101
+.b8 120
+.b8 116
+.b8 47
+.b8 99
+.b8 97
+.b8 99
+.b8 104
+.b8 101
+.b8 47
+.b8 99
+.b8 111
+.b8 109
+.b8 112
+.b8 105
+.b8 108
+.b8 101
+.b8 100
+.b8 95
+.b8 107
+.b8 101
+.b8 114
+.b8 110
+.b8 101
+.b8 108
+.b8 115
+.b8 47
+.b8 118
+.b8 114
+.b8 0
+.b8 2                                   // Abbrev [2] 0x8b:0x63 DW_TAG_subprogram
+.b8 116                                 // DW_AT_name
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 95
+.b8 114
+.b8 101
+.b8 100
+.b8 95
+.b8 102
+.b8 117
+.b8 115
+.b8 101
+.b8 100
+.b8 95
+.b8 95
+.b8 116
+.b8 111
+.b8 95
+.b8 99
+.b8 111
+.b8 112
+.b8 121
+.b8 95
+.b8 97
+.b8 114
+.b8 97
+.b8 110
+.b8 103
+.b8 101
+.b8 95
+.b8 105
+.b8 110
+.b8 100
+.b8 101
+.b8 120
+.b8 95
+.b8 112
+.b8 117
+.b8 116
+.b8 95
+.b8 108
+.b8 116
+.b8 95
+.b8 110
+.b8 101
+.b8 119
+.b8 95
+.b8 122
+.b8 101
+.b8 114
+.b8 111
+.b8 115
+.b8 95
+.b8 115
+.b8 99
+.b8 97
+.b8 108
+.b8 97
+.b8 114
+.b8 95
+.b8 116
+.b8 101
+.b8 110
+.b8 115
+.b8 111
+.b8 114
+.b8 95
+.b8 115
+.b8 117
+.b8 109
+.b8 95
+.b8 117
+.b8 110
+.b8 115
+.b8 113
+.b8 117
+.b8 101
+.b8 101
+.b8 122
+.b8 101
+.b8 95
+.b8 118
+.b8 105
+.b8 101
+.b8 119
+.b8 95
+.b8 119
+.b8 104
+.b8 101
+.b8 114
+.b8 101
+.b8 95
+.b8 50
+.b8 0
+.b8 1                                   // DW_AT_inline
+.b8 3                                   // Abbrev [3] 0xee:0x2e DW_TAG_subprogram
+.b64 $L__func_begin0                    // DW_AT_low_pc
+.b64 $L__func_end0                      // DW_AT_high_pc
+.b32 139                                // DW_AT_abstract_origin
+.b8 4                                   // Abbrev [4] 0x103:0x18 DW_TAG_inlined_subroutine
+.b32 139                                // DW_AT_abstract_origin
+.b64 $L__tmp1                           // DW_AT_low_pc
+.b64 $L__tmp2                           // DW_AT_high_pc
+.b8 1                                   // DW_AT_call_file
+.b8 40                                  // DW_AT_call_line
+.b8 25                                  // DW_AT_call_column
+.b8 0                                   // End Of Children Mark
+.b8 0                                   // End Of Children Mark
+	}
+	.section	.debug_macinfo	{	}

	@@ -0,0 +1,379 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":18:0)
+#loc77 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":285:0)
+#loc79 = loc(unknown)
+#loc82 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":260:0)
+#loc86 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":69:0)
+#loc97 = loc("in_ptr0"(#loc))
+#loc98 = loc("in_ptr1"(#loc))
+#loc99 = loc("out_ptr1"(#loc))
+#loc100 = loc("out_ptr2"(#loc))
+#loc101 = loc("out_ptr3"(#loc))
+#loc102 = loc("ks0"(#loc))
+#loc103 = loc("ks1"(#loc))
+#loc104 = loc("xnumel"(#loc))
+#loc105 = loc("r0_numel"(#loc))
+#loc151 = loc("input"(#loc77))
+#loc152 = loc("a"(#loc82))
+#loc153 = loc("b"(#loc82))
+#loc154 = loc("a"(#loc86))
+module {
+  tt.func public @triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2(%in_ptr0: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %in_ptr1: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr1"(#loc)), %out_ptr1: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr1"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %out_ptr3: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr3"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %xnumel_0 = arith.constant 32 : i32 loc(#loc106)
+    %xoffset = tt.get_program_id x : i32 loc(#loc107)
+    %xoffset_1 = arith.constant 1 : i32 loc(#loc108)
+    %xoffset_2 = arith.constant 1 : i32 loc(#loc108)
+    %xoffset_3 = arith.muli %xoffset, %xoffset_2 : i32 loc(#loc108)
+    %xindex = tt.make_range {end = 1 : i32, start = 0 : i32} : tensor<1xi32> loc(#loc109)
+    %xindex_4 = tt.expand_dims %xindex {axis = 1 : i32} : tensor<1xi32> -> tensor<1x1xi32> loc(#loc110)
+    %xindex_5 = tt.splat %xoffset_3 : i32 -> tensor<1x1xi32> loc(#loc111)
+    %xindex_6 = arith.addi %xindex_5, %xindex_4 : tensor<1x1xi32> loc(#loc111)
+    %xmask = arith.constant dense<32> : tensor<1x1xi32> loc(#loc112)
+    %xmask_7 = arith.cmpi slt, %xindex_6, %xmask : tensor<1x1xi32> loc(#loc112)
+    %r0_base = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> loc(#loc113)
+    %r0_base_8 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> loc(#loc114)
+    %_tmp3 = arith.constant 0 : i64 loc(#loc115)
+    %_tmp3_9 = arith.constant dense<0> : tensor<1x32xi64> loc(#loc115)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc11)
+    %c32_i32 = arith.constant 32 : i32 loc(#loc11)
+    %0 = arith.bitcast %c0_i32 : i32 to i32 loc(#loc11)
+    %1 = arith.bitcast %r0_numel : i32 to i32 loc(#loc11)
+    %2 = arith.bitcast %c32_i32 : i32 to i32 loc(#loc11)
+    %3 = ub.poison : i32 loc(#loc11)
+    %_tmp3_10 = scf.for %r0_offset = %0 to %1 step %2 iter_args(%_tmp3_14 = %_tmp3_9) -> (tensor<1x32xi64>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x32xi32> loc(#loc117)
+      %r0_index_15 = arith.addi %r0_index, %r0_base_8 : tensor<1x32xi32> loc(#loc117)
+      %r0_mask = tt.splat %r0_numel : i32 -> tensor<1x32xi32> loc(#loc118)
+      %r0_mask_16 = arith.cmpi slt, %r0_index_15, %r0_mask : tensor<1x32xi32> loc(#loc118)
+      %tmp0 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc119)
+      %tmp0_17 = tt.splat %ks0 : i64 -> tensor<1x1xi64> loc(#loc119)
+      %tmp0_18 = arith.muli %tmp0_17, %tmp0 : tensor<1x1xi64> loc(#loc119)
+      %tmp0_19 = arith.extsi %r0_index_15 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc120)
+      %tmp0_20 = tt.broadcast %tmp0_18 : tensor<1x1xi64> -> tensor<1x32xi64> loc(#loc120)
+      %tmp0_21 = arith.addi %tmp0_19, %tmp0_20 : tensor<1x32xi64> loc(#loc120)
+      %tmp0_22 = tt.splat %in_ptr0 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>> loc(#loc121)
+      %tmp0_23 = tt.addptr %tmp0_22, %tmp0_21 : tensor<1x32x!tt.ptr<i32>>, tensor<1x32xi64> loc(#loc121)
+      %tmp0_24 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x32xi1> loc(#loc122)
+      %tmp0_25 = arith.andi %r0_mask_16, %tmp0_24 : tensor<1x32xi1> loc(#loc122)
+      %tmp0_26 = arith.constant 0.000000e+00 : f32 loc(#loc123)
+      %tmp0_27 = arith.constant dense<0.000000e+00> : tensor<1x32xf32> loc(#loc123)
+      %tmp0_28 = arith.fptosi %tmp0_27 : tensor<1x32xf32> to tensor<1x32xi32> loc(#loc123)
+      %tmp0_29 = tt.load %tmp0_23, %tmp0_25, %tmp0_28 evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i32>> loc(#loc123)
+      %tmp1 = arith.extsi %tmp0_29 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc124)
+      %tmp4 = arith.addi %_tmp3_14, %tmp1 : tensor<1x32xi64> loc(#loc125)
+      %_tmp3_30 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x32xi1> loc(#loc126)
+      %_tmp3_31 = arith.andi %r0_mask_16, %_tmp3_30 : tensor<1x32xi1> loc(#loc126)
+      %_tmp3_32 = arith.select %_tmp3_31, %tmp4, %_tmp3_14 : tensor<1x32xi1>, tensor<1x32xi64> loc(#loc127)
+      scf.yield %_tmp3_32 : tensor<1x32xi64> loc(#loc23)
+    } loc(#loc116)
+    %tmp3 = tt.call @"triton.language.standard.sum__i64S1_32S__(1,)cconstexpr_1__(2,)cconstexpr_False__(3,)cNone"(%_tmp3_10) : (tensor<1x32xi64>) -> tensor<1xi64> loc(#loc128)
+    %tmp3_11 = tt.expand_dims %tmp3 {axis = 1 : i32} : tensor<1xi64> -> tensor<1x1xi64> loc(#loc129)
+    %tmp5 = arith.trunci %tmp3_11 : tensor<1x1xi64> to tensor<1x1xi32> loc(#loc130)
+    %4 = tt.splat %out_ptr1 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>> loc(#loc27)
+    %5 = tt.addptr %4, %xindex_6 : tensor<1x1x!tt.ptr<i32>>, tensor<1x1xi32> loc(#loc27)
+    tt.store %5, %tmp5, %xmask_7 : tensor<1x1x!tt.ptr<i32>> loc(#loc28)
+    %c0_i32_12 = arith.constant 0 : i32 loc(#loc29)
+    %c32_i32_13 = arith.constant 32 : i32 loc(#loc29)
+    %6 = arith.bitcast %c0_i32_12 : i32 to i32 loc(#loc29)
+    %7 = arith.bitcast %r0_numel : i32 to i32 loc(#loc29)
+    %8 = arith.bitcast %c32_i32_13 : i32 to i32 loc(#loc29)
+    %9 = ub.poison : i32 loc(#loc29)
+    scf.for %r0_offset = %6 to %7 step %8  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x32xi32> loc(#loc131)
+      %r0_index_14 = arith.addi %r0_index, %r0_base_8 : tensor<1x32xi32> loc(#loc131)
+      %r0_mask = tt.splat %r0_numel : i32 -> tensor<1x32xi32> loc(#loc132)
+      %r0_mask_15 = arith.cmpi slt, %r0_index_14, %r0_mask : tensor<1x32xi32> loc(#loc132)
+      %tmp6 = arith.constant 1 : i32 loc(#loc133)
+      %tmp6_16 = arith.extsi %tmp6 : i32 to i64 loc(#loc133)
+      %tmp6_17 = arith.cmpi sge, %tmp6_16, %ks0 : i64 loc(#loc133)
+      %tmp6_18 = arith.constant 1 : i32 loc(#loc134)
+      %tmp6_19 = arith.constant 1 : i32 loc(#loc134)
+      %tmp6_20 = arith.extui %tmp6_17 : i1 to i32 loc(#loc134)
+      %tmp6_21 = arith.muli %tmp6_19, %tmp6_20 : i32 loc(#loc134)
+      %tmp6_22 = arith.constant 1 : i32 loc(#loc135)
+      %tmp6_23 = arith.extsi %tmp6_22 : i32 to i64 loc(#loc135)
+      %tmp6_24 = arith.cmpi sgt, %ks0, %tmp6_23 : i64 loc(#loc135)
+      %tmp6_25 = arith.extui %tmp6_24 : i1 to i64 loc(#loc136)
+      %tmp6_26 = arith.muli %ks0, %tmp6_25 : i64 loc(#loc136)
+      %tmp6_27 = arith.extsi %tmp6_21 : i32 to i64 loc(#loc137)
+      %tmp6_28 = arith.addi %tmp6_27, %tmp6_26 : i64 loc(#loc137)
+      %tmp6_29 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc138)
+      %tmp6_30 = tt.splat %tmp6_28 : i64 -> tensor<1x1xi64> loc(#loc138)
+      %tmp6_31 = arith.muli %tmp6_29, %tmp6_30 : tensor<1x1xi64> loc(#loc138)
+      %tmp6_32 = arith.extsi %r0_index_14 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc139)
+      %tmp6_33 = tt.broadcast %tmp6_31 : tensor<1x1xi64> -> tensor<1x32xi64> loc(#loc139)
+      %tmp6_34 = arith.addi %tmp6_32, %tmp6_33 : tensor<1x32xi64> loc(#loc139)
+      %tmp6_35 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1x32x!tt.ptr<i64>> loc(#loc140)
+      %tmp6_36 = tt.addptr %tmp6_35, %tmp6_34 : tensor<1x32x!tt.ptr<i64>>, tensor<1x32xi64> loc(#loc140)
+      %tmp6_37 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x32xi1> loc(#loc141)
+      %tmp6_38 = arith.andi %r0_mask_15, %tmp6_37 : tensor<1x32xi1> loc(#loc141)
+      %tmp6_39 = arith.constant 0.000000e+00 : f32 loc(#loc142)
+      %tmp6_40 = arith.constant dense<0.000000e+00> : tensor<1x32xf32> loc(#loc142)
+      %tmp6_41 = arith.fptosi %tmp6_40 : tensor<1x32xf32> to tensor<1x32xi64> loc(#loc142)
+      %tmp6_42 = tt.load %tmp6_36, %tmp6_38, %tmp6_41 evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i64>> loc(#loc142)
+      %tmp7 = arith.trunci %tmp6_42 : tensor<1x32xi64> to tensor<1x32xi32> loc(#loc143)
+      %tmp9 = tt.broadcast %tmp5 : tensor<1x1xi32> -> tensor<1x32xi32> loc(#loc144)
+      %tmp9_43 = arith.cmpi slt, %r0_index_14, %tmp9 : tensor<1x32xi32> loc(#loc144)
+      %tmp11 = arith.extsi %tmp7 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc145)
+      %tmp11_44 = tt.splat %ks0 : i64 -> tensor<1x32xi64> loc(#loc145)
+      %tmp11_45 = arith.select %tmp9_43, %tmp11, %tmp11_44 : tensor<1x32xi1>, tensor<1x32xi64> loc(#loc145)
+      %tmp12 = arith.constant 1 : i32 loc(#loc146)
+      %tmp12_46 = arith.constant 1 : i64 loc(#loc146)
+      %tmp12_47 = arith.addi %tmp12_46, %ks0 : i64 loc(#loc146)
+      %tmp13 = tt.splat %tmp12_47 : i64 -> tensor<1x32xi64> loc(#loc147)
+      %tmp13_48 = arith.addi %tmp11_45, %tmp13 : tensor<1x32xi64> loc(#loc147)
+      %tmp14 = arith.constant 0 : i32 loc(#loc148)
+      %tmp14_49 = arith.extsi %tmp14 : i32 to i64 loc(#loc148)
+      %tmp14_50 = tt.splat %tmp14_49 : i64 -> tensor<1x32xi64> loc(#loc148)
+      %tmp14_51 = arith.cmpi slt, %tmp11_45, %tmp14_50 : tensor<1x32xi64> loc(#loc148)
+      %tmp15 = arith.select %tmp14_51, %tmp13_48, %tmp11_45 : tensor<1x32xi1>, tensor<1x32xi64> loc(#loc149)
+      %c0_i32_52 = arith.constant 0 : i32 loc(#loc49)
+      %10 = arith.extsi %c0_i32_52 : i32 to i64 loc(#loc49)
+      %11 = tt.splat %10 : i64 -> tensor<1x32xi64> loc(#loc49)
+      %12 = arith.cmpi sle, %11, %tmp15 : tensor<1x32xi64> loc(#loc49)
+      %c127_i32 = arith.constant 127 : i32 loc(#loc50)
+      %c127_i64 = arith.constant 127 : i64 loc(#loc50)
+      %13 = arith.addi %c127_i64, %ks1 : i64 loc(#loc50)
+      %14 = tt.call @"torch._inductor.runtime.triton_helpers.div_floor_integer__i64__(1,)cconstexpr_128_"(%13) : (i64) -> i64 loc(#loc51)
+      %c1_i32 = arith.constant 1 : i32 loc(#loc52)
+      %c1_i64 = arith.constant 1 : i64 loc(#loc52)
+      %15 = arith.addi %c1_i64, %14 : i64 loc(#loc52)
+      %16 = tt.splat %15 : i64 -> tensor<1x32xi64> loc(#loc53)
+      %17 = arith.cmpi slt, %tmp15, %16 : tensor<1x32xi64> loc(#loc53)
+      %18 = arith.andi %12, %17 : tensor<1x32xi1> loc(#loc54)
+      %19 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x32xi1> loc(#loc55)
+      %20 = arith.andi %r0_mask_15, %19 : tensor<1x32xi1> loc(#loc55)
+      %true = arith.constant true loc(#loc56)
+      %cst = arith.constant dense<true> : tensor<1x32xi1> loc(#loc56)
+      %21 = arith.xori %20, %cst : tensor<1x32xi1> loc(#loc56)
+      %22 = arith.ori %18, %21 : tensor<1x32xi1> loc(#loc57)
+      tt.assert %22, "index out of bounds: 0 <= tmp15 < 1 + (triton_helpers.div_floor_integer(127 + ks1,  128))" : tensor<1x32xi1> loc(#loc58)
+      %tmp17 = arith.constant 1 : i32 loc(#loc150)
+      %tmp17_53 = arith.constant dense<1> : tensor<1x1xi32> loc(#loc150)
+      %c1_i32_54 = arith.constant 1 : i32 loc(#loc60)
+      %23 = arith.extsi %c1_i32_54 : i32 to i64 loc(#loc60)
+      %24 = arith.cmpi sge, %23, %ks0 : i64 loc(#loc60)
+      %c1_i32_55 = arith.constant 1 : i32 loc(#loc61)
+      %c1_i32_56 = arith.constant 1 : i32 loc(#loc61)
+      %25 = arith.extui %24 : i1 to i32 loc(#loc61)
+      %26 = arith.muli %c1_i32_56, %25 : i32 loc(#loc61)
+      %c1_i32_57 = arith.constant 1 : i32 loc(#loc62)
+      %27 = arith.extsi %c1_i32_57 : i32 to i64 loc(#loc62)
+      %28 = arith.cmpi sgt, %ks0, %27 : i64 loc(#loc62)
+      %29 = arith.extui %28 : i1 to i64 loc(#loc63)
+      %30 = arith.muli %ks0, %29 : i64 loc(#loc63)
+      %31 = arith.extsi %26 : i32 to i64 loc(#loc64)
+      %32 = arith.addi %31, %30 : i64 loc(#loc64)
+      %33 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc65)
+      %34 = tt.splat %32 : i64 -> tensor<1x1xi64> loc(#loc65)
+      %35 = arith.muli %33, %34 : tensor<1x1xi64> loc(#loc65)
+      %36 = arith.extsi %r0_index_14 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc66)
+      %37 = tt.broadcast %35 : tensor<1x1xi64> -> tensor<1x32xi64> loc(#loc66)
+      %38 = arith.addi %36, %37 : tensor<1x32xi64> loc(#loc66)
+      %39 = tt.splat %out_ptr2 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>> loc(#loc67)
+      %40 = tt.addptr %39, %38 : tensor<1x32x!tt.ptr<i32>>, tensor<1x32xi64> loc(#loc67)
+      %41 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x32xi1> loc(#loc68)
+      %42 = arith.andi %r0_mask_15, %41 : tensor<1x32xi1> loc(#loc68)
+      tt.store %40, %tmp7, %42 : tensor<1x32x!tt.ptr<i32>> loc(#loc69)
+      %43 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc70)
+      %44 = tt.broadcast %43 : tensor<1x1xi64> -> tensor<1x32xi64> loc(#loc70)
+      %45 = arith.addi %tmp15, %44 : tensor<1x32xi64> loc(#loc70)
+      %46 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc71)
+      %47 = tt.splat %ks0 : i64 -> tensor<1x1xi64> loc(#loc71)
+      %48 = arith.muli %47, %46 : tensor<1x1xi64> loc(#loc71)
+      %49 = tt.broadcast %48 : tensor<1x1xi64> -> tensor<1x32xi64> loc(#loc72)
+      %50 = arith.addi %45, %49 : tensor<1x32xi64> loc(#loc72)
+      %51 = tt.splat %out_ptr3 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>> loc(#loc73)
+      %52 = tt.addptr %51, %50 : tensor<1x32x!tt.ptr<i32>>, tensor<1x32xi64> loc(#loc73)
+      %53 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x32xi1> loc(#loc74)
+      %54 = arith.andi %r0_mask_15, %53 : tensor<1x32xi1> loc(#loc74)
+      %cst_58 = arith.constant dense<1> : tensor<1x32xi32> loc(#loc75)
+      tt.store %52, %cst_58, %54 : tensor<1x32x!tt.ptr<i32>> loc(#loc75)
+    } loc(#loc29)
+    tt.return loc(#loc76)
+  } loc(#loc)
+  tt.func private @"triton.language.standard.sum__i64S1_32S__(1,)cconstexpr_1__(2,)cconstexpr_False__(3,)cNone"(%input: tensor<1x32xi64> loc("input"(#loc77))) -> tensor<1xi64> attributes {noinline = false} {
+    %0 = "tt.reduce"(%input) <{axis = 1 : i32}> ({
+    ^bb0(%arg1: i64 loc(unknown), %arg2: i64 loc(unknown)):
+      %2 = tt.call @triton.language.standard._sum_combine__i64_i64__(%arg1, %arg2) : (i64, i64) -> i64 loc(#loc78)
+      tt.reduce.return %2 : i64 loc(#loc78)
+    }) : (tensor<1x32xi64>) -> tensor<1xi64> loc(#loc78)
+    tt.return %0 : tensor<1xi64> loc(#loc80)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : tensor<1xi64> loc(#loc81)
+    tt.return %1 : tensor<1xi64> loc(#loc81)
+  } loc(#loc77)
+  tt.func private @triton.language.standard._sum_combine__i64_i64__(%a: i64 loc("a"(#loc82)), %b: i64 loc("b"(#loc82))) -> i64 attributes {noinline = false} {
+    %0 = arith.addi %a, %b : i64 loc(#loc83)
+    tt.return %0 : i64 loc(#loc84)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : i64 loc(#loc85)
+    tt.return %1 : i64 loc(#loc85)
+  } loc(#loc82)
+  tt.func private @"torch._inductor.runtime.triton_helpers.div_floor_integer__i64__(1,)cconstexpr_128_"(%a: i64 loc("a"(#loc86))) -> i64 attributes {noinline = false} {
+    %quot = arith.constant 128 : i32 loc(#loc155)
+    %quot_0 = arith.constant 128 : i64 loc(#loc155)
+    %quot_1 = arith.divsi %a, %quot_0 : i64 loc(#loc155)
+    %remainder = arith.constant 128 : i32 loc(#loc156)
+    %remainder_2 = arith.constant 128 : i64 loc(#loc156)
+    %remainder_3 = arith.remsi %a, %remainder_2 : i64 loc(#loc156)
+    %fixed = arith.constant 0 : i32 loc(#loc157)
+    %fixed_4 = arith.extsi %fixed : i32 to i64 loc(#loc157)
+    %fixed_5 = arith.cmpi ne, %remainder_3, %fixed_4 : i64 loc(#loc157)
+    %fixed_6 = arith.constant 1 : i32 loc(#loc158)
+    %fixed_7 = arith.constant 1 : i64 loc(#loc158)
+    %fixed_8 = arith.subi %quot_1, %fixed_7 : i64 loc(#loc158)
+    %fixed_9 = arith.select %fixed_5, %fixed_8, %quot_1 : i64 loc(#loc159)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc92)
+    %0 = arith.extsi %c0_i32 : i32 to i64 loc(#loc92)
+    %1 = arith.cmpi slt, %a, %0 : i64 loc(#loc92)
+    %false = arith.constant false loc(#loc93)
+    %2 = arith.cmpi ne, %1, %false : i1 loc(#loc93)
+    %3 = arith.select %2, %fixed_9, %quot_1 : i64 loc(#loc94)
+    tt.return %3 : i64 loc(#loc95)
+  ^bb1:  // no predecessors
+    %4 = ub.poison : i64 loc(#loc96)
+    tt.return %4 : i64 loc(#loc96)
+  } loc(#loc86)
+} loc(#loc)
+#loc1 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":19:13)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":22:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":22:33)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":23:36)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":23:44)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":23:23)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":24:21)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":25:27)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":25:37)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":28:43)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":29:40)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":30:31)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":31:29)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:45)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:41)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:34)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:60)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:50)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":36:23)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":38:23)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:35)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:48)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:8)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":40:25)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":40:28)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":41:19)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":42:25)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":42:36)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":43:40)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":44:31)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":45:29)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:60)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:52)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:86)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:77)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:68)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:45)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:41)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:34)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:103)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:93)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":50:23)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":52:22)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":54:37)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":55:20)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":56:24)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":57:24)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":58:39)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:32)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:94)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:100)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:55)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:50)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:42)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:122)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:112)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:110)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:130)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":60:35)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:55)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:47)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:81)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:72)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:63)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:40)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:36)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:29)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:104)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:94)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:53)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:62)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:58)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:29)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:105)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:95)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":43:4)
+#loc78 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc80 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:11)
+#loc81 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:4)
+#loc83 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc84 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:11)
+#loc85 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:4)
+#loc87 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":72:16)
+#loc88 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":73:20)
+#loc89 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:34)
+#loc90 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:44)
+#loc91 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:47)
+#loc92 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:25)
+#loc93 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:32)
+#loc94 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:47)
+#loc95 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:11)
+#loc96 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:4)
+#loc106 = loc("xnumel"(#loc1))
+#loc107 = loc("xoffset"(#loc2))
+#loc108 = loc("xoffset"(#loc3))
+#loc109 = loc("xindex"(#loc4))
+#loc110 = loc("xindex"(#loc5))
+#loc111 = loc("xindex"(#loc6))
+#loc112 = loc("xmask"(#loc7))
+#loc113 = loc("r0_base"(#loc8))
+#loc114 = loc("r0_base"(#loc9))
+#loc115 = loc("_tmp3"(#loc10))
+#loc116 = loc("_tmp3"(#loc11))
+#loc117 = loc("r0_index"(#loc12))
+#loc118 = loc("r0_mask"(#loc13))
+#loc119 = loc("tmp0"(#loc14))
+#loc120 = loc("tmp0"(#loc15))
+#loc121 = loc("tmp0"(#loc16))
+#loc122 = loc("tmp0"(#loc17))
+#loc123 = loc("tmp0"(#loc18))
+#loc124 = loc("tmp1"(#loc19))
+#loc125 = loc("tmp4"(#loc20))
+#loc126 = loc("_tmp3"(#loc21))
+#loc127 = loc("_tmp3"(#loc22))
+#loc128 = loc("tmp3"(#loc24))
+#loc129 = loc("tmp3"(#loc25))
+#loc130 = loc("tmp5"(#loc26))
+#loc131 = loc("r0_index"(#loc30))
+#loc132 = loc("r0_mask"(#loc31))
+#loc133 = loc("tmp6"(#loc32))
+#loc134 = loc("tmp6"(#loc33))
+#loc135 = loc("tmp6"(#loc34))
+#loc136 = loc("tmp6"(#loc35))
+#loc137 = loc("tmp6"(#loc36))
+#loc138 = loc("tmp6"(#loc37))
+#loc139 = loc("tmp6"(#loc38))
+#loc140 = loc("tmp6"(#loc39))
+#loc141 = loc("tmp6"(#loc40))
+#loc142 = loc("tmp6"(#loc41))
+#loc143 = loc("tmp7"(#loc42))
+#loc144 = loc("tmp9"(#loc43))
+#loc145 = loc("tmp11"(#loc44))
+#loc146 = loc("tmp12"(#loc45))
+#loc147 = loc("tmp13"(#loc46))
+#loc148 = loc("tmp14"(#loc47))
+#loc149 = loc("tmp15"(#loc48))
+#loc150 = loc("tmp17"(#loc59))
+#loc155 = loc("quot"(#loc87))
+#loc156 = loc("remainder"(#loc88))
+#loc157 = loc("fixed"(#loc89))
+#loc158 = loc("fixed"(#loc90))
+#loc159 = loc("fixed"(#loc91))

	@@ -0,0 +1,270 @@

+#blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 32], warpsPerCTA = [1, 2], order = [0, 1]}>
+#blocked1 = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 32], warpsPerCTA = [2, 1], order = [1, 0]}>
+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":18:0)
+#loc1 = loc(unknown)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":40:25)
+#loc68 = loc("in_ptr0"(#loc))
+#loc69 = loc("in_ptr1"(#loc))
+#loc70 = loc("out_ptr1"(#loc))
+#loc71 = loc("out_ptr2"(#loc))
+#loc72 = loc("out_ptr3"(#loc))
+#loc73 = loc("ks0"(#loc))
+#loc74 = loc("ks1"(#loc))
+#loc75 = loc("xnumel"(#loc))
+#loc76 = loc("r0_numel"(#loc))
+#loc91 = loc("tmp3"(#loc18))
+#loc124 = loc(callsite(#loc1 at #loc91))
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 2 : i32, ttg.target = "cuda:90", "ttg.threads-per-warp" = 32 : i32} {
+  tt.func public @triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2(%in_ptr0: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %in_ptr1: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr1"(#loc)), %out_ptr1: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr1"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %out_ptr3: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr3"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<0> : tensor<1x32xi64, #blocked> loc(#loc1)
+    %cst_0 = arith.constant dense<0> : tensor<1x32xi64, #blocked1> loc(#loc1)
+    %c1_i64 = arith.constant 1 : i64 loc(#loc1)
+    %c127_i64 = arith.constant 127 : i64 loc(#loc1)
+    %cst_1 = arith.constant dense<true> : tensor<1x32xi1, #blocked1> loc(#loc1)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc1)
+    %c32_i32 = arith.constant 32 : i32 loc(#loc1)
+    %cst_2 = arith.constant dense<0> : tensor<1x32xi32, #blocked1> loc(#loc1)
+    %c0_i64 = arith.constant 0 : i64 loc(#loc1)
+    %c128_i64 = arith.constant 128 : i64 loc(#loc1)
+    %cst_3 = arith.constant dense<1> : tensor<1x32xi32, #blocked> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc77)
+    %xmask = arith.cmpi slt, %xoffset, %c32_i32 : i32 loc(#loc78)
+    %r0_base = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked}>> loc(#loc79)
+    %r0_base_4 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> loc(#loc79)
+    %r0_base_5 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x32xi32, #blocked> loc(#loc79)
+    %r0_base_6 = tt.expand_dims %r0_base_4 {axis = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x32xi32, #blocked1> loc(#loc79)
+    %r0_mask = tt.splat %r0_numel : i32 -> tensor<1x32xi32, #blocked1> loc(#loc80)
+    %tmp0 = arith.extsi %xoffset : i32 to i64 loc(#loc81)
+    %tmp0_7 = arith.muli %ks0, %tmp0 : i64 loc(#loc81)
+    %tmp0_8 = tt.splat %tmp0_7 : i64 -> tensor<1x32xi64, #blocked1> loc(#loc121)
+    %tmp0_9 = tt.splat %in_ptr0 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>, #blocked1> loc(#loc83)
+    %tmp0_10 = tt.splat %xmask : i1 -> tensor<1x32xi1, #blocked1> loc(#loc122)
+    %_tmp3 = scf.for %r0_offset = %c0_i32 to %r0_numel step %c32_i32 iter_args(%_tmp3_31 = %cst_0) -> (tensor<1x32xi64, #blocked1>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x32xi32, #blocked1> loc(#loc86)
+      %r0_index_32 = arith.addi %r0_index, %r0_base_6 : tensor<1x32xi32, #blocked1> loc(#loc86)
+      %r0_mask_33 = arith.cmpi slt, %r0_index_32, %r0_mask : tensor<1x32xi32, #blocked1> loc(#loc80)
+      %tmp0_34 = arith.extsi %r0_index_32 : tensor<1x32xi32, #blocked1> to tensor<1x32xi64, #blocked1> loc(#loc82)
+      %tmp0_35 = arith.addi %tmp0_34, %tmp0_8 : tensor<1x32xi64, #blocked1> loc(#loc82)
+      %tmp0_36 = tt.addptr %tmp0_9, %tmp0_35 : tensor<1x32x!tt.ptr<i32>, #blocked1>, tensor<1x32xi64, #blocked1> loc(#loc83)
+      %tmp0_37 = arith.andi %r0_mask_33, %tmp0_10 : tensor<1x32xi1, #blocked1> loc(#loc84)
+      %tmp0_38 = tt.load %tmp0_36, %tmp0_37, %cst_2 evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i32>, #blocked1> loc(#loc87)
+      %tmp1 = arith.extsi %tmp0_38 : tensor<1x32xi32, #blocked1> to tensor<1x32xi64, #blocked1> loc(#loc88)
+      %tmp4 = arith.addi %_tmp3_31, %tmp1 : tensor<1x32xi64, #blocked1> loc(#loc89)
+      %_tmp3_39 = arith.select %tmp0_37, %tmp4, %_tmp3_31 : tensor<1x32xi1, #blocked1>, tensor<1x32xi64, #blocked1> loc(#loc90)
+      scf.yield %_tmp3_39 : tensor<1x32xi64, #blocked1> loc(#loc16)
+    } loc(#loc85)
+    %tmp3 = "tt.reduce"(%_tmp3) <{axis = 1 : i32}> ({
+    ^bb0(%tmp3_31: i64 loc(callsite(#loc1 at #loc91)), %tmp3_32: i64 loc(callsite(#loc1 at #loc91))):
+      %tmp3_33 = arith.addi %tmp3_31, %tmp3_32 : i64 loc(#loc133)
+      tt.reduce.return %tmp3_33 : i64 loc(#loc123)
+    }) : (tensor<1x32xi64, #blocked1>) -> tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> loc(#loc123)
+    %0 = ttg.convert_layout %tmp3 : tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> loc(#loc20)
+    %tmp3_11 = tt.expand_dims %0 {axis = 1 : i32} : tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1xi64, #blocked> loc(#loc92)
+    %tmp3_12 = tt.expand_dims %tmp3 {axis = 1 : i32} : tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<1x1xi64, #blocked1> loc(#loc92)
+    %tmp5 = arith.trunci %tmp3_11 : tensor<1x1xi64, #blocked> to tensor<1x1xi32, #blocked> loc(#loc93)
+    %tmp5_13 = arith.trunci %tmp3_12 : tensor<1x1xi64, #blocked1> to tensor<1x1xi32, #blocked1> loc(#loc93)
+    %1 = tt.addptr %out_ptr1, %xoffset : !tt.ptr<i32>, i32 loc(#loc23)
+    %2 = tt.splat %1 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>, #blocked> loc(#loc24)
+    %3 = tt.splat %xmask : i1 -> tensor<1x1xi1, #blocked> loc(#loc24)
+    tt.store %2, %tmp5, %3 : tensor<1x1x!tt.ptr<i32>, #blocked> loc(#loc24)
+    %r0_mask_14 = tt.splat %r0_numel : i32 -> tensor<1x32xi32, #blocked> loc(#loc94)
+    %tmp6 = arith.cmpi sle, %ks0, %c1_i64 : i64 loc(#loc95)
+    %tmp6_15 = arith.cmpi sgt, %ks0, %c1_i64 : i64 loc(#loc96)
+    %tmp6_16 = arith.extui %tmp6_15 : i1 to i64 loc(#loc97)
+    %tmp6_17 = arith.muli %ks0, %tmp6_16 : i64 loc(#loc97)
+    %tmp6_18 = arith.extui %tmp6 : i1 to i64 loc(#loc125)
+    %tmp6_19 = arith.addi %tmp6_18, %tmp6_17 : i64 loc(#loc98)
+    %tmp6_20 = arith.muli %tmp0, %tmp6_19 : i64 loc(#loc100)
+    %tmp6_21 = tt.splat %tmp6_20 : i64 -> tensor<1x32xi64, #blocked> loc(#loc126)
+    %tmp6_22 = tt.splat %tmp6_20 : i64 -> tensor<1x32xi64, #blocked1> loc(#loc126)
+    %tmp6_23 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1x32x!tt.ptr<i64>, #blocked> loc(#loc102)
+    %tmp6_24 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1x32x!tt.ptr<i64>, #blocked1> loc(#loc102)
+    %tmp6_25 = tt.splat %xmask : i1 -> tensor<1x32xi1, #blocked> loc(#loc127)
+    %tmp9 = tt.broadcast %tmp5 : tensor<1x1xi32, #blocked> -> tensor<1x32xi32, #blocked> loc(#loc104)
+    %tmp9_26 = tt.broadcast %tmp5_13 : tensor<1x1xi32, #blocked1> -> tensor<1x32xi32, #blocked1> loc(#loc104)
+    %tmp11 = tt.splat %ks0 : i64 -> tensor<1x32xi64, #blocked> loc(#loc105)
+    %tmp11_27 = tt.splat %ks0 : i64 -> tensor<1x32xi64, #blocked1> loc(#loc105)
+    %tmp12 = arith.addi %ks0, %c1_i64 : i64 loc(#loc106)
+    %tmp13 = tt.splat %tmp12 : i64 -> tensor<1x32xi64, #blocked> loc(#loc107)
+    %tmp13_28 = tt.splat %tmp12 : i64 -> tensor<1x32xi64, #blocked1> loc(#loc107)
+    %4 = arith.addi %ks1, %c127_i64 : i64 loc(#loc39)
+    %quot = arith.divsi %4, %c128_i64 : i64 loc(#loc128)
+    %remainder = arith.remsi %4, %c128_i64 : i64 loc(#loc129)
+    %fixed = arith.cmpi ne, %remainder, %c0_i64 : i64 loc(#loc130)
+    %fixed_29 = arith.subi %quot, %c1_i64 : i64 loc(#loc131)
+    %fixed_30 = arith.select %fixed, %fixed_29, %quot : i64 loc(#loc132)
+    %5 = arith.cmpi slt, %4, %c0_i64 : i64 loc(#loc113)
+    %6 = arith.select %5, %fixed_30, %quot : i64 loc(#loc114)
+    %7 = arith.addi %6, %c1_i64 : i64 loc(#loc48)
+    %8 = tt.splat %7 : i64 -> tensor<1x32xi64, #blocked1> loc(#loc49)
+    %9 = tt.splat %out_ptr2 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>, #blocked1> loc(#loc50)
+    %10 = tt.splat %tmp0 : i64 -> tensor<1x32xi64, #blocked> loc(#loc51)
+    %11 = tt.splat %tmp0_7 : i64 -> tensor<1x32xi64, #blocked> loc(#loc115)
+    %12 = tt.splat %out_ptr3 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>, #blocked> loc(#loc54)
+    scf.for %r0_offset = %c0_i32 to %r0_numel step %c32_i32  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x32xi32, #blocked> loc(#loc116)
+      %r0_index_31 = tt.splat %r0_offset : i32 -> tensor<1x32xi32, #blocked1> loc(#loc116)
+      %r0_index_32 = arith.addi %r0_index, %r0_base_5 : tensor<1x32xi32, #blocked> loc(#loc116)
+      %r0_index_33 = arith.addi %r0_index_31, %r0_base_6 : tensor<1x32xi32, #blocked1> loc(#loc116)
+      %r0_mask_34 = arith.cmpi slt, %r0_index_32, %r0_mask_14 : tensor<1x32xi32, #blocked> loc(#loc94)
+      %r0_mask_35 = arith.cmpi slt, %r0_index_33, %r0_mask : tensor<1x32xi32, #blocked1> loc(#loc94)
+      %tmp6_36 = arith.extsi %r0_index_32 : tensor<1x32xi32, #blocked> to tensor<1x32xi64, #blocked> loc(#loc101)
+      %tmp6_37 = arith.extsi %r0_index_33 : tensor<1x32xi32, #blocked1> to tensor<1x32xi64, #blocked1> loc(#loc101)
+      %tmp6_38 = arith.addi %tmp6_36, %tmp6_21 : tensor<1x32xi64, #blocked> loc(#loc101)
+      %tmp6_39 = arith.addi %tmp6_37, %tmp6_22 : tensor<1x32xi64, #blocked1> loc(#loc101)
+      %tmp6_40 = tt.addptr %tmp6_23, %tmp6_38 : tensor<1x32x!tt.ptr<i64>, #blocked>, tensor<1x32xi64, #blocked> loc(#loc102)
+      %tmp6_41 = tt.addptr %tmp6_24, %tmp6_39 : tensor<1x32x!tt.ptr<i64>, #blocked1>, tensor<1x32xi64, #blocked1> loc(#loc102)
+      %tmp6_42 = arith.andi %r0_mask_34, %tmp6_25 : tensor<1x32xi1, #blocked> loc(#loc103)
+      %tmp6_43 = arith.andi %r0_mask_35, %tmp0_10 : tensor<1x32xi1, #blocked1> loc(#loc103)
+      %tmp6_44 = tt.load %tmp6_40, %tmp6_42, %cst evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i64>, #blocked> loc(#loc117)
+      %tmp6_45 = tt.load %tmp6_41, %tmp6_43, %cst_0 evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i64>, #blocked1> loc(#loc117)
+      %tmp7 = arith.trunci %tmp6_44 : tensor<1x32xi64, #blocked> to tensor<1x32xi32, #blocked> loc(#loc118)
+      %tmp7_46 = arith.trunci %tmp6_45 : tensor<1x32xi64, #blocked1> to tensor<1x32xi32, #blocked1> loc(#loc118)
+      %tmp9_47 = arith.cmpi slt, %r0_index_32, %tmp9 : tensor<1x32xi32, #blocked> loc(#loc104)
+      %tmp9_48 = arith.cmpi slt, %r0_index_33, %tmp9_26 : tensor<1x32xi32, #blocked1> loc(#loc104)
+      %tmp11_49 = arith.extsi %tmp7 : tensor<1x32xi32, #blocked> to tensor<1x32xi64, #blocked> loc(#loc105)
+      %tmp11_50 = arith.extsi %tmp7_46 : tensor<1x32xi32, #blocked1> to tensor<1x32xi64, #blocked1> loc(#loc105)
+      %tmp11_51 = arith.select %tmp9_47, %tmp11_49, %tmp11 : tensor<1x32xi1, #blocked>, tensor<1x32xi64, #blocked> loc(#loc105)
+      %tmp11_52 = arith.select %tmp9_48, %tmp11_50, %tmp11_27 : tensor<1x32xi1, #blocked1>, tensor<1x32xi64, #blocked1> loc(#loc105)
+      %tmp13_53 = arith.addi %tmp11_51, %tmp13 : tensor<1x32xi64, #blocked> loc(#loc107)
+      %tmp13_54 = arith.addi %tmp11_52, %tmp13_28 : tensor<1x32xi64, #blocked1> loc(#loc107)
+      %tmp14 = arith.cmpi slt, %tmp11_51, %cst : tensor<1x32xi64, #blocked> loc(#loc119)
+      %tmp14_55 = arith.cmpi slt, %tmp11_52, %cst_0 : tensor<1x32xi64, #blocked1> loc(#loc119)
+      %tmp15 = arith.select %tmp14, %tmp13_53, %tmp11_51 : tensor<1x32xi1, #blocked>, tensor<1x32xi64, #blocked> loc(#loc120)
+      %tmp15_56 = arith.select %tmp14_55, %tmp13_54, %tmp11_52 : tensor<1x32xi1, #blocked1>, tensor<1x32xi64, #blocked1> loc(#loc120)
+      %13 = arith.cmpi sge, %tmp15_56, %cst_0 : tensor<1x32xi64, #blocked1> loc(#loc61)
+      %14 = arith.cmpi slt, %tmp15_56, %8 : tensor<1x32xi64, #blocked1> loc(#loc49)
+      %15 = arith.andi %13, %14 : tensor<1x32xi1, #blocked1> loc(#loc62)
+      %16 = arith.xori %tmp6_43, %cst_1 : tensor<1x32xi1, #blocked1> loc(#loc63)
+      %17 = arith.ori %15, %16 : tensor<1x32xi1, #blocked1> loc(#loc64)
+      tt.assert %17, "index out of bounds: 0 <= tmp15 < 1 + (triton_helpers.div_floor_integer(127 + ks1,  128))" : tensor<1x32xi1, #blocked1> loc(#loc65)
+      %18 = tt.addptr %9, %tmp6_39 : tensor<1x32x!tt.ptr<i32>, #blocked1>, tensor<1x32xi64, #blocked1> loc(#loc50)
+      tt.store %18, %tmp7_46, %tmp6_43 : tensor<1x32x!tt.ptr<i32>, #blocked1> loc(#loc66)
+      %19 = arith.addi %tmp15, %10 : tensor<1x32xi64, #blocked> loc(#loc51)
+      %20 = arith.addi %19, %11 : tensor<1x32xi64, #blocked> loc(#loc52)
+      %21 = tt.addptr %12, %20 : tensor<1x32x!tt.ptr<i32>, #blocked>, tensor<1x32xi64, #blocked> loc(#loc54)
+      tt.store %21, %cst_3, %tmp6_42 : tensor<1x32x!tt.ptr<i32>, #blocked> loc(#loc20)
+    } loc(#loc55)
+    tt.return loc(#loc67)
+  } loc(#loc)
+} loc(#loc)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":22:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":24:21)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":25:37)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":31:29)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:45)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:41)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:34)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:60)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":29:40)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":30:31)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:50)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":36:23)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":38:23)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:48)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:8)
+#loc17 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc19 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:95)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":40:28)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":41:19)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":42:25)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":42:36)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":45:29)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:60)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:86)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:77)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:68)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:52)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:45)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:41)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:34)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:103)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":52:22)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":54:37)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":55:20)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":56:24)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:94)
+#loc40 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":72:16)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:100)
+#loc42 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":73:20)
+#loc43 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:34)
+#loc44 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:44)
+#loc45 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:47)
+#loc46 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:25)
+#loc47 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:47)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:55)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:50)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:29)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:53)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:58)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:62)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:29)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":43:40)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":44:31)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:93)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":50:23)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":57:24)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":58:39)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:32)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:42)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:112)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:110)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:130)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:94)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":43:4)
+#loc77 = loc("xoffset"(#loc2))
+#loc78 = loc("xmask"(#loc3))
+#loc79 = loc("r0_base"(#loc4))
+#loc80 = loc("r0_mask"(#loc5))
+#loc81 = loc("tmp0"(#loc6))
+#loc82 = loc("tmp0"(#loc7))
+#loc83 = loc("tmp0"(#loc8))
+#loc84 = loc("tmp0"(#loc9))
+#loc85 = loc("_tmp3"(#loc10))
+#loc86 = loc("r0_index"(#loc11))
+#loc87 = loc("tmp0"(#loc12))
+#loc88 = loc("tmp1"(#loc13))
+#loc89 = loc("tmp4"(#loc14))
+#loc90 = loc("_tmp3"(#loc15))
+#loc92 = loc("tmp3"(#loc21))
+#loc93 = loc("tmp5"(#loc22))
+#loc94 = loc("r0_mask"(#loc25))
+#loc95 = loc("tmp6"(#loc26))
+#loc96 = loc("tmp6"(#loc27))
+#loc97 = loc("tmp6"(#loc28))
+#loc98 = loc("tmp6"(#loc29))
+#loc99 = loc("tmp6"(#loc30))
+#loc100 = loc("tmp6"(#loc31))
+#loc101 = loc("tmp6"(#loc32))
+#loc102 = loc("tmp6"(#loc33))
+#loc103 = loc("tmp6"(#loc34))
+#loc104 = loc("tmp9"(#loc35))
+#loc105 = loc("tmp11"(#loc36))
+#loc106 = loc("tmp12"(#loc37))
+#loc107 = loc("tmp13"(#loc38))
+#loc108 = loc("quot"(#loc40))
+#loc109 = loc("remainder"(#loc42))
+#loc110 = loc("fixed"(#loc43))
+#loc111 = loc("fixed"(#loc44))
+#loc112 = loc("fixed"(#loc45))
+#loc113 = loc(callsite(#loc46 at #loc41))
+#loc114 = loc(callsite(#loc47 at #loc41))
+#loc115 = loc(fused[#loc52, #loc53])
+#loc116 = loc("r0_index"(#loc56))
+#loc117 = loc("tmp6"(#loc57))
+#loc118 = loc("tmp7"(#loc58))
+#loc119 = loc("tmp14"(#loc59))
+#loc120 = loc("tmp15"(#loc60))
+#loc121 = loc(fused[#loc82, #loc81])
+#loc122 = loc(fused[#loc84, #loc78])
+#loc123 = loc(callsite(#loc17 at #loc91))
+#loc125 = loc(fused[#loc98, #loc99])
+#loc126 = loc(fused[#loc101, #loc100])
+#loc127 = loc(fused[#loc103, #loc78])
+#loc128 = loc(callsite(#loc108 at #loc41))
+#loc129 = loc(callsite(#loc109 at #loc41))
+#loc130 = loc(callsite(#loc110 at #loc41))
+#loc131 = loc(callsite(#loc111 at #loc41))
+#loc132 = loc(callsite(#loc112 at #loc41))
+#loc133 = loc(callsite(#loc19 at #loc123))

	@@ -0,0 +1,246 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":18:0)
+#loc1 = loc(unknown)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":40:25)
+#loc69 = loc("in_ptr0"(#loc))
+#loc70 = loc("in_ptr1"(#loc))
+#loc71 = loc("out_ptr1"(#loc))
+#loc72 = loc("out_ptr2"(#loc))
+#loc73 = loc("out_ptr3"(#loc))
+#loc74 = loc("ks0"(#loc))
+#loc75 = loc("ks1"(#loc))
+#loc76 = loc("xnumel"(#loc))
+#loc77 = loc("r0_numel"(#loc))
+#loc93 = loc("tmp3"(#loc19))
+#loc126 = loc(callsite(#loc1 at #loc93))
+module {
+  tt.func public @triton_red_fused__to_copy_arange_index_put_lt_new_zeros_scalar_tensor_sum_unsqueeze_view_where_2(%in_ptr0: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %in_ptr1: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr1"(#loc)), %out_ptr1: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr1"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %out_ptr3: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr3"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %c128_i64 = arith.constant 128 : i64 loc(#loc1)
+    %c0_i64 = arith.constant 0 : i64 loc(#loc1)
+    %cst = arith.constant dense<0> : tensor<1x32xi32> loc(#loc1)
+    %c32_i32 = arith.constant 32 : i32 loc(#loc1)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc1)
+    %cst_0 = arith.constant dense<1> : tensor<1x32xi32> loc(#loc1)
+    %cst_1 = arith.constant dense<true> : tensor<1x32xi1> loc(#loc1)
+    %c127_i64 = arith.constant 127 : i64 loc(#loc1)
+    %c1_i64 = arith.constant 1 : i64 loc(#loc1)
+    %cst_2 = arith.constant dense<0> : tensor<1x32xi64> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc78)
+    %xmask = arith.cmpi slt, %xoffset, %c32_i32 : i32 loc(#loc79)
+    %xmask_3 = tt.splat %xmask : i1 -> tensor<1x1xi1> loc(#loc79)
+    %r0_base = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> loc(#loc80)
+    %r0_base_4 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<32xi32> -> tensor<1x32xi32> loc(#loc81)
+    %_tmp3 = scf.for %r0_offset = %c0_i32 to %r0_numel step %c32_i32 iter_args(%_tmp3_6 = %cst_2) -> (tensor<1x32xi64>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x32xi32> loc(#loc83)
+      %r0_index_7 = arith.addi %r0_index, %r0_base_4 : tensor<1x32xi32> loc(#loc83)
+      %r0_mask = tt.splat %r0_numel : i32 -> tensor<1x32xi32> loc(#loc84)
+      %r0_mask_8 = arith.cmpi slt, %r0_index_7, %r0_mask : tensor<1x32xi32> loc(#loc84)
+      %tmp0 = arith.extsi %xoffset : i32 to i64 loc(#loc85)
+      %tmp0_9 = arith.muli %ks0, %tmp0 : i64 loc(#loc85)
+      %tmp0_10 = arith.extsi %r0_index_7 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc86)
+      %tmp0_11 = tt.splat %tmp0_9 : i64 -> tensor<1x32xi64> loc(#loc123)
+      %tmp0_12 = arith.addi %tmp0_10, %tmp0_11 : tensor<1x32xi64> loc(#loc86)
+      %tmp0_13 = tt.splat %in_ptr0 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>> loc(#loc87)
+      %tmp0_14 = tt.addptr %tmp0_13, %tmp0_12 : tensor<1x32x!tt.ptr<i32>>, tensor<1x32xi64> loc(#loc87)
+      %tmp0_15 = tt.splat %xmask : i1 -> tensor<1x32xi1> loc(#loc124)
+      %tmp0_16 = arith.andi %r0_mask_8, %tmp0_15 : tensor<1x32xi1> loc(#loc88)
+      %tmp0_17 = tt.load %tmp0_14, %tmp0_16, %cst evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i32>> loc(#loc89)
+      %tmp1 = arith.extsi %tmp0_17 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc90)
+      %tmp4 = arith.addi %_tmp3_6, %tmp1 : tensor<1x32xi64> loc(#loc91)
+      %_tmp3_18 = arith.select %tmp0_16, %tmp4, %_tmp3_6 : tensor<1x32xi1>, tensor<1x32xi64> loc(#loc92)
+      scf.yield %_tmp3_18 : tensor<1x32xi64> loc(#loc17)
+    } loc(#loc82)
+    %tmp3 = "tt.reduce"(%_tmp3) <{axis = 1 : i32}> ({
+    ^bb0(%tmp3_6: i64 loc(callsite(#loc1 at #loc93)), %tmp3_7: i64 loc(callsite(#loc1 at #loc93))):
+      %tmp3_8 = arith.addi %tmp3_6, %tmp3_7 : i64 loc(#loc135)
+      tt.reduce.return %tmp3_8 : i64 loc(#loc125)
+    }) : (tensor<1x32xi64>) -> tensor<1xi64> loc(#loc125)
+    %tmp3_5 = tt.expand_dims %tmp3 {axis = 1 : i32} : tensor<1xi64> -> tensor<1x1xi64> loc(#loc94)
+    %tmp5 = arith.trunci %tmp3_5 : tensor<1x1xi64> to tensor<1x1xi32> loc(#loc95)
+    %0 = tt.addptr %out_ptr1, %xoffset : !tt.ptr<i32>, i32 loc(#loc23)
+    %1 = tt.splat %0 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>> loc(#loc23)
+    tt.store %1, %tmp5, %xmask_3 : tensor<1x1x!tt.ptr<i32>> loc(#loc24)
+    scf.for %r0_offset = %c0_i32 to %r0_numel step %c32_i32  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x32xi32> loc(#loc96)
+      %r0_index_6 = arith.addi %r0_index, %r0_base_4 : tensor<1x32xi32> loc(#loc96)
+      %r0_mask = tt.splat %r0_numel : i32 -> tensor<1x32xi32> loc(#loc97)
+      %r0_mask_7 = arith.cmpi slt, %r0_index_6, %r0_mask : tensor<1x32xi32> loc(#loc97)
+      %tmp6 = arith.cmpi sle, %ks0, %c1_i64 : i64 loc(#loc98)
+      %tmp6_8 = arith.cmpi sgt, %ks0, %c1_i64 : i64 loc(#loc99)
+      %tmp6_9 = arith.extui %tmp6_8 : i1 to i64 loc(#loc100)
+      %tmp6_10 = arith.muli %ks0, %tmp6_9 : i64 loc(#loc100)
+      %tmp6_11 = arith.extui %tmp6 : i1 to i64 loc(#loc127)
+      %tmp6_12 = arith.addi %tmp6_11, %tmp6_10 : i64 loc(#loc101)
+      %tmp6_13 = arith.extsi %xoffset : i32 to i64 loc(#loc103)
+      %tmp6_14 = arith.muli %tmp6_13, %tmp6_12 : i64 loc(#loc103)
+      %tmp6_15 = arith.extsi %r0_index_6 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc104)
+      %tmp6_16 = tt.splat %tmp6_14 : i64 -> tensor<1x32xi64> loc(#loc128)
+      %tmp6_17 = arith.addi %tmp6_15, %tmp6_16 : tensor<1x32xi64> loc(#loc104)
+      %tmp6_18 = tt.splat %in_ptr1 : !tt.ptr<i64> -> tensor<1x32x!tt.ptr<i64>> loc(#loc105)
+      %tmp6_19 = tt.addptr %tmp6_18, %tmp6_17 : tensor<1x32x!tt.ptr<i64>>, tensor<1x32xi64> loc(#loc105)
+      %tmp6_20 = tt.splat %xmask : i1 -> tensor<1x32xi1> loc(#loc129)
+      %tmp6_21 = arith.andi %r0_mask_7, %tmp6_20 : tensor<1x32xi1> loc(#loc106)
+      %tmp6_22 = tt.load %tmp6_19, %tmp6_21, %cst_2 evictionPolicy = evict_first : tensor<1x32x!tt.ptr<i64>> loc(#loc107)
+      %tmp7 = arith.trunci %tmp6_22 : tensor<1x32xi64> to tensor<1x32xi32> loc(#loc108)
+      %tmp9 = tt.broadcast %tmp5 : tensor<1x1xi32> -> tensor<1x32xi32> loc(#loc109)
+      %tmp9_23 = arith.cmpi slt, %r0_index_6, %tmp9 : tensor<1x32xi32> loc(#loc109)
+      %tmp11 = arith.extsi %tmp7 : tensor<1x32xi32> to tensor<1x32xi64> loc(#loc110)
+      %tmp11_24 = tt.splat %ks0 : i64 -> tensor<1x32xi64> loc(#loc110)
+      %tmp11_25 = arith.select %tmp9_23, %tmp11, %tmp11_24 : tensor<1x32xi1>, tensor<1x32xi64> loc(#loc110)
+      %tmp12 = arith.addi %ks0, %c1_i64 : i64 loc(#loc111)
+      %tmp13 = tt.splat %tmp12 : i64 -> tensor<1x32xi64> loc(#loc112)
+      %tmp13_26 = arith.addi %tmp11_25, %tmp13 : tensor<1x32xi64> loc(#loc112)
+      %tmp14 = arith.cmpi slt, %tmp11_25, %cst_2 : tensor<1x32xi64> loc(#loc113)
+      %tmp15 = arith.select %tmp14, %tmp13_26, %tmp11_25 : tensor<1x32xi1>, tensor<1x32xi64> loc(#loc114)
+      %2 = arith.cmpi sge, %tmp15, %cst_2 : tensor<1x32xi64> loc(#loc45)
+      %3 = arith.addi %ks1, %c127_i64 : i64 loc(#loc46)
+      %quot = arith.divsi %3, %c128_i64 : i64 loc(#loc130)
+      %remainder = arith.remsi %3, %c128_i64 : i64 loc(#loc131)
+      %fixed = arith.cmpi ne, %remainder, %c0_i64 : i64 loc(#loc132)
+      %fixed_27 = arith.subi %quot, %c1_i64 : i64 loc(#loc133)
+      %fixed_28 = arith.select %fixed, %fixed_27, %quot : i64 loc(#loc134)
+      %4 = arith.cmpi slt, %3, %c0_i64 : i64 loc(#loc120)
+      %5 = arith.select %4, %fixed_28, %quot : i64 loc(#loc121)
+      %6 = arith.addi %5, %c1_i64 : i64 loc(#loc55)
+      %7 = tt.splat %6 : i64 -> tensor<1x32xi64> loc(#loc56)
+      %8 = arith.cmpi slt, %tmp15, %7 : tensor<1x32xi64> loc(#loc56)
+      %9 = arith.andi %2, %8 : tensor<1x32xi1> loc(#loc57)
+      %10 = arith.xori %tmp6_21, %cst_1 : tensor<1x32xi1> loc(#loc58)
+      %11 = arith.ori %9, %10 : tensor<1x32xi1> loc(#loc59)
+      tt.assert %11, "index out of bounds: 0 <= tmp15 < 1 + (triton_helpers.div_floor_integer(127 + ks1,  128))" : tensor<1x32xi1> loc(#loc60)
+      %12 = tt.splat %out_ptr2 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>> loc(#loc61)
+      %13 = tt.addptr %12, %tmp6_17 : tensor<1x32x!tt.ptr<i32>>, tensor<1x32xi64> loc(#loc61)
+      tt.store %13, %tmp7, %tmp6_21 : tensor<1x32x!tt.ptr<i32>> loc(#loc62)
+      %14 = tt.splat %tmp6_13 : i64 -> tensor<1x32xi64> loc(#loc63)
+      %15 = arith.addi %tmp15, %14 : tensor<1x32xi64> loc(#loc63)
+      %16 = arith.muli %ks0, %tmp6_13 : i64 loc(#loc64)
+      %17 = tt.splat %16 : i64 -> tensor<1x32xi64> loc(#loc122)
+      %18 = arith.addi %15, %17 : tensor<1x32xi64> loc(#loc65)
+      %19 = tt.splat %out_ptr3 : !tt.ptr<i32> -> tensor<1x32x!tt.ptr<i32>> loc(#loc66)
+      %20 = tt.addptr %19, %18 : tensor<1x32x!tt.ptr<i32>>, tensor<1x32xi64> loc(#loc66)
+      tt.store %20, %cst_0, %tmp6_21 : tensor<1x32x!tt.ptr<i32>> loc(#loc67)
+    } loc(#loc25)
+    tt.return loc(#loc68)
+  } loc(#loc)
+} loc(#loc)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":22:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":24:21)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":25:27)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":25:37)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":29:40)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":30:31)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":31:29)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:45)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:41)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:34)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:60)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":35:50)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":36:23)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":38:23)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:48)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":39:8)
+#loc18 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc20 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":40:28)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":41:19)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":42:25)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":42:36)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":43:40)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":44:31)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":45:29)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:60)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:86)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:77)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:68)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:52)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:45)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:41)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:34)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:103)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":49:93)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":50:23)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":52:22)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":54:37)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":55:20)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":56:24)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":57:24)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":58:39)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:32)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:94)
+#loc47 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":72:16)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:100)
+#loc49 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":73:20)
+#loc50 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:34)
+#loc51 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:44)
+#loc52 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":74:47)
+#loc53 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:25)
+#loc54 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":75:47)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:55)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:50)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:42)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:112)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:110)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":59:130)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:29)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":61:94)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:53)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:62)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:58)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:29)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":62:95)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vr/cvrhnrmpgyxwu34xleclee3tt4kemoldkj7iam4uciathomirvlc.py":43:4)
+#loc78 = loc("xoffset"(#loc2))
+#loc79 = loc("xmask"(#loc3))
+#loc80 = loc("r0_base"(#loc4))
+#loc81 = loc("r0_base"(#loc5))
+#loc82 = loc("_tmp3"(#loc6))
+#loc83 = loc("r0_index"(#loc7))
+#loc84 = loc("r0_mask"(#loc8))
+#loc85 = loc("tmp0"(#loc9))
+#loc86 = loc("tmp0"(#loc10))
+#loc87 = loc("tmp0"(#loc11))
+#loc88 = loc("tmp0"(#loc12))
+#loc89 = loc("tmp0"(#loc13))
+#loc90 = loc("tmp1"(#loc14))
+#loc91 = loc("tmp4"(#loc15))
+#loc92 = loc("_tmp3"(#loc16))
+#loc94 = loc("tmp3"(#loc21))
+#loc95 = loc("tmp5"(#loc22))
+#loc96 = loc("r0_index"(#loc26))
+#loc97 = loc("r0_mask"(#loc27))
+#loc98 = loc("tmp6"(#loc28))
+#loc99 = loc("tmp6"(#loc29))
+#loc100 = loc("tmp6"(#loc30))
+#loc101 = loc("tmp6"(#loc31))
+#loc102 = loc("tmp6"(#loc32))
+#loc103 = loc("tmp6"(#loc33))
+#loc104 = loc("tmp6"(#loc34))
+#loc105 = loc("tmp6"(#loc35))
+#loc106 = loc("tmp6"(#loc36))
+#loc107 = loc("tmp6"(#loc37))
+#loc108 = loc("tmp7"(#loc38))
+#loc109 = loc("tmp9"(#loc39))
+#loc110 = loc("tmp11"(#loc40))
+#loc111 = loc("tmp12"(#loc41))
+#loc112 = loc("tmp13"(#loc42))
+#loc113 = loc("tmp14"(#loc43))
+#loc114 = loc("tmp15"(#loc44))
+#loc115 = loc("quot"(#loc47))
+#loc116 = loc("remainder"(#loc49))
+#loc117 = loc("fixed"(#loc50))
+#loc118 = loc("fixed"(#loc51))
+#loc119 = loc("fixed"(#loc52))
+#loc120 = loc(callsite(#loc53 at #loc48))
+#loc121 = loc(callsite(#loc54 at #loc48))
+#loc122 = loc(fused[#loc65, #loc64])
+#loc123 = loc(fused[#loc86, #loc85])
+#loc124 = loc(fused[#loc88, #loc79])
+#loc125 = loc(callsite(#loc18 at #loc93))
+#loc127 = loc(fused[#loc101, #loc102])
+#loc128 = loc(fused[#loc104, #loc103])
+#loc129 = loc(fused[#loc106, #loc79])
+#loc130 = loc(callsite(#loc115 at #loc48))
+#loc131 = loc(callsite(#loc116 at #loc48))
+#loc132 = loc(callsite(#loc117 at #loc48))
+#loc133 = loc(callsite(#loc118 at #loc48))
+#loc134 = loc(callsite(#loc119 at #loc48))
+#loc135 = loc(callsite(#loc20 at #loc125))

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/__grp__triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.source", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttir", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttgir", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.llir", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ptx", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin ADDED Viewed

Binary file (86.4 kB). View file

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"hash": "07c7815d2ce5fa33e16044674f04a1dcbb415776e0b5f0da0149af801b6db42c", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 4, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 2048, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3"}

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.llir ADDED Viewed

The diff for this file is too large to render. See raw diff

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ptx ADDED Viewed

The diff for this file is too large to render. See raw diff

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.source ADDED Viewed

The diff for this file is too large to render. See raw diff

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttgir ADDED Viewed

	@@ -0,0 +1,841 @@

+#blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [32, 1], warpsPerCTA = [1, 4], order = [0, 1]}>
+#blocked1 = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [8, 4], warpsPerCTA = [4, 1], order = [1, 0]}>
+#linear = #ttg.linear<{register = [[0, 4], [0, 8]], lane = [[1, 0], [2, 0], [4, 0], [8, 0], [16, 0]], warp = [[0, 1], [0, 2]], block = []}>
+#linear1 = #ttg.linear<{register = [[2, 0, 0], [4, 0, 0]], lane = [[8, 0, 0], [16, 0, 0], [32, 0, 0], [64, 0, 0], [128, 0, 0]], warp = [[0, 1, 0], [1, 0, 0]], block = []}>
+#linear2 = #ttg.linear<{register = [[1, 0, 0], [2, 0, 0]], lane = [[4, 0, 0], [8, 0, 0], [16, 0, 0], [32, 0, 0], [64, 0, 0]], warp = [[0, 0, 1], [0, 1, 0]], block = []}>
+#linear3 = #ttg.linear<{register = [[0, 1, 0], [1, 0, 0]], lane = [[2, 0, 0], [4, 0, 0], [8, 0, 0], [16, 0, 0], [32, 0, 0]], warp = [[0, 0, 1], [0, 0, 2]], block = []}>
+#linear4 = #ttg.linear<{register = [[0, 0, 4], [0, 1, 0]], lane = [[1, 0, 0], [2, 0, 0], [4, 0, 0], [8, 0, 0], [16, 0, 0]], warp = [[0, 0, 1], [0, 0, 2]], block = []}>
+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":18:0)
+#loc1 = loc(unknown)
+#loc19 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":662:12)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":41:67)
+#loc24 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":634:73)
+#loc28 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:51)
+#loc33 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:53)
+#loc42 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:50)
+#loc47 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:51)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":45:26)
+#loc77 = loc("in_ptr0"(#loc))
+#loc78 = loc("out_ptr2"(#loc))
+#loc79 = loc("out_ptr3"(#loc))
+#loc80 = loc("xnumel"(#loc))
+#loc81 = loc("r0_numel"(#loc))
+#loc99 = loc(callsite(#loc19 at #loc20))
+#loc105 = loc("ileft"(#loc28))
+#loc109 = loc("iright"(#loc33))
+#loc118 = loc("left_idx"(#loc42))
+#loc123 = loc("right_idx"(#loc47))
+#loc143 = loc("tmp11"(#loc67))
+#loc149 = loc(callsite(#loc24 at #loc99))
+#loc153 = loc(callsite(#loc1 at #loc143))
+#loc157 = loc(callsite(#loc105 at #loc149))
+#loc161 = loc(callsite(#loc109 at #loc149))
+#loc169 = loc(callsite(#loc118 at #loc149))
+#loc174 = loc(callsite(#loc123 at #loc149))
+#loc194 = loc(callsite(#loc1 at #loc157))
+#loc196 = loc(callsite(#loc1 at #loc161))
+#loc199 = loc(callsite(#loc1 at #loc169))
+#loc202 = loc(callsite(#loc1 at #loc174))
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 4 : i32, ttg.target = "cuda:90", "ttg.threads-per-warp" = 32 : i32} {
+  tt.func public @triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3(%in_ptr0: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %out_ptr3: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr3"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<0> : tensor<32x16xi32, #linear> loc(#loc1)
+    %cst_0 = arith.constant dense<0> : tensor<32x16xi64, #blocked> loc(#loc1)
+    %c32_i32 = arith.constant 32 : i32 loc(#loc1)
+    %cst_1 = arith.constant dense<32> : tensor<32x1xi32, #blocked> loc(#loc1)
+    %cst_2 = arith.constant dense<32> : tensor<32x1xi32, #blocked1> loc(#loc1)
+    %cst_3 = arith.constant dense<16> : tensor<32x1xi32, #blocked> loc(#loc1)
+    %cst_4 = arith.constant dense<16> : tensor<32x1xi32, #blocked1> loc(#loc1)
+    %cst_5 = arith.constant dense<17> : tensor<1x16xi32, #blocked> loc(#loc1)
+    %cst_6 = arith.constant dense<272> : tensor<32x1xi32, #blocked> loc(#loc1)
+    %cst_7 = arith.constant dense<1> : tensor<1x2x1xi32, #linear1> loc(#loc1)
+    %cst_8 = arith.constant dense<1> : tensor<1x2x1xi32, #linear2> loc(#loc1)
+    %cst_9 = arith.constant dense<1> : tensor<1x2x1xi32, #linear3> loc(#loc1)
+    %cst_10 = arith.constant dense<1> : tensor<1x2x1xi32, #linear4> loc(#loc1)
+    %cst_11 = arith.constant dense<0> : tensor<32x16xi32, #blocked> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc82)
+    %xoffset_12 = arith.muli %xoffset, %c32_i32 : i32 loc(#loc83)
+    %xindex = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> loc(#loc84)
+    %xindex_13 = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> loc(#loc84)
+    %xindex_14 = tt.expand_dims %xindex {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<32x1xi32, #blocked> loc(#loc84)
+    %xindex_15 = tt.expand_dims %xindex_13 {axis = 1 : i32} : tensor<32xi32, #ttg.slice<{dim = 1, parent = #blocked1}>> -> tensor<32x1xi32, #blocked1> loc(#loc84)
+    %xindex_16 = tt.splat %xoffset_12 : i32 -> tensor<32x1xi32, #blocked> loc(#loc85)
+    %xindex_17 = tt.splat %xoffset_12 : i32 -> tensor<32x1xi32, #blocked1> loc(#loc85)
+    %xindex_18 = arith.addi %xindex_16, %xindex_14 : tensor<32x1xi32, #blocked> loc(#loc85)
+    %xindex_19 = arith.addi %xindex_17, %xindex_15 : tensor<32x1xi32, #blocked1> loc(#loc85)
+    %xmask = arith.cmpi slt, %xindex_18, %cst_1 : tensor<32x1xi32, #blocked> loc(#loc86)
+    %xmask_20 = arith.cmpi slt, %xindex_19, %cst_2 : tensor<32x1xi32, #blocked1> loc(#loc86)
+    %r0_index = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked}>> loc(#loc87)
+    %r0_index_21 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #linear}>> loc(#loc87)
+    %r0_index_22 = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> loc(#loc87)
+    %r0_index_23 = tt.expand_dims %r0_index {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x16xi32, #blocked> loc(#loc87)
+    %r0_index_24 = tt.expand_dims %r0_index_21 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #linear}>> -> tensor<1x16xi32, #linear> loc(#loc87)
+    %r0_index_25 = tt.expand_dims %r0_index_22 {axis = 0 : i32} : tensor<16xi32, #ttg.slice<{dim = 0, parent = #blocked1}>> -> tensor<1x16xi32, #blocked1> loc(#loc87)
+    %x0 = arith.remsi %xindex_18, %cst_3 : tensor<32x1xi32, #blocked> loc(#loc88)
+    %x1 = arith.divsi %xindex_18, %cst_3 : tensor<32x1xi32, #blocked> loc(#loc89)
+    %tmp0 = arith.muli %r0_index_23, %cst_5 : tensor<1x16xi32, #blocked> loc(#loc90)
+    %tmp0_26 = tt.broadcast %x0 : tensor<32x1xi32, #blocked> -> tensor<32x16xi32, #blocked> loc(#loc91)
+    %tmp0_27 = tt.broadcast %tmp0 : tensor<1x16xi32, #blocked> -> tensor<32x16xi32, #blocked> loc(#loc91)
+    %tmp0_28 = arith.addi %tmp0_26, %tmp0_27 : tensor<32x16xi32, #blocked> loc(#loc91)
+    %tmp0_29 = arith.muli %x1, %cst_6 : tensor<32x1xi32, #blocked> loc(#loc92)
+    %tmp0_30 = tt.broadcast %tmp0_29 : tensor<32x1xi32, #blocked> -> tensor<32x16xi32, #blocked> loc(#loc93)
+    %tmp0_31 = arith.addi %tmp0_28, %tmp0_30 : tensor<32x16xi32, #blocked> loc(#loc93)
+    %tmp0_32 = tt.splat %in_ptr0 : !tt.ptr<i32> -> tensor<32x16x!tt.ptr<i32>, #blocked> loc(#loc94)
+    %tmp0_33 = tt.addptr %tmp0_32, %tmp0_31 : tensor<32x16x!tt.ptr<i32>, #blocked>, tensor<32x16xi32, #blocked> loc(#loc94)
+    %tmp0_34 = tt.broadcast %xmask : tensor<32x1xi1, #blocked> -> tensor<32x16xi1, #blocked> loc(#loc95)
+    %tmp0_35 = tt.broadcast %xmask_20 : tensor<32x1xi1, #blocked1> -> tensor<32x16xi1, #blocked1> loc(#loc95)
+    %tmp0_36 = tt.load %tmp0_33, %tmp0_34, %cst_11 : tensor<32x16x!tt.ptr<i32>, #blocked> loc(#loc95)
+    %tmp2 = arith.trunci %r0_index_24 : tensor<1x16xi32, #linear> to tensor<1x16xi16, #linear> loc(#loc96)
+    %tmp4 = tt.broadcast %tmp2 : tensor<1x16xi16, #linear> -> tensor<32x16xi16, #linear> loc(#loc97)
+    %flip = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear2}>}>> loc(#loc146)
+    %flip_37 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear1}>}>> loc(#loc146)
+    %flip_38 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear3}>}>> loc(#loc146)
+    %flip_39 = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear4}>}>> loc(#loc146)
+    %flip_40 = tt.expand_dims %flip {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear2}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear2}>> loc(#loc146)
+    %flip_41 = tt.expand_dims %flip_37 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear1}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear1}>> loc(#loc146)
+    %flip_42 = tt.expand_dims %flip_38 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear3}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear3}>> loc(#loc146)
+    %flip_43 = tt.expand_dims %flip_39 {axis = 0 : i32} : tensor<2xi32, #ttg.slice<{dim = 0, parent = #ttg.slice<{dim = 2, parent = #linear4}>}>> -> tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear4}>> loc(#loc146)
+    %flip_44 = tt.expand_dims %flip_40 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear2}>> -> tensor<1x2x1xi32, #linear2> loc(#loc146)
+    %flip_45 = tt.expand_dims %flip_41 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear1}>> -> tensor<1x2x1xi32, #linear1> loc(#loc146)
+    %flip_46 = tt.expand_dims %flip_42 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear3}>> -> tensor<1x2x1xi32, #linear3> loc(#loc146)
+    %flip_47 = tt.expand_dims %flip_43 {axis = 2 : i32} : tensor<1x2xi32, #ttg.slice<{dim = 2, parent = #linear4}>> -> tensor<1x2x1xi32, #linear4> loc(#loc146)
+    %flip_48 = tt.broadcast %flip_44 : tensor<1x2x1xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc147)
+    %flip_49 = tt.reshape %flip_48 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #blocked> loc(#loc148)
+    %flip_50 = tt.reshape %flip_48 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc148)
+    %y = tt.reshape %tmp0_36 : tensor<32x16xi32, #blocked> -> tensor<256x2x1xi32, #linear1> loc(#loc154)
+    %left_mask = arith.subi %cst_7, %flip_45 : tensor<1x2x1xi32, #linear1> loc(#loc155)
+    %left_mask_51 = arith.subi %cst_8, %flip_44 : tensor<1x2x1xi32, #linear2> loc(#loc155)
+    %left_mask_52 = arith.subi %cst_9, %flip_46 : tensor<1x2x1xi32, #linear3> loc(#loc155)
+    %left_mask_53 = arith.subi %cst_10, %flip_47 : tensor<1x2x1xi32, #linear4> loc(#loc155)
+    %ileft = tt.broadcast %left_mask : tensor<1x2x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc156)
+    %ileft_54 = arith.muli %y, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc156)
+    %ileft_55 = "tt.reduce"(%ileft_54) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc193)
+    %ileft_56 = tt.expand_dims %ileft_55 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc158)
+    %ileft_57 = tt.broadcast %ileft_56 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc159)
+    %iright = tt.broadcast %flip_45 : tensor<1x2x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc160)
+    %iright_58 = arith.muli %y, %iright : tensor<256x2x1xi32, #linear1> loc(#loc160)
+    %iright_59 = "tt.reduce"(%iright_58) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc195)
+    %iright_60 = tt.expand_dims %iright_59 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc162)
+    %iright_61 = tt.broadcast %iright_60 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc163)
+    %ileft_62 = tt.reshape %ileft_57 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #blocked> loc(#loc164)
+    %ileft_63 = tt.reshape %ileft_57 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_64 = tt.reshape %iright_61 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #blocked> loc(#loc165)
+    %iright_65 = tt.reshape %iright_61 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx = tt.reshape %tmp4 : tensor<32x16xi16, #linear> -> tensor<256x2x1xi16, #linear1> loc(#loc166)
+    %left_idx = arith.trunci %left_mask : tensor<1x2x1xi32, #linear1> to tensor<1x2x1xi16, #linear1> loc(#loc167)
+    %left_idx_66 = tt.broadcast %left_idx : tensor<1x2x1xi16, #linear1> -> tensor<256x2x1xi16, #linear1> loc(#loc168)
+    %left_idx_67 = arith.muli %y_idx, %left_idx_66 : tensor<256x2x1xi16, #linear1> loc(#loc168)
+    %input = arith.extsi %left_idx_67 : tensor<256x2x1xi16, #linear1> to tensor<256x2x1xi32, #linear1> loc(#loc197)
+    %left_idx_68 = "tt.reduce"(%input) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc198)
+    %left_idx_69 = tt.expand_dims %left_idx_68 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc170)
+    %left_idx_70 = tt.broadcast %left_idx_69 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc171)
+    %right_idx = arith.trunci %flip_45 : tensor<1x2x1xi32, #linear1> to tensor<1x2x1xi16, #linear1> loc(#loc172)
+    %right_idx_71 = tt.broadcast %right_idx : tensor<1x2x1xi16, #linear1> -> tensor<256x2x1xi16, #linear1> loc(#loc173)
+    %right_idx_72 = arith.muli %y_idx, %right_idx_71 : tensor<256x2x1xi16, #linear1> loc(#loc173)
+    %input_73 = arith.extsi %right_idx_72 : tensor<256x2x1xi16, #linear1> to tensor<256x2x1xi32, #linear1> loc(#loc200)
+    %right_idx_74 = "tt.reduce"(%input_73) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc201)
+    %right_idx_75 = tt.expand_dims %right_idx_74 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc175)
+    %right_idx_76 = tt.broadcast %right_idx_75 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc176)
+    %left_idx_77 = tt.reshape %left_idx_70 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #blocked> loc(#loc177)
+    %left_idx_78 = tt.reshape %left_idx_70 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_79 = tt.reshape %right_idx_76 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #blocked> loc(#loc178)
+    %right_idx_80 = tt.reshape %right_idx_76 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond = arith.cmpi slt, %ileft_62, %iright_64 : tensor<32x16xi32, #blocked> loc(#loc179)
+    %cond_81 = arith.cmpi slt, %ileft_63, %iright_65 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq = arith.cmpi eq, %ileft_62, %iright_64 : tensor<32x16xi32, #blocked> loc(#loc180)
+    %eq_82 = arith.cmpi eq, %ileft_63, %iright_65 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_83 = arith.cmpi sgt, %left_idx_77, %right_idx_79 : tensor<32x16xi32, #blocked> loc(#loc181)
+    %cond_84 = arith.cmpi sgt, %left_idx_78, %right_idx_80 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_85 = arith.andi %eq, %cond_83 : tensor<32x16xi1, #blocked> loc(#loc182)
+    %cond_86 = arith.andi %eq_82, %cond_84 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_87 = arith.ori %cond, %cond_85 : tensor<32x16xi1, #blocked> loc(#loc183)
+    %cond_88 = arith.ori %cond_81, %cond_86 : tensor<32x16xi1, #linear> loc(#loc183)
+    %cond_89 = arith.extui %cond_87 : tensor<32x16xi1, #blocked> to tensor<32x16xi32, #blocked> loc(#loc184)
+    %cond_90 = arith.extui %cond_88 : tensor<32x16xi1, #linear> to tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_91 = arith.xori %cond_89, %flip_49 : tensor<32x16xi32, #blocked> loc(#loc184)
+    %cond_92 = arith.xori %cond_90, %flip_50 : tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_93 = arith.cmpi ne, %cond_91, %cst_11 : tensor<32x16xi32, #blocked> loc(#loc185)
+    %cond_94 = arith.cmpi ne, %cond_92, %cst : tensor<32x16xi32, #linear> loc(#loc185)
+    %ret = arith.xori %ileft_62, %iright_64 : tensor<32x16xi32, #blocked> loc(#loc186)
+    %ret_95 = arith.select %cond_93, %ret, %cst_11 : tensor<32x16xi1, #blocked>, tensor<32x16xi32, #blocked> loc(#loc187)
+    %ret_96 = arith.xori %tmp0_36, %ret_95 : tensor<32x16xi32, #blocked> loc(#loc188)
+    %ret_97 = ttg.convert_layout %ret_96 : tensor<32x16xi32, #blocked> -> tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs = arith.xori %left_idx_78, %right_idx_80 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_98 = arith.select %cond_94, %new_idxs, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_99 = arith.extsi %tmp2 : tensor<1x16xi16, #linear> to tensor<1x16xi32, #linear> loc(#loc191)
+    %new_idxs_100 = tt.broadcast %new_idxs_99 : tensor<1x16xi32, #linear> -> tensor<32x16xi32, #linear> loc(#loc191)
+    %new_idxs_101 = arith.xori %new_idxs_100, %new_idxs_98 : tensor<32x16xi32, #linear> loc(#loc191)
+    %flip_102 = tt.broadcast %flip_46 : tensor<1x2x1xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc147)
+    %flip_103 = tt.reshape %flip_102 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc148)
+    %y_104 = tt.reshape %ret_96 : tensor<32x16xi32, #blocked> -> tensor<128x2x2xi32, #linear2> loc(#loc154)
+    %ileft_105 = tt.broadcast %left_mask_51 : tensor<1x2x1xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc156)
+    %ileft_106 = arith.muli %y_104, %ileft_105 : tensor<128x2x2xi32, #linear2> loc(#loc156)
+    %ileft_107 = "tt.reduce"(%ileft_106) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc193)
+    %ileft_108 = tt.expand_dims %ileft_107 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc158)
+    %ileft_109 = tt.broadcast %ileft_108 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc159)
+    %iright_110 = arith.muli %y_104, %flip_48 : tensor<128x2x2xi32, #linear2> loc(#loc160)
+    %iright_111 = "tt.reduce"(%iright_110) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc195)
+    %iright_112 = tt.expand_dims %iright_111 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc162)
+    %iright_113 = tt.broadcast %iright_112 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc163)
+    %ileft_114 = tt.reshape %ileft_109 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_115 = tt.reshape %iright_113 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_116 = tt.reshape %new_idxs_101 : tensor<32x16xi32, #linear> -> tensor<128x2x2xi32, #linear2> loc(#loc166)
+    %left_idx_117 = arith.muli %y_idx_116, %ileft_105 : tensor<128x2x2xi32, #linear2> loc(#loc168)
+    %left_idx_118 = "tt.reduce"(%left_idx_117) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc198)
+    %left_idx_119 = tt.expand_dims %left_idx_118 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc170)
+    %left_idx_120 = tt.broadcast %left_idx_119 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc171)
+    %right_idx_121 = arith.muli %y_idx_116, %flip_48 : tensor<128x2x2xi32, #linear2> loc(#loc173)
+    %right_idx_122 = "tt.reduce"(%right_idx_121) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc201)
+    %right_idx_123 = tt.expand_dims %right_idx_122 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc175)
+    %right_idx_124 = tt.broadcast %right_idx_123 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc176)
+    %left_idx_125 = tt.reshape %left_idx_120 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_126 = tt.reshape %right_idx_124 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_127 = arith.cmpi slt, %ileft_114, %iright_115 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_128 = arith.cmpi eq, %ileft_114, %iright_115 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_129 = arith.cmpi sgt, %left_idx_125, %right_idx_126 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_130 = arith.andi %eq_128, %cond_129 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_131 = arith.ori %cond_127, %cond_130 : tensor<32x16xi1, #linear> loc(#loc183)
+    %cond_132 = arith.extui %cond_131 : tensor<32x16xi1, #linear> to tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_133 = arith.xori %cond_132, %flip_103 : tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_134 = arith.cmpi ne, %cond_133, %cst : tensor<32x16xi32, #linear> loc(#loc185)
+    %ret_135 = arith.xori %ileft_114, %iright_115 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_136 = arith.select %cond_134, %ret_135, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_137 = arith.xori %ret_97, %ret_136 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_138 = arith.xori %left_idx_125, %right_idx_126 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_139 = arith.select %cond_134, %new_idxs_138, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_140 = arith.xori %new_idxs_101, %new_idxs_139 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_141 = tt.reshape %ret_137 : tensor<32x16xi32, #linear> -> tensor<256x2x1xi32, #linear1> loc(#loc154)
+    %ileft_142 = arith.muli %y_141, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc156)
+    %ileft_143 = "tt.reduce"(%ileft_142) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc193)
+    %ileft_144 = tt.expand_dims %ileft_143 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc158)
+    %ileft_145 = tt.broadcast %ileft_144 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc159)
+    %iright_146 = arith.muli %y_141, %iright : tensor<256x2x1xi32, #linear1> loc(#loc160)
+    %iright_147 = "tt.reduce"(%iright_146) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc195)
+    %iright_148 = tt.expand_dims %iright_147 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc162)
+    %iright_149 = tt.broadcast %iright_148 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc163)
+    %ileft_150 = tt.reshape %ileft_145 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_151 = tt.reshape %iright_149 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_152 = tt.reshape %new_idxs_140 : tensor<32x16xi32, #linear> -> tensor<256x2x1xi32, #linear1> loc(#loc166)
+    %left_idx_153 = arith.muli %y_idx_152, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc168)
+    %left_idx_154 = "tt.reduce"(%left_idx_153) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc198)
+    %left_idx_155 = tt.expand_dims %left_idx_154 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc170)
+    %left_idx_156 = tt.broadcast %left_idx_155 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc171)
+    %right_idx_157 = arith.muli %y_idx_152, %iright : tensor<256x2x1xi32, #linear1> loc(#loc173)
+    %right_idx_158 = "tt.reduce"(%right_idx_157) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc201)
+    %right_idx_159 = tt.expand_dims %right_idx_158 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc175)
+    %right_idx_160 = tt.broadcast %right_idx_159 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc176)
+    %left_idx_161 = tt.reshape %left_idx_156 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_162 = tt.reshape %right_idx_160 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_163 = arith.cmpi slt, %ileft_150, %iright_151 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_164 = arith.cmpi eq, %ileft_150, %iright_151 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_165 = arith.cmpi sgt, %left_idx_161, %right_idx_162 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_166 = arith.andi %eq_164, %cond_165 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_167 = arith.ori %cond_163, %cond_166 : tensor<32x16xi1, #linear> loc(#loc183)
+    %cond_168 = arith.extui %cond_167 : tensor<32x16xi1, #linear> to tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_169 = arith.xori %cond_168, %flip_103 : tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_170 = arith.cmpi ne, %cond_169, %cst : tensor<32x16xi32, #linear> loc(#loc185)
+    %ret_171 = arith.xori %ileft_150, %iright_151 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_172 = arith.select %cond_170, %ret_171, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_173 = arith.xori %ret_137, %ret_172 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_174 = arith.xori %left_idx_161, %right_idx_162 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_175 = arith.select %cond_170, %new_idxs_174, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_176 = arith.xori %new_idxs_140, %new_idxs_175 : tensor<32x16xi32, #linear> loc(#loc191)
+    %flip_177 = tt.broadcast %flip_47 : tensor<1x2x1xi32, #linear4> -> tensor<32x2x8xi32, #linear4> loc(#loc147)
+    %flip_178 = tt.reshape %flip_177 : tensor<32x2x8xi32, #linear4> -> tensor<32x16xi32, #linear> loc(#loc148)
+    %y_179 = tt.reshape %ret_173 : tensor<32x16xi32, #linear> -> tensor<64x2x4xi32, #linear3> loc(#loc154)
+    %ileft_180 = tt.broadcast %left_mask_52 : tensor<1x2x1xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc156)
+    %ileft_181 = arith.muli %y_179, %ileft_180 : tensor<64x2x4xi32, #linear3> loc(#loc156)
+    %ileft_182 = "tt.reduce"(%ileft_181) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc193)
+    %ileft_183 = tt.expand_dims %ileft_182 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc158)
+    %ileft_184 = tt.broadcast %ileft_183 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc159)
+    %iright_185 = arith.muli %y_179, %flip_102 : tensor<64x2x4xi32, #linear3> loc(#loc160)
+    %iright_186 = "tt.reduce"(%iright_185) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc195)
+    %iright_187 = tt.expand_dims %iright_186 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc162)
+    %iright_188 = tt.broadcast %iright_187 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc163)
+    %ileft_189 = tt.reshape %ileft_184 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_190 = tt.reshape %iright_188 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_191 = tt.reshape %new_idxs_176 : tensor<32x16xi32, #linear> -> tensor<64x2x4xi32, #linear3> loc(#loc166)
+    %left_idx_192 = arith.muli %y_idx_191, %ileft_180 : tensor<64x2x4xi32, #linear3> loc(#loc168)
+    %left_idx_193 = "tt.reduce"(%left_idx_192) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc198)
+    %left_idx_194 = tt.expand_dims %left_idx_193 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc170)
+    %left_idx_195 = tt.broadcast %left_idx_194 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc171)
+    %right_idx_196 = arith.muli %y_idx_191, %flip_102 : tensor<64x2x4xi32, #linear3> loc(#loc173)
+    %right_idx_197 = "tt.reduce"(%right_idx_196) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc201)
+    %right_idx_198 = tt.expand_dims %right_idx_197 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc175)
+    %right_idx_199 = tt.broadcast %right_idx_198 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc176)
+    %left_idx_200 = tt.reshape %left_idx_195 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_201 = tt.reshape %right_idx_199 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_202 = arith.cmpi slt, %ileft_189, %iright_190 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_203 = arith.cmpi eq, %ileft_189, %iright_190 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_204 = arith.cmpi sgt, %left_idx_200, %right_idx_201 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_205 = arith.andi %eq_203, %cond_204 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_206 = arith.ori %cond_202, %cond_205 : tensor<32x16xi1, #linear> loc(#loc183)
+    %cond_207 = arith.extui %cond_206 : tensor<32x16xi1, #linear> to tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_208 = arith.xori %cond_207, %flip_178 : tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_209 = arith.cmpi ne, %cond_208, %cst : tensor<32x16xi32, #linear> loc(#loc185)
+    %ret_210 = arith.xori %ileft_189, %iright_190 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_211 = arith.select %cond_209, %ret_210, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_212 = arith.xori %ret_173, %ret_211 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_213 = arith.xori %left_idx_200, %right_idx_201 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_214 = arith.select %cond_209, %new_idxs_213, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_215 = arith.xori %new_idxs_176, %new_idxs_214 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_216 = tt.reshape %ret_212 : tensor<32x16xi32, #linear> -> tensor<128x2x2xi32, #linear2> loc(#loc154)
+    %ileft_217 = arith.muli %y_216, %ileft_105 : tensor<128x2x2xi32, #linear2> loc(#loc156)
+    %ileft_218 = "tt.reduce"(%ileft_217) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc193)
+    %ileft_219 = tt.expand_dims %ileft_218 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc158)
+    %ileft_220 = tt.broadcast %ileft_219 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc159)
+    %iright_221 = arith.muli %y_216, %flip_48 : tensor<128x2x2xi32, #linear2> loc(#loc160)
+    %iright_222 = "tt.reduce"(%iright_221) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc195)
+    %iright_223 = tt.expand_dims %iright_222 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc162)
+    %iright_224 = tt.broadcast %iright_223 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc163)
+    %ileft_225 = tt.reshape %ileft_220 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_226 = tt.reshape %iright_224 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_227 = tt.reshape %new_idxs_215 : tensor<32x16xi32, #linear> -> tensor<128x2x2xi32, #linear2> loc(#loc166)
+    %left_idx_228 = arith.muli %y_idx_227, %ileft_105 : tensor<128x2x2xi32, #linear2> loc(#loc168)
+    %left_idx_229 = "tt.reduce"(%left_idx_228) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc198)
+    %left_idx_230 = tt.expand_dims %left_idx_229 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc170)
+    %left_idx_231 = tt.broadcast %left_idx_230 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc171)
+    %right_idx_232 = arith.muli %y_idx_227, %flip_48 : tensor<128x2x2xi32, #linear2> loc(#loc173)
+    %right_idx_233 = "tt.reduce"(%right_idx_232) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc201)
+    %right_idx_234 = tt.expand_dims %right_idx_233 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc175)
+    %right_idx_235 = tt.broadcast %right_idx_234 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc176)
+    %left_idx_236 = tt.reshape %left_idx_231 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_237 = tt.reshape %right_idx_235 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_238 = arith.cmpi slt, %ileft_225, %iright_226 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_239 = arith.cmpi eq, %ileft_225, %iright_226 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_240 = arith.cmpi sgt, %left_idx_236, %right_idx_237 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_241 = arith.andi %eq_239, %cond_240 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_242 = arith.ori %cond_238, %cond_241 : tensor<32x16xi1, #linear> loc(#loc183)
+    %cond_243 = arith.extui %cond_242 : tensor<32x16xi1, #linear> to tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_244 = arith.xori %cond_243, %flip_178 : tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_245 = arith.cmpi ne, %cond_244, %cst : tensor<32x16xi32, #linear> loc(#loc185)
+    %ret_246 = arith.xori %ileft_225, %iright_226 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_247 = arith.select %cond_245, %ret_246, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_248 = arith.xori %ret_212, %ret_247 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_249 = arith.xori %left_idx_236, %right_idx_237 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_250 = arith.select %cond_245, %new_idxs_249, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_251 = arith.xori %new_idxs_215, %new_idxs_250 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_252 = tt.reshape %ret_248 : tensor<32x16xi32, #linear> -> tensor<256x2x1xi32, #linear1> loc(#loc154)
+    %ileft_253 = arith.muli %y_252, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc156)
+    %ileft_254 = "tt.reduce"(%ileft_253) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc193)
+    %ileft_255 = tt.expand_dims %ileft_254 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc158)
+    %ileft_256 = tt.broadcast %ileft_255 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc159)
+    %iright_257 = arith.muli %y_252, %iright : tensor<256x2x1xi32, #linear1> loc(#loc160)
+    %iright_258 = "tt.reduce"(%iright_257) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc195)
+    %iright_259 = tt.expand_dims %iright_258 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc162)
+    %iright_260 = tt.broadcast %iright_259 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc163)
+    %ileft_261 = tt.reshape %ileft_256 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_262 = tt.reshape %iright_260 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_263 = tt.reshape %new_idxs_251 : tensor<32x16xi32, #linear> -> tensor<256x2x1xi32, #linear1> loc(#loc166)
+    %left_idx_264 = arith.muli %y_idx_263, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc168)
+    %left_idx_265 = "tt.reduce"(%left_idx_264) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc198)
+    %left_idx_266 = tt.expand_dims %left_idx_265 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc170)
+    %left_idx_267 = tt.broadcast %left_idx_266 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc171)
+    %right_idx_268 = arith.muli %y_idx_263, %iright : tensor<256x2x1xi32, #linear1> loc(#loc173)
+    %right_idx_269 = "tt.reduce"(%right_idx_268) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc201)
+    %right_idx_270 = tt.expand_dims %right_idx_269 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc175)
+    %right_idx_271 = tt.broadcast %right_idx_270 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc176)
+    %left_idx_272 = tt.reshape %left_idx_267 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_273 = tt.reshape %right_idx_271 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_274 = arith.cmpi slt, %ileft_261, %iright_262 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_275 = arith.cmpi eq, %ileft_261, %iright_262 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_276 = arith.cmpi sgt, %left_idx_272, %right_idx_273 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_277 = arith.andi %eq_275, %cond_276 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_278 = arith.ori %cond_274, %cond_277 : tensor<32x16xi1, #linear> loc(#loc183)
+    %cond_279 = arith.extui %cond_278 : tensor<32x16xi1, #linear> to tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_280 = arith.xori %cond_279, %flip_178 : tensor<32x16xi32, #linear> loc(#loc184)
+    %cond_281 = arith.cmpi ne, %cond_280, %cst : tensor<32x16xi32, #linear> loc(#loc185)
+    %ret_282 = arith.xori %ileft_261, %iright_262 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_283 = arith.select %cond_281, %ret_282, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_284 = arith.xori %ret_248, %ret_283 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_285 = arith.xori %left_idx_272, %right_idx_273 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_286 = arith.select %cond_281, %new_idxs_285, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_287 = arith.xori %new_idxs_251, %new_idxs_286 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_288 = tt.reshape %ret_284 : tensor<32x16xi32, #linear> -> tensor<32x2x8xi32, #linear4> loc(#loc154)
+    %ileft_289 = tt.broadcast %left_mask_53 : tensor<1x2x1xi32, #linear4> -> tensor<32x2x8xi32, #linear4> loc(#loc156)
+    %ileft_290 = arith.muli %y_288, %ileft_289 : tensor<32x2x8xi32, #linear4> loc(#loc156)
+    %ileft_291 = "tt.reduce"(%ileft_290) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<32x2x8xi32, #linear4>) -> tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> loc(#loc193)
+    %ileft_292 = tt.expand_dims %ileft_291 {axis = 1 : i32} : tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> -> tensor<32x1x8xi32, #linear4> loc(#loc158)
+    %ileft_293 = tt.broadcast %ileft_292 : tensor<32x1x8xi32, #linear4> -> tensor<32x2x8xi32, #linear4> loc(#loc159)
+    %iright_294 = arith.muli %y_288, %flip_177 : tensor<32x2x8xi32, #linear4> loc(#loc160)
+    %iright_295 = "tt.reduce"(%iright_294) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<32x2x8xi32, #linear4>) -> tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> loc(#loc195)
+    %iright_296 = tt.expand_dims %iright_295 {axis = 1 : i32} : tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> -> tensor<32x1x8xi32, #linear4> loc(#loc162)
+    %iright_297 = tt.broadcast %iright_296 : tensor<32x1x8xi32, #linear4> -> tensor<32x2x8xi32, #linear4> loc(#loc163)
+    %ileft_298 = tt.reshape %ileft_293 : tensor<32x2x8xi32, #linear4> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_299 = tt.reshape %iright_297 : tensor<32x2x8xi32, #linear4> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_300 = tt.reshape %new_idxs_287 : tensor<32x16xi32, #linear> -> tensor<32x2x8xi32, #linear4> loc(#loc166)
+    %left_idx_301 = arith.muli %y_idx_300, %ileft_289 : tensor<32x2x8xi32, #linear4> loc(#loc168)
+    %left_idx_302 = "tt.reduce"(%left_idx_301) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<32x2x8xi32, #linear4>) -> tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> loc(#loc198)
+    %left_idx_303 = tt.expand_dims %left_idx_302 {axis = 1 : i32} : tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> -> tensor<32x1x8xi32, #linear4> loc(#loc170)
+    %left_idx_304 = tt.broadcast %left_idx_303 : tensor<32x1x8xi32, #linear4> -> tensor<32x2x8xi32, #linear4> loc(#loc171)
+    %right_idx_305 = arith.muli %y_idx_300, %flip_177 : tensor<32x2x8xi32, #linear4> loc(#loc173)
+    %right_idx_306 = "tt.reduce"(%right_idx_305) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<32x2x8xi32, #linear4>) -> tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> loc(#loc201)
+    %right_idx_307 = tt.expand_dims %right_idx_306 {axis = 1 : i32} : tensor<32x8xi32, #ttg.slice<{dim = 1, parent = #linear4}>> -> tensor<32x1x8xi32, #linear4> loc(#loc175)
+    %right_idx_308 = tt.broadcast %right_idx_307 : tensor<32x1x8xi32, #linear4> -> tensor<32x2x8xi32, #linear4> loc(#loc176)
+    %left_idx_309 = tt.reshape %left_idx_304 : tensor<32x2x8xi32, #linear4> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_310 = tt.reshape %right_idx_308 : tensor<32x2x8xi32, #linear4> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_311 = arith.cmpi slt, %ileft_298, %iright_299 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_312 = arith.cmpi eq, %ileft_298, %iright_299 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_313 = arith.cmpi sgt, %left_idx_309, %right_idx_310 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_314 = arith.andi %eq_312, %cond_313 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_315 = arith.ori %cond_311, %cond_314 : tensor<32x16xi1, #linear> loc(#loc183)
+    %ret_316 = arith.xori %ileft_298, %iright_299 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_317 = arith.select %cond_315, %ret_316, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_318 = arith.xori %ret_284, %ret_317 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_319 = arith.xori %left_idx_309, %right_idx_310 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_320 = arith.select %cond_315, %new_idxs_319, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_321 = arith.xori %new_idxs_287, %new_idxs_320 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_322 = tt.reshape %ret_318 : tensor<32x16xi32, #linear> -> tensor<64x2x4xi32, #linear3> loc(#loc154)
+    %ileft_323 = arith.muli %y_322, %ileft_180 : tensor<64x2x4xi32, #linear3> loc(#loc156)
+    %ileft_324 = "tt.reduce"(%ileft_323) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc193)
+    %ileft_325 = tt.expand_dims %ileft_324 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc158)
+    %ileft_326 = tt.broadcast %ileft_325 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc159)
+    %iright_327 = arith.muli %y_322, %flip_102 : tensor<64x2x4xi32, #linear3> loc(#loc160)
+    %iright_328 = "tt.reduce"(%iright_327) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc195)
+    %iright_329 = tt.expand_dims %iright_328 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc162)
+    %iright_330 = tt.broadcast %iright_329 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc163)
+    %ileft_331 = tt.reshape %ileft_326 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_332 = tt.reshape %iright_330 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_333 = tt.reshape %new_idxs_321 : tensor<32x16xi32, #linear> -> tensor<64x2x4xi32, #linear3> loc(#loc166)
+    %left_idx_334 = arith.muli %y_idx_333, %ileft_180 : tensor<64x2x4xi32, #linear3> loc(#loc168)
+    %left_idx_335 = "tt.reduce"(%left_idx_334) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc198)
+    %left_idx_336 = tt.expand_dims %left_idx_335 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc170)
+    %left_idx_337 = tt.broadcast %left_idx_336 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc171)
+    %right_idx_338 = arith.muli %y_idx_333, %flip_102 : tensor<64x2x4xi32, #linear3> loc(#loc173)
+    %right_idx_339 = "tt.reduce"(%right_idx_338) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<64x2x4xi32, #linear3>) -> tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> loc(#loc201)
+    %right_idx_340 = tt.expand_dims %right_idx_339 {axis = 1 : i32} : tensor<64x4xi32, #ttg.slice<{dim = 1, parent = #linear3}>> -> tensor<64x1x4xi32, #linear3> loc(#loc175)
+    %right_idx_341 = tt.broadcast %right_idx_340 : tensor<64x1x4xi32, #linear3> -> tensor<64x2x4xi32, #linear3> loc(#loc176)
+    %left_idx_342 = tt.reshape %left_idx_337 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_343 = tt.reshape %right_idx_341 : tensor<64x2x4xi32, #linear3> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_344 = arith.cmpi slt, %ileft_331, %iright_332 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_345 = arith.cmpi eq, %ileft_331, %iright_332 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_346 = arith.cmpi sgt, %left_idx_342, %right_idx_343 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_347 = arith.andi %eq_345, %cond_346 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_348 = arith.ori %cond_344, %cond_347 : tensor<32x16xi1, #linear> loc(#loc183)
+    %ret_349 = arith.xori %ileft_331, %iright_332 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_350 = arith.select %cond_348, %ret_349, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_351 = arith.xori %ret_318, %ret_350 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_352 = arith.xori %left_idx_342, %right_idx_343 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_353 = arith.select %cond_348, %new_idxs_352, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_354 = arith.xori %new_idxs_321, %new_idxs_353 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_355 = tt.reshape %ret_351 : tensor<32x16xi32, #linear> -> tensor<128x2x2xi32, #linear2> loc(#loc154)
+    %ileft_356 = arith.muli %y_355, %ileft_105 : tensor<128x2x2xi32, #linear2> loc(#loc156)
+    %ileft_357 = "tt.reduce"(%ileft_356) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc193)
+    %ileft_358 = tt.expand_dims %ileft_357 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc158)
+    %ileft_359 = tt.broadcast %ileft_358 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc159)
+    %iright_360 = arith.muli %y_355, %flip_48 : tensor<128x2x2xi32, #linear2> loc(#loc160)
+    %iright_361 = "tt.reduce"(%iright_360) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc195)
+    %iright_362 = tt.expand_dims %iright_361 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc162)
+    %iright_363 = tt.broadcast %iright_362 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc163)
+    %ileft_364 = tt.reshape %ileft_359 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_365 = tt.reshape %iright_363 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_366 = tt.reshape %new_idxs_354 : tensor<32x16xi32, #linear> -> tensor<128x2x2xi32, #linear2> loc(#loc166)
+    %left_idx_367 = arith.muli %y_idx_366, %ileft_105 : tensor<128x2x2xi32, #linear2> loc(#loc168)
+    %left_idx_368 = "tt.reduce"(%left_idx_367) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc198)
+    %left_idx_369 = tt.expand_dims %left_idx_368 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc170)
+    %left_idx_370 = tt.broadcast %left_idx_369 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc171)
+    %right_idx_371 = arith.muli %y_idx_366, %flip_48 : tensor<128x2x2xi32, #linear2> loc(#loc173)
+    %right_idx_372 = "tt.reduce"(%right_idx_371) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<128x2x2xi32, #linear2>) -> tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> loc(#loc201)
+    %right_idx_373 = tt.expand_dims %right_idx_372 {axis = 1 : i32} : tensor<128x2xi32, #ttg.slice<{dim = 1, parent = #linear2}>> -> tensor<128x1x2xi32, #linear2> loc(#loc175)
+    %right_idx_374 = tt.broadcast %right_idx_373 : tensor<128x1x2xi32, #linear2> -> tensor<128x2x2xi32, #linear2> loc(#loc176)
+    %left_idx_375 = tt.reshape %left_idx_370 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_376 = tt.reshape %right_idx_374 : tensor<128x2x2xi32, #linear2> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_377 = arith.cmpi slt, %ileft_364, %iright_365 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_378 = arith.cmpi eq, %ileft_364, %iright_365 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_379 = arith.cmpi sgt, %left_idx_375, %right_idx_376 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_380 = arith.andi %eq_378, %cond_379 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_381 = arith.ori %cond_377, %cond_380 : tensor<32x16xi1, #linear> loc(#loc183)
+    %ret_382 = arith.xori %ileft_364, %iright_365 : tensor<32x16xi32, #linear> loc(#loc186)
+    %ret_383 = arith.select %cond_381, %ret_382, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc187)
+    %ret_384 = arith.xori %ret_351, %ret_383 : tensor<32x16xi32, #linear> loc(#loc188)
+    %new_idxs_385 = arith.xori %left_idx_375, %right_idx_376 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_386 = arith.select %cond_381, %new_idxs_385, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_387 = arith.xori %new_idxs_354, %new_idxs_386 : tensor<32x16xi32, #linear> loc(#loc191)
+    %y_388 = tt.reshape %ret_384 : tensor<32x16xi32, #linear> -> tensor<256x2x1xi32, #linear1> loc(#loc154)
+    %ileft_389 = arith.muli %y_388, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc156)
+    %ileft_390 = "tt.reduce"(%ileft_389) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_419: i32 loc(callsite(#loc1 at #loc157)), %ileft_420: i32 loc(callsite(#loc1 at #loc157))):
+      %ileft_421 = arith.addi %ileft_419, %ileft_420 : i32 loc(#loc203)
+      tt.reduce.return %ileft_421 : i32 loc(#loc193)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc193)
+    %ileft_391 = tt.expand_dims %ileft_390 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc158)
+    %ileft_392 = tt.broadcast %ileft_391 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc159)
+    %iright_393 = arith.muli %y_388, %iright : tensor<256x2x1xi32, #linear1> loc(#loc160)
+    %iright_394 = "tt.reduce"(%iright_393) <{axis = 1 : i32}> ({
+    ^bb0(%iright_419: i32 loc(callsite(#loc1 at #loc161)), %iright_420: i32 loc(callsite(#loc1 at #loc161))):
+      %iright_421 = arith.addi %iright_419, %iright_420 : i32 loc(#loc204)
+      tt.reduce.return %iright_421 : i32 loc(#loc195)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc195)
+    %iright_395 = tt.expand_dims %iright_394 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc162)
+    %iright_396 = tt.broadcast %iright_395 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc163)
+    %ileft_397 = tt.reshape %ileft_392 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc164)
+    %iright_398 = tt.reshape %iright_396 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc165)
+    %y_idx_399 = tt.reshape %new_idxs_387 : tensor<32x16xi32, #linear> -> tensor<256x2x1xi32, #linear1> loc(#loc166)
+    %left_idx_400 = arith.muli %y_idx_399, %ileft : tensor<256x2x1xi32, #linear1> loc(#loc168)
+    %left_idx_401 = "tt.reduce"(%left_idx_400) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_419: i32 loc(callsite(#loc1 at #loc169)), %left_idx_420: i32 loc(callsite(#loc1 at #loc169))):
+      %left_idx_421 = arith.addi %left_idx_419, %left_idx_420 : i32 loc(#loc205)
+      tt.reduce.return %left_idx_421 : i32 loc(#loc198)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc198)
+    %left_idx_402 = tt.expand_dims %left_idx_401 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc170)
+    %left_idx_403 = tt.broadcast %left_idx_402 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc171)
+    %right_idx_404 = arith.muli %y_idx_399, %iright : tensor<256x2x1xi32, #linear1> loc(#loc173)
+    %right_idx_405 = "tt.reduce"(%right_idx_404) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_419: i32 loc(callsite(#loc1 at #loc174)), %right_idx_420: i32 loc(callsite(#loc1 at #loc174))):
+      %right_idx_421 = arith.addi %right_idx_419, %right_idx_420 : i32 loc(#loc206)
+      tt.reduce.return %right_idx_421 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32, #linear1>) -> tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> loc(#loc201)
+    %right_idx_406 = tt.expand_dims %right_idx_405 {axis = 1 : i32} : tensor<256x1xi32, #ttg.slice<{dim = 1, parent = #linear1}>> -> tensor<256x1x1xi32, #linear1> loc(#loc175)
+    %right_idx_407 = tt.broadcast %right_idx_406 : tensor<256x1x1xi32, #linear1> -> tensor<256x2x1xi32, #linear1> loc(#loc176)
+    %left_idx_408 = tt.reshape %left_idx_403 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc177)
+    %right_idx_409 = tt.reshape %right_idx_407 : tensor<256x2x1xi32, #linear1> -> tensor<32x16xi32, #linear> loc(#loc178)
+    %cond_410 = arith.cmpi slt, %ileft_397, %iright_398 : tensor<32x16xi32, #linear> loc(#loc179)
+    %eq_411 = arith.cmpi eq, %ileft_397, %iright_398 : tensor<32x16xi32, #linear> loc(#loc180)
+    %cond_412 = arith.cmpi sgt, %left_idx_408, %right_idx_409 : tensor<32x16xi32, #linear> loc(#loc181)
+    %cond_413 = arith.andi %eq_411, %cond_412 : tensor<32x16xi1, #linear> loc(#loc182)
+    %cond_414 = arith.ori %cond_410, %cond_413 : tensor<32x16xi1, #linear> loc(#loc183)
+    %new_idxs_415 = arith.xori %left_idx_408, %right_idx_409 : tensor<32x16xi32, #linear> loc(#loc189)
+    %new_idxs_416 = arith.select %cond_414, %new_idxs_415, %cst : tensor<32x16xi1, #linear>, tensor<32x16xi32, #linear> loc(#loc190)
+    %new_idxs_417 = arith.xori %new_idxs_387, %new_idxs_416 : tensor<32x16xi32, #linear> loc(#loc191)
+    %tmp7 = arith.extsi %tmp0_36 : tensor<32x16xi32, #blocked> to tensor<32x16xi64, #blocked> loc(#loc141)
+    %tmp10 = arith.select %tmp0_34, %tmp7, %cst_0 : tensor<32x16xi1, #blocked>, tensor<32x16xi64, #blocked> loc(#loc142)
+    %tmp11 = "tt.reduce"(%tmp10) <{axis = 1 : i32}> ({
+    ^bb0(%tmp11_419: i64 loc(callsite(#loc1 at #loc143)), %tmp11_420: i64 loc(callsite(#loc1 at #loc143))):
+      %tmp11_421 = arith.addi %tmp11_419, %tmp11_420 : i64 loc(#loc192)
+      tt.reduce.return %tmp11_421 : i64 loc(#loc152)
+    }) : (tensor<32x16xi64, #blocked>) -> tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> loc(#loc152)
+    %tmp11_418 = tt.expand_dims %tmp11 {axis = 1 : i32} : tensor<32xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<32x1xi64, #blocked> loc(#loc144)
+    %tmp14 = arith.trunci %tmp11_418 : tensor<32x1xi64, #blocked> to tensor<32x1xi32, #blocked> loc(#loc145)
+    %0 = arith.muli %xindex_19, %cst_4 : tensor<32x1xi32, #blocked1> loc(#loc70)
+    %1 = tt.broadcast %r0_index_25 : tensor<1x16xi32, #blocked1> -> tensor<32x16xi32, #blocked1> loc(#loc71)
+    %2 = tt.broadcast %0 : tensor<32x1xi32, #blocked1> -> tensor<32x16xi32, #blocked1> loc(#loc71)
+    %3 = arith.addi %1, %2 : tensor<32x16xi32, #blocked1> loc(#loc71)
+    %4 = tt.splat %out_ptr2 : !tt.ptr<i32> -> tensor<32x16x!tt.ptr<i32>, #blocked1> loc(#loc72)
+    %5 = tt.addptr %4, %3 : tensor<32x16x!tt.ptr<i32>, #blocked1>, tensor<32x16xi32, #blocked1> loc(#loc72)
+    %6 = ttg.convert_layout %new_idxs_417 : tensor<32x16xi32, #linear> -> tensor<32x16xi32, #blocked1> loc(#loc73)
+    tt.store %5, %6, %tmp0_35 : tensor<32x16x!tt.ptr<i32>, #blocked1> loc(#loc73)
+    %7 = tt.splat %out_ptr3 : !tt.ptr<i32> -> tensor<32x1x!tt.ptr<i32>, #blocked> loc(#loc74)
+    %8 = tt.addptr %7, %xindex_18 : tensor<32x1x!tt.ptr<i32>, #blocked>, tensor<32x1xi32, #blocked> loc(#loc74)
+    tt.store %8, %tmp14, %xmask : tensor<32x1x!tt.ptr<i32>, #blocked> loc(#loc75)
+    tt.return loc(#loc76)
+  } loc(#loc)
+} loc(#loc)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":24:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":24:33)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":25:44)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":25:23)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":26:21)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":27:38)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":33:19)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":34:19)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:38)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:35)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:49)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:45)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:30)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:54)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":38:19)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":40:33)
+#loc18 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:44)
+#loc21 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:60)
+#loc22 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:68)
+#loc23 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":533:22)
+#loc25 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":537:21)
+#loc26 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:40)
+#loc27 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc29 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc30 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:65)
+#loc31 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:78)
+#loc32 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:41)
+#loc34 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:67)
+#loc35 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:80)
+#loc36 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":540:30)
+#loc37 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":541:32)
+#loc38 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":546:29)
+#loc39 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:36)
+#loc40 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:23)
+#loc41 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":290:25)
+#loc43 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:53)
+#loc44 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:66)
+#loc45 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:37)
+#loc46 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:23)
+#loc48 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:54)
+#loc49 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:67)
+#loc50 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":553:36)
+#loc51 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":554:38)
+#loc52 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":574:22)
+#loc53 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":591:21)
+#loc54 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":594:40)
+#loc55 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":594:29)
+#loc56 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":594:23)
+#loc57 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":599:19)
+#loc58 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":599:28)
+#loc59 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":600:38)
+#loc60 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":600:46)
+#loc61 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":600:15)
+#loc62 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":601:48)
+#loc63 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":601:59)
+#loc64 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":601:22)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":42:19)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":44:34)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":45:29)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":48:21)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:35)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:32)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:25)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:47)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":50:25)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":50:37)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":50:4)
+#loc82 = loc("xoffset"(#loc2))
+#loc83 = loc("xoffset"(#loc3))
+#loc84 = loc("xindex"(#loc4))
+#loc85 = loc("xindex"(#loc5))
+#loc86 = loc("xmask"(#loc6))
+#loc87 = loc("r0_index"(#loc7))
+#loc88 = loc("x0"(#loc8))
+#loc89 = loc("x1"(#loc9))
+#loc90 = loc("tmp0"(#loc10))
+#loc91 = loc("tmp0"(#loc11))
+#loc92 = loc("tmp0"(#loc12))
+#loc93 = loc("tmp0"(#loc13))
+#loc94 = loc("tmp0"(#loc14))
+#loc95 = loc("tmp0"(#loc15))
+#loc96 = loc("tmp2"(#loc16))
+#loc97 = loc("tmp4"(#loc17))
+#loc98 = loc("flip"(#loc18))
+#loc100 = loc("flip"(#loc21))
+#loc101 = loc("flip"(#loc22))
+#loc102 = loc("y"(#loc23))
+#loc103 = loc("left_mask"(#loc25))
+#loc104 = loc("ileft"(#loc26))
+#loc106 = loc("ileft"(#loc30))
+#loc107 = loc("ileft"(#loc31))
+#loc108 = loc("iright"(#loc32))
+#loc110 = loc("iright"(#loc34))
+#loc111 = loc("iright"(#loc35))
+#loc112 = loc("ileft"(#loc36))
+#loc113 = loc("iright"(#loc37))
+#loc114 = loc("y_idx"(#loc38))
+#loc115 = loc("left_idx"(#loc39))
+#loc116 = loc("left_idx"(#loc40))
+#loc117 = loc("input"(#loc41))
+#loc119 = loc("left_idx"(#loc43))
+#loc120 = loc("left_idx"(#loc44))
+#loc121 = loc("right_idx"(#loc45))
+#loc122 = loc("right_idx"(#loc46))
+#loc124 = loc("right_idx"(#loc48))
+#loc125 = loc("right_idx"(#loc49))
+#loc126 = loc("left_idx"(#loc50))
+#loc127 = loc("right_idx"(#loc51))
+#loc128 = loc("cond"(#loc52))
+#loc129 = loc("eq"(#loc53))
+#loc130 = loc("cond"(#loc54))
+#loc131 = loc("cond"(#loc55))
+#loc132 = loc("cond"(#loc56))
+#loc133 = loc("cond"(#loc57))
+#loc134 = loc("cond"(#loc58))
+#loc135 = loc("ret"(#loc59))
+#loc136 = loc("ret"(#loc60))
+#loc137 = loc("ret"(#loc61))
+#loc138 = loc("new_idxs"(#loc62))
+#loc139 = loc("new_idxs"(#loc63))
+#loc140 = loc("new_idxs"(#loc64))
+#loc141 = loc("tmp7"(#loc65))
+#loc142 = loc("tmp10"(#loc66))
+#loc144 = loc("tmp11"(#loc68))
+#loc145 = loc("tmp14"(#loc69))
+#loc146 = loc(callsite(#loc98 at #loc99))
+#loc147 = loc(callsite(#loc100 at #loc99))
+#loc148 = loc(callsite(#loc101 at #loc99))
+#loc150 = loc("cond"(#loc128))
+#loc151 = loc("eq"(#loc129))
+#loc152 = loc(callsite(#loc27 at #loc143))
+#loc154 = loc(callsite(#loc102 at #loc149))
+#loc155 = loc(callsite(#loc103 at #loc149))
+#loc156 = loc(callsite(#loc104 at #loc149))
+#loc158 = loc(callsite(#loc106 at #loc149))
+#loc159 = loc(callsite(#loc107 at #loc149))
+#loc160 = loc(callsite(#loc108 at #loc149))
+#loc162 = loc(callsite(#loc110 at #loc149))
+#loc163 = loc(callsite(#loc111 at #loc149))
+#loc164 = loc(callsite(#loc112 at #loc149))
+#loc165 = loc(callsite(#loc113 at #loc149))
+#loc166 = loc(callsite(#loc114 at #loc149))
+#loc167 = loc(callsite(#loc115 at #loc149))
+#loc168 = loc(callsite(#loc116 at #loc149))
+#loc170 = loc(callsite(#loc119 at #loc149))
+#loc171 = loc(callsite(#loc120 at #loc149))
+#loc172 = loc(callsite(#loc121 at #loc149))
+#loc173 = loc(callsite(#loc122 at #loc149))
+#loc175 = loc(callsite(#loc124 at #loc149))
+#loc176 = loc(callsite(#loc125 at #loc149))
+#loc177 = loc(callsite(#loc126 at #loc149))
+#loc178 = loc(callsite(#loc127 at #loc149))
+#loc179 = loc(callsite(#loc150 at #loc149))
+#loc180 = loc(callsite(#loc151 at #loc149))
+#loc181 = loc(callsite(#loc130 at #loc149))
+#loc182 = loc(callsite(#loc131 at #loc149))
+#loc183 = loc(callsite(#loc132 at #loc149))
+#loc184 = loc(callsite(#loc133 at #loc149))
+#loc185 = loc(callsite(#loc134 at #loc149))
+#loc186 = loc(callsite(#loc135 at #loc149))
+#loc187 = loc(callsite(#loc136 at #loc149))
+#loc188 = loc(callsite(#loc137 at #loc149))
+#loc189 = loc(callsite(#loc138 at #loc149))
+#loc190 = loc(callsite(#loc139 at #loc149))
+#loc191 = loc(callsite(#loc140 at #loc149))
+#loc192 = loc(callsite(#loc29 at #loc152))
+#loc193 = loc(callsite(#loc27 at #loc157))
+#loc195 = loc(callsite(#loc27 at #loc161))
+#loc197 = loc(callsite(#loc117 at #loc169))
+#loc198 = loc(callsite(#loc27 at #loc169))
+#loc200 = loc(callsite(#loc117 at #loc174))
+#loc201 = loc(callsite(#loc27 at #loc174))
+#loc203 = loc(callsite(#loc29 at #loc193))
+#loc204 = loc(callsite(#loc29 at #loc195))
+#loc205 = loc(callsite(#loc29 at #loc198))
+#loc206 = loc(callsite(#loc29 at #loc201))

SpecForge-ext/cache/compiled_kernels/triton/3/A7DYCXJM4X5DHYLAIRTU6BFB3S5UCV3W4C27BWQBJGXYAG3NWQWA/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttir ADDED Viewed

	@@ -0,0 +1,799 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":18:0)
+#loc1 = loc(unknown)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":41:67)
+#loc23 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":662:12)
+#loc28 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":634:73)
+#loc32 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:51)
+#loc37 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:53)
+#loc46 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:50)
+#loc51 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:51)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":45:26)
+#loc80 = loc("in_ptr0"(#loc))
+#loc81 = loc("out_ptr2"(#loc))
+#loc82 = loc("out_ptr3"(#loc))
+#loc83 = loc("xnumel"(#loc))
+#loc84 = loc("r0_numel"(#loc))
+#loc106 = loc(callsite(#loc23 at #loc2))
+#loc113 = loc("ileft"(#loc32))
+#loc117 = loc("iright"(#loc37))
+#loc126 = loc("left_idx"(#loc46))
+#loc131 = loc("right_idx"(#loc51))
+#loc150 = loc("tmp11"(#loc70))
+#loc157 = loc(callsite(#loc28 at #loc106))
+#loc161 = loc(callsite(#loc1 at #loc150))
+#loc165 = loc(callsite(#loc113 at #loc157))
+#loc169 = loc(callsite(#loc117 at #loc157))
+#loc177 = loc(callsite(#loc126 at #loc157))
+#loc182 = loc(callsite(#loc131 at #loc157))
+#loc202 = loc(callsite(#loc1 at #loc165))
+#loc204 = loc(callsite(#loc1 at #loc169))
+#loc207 = loc(callsite(#loc1 at #loc177))
+#loc210 = loc(callsite(#loc1 at #loc182))
+module {
+  tt.func public @triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3(%in_ptr0: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %out_ptr3: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr3"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<1> : tensor<1x2x1xi32> loc(#loc85)
+    %cst_0 = arith.constant dense<0> : tensor<32x16xi32> loc(#loc1)
+    %tmp10 = arith.constant dense<0> : tensor<32x16xi64> loc(#loc86)
+    %tmp0 = arith.constant dense<272> : tensor<32x1xi32> loc(#loc87)
+    %tmp0_1 = arith.constant dense<17> : tensor<1x16xi32> loc(#loc88)
+    %cst_2 = arith.constant dense<16> : tensor<32x1xi32> loc(#loc1)
+    %xmask = arith.constant dense<32> : tensor<32x1xi32> loc(#loc89)
+    %c32_i32 = arith.constant 32 : i32 loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc90)
+    %xoffset_3 = arith.muli %xoffset, %c32_i32 : i32 loc(#loc91)
+    %xindex = tt.make_range {end = 32 : i32, start = 0 : i32} : tensor<32xi32> loc(#loc92)
+    %xindex_4 = tt.expand_dims %xindex {axis = 1 : i32} : tensor<32xi32> -> tensor<32x1xi32> loc(#loc93)
+    %xindex_5 = tt.splat %xoffset_3 : i32 -> tensor<32x1xi32> loc(#loc94)
+    %xindex_6 = arith.addi %xindex_5, %xindex_4 : tensor<32x1xi32> loc(#loc94)
+    %xmask_7 = arith.cmpi slt, %xindex_6, %xmask : tensor<32x1xi32> loc(#loc89)
+    %r0_index = tt.make_range {end = 16 : i32, start = 0 : i32} : tensor<16xi32> loc(#loc95)
+    %r0_index_8 = tt.expand_dims %r0_index {axis = 0 : i32} : tensor<16xi32> -> tensor<1x16xi32> loc(#loc96)
+    %x0 = arith.remsi %xindex_6, %cst_2 : tensor<32x1xi32> loc(#loc97)
+    %x1 = arith.divsi %xindex_6, %cst_2 : tensor<32x1xi32> loc(#loc98)
+    %tmp0_9 = arith.muli %r0_index_8, %tmp0_1 : tensor<1x16xi32> loc(#loc88)
+    %tmp0_10 = tt.broadcast %x0 : tensor<32x1xi32> -> tensor<32x16xi32> loc(#loc99)
+    %tmp0_11 = tt.broadcast %tmp0_9 : tensor<1x16xi32> -> tensor<32x16xi32> loc(#loc99)
+    %tmp0_12 = arith.addi %tmp0_10, %tmp0_11 : tensor<32x16xi32> loc(#loc99)
+    %tmp0_13 = arith.muli %x1, %tmp0 : tensor<32x1xi32> loc(#loc87)
+    %tmp0_14 = tt.broadcast %tmp0_13 : tensor<32x1xi32> -> tensor<32x16xi32> loc(#loc100)
+    %tmp0_15 = arith.addi %tmp0_12, %tmp0_14 : tensor<32x16xi32> loc(#loc100)
+    %tmp0_16 = tt.splat %in_ptr0 : !tt.ptr<i32> -> tensor<32x16x!tt.ptr<i32>> loc(#loc101)
+    %tmp0_17 = tt.addptr %tmp0_16, %tmp0_15 : tensor<32x16x!tt.ptr<i32>>, tensor<32x16xi32> loc(#loc101)
+    %tmp0_18 = tt.broadcast %xmask_7 : tensor<32x1xi1> -> tensor<32x16xi1> loc(#loc102)
+    %tmp0_19 = tt.load %tmp0_17, %tmp0_18, %cst_0 : tensor<32x16x!tt.ptr<i32>> loc(#loc102)
+    %tmp2 = arith.trunci %r0_index_8 : tensor<1x16xi32> to tensor<1x16xi16> loc(#loc103)
+    %tmp4 = tt.broadcast %tmp2 : tensor<1x16xi16> -> tensor<32x16xi16> loc(#loc104)
+    %flip = tt.make_range {end = 2 : i32, start = 0 : i32} : tensor<2xi32> loc(#loc153)
+    %flip_20 = tt.expand_dims %flip {axis = 0 : i32} : tensor<2xi32> -> tensor<1x2xi32> loc(#loc154)
+    %flip_21 = tt.expand_dims %flip_20 {axis = 2 : i32} : tensor<1x2xi32> -> tensor<1x2x1xi32> loc(#loc154)
+    %flip_22 = tt.broadcast %flip_21 : tensor<1x2x1xi32> -> tensor<128x2x2xi32> loc(#loc155)
+    %flip_23 = tt.reshape %flip_22 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc156)
+    %y = tt.reshape %tmp0_19 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc162)
+    %left_mask = arith.subi %cst, %flip_21 : tensor<1x2x1xi32> loc(#loc163)
+    %ileft = tt.broadcast %left_mask : tensor<1x2x1xi32> -> tensor<256x2x1xi32> loc(#loc164)
+    %ileft_24 = arith.muli %y, %ileft : tensor<256x2x1xi32> loc(#loc164)
+    %ileft_25 = "tt.reduce"(%ileft_24) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc201)
+    %ileft_26 = tt.expand_dims %ileft_25 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc166)
+    %ileft_27 = tt.broadcast %ileft_26 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc167)
+    %iright = tt.broadcast %flip_21 : tensor<1x2x1xi32> -> tensor<256x2x1xi32> loc(#loc168)
+    %iright_28 = arith.muli %y, %iright : tensor<256x2x1xi32> loc(#loc168)
+    %iright_29 = "tt.reduce"(%iright_28) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc203)
+    %iright_30 = tt.expand_dims %iright_29 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc170)
+    %iright_31 = tt.broadcast %iright_30 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc171)
+    %ileft_32 = tt.reshape %ileft_27 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_33 = tt.reshape %iright_31 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx = tt.reshape %tmp4 : tensor<32x16xi16> -> tensor<256x2x1xi16> loc(#loc174)
+    %left_idx = arith.trunci %left_mask : tensor<1x2x1xi32> to tensor<1x2x1xi16> loc(#loc175)
+    %left_idx_34 = tt.broadcast %left_idx : tensor<1x2x1xi16> -> tensor<256x2x1xi16> loc(#loc176)
+    %left_idx_35 = arith.muli %y_idx, %left_idx_34 : tensor<256x2x1xi16> loc(#loc176)
+    %input = arith.extsi %left_idx_35 : tensor<256x2x1xi16> to tensor<256x2x1xi32> loc(#loc205)
+    %left_idx_36 = "tt.reduce"(%input) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc206)
+    %left_idx_37 = tt.expand_dims %left_idx_36 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc178)
+    %left_idx_38 = tt.broadcast %left_idx_37 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc179)
+    %right_idx = arith.trunci %flip_21 : tensor<1x2x1xi32> to tensor<1x2x1xi16> loc(#loc180)
+    %right_idx_39 = tt.broadcast %right_idx : tensor<1x2x1xi16> -> tensor<256x2x1xi16> loc(#loc181)
+    %right_idx_40 = arith.muli %y_idx, %right_idx_39 : tensor<256x2x1xi16> loc(#loc181)
+    %input_41 = arith.extsi %right_idx_40 : tensor<256x2x1xi16> to tensor<256x2x1xi32> loc(#loc208)
+    %right_idx_42 = "tt.reduce"(%input_41) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc209)
+    %right_idx_43 = tt.expand_dims %right_idx_42 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc183)
+    %right_idx_44 = tt.broadcast %right_idx_43 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc184)
+    %left_idx_45 = tt.reshape %left_idx_38 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_46 = tt.reshape %right_idx_44 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond = arith.cmpi slt, %ileft_32, %iright_33 : tensor<32x16xi32> loc(#loc187)
+    %eq = arith.cmpi eq, %ileft_32, %iright_33 : tensor<32x16xi32> loc(#loc188)
+    %cond_47 = arith.cmpi sgt, %left_idx_45, %right_idx_46 : tensor<32x16xi32> loc(#loc189)
+    %cond_48 = arith.andi %eq, %cond_47 : tensor<32x16xi1> loc(#loc190)
+    %cond_49 = arith.ori %cond, %cond_48 : tensor<32x16xi1> loc(#loc191)
+    %cond_50 = arith.extui %cond_49 : tensor<32x16xi1> to tensor<32x16xi32> loc(#loc192)
+    %cond_51 = arith.xori %cond_50, %flip_23 : tensor<32x16xi32> loc(#loc192)
+    %cond_52 = arith.cmpi ne, %cond_51, %cst_0 : tensor<32x16xi32> loc(#loc193)
+    %ret = arith.xori %ileft_32, %iright_33 : tensor<32x16xi32> loc(#loc194)
+    %ret_53 = arith.select %cond_52, %ret, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_54 = arith.xori %tmp0_19, %ret_53 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs = arith.xori %left_idx_45, %right_idx_46 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_55 = arith.select %cond_52, %new_idxs, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_56 = arith.extsi %tmp2 : tensor<1x16xi16> to tensor<1x16xi32> loc(#loc199)
+    %new_idxs_57 = tt.broadcast %new_idxs_56 : tensor<1x16xi32> -> tensor<32x16xi32> loc(#loc199)
+    %new_idxs_58 = arith.xori %new_idxs_57, %new_idxs_55 : tensor<32x16xi32> loc(#loc199)
+    %flip_59 = tt.broadcast %flip_21 : tensor<1x2x1xi32> -> tensor<64x2x4xi32> loc(#loc155)
+    %flip_60 = tt.reshape %flip_59 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc156)
+    %y_61 = tt.reshape %ret_54 : tensor<32x16xi32> -> tensor<128x2x2xi32> loc(#loc162)
+    %ileft_62 = tt.broadcast %left_mask : tensor<1x2x1xi32> -> tensor<128x2x2xi32> loc(#loc164)
+    %ileft_63 = arith.muli %y_61, %ileft_62 : tensor<128x2x2xi32> loc(#loc164)
+    %ileft_64 = "tt.reduce"(%ileft_63) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc201)
+    %ileft_65 = tt.expand_dims %ileft_64 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc166)
+    %ileft_66 = tt.broadcast %ileft_65 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc167)
+    %iright_67 = arith.muli %y_61, %flip_22 : tensor<128x2x2xi32> loc(#loc168)
+    %iright_68 = "tt.reduce"(%iright_67) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc203)
+    %iright_69 = tt.expand_dims %iright_68 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc170)
+    %iright_70 = tt.broadcast %iright_69 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc171)
+    %ileft_71 = tt.reshape %ileft_66 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_72 = tt.reshape %iright_70 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_73 = tt.reshape %new_idxs_58 : tensor<32x16xi32> -> tensor<128x2x2xi32> loc(#loc174)
+    %left_idx_74 = arith.muli %y_idx_73, %ileft_62 : tensor<128x2x2xi32> loc(#loc176)
+    %left_idx_75 = "tt.reduce"(%left_idx_74) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc206)
+    %left_idx_76 = tt.expand_dims %left_idx_75 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc178)
+    %left_idx_77 = tt.broadcast %left_idx_76 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc179)
+    %right_idx_78 = arith.muli %y_idx_73, %flip_22 : tensor<128x2x2xi32> loc(#loc181)
+    %right_idx_79 = "tt.reduce"(%right_idx_78) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc209)
+    %right_idx_80 = tt.expand_dims %right_idx_79 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc183)
+    %right_idx_81 = tt.broadcast %right_idx_80 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc184)
+    %left_idx_82 = tt.reshape %left_idx_77 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_83 = tt.reshape %right_idx_81 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_84 = arith.cmpi slt, %ileft_71, %iright_72 : tensor<32x16xi32> loc(#loc187)
+    %eq_85 = arith.cmpi eq, %ileft_71, %iright_72 : tensor<32x16xi32> loc(#loc188)
+    %cond_86 = arith.cmpi sgt, %left_idx_82, %right_idx_83 : tensor<32x16xi32> loc(#loc189)
+    %cond_87 = arith.andi %eq_85, %cond_86 : tensor<32x16xi1> loc(#loc190)
+    %cond_88 = arith.ori %cond_84, %cond_87 : tensor<32x16xi1> loc(#loc191)
+    %cond_89 = arith.extui %cond_88 : tensor<32x16xi1> to tensor<32x16xi32> loc(#loc192)
+    %cond_90 = arith.xori %cond_89, %flip_60 : tensor<32x16xi32> loc(#loc192)
+    %cond_91 = arith.cmpi ne, %cond_90, %cst_0 : tensor<32x16xi32> loc(#loc193)
+    %ret_92 = arith.xori %ileft_71, %iright_72 : tensor<32x16xi32> loc(#loc194)
+    %ret_93 = arith.select %cond_91, %ret_92, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_94 = arith.xori %ret_54, %ret_93 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_95 = arith.xori %left_idx_82, %right_idx_83 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_96 = arith.select %cond_91, %new_idxs_95, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_97 = arith.xori %new_idxs_58, %new_idxs_96 : tensor<32x16xi32> loc(#loc199)
+    %y_98 = tt.reshape %ret_94 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc162)
+    %ileft_99 = arith.muli %y_98, %ileft : tensor<256x2x1xi32> loc(#loc164)
+    %ileft_100 = "tt.reduce"(%ileft_99) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc201)
+    %ileft_101 = tt.expand_dims %ileft_100 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc166)
+    %ileft_102 = tt.broadcast %ileft_101 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc167)
+    %iright_103 = arith.muli %y_98, %iright : tensor<256x2x1xi32> loc(#loc168)
+    %iright_104 = "tt.reduce"(%iright_103) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc203)
+    %iright_105 = tt.expand_dims %iright_104 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc170)
+    %iright_106 = tt.broadcast %iright_105 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc171)
+    %ileft_107 = tt.reshape %ileft_102 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_108 = tt.reshape %iright_106 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_109 = tt.reshape %new_idxs_97 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc174)
+    %left_idx_110 = arith.muli %y_idx_109, %ileft : tensor<256x2x1xi32> loc(#loc176)
+    %left_idx_111 = "tt.reduce"(%left_idx_110) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc206)
+    %left_idx_112 = tt.expand_dims %left_idx_111 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc178)
+    %left_idx_113 = tt.broadcast %left_idx_112 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc179)
+    %right_idx_114 = arith.muli %y_idx_109, %iright : tensor<256x2x1xi32> loc(#loc181)
+    %right_idx_115 = "tt.reduce"(%right_idx_114) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc209)
+    %right_idx_116 = tt.expand_dims %right_idx_115 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc183)
+    %right_idx_117 = tt.broadcast %right_idx_116 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc184)
+    %left_idx_118 = tt.reshape %left_idx_113 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_119 = tt.reshape %right_idx_117 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_120 = arith.cmpi slt, %ileft_107, %iright_108 : tensor<32x16xi32> loc(#loc187)
+    %eq_121 = arith.cmpi eq, %ileft_107, %iright_108 : tensor<32x16xi32> loc(#loc188)
+    %cond_122 = arith.cmpi sgt, %left_idx_118, %right_idx_119 : tensor<32x16xi32> loc(#loc189)
+    %cond_123 = arith.andi %eq_121, %cond_122 : tensor<32x16xi1> loc(#loc190)
+    %cond_124 = arith.ori %cond_120, %cond_123 : tensor<32x16xi1> loc(#loc191)
+    %cond_125 = arith.extui %cond_124 : tensor<32x16xi1> to tensor<32x16xi32> loc(#loc192)
+    %cond_126 = arith.xori %cond_125, %flip_60 : tensor<32x16xi32> loc(#loc192)
+    %cond_127 = arith.cmpi ne, %cond_126, %cst_0 : tensor<32x16xi32> loc(#loc193)
+    %ret_128 = arith.xori %ileft_107, %iright_108 : tensor<32x16xi32> loc(#loc194)
+    %ret_129 = arith.select %cond_127, %ret_128, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_130 = arith.xori %ret_94, %ret_129 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_131 = arith.xori %left_idx_118, %right_idx_119 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_132 = arith.select %cond_127, %new_idxs_131, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_133 = arith.xori %new_idxs_97, %new_idxs_132 : tensor<32x16xi32> loc(#loc199)
+    %flip_134 = tt.broadcast %flip_21 : tensor<1x2x1xi32> -> tensor<32x2x8xi32> loc(#loc155)
+    %flip_135 = tt.reshape %flip_134 : tensor<32x2x8xi32> -> tensor<32x16xi32> loc(#loc156)
+    %y_136 = tt.reshape %ret_130 : tensor<32x16xi32> -> tensor<64x2x4xi32> loc(#loc162)
+    %ileft_137 = tt.broadcast %left_mask : tensor<1x2x1xi32> -> tensor<64x2x4xi32> loc(#loc164)
+    %ileft_138 = arith.muli %y_136, %ileft_137 : tensor<64x2x4xi32> loc(#loc164)
+    %ileft_139 = "tt.reduce"(%ileft_138) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc201)
+    %ileft_140 = tt.expand_dims %ileft_139 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc166)
+    %ileft_141 = tt.broadcast %ileft_140 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc167)
+    %iright_142 = arith.muli %y_136, %flip_59 : tensor<64x2x4xi32> loc(#loc168)
+    %iright_143 = "tt.reduce"(%iright_142) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc203)
+    %iright_144 = tt.expand_dims %iright_143 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc170)
+    %iright_145 = tt.broadcast %iright_144 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc171)
+    %ileft_146 = tt.reshape %ileft_141 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_147 = tt.reshape %iright_145 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_148 = tt.reshape %new_idxs_133 : tensor<32x16xi32> -> tensor<64x2x4xi32> loc(#loc174)
+    %left_idx_149 = arith.muli %y_idx_148, %ileft_137 : tensor<64x2x4xi32> loc(#loc176)
+    %left_idx_150 = "tt.reduce"(%left_idx_149) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc206)
+    %left_idx_151 = tt.expand_dims %left_idx_150 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc178)
+    %left_idx_152 = tt.broadcast %left_idx_151 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc179)
+    %right_idx_153 = arith.muli %y_idx_148, %flip_59 : tensor<64x2x4xi32> loc(#loc181)
+    %right_idx_154 = "tt.reduce"(%right_idx_153) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc209)
+    %right_idx_155 = tt.expand_dims %right_idx_154 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc183)
+    %right_idx_156 = tt.broadcast %right_idx_155 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc184)
+    %left_idx_157 = tt.reshape %left_idx_152 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_158 = tt.reshape %right_idx_156 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_159 = arith.cmpi slt, %ileft_146, %iright_147 : tensor<32x16xi32> loc(#loc187)
+    %eq_160 = arith.cmpi eq, %ileft_146, %iright_147 : tensor<32x16xi32> loc(#loc188)
+    %cond_161 = arith.cmpi sgt, %left_idx_157, %right_idx_158 : tensor<32x16xi32> loc(#loc189)
+    %cond_162 = arith.andi %eq_160, %cond_161 : tensor<32x16xi1> loc(#loc190)
+    %cond_163 = arith.ori %cond_159, %cond_162 : tensor<32x16xi1> loc(#loc191)
+    %cond_164 = arith.extui %cond_163 : tensor<32x16xi1> to tensor<32x16xi32> loc(#loc192)
+    %cond_165 = arith.xori %cond_164, %flip_135 : tensor<32x16xi32> loc(#loc192)
+    %cond_166 = arith.cmpi ne, %cond_165, %cst_0 : tensor<32x16xi32> loc(#loc193)
+    %ret_167 = arith.xori %ileft_146, %iright_147 : tensor<32x16xi32> loc(#loc194)
+    %ret_168 = arith.select %cond_166, %ret_167, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_169 = arith.xori %ret_130, %ret_168 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_170 = arith.xori %left_idx_157, %right_idx_158 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_171 = arith.select %cond_166, %new_idxs_170, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_172 = arith.xori %new_idxs_133, %new_idxs_171 : tensor<32x16xi32> loc(#loc199)
+    %y_173 = tt.reshape %ret_169 : tensor<32x16xi32> -> tensor<128x2x2xi32> loc(#loc162)
+    %ileft_174 = arith.muli %y_173, %ileft_62 : tensor<128x2x2xi32> loc(#loc164)
+    %ileft_175 = "tt.reduce"(%ileft_174) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc201)
+    %ileft_176 = tt.expand_dims %ileft_175 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc166)
+    %ileft_177 = tt.broadcast %ileft_176 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc167)
+    %iright_178 = arith.muli %y_173, %flip_22 : tensor<128x2x2xi32> loc(#loc168)
+    %iright_179 = "tt.reduce"(%iright_178) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc203)
+    %iright_180 = tt.expand_dims %iright_179 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc170)
+    %iright_181 = tt.broadcast %iright_180 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc171)
+    %ileft_182 = tt.reshape %ileft_177 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_183 = tt.reshape %iright_181 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_184 = tt.reshape %new_idxs_172 : tensor<32x16xi32> -> tensor<128x2x2xi32> loc(#loc174)
+    %left_idx_185 = arith.muli %y_idx_184, %ileft_62 : tensor<128x2x2xi32> loc(#loc176)
+    %left_idx_186 = "tt.reduce"(%left_idx_185) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc206)
+    %left_idx_187 = tt.expand_dims %left_idx_186 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc178)
+    %left_idx_188 = tt.broadcast %left_idx_187 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc179)
+    %right_idx_189 = arith.muli %y_idx_184, %flip_22 : tensor<128x2x2xi32> loc(#loc181)
+    %right_idx_190 = "tt.reduce"(%right_idx_189) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc209)
+    %right_idx_191 = tt.expand_dims %right_idx_190 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc183)
+    %right_idx_192 = tt.broadcast %right_idx_191 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc184)
+    %left_idx_193 = tt.reshape %left_idx_188 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_194 = tt.reshape %right_idx_192 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_195 = arith.cmpi slt, %ileft_182, %iright_183 : tensor<32x16xi32> loc(#loc187)
+    %eq_196 = arith.cmpi eq, %ileft_182, %iright_183 : tensor<32x16xi32> loc(#loc188)
+    %cond_197 = arith.cmpi sgt, %left_idx_193, %right_idx_194 : tensor<32x16xi32> loc(#loc189)
+    %cond_198 = arith.andi %eq_196, %cond_197 : tensor<32x16xi1> loc(#loc190)
+    %cond_199 = arith.ori %cond_195, %cond_198 : tensor<32x16xi1> loc(#loc191)
+    %cond_200 = arith.extui %cond_199 : tensor<32x16xi1> to tensor<32x16xi32> loc(#loc192)
+    %cond_201 = arith.xori %cond_200, %flip_135 : tensor<32x16xi32> loc(#loc192)
+    %cond_202 = arith.cmpi ne, %cond_201, %cst_0 : tensor<32x16xi32> loc(#loc193)
+    %ret_203 = arith.xori %ileft_182, %iright_183 : tensor<32x16xi32> loc(#loc194)
+    %ret_204 = arith.select %cond_202, %ret_203, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_205 = arith.xori %ret_169, %ret_204 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_206 = arith.xori %left_idx_193, %right_idx_194 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_207 = arith.select %cond_202, %new_idxs_206, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_208 = arith.xori %new_idxs_172, %new_idxs_207 : tensor<32x16xi32> loc(#loc199)
+    %y_209 = tt.reshape %ret_205 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc162)
+    %ileft_210 = arith.muli %y_209, %ileft : tensor<256x2x1xi32> loc(#loc164)
+    %ileft_211 = "tt.reduce"(%ileft_210) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc201)
+    %ileft_212 = tt.expand_dims %ileft_211 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc166)
+    %ileft_213 = tt.broadcast %ileft_212 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc167)
+    %iright_214 = arith.muli %y_209, %iright : tensor<256x2x1xi32> loc(#loc168)
+    %iright_215 = "tt.reduce"(%iright_214) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc203)
+    %iright_216 = tt.expand_dims %iright_215 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc170)
+    %iright_217 = tt.broadcast %iright_216 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc171)
+    %ileft_218 = tt.reshape %ileft_213 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_219 = tt.reshape %iright_217 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_220 = tt.reshape %new_idxs_208 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc174)
+    %left_idx_221 = arith.muli %y_idx_220, %ileft : tensor<256x2x1xi32> loc(#loc176)
+    %left_idx_222 = "tt.reduce"(%left_idx_221) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc206)
+    %left_idx_223 = tt.expand_dims %left_idx_222 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc178)
+    %left_idx_224 = tt.broadcast %left_idx_223 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc179)
+    %right_idx_225 = arith.muli %y_idx_220, %iright : tensor<256x2x1xi32> loc(#loc181)
+    %right_idx_226 = "tt.reduce"(%right_idx_225) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc209)
+    %right_idx_227 = tt.expand_dims %right_idx_226 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc183)
+    %right_idx_228 = tt.broadcast %right_idx_227 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc184)
+    %left_idx_229 = tt.reshape %left_idx_224 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_230 = tt.reshape %right_idx_228 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_231 = arith.cmpi slt, %ileft_218, %iright_219 : tensor<32x16xi32> loc(#loc187)
+    %eq_232 = arith.cmpi eq, %ileft_218, %iright_219 : tensor<32x16xi32> loc(#loc188)
+    %cond_233 = arith.cmpi sgt, %left_idx_229, %right_idx_230 : tensor<32x16xi32> loc(#loc189)
+    %cond_234 = arith.andi %eq_232, %cond_233 : tensor<32x16xi1> loc(#loc190)
+    %cond_235 = arith.ori %cond_231, %cond_234 : tensor<32x16xi1> loc(#loc191)
+    %cond_236 = arith.extui %cond_235 : tensor<32x16xi1> to tensor<32x16xi32> loc(#loc192)
+    %cond_237 = arith.xori %cond_236, %flip_135 : tensor<32x16xi32> loc(#loc192)
+    %cond_238 = arith.cmpi ne, %cond_237, %cst_0 : tensor<32x16xi32> loc(#loc193)
+    %ret_239 = arith.xori %ileft_218, %iright_219 : tensor<32x16xi32> loc(#loc194)
+    %ret_240 = arith.select %cond_238, %ret_239, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_241 = arith.xori %ret_205, %ret_240 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_242 = arith.xori %left_idx_229, %right_idx_230 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_243 = arith.select %cond_238, %new_idxs_242, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_244 = arith.xori %new_idxs_208, %new_idxs_243 : tensor<32x16xi32> loc(#loc199)
+    %y_245 = tt.reshape %ret_241 : tensor<32x16xi32> -> tensor<32x2x8xi32> loc(#loc162)
+    %ileft_246 = tt.broadcast %left_mask : tensor<1x2x1xi32> -> tensor<32x2x8xi32> loc(#loc164)
+    %ileft_247 = arith.muli %y_245, %ileft_246 : tensor<32x2x8xi32> loc(#loc164)
+    %ileft_248 = "tt.reduce"(%ileft_247) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<32x2x8xi32>) -> tensor<32x8xi32> loc(#loc201)
+    %ileft_249 = tt.expand_dims %ileft_248 {axis = 1 : i32} : tensor<32x8xi32> -> tensor<32x1x8xi32> loc(#loc166)
+    %ileft_250 = tt.broadcast %ileft_249 : tensor<32x1x8xi32> -> tensor<32x2x8xi32> loc(#loc167)
+    %iright_251 = arith.muli %y_245, %flip_134 : tensor<32x2x8xi32> loc(#loc168)
+    %iright_252 = "tt.reduce"(%iright_251) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<32x2x8xi32>) -> tensor<32x8xi32> loc(#loc203)
+    %iright_253 = tt.expand_dims %iright_252 {axis = 1 : i32} : tensor<32x8xi32> -> tensor<32x1x8xi32> loc(#loc170)
+    %iright_254 = tt.broadcast %iright_253 : tensor<32x1x8xi32> -> tensor<32x2x8xi32> loc(#loc171)
+    %ileft_255 = tt.reshape %ileft_250 : tensor<32x2x8xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_256 = tt.reshape %iright_254 : tensor<32x2x8xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_257 = tt.reshape %new_idxs_244 : tensor<32x16xi32> -> tensor<32x2x8xi32> loc(#loc174)
+    %left_idx_258 = arith.muli %y_idx_257, %ileft_246 : tensor<32x2x8xi32> loc(#loc176)
+    %left_idx_259 = "tt.reduce"(%left_idx_258) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<32x2x8xi32>) -> tensor<32x8xi32> loc(#loc206)
+    %left_idx_260 = tt.expand_dims %left_idx_259 {axis = 1 : i32} : tensor<32x8xi32> -> tensor<32x1x8xi32> loc(#loc178)
+    %left_idx_261 = tt.broadcast %left_idx_260 : tensor<32x1x8xi32> -> tensor<32x2x8xi32> loc(#loc179)
+    %right_idx_262 = arith.muli %y_idx_257, %flip_134 : tensor<32x2x8xi32> loc(#loc181)
+    %right_idx_263 = "tt.reduce"(%right_idx_262) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<32x2x8xi32>) -> tensor<32x8xi32> loc(#loc209)
+    %right_idx_264 = tt.expand_dims %right_idx_263 {axis = 1 : i32} : tensor<32x8xi32> -> tensor<32x1x8xi32> loc(#loc183)
+    %right_idx_265 = tt.broadcast %right_idx_264 : tensor<32x1x8xi32> -> tensor<32x2x8xi32> loc(#loc184)
+    %left_idx_266 = tt.reshape %left_idx_261 : tensor<32x2x8xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_267 = tt.reshape %right_idx_265 : tensor<32x2x8xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_268 = arith.cmpi slt, %ileft_255, %iright_256 : tensor<32x16xi32> loc(#loc187)
+    %eq_269 = arith.cmpi eq, %ileft_255, %iright_256 : tensor<32x16xi32> loc(#loc188)
+    %cond_270 = arith.cmpi sgt, %left_idx_266, %right_idx_267 : tensor<32x16xi32> loc(#loc189)
+    %cond_271 = arith.andi %eq_269, %cond_270 : tensor<32x16xi1> loc(#loc190)
+    %cond_272 = arith.ori %cond_268, %cond_271 : tensor<32x16xi1> loc(#loc191)
+    %ret_273 = arith.xori %ileft_255, %iright_256 : tensor<32x16xi32> loc(#loc194)
+    %ret_274 = arith.select %cond_272, %ret_273, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_275 = arith.xori %ret_241, %ret_274 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_276 = arith.xori %left_idx_266, %right_idx_267 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_277 = arith.select %cond_272, %new_idxs_276, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_278 = arith.xori %new_idxs_244, %new_idxs_277 : tensor<32x16xi32> loc(#loc199)
+    %y_279 = tt.reshape %ret_275 : tensor<32x16xi32> -> tensor<64x2x4xi32> loc(#loc162)
+    %ileft_280 = arith.muli %y_279, %ileft_137 : tensor<64x2x4xi32> loc(#loc164)
+    %ileft_281 = "tt.reduce"(%ileft_280) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc201)
+    %ileft_282 = tt.expand_dims %ileft_281 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc166)
+    %ileft_283 = tt.broadcast %ileft_282 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc167)
+    %iright_284 = arith.muli %y_279, %flip_59 : tensor<64x2x4xi32> loc(#loc168)
+    %iright_285 = "tt.reduce"(%iright_284) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc203)
+    %iright_286 = tt.expand_dims %iright_285 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc170)
+    %iright_287 = tt.broadcast %iright_286 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc171)
+    %ileft_288 = tt.reshape %ileft_283 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_289 = tt.reshape %iright_287 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_290 = tt.reshape %new_idxs_278 : tensor<32x16xi32> -> tensor<64x2x4xi32> loc(#loc174)
+    %left_idx_291 = arith.muli %y_idx_290, %ileft_137 : tensor<64x2x4xi32> loc(#loc176)
+    %left_idx_292 = "tt.reduce"(%left_idx_291) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc206)
+    %left_idx_293 = tt.expand_dims %left_idx_292 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc178)
+    %left_idx_294 = tt.broadcast %left_idx_293 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc179)
+    %right_idx_295 = arith.muli %y_idx_290, %flip_59 : tensor<64x2x4xi32> loc(#loc181)
+    %right_idx_296 = "tt.reduce"(%right_idx_295) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<64x2x4xi32>) -> tensor<64x4xi32> loc(#loc209)
+    %right_idx_297 = tt.expand_dims %right_idx_296 {axis = 1 : i32} : tensor<64x4xi32> -> tensor<64x1x4xi32> loc(#loc183)
+    %right_idx_298 = tt.broadcast %right_idx_297 : tensor<64x1x4xi32> -> tensor<64x2x4xi32> loc(#loc184)
+    %left_idx_299 = tt.reshape %left_idx_294 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_300 = tt.reshape %right_idx_298 : tensor<64x2x4xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_301 = arith.cmpi slt, %ileft_288, %iright_289 : tensor<32x16xi32> loc(#loc187)
+    %eq_302 = arith.cmpi eq, %ileft_288, %iright_289 : tensor<32x16xi32> loc(#loc188)
+    %cond_303 = arith.cmpi sgt, %left_idx_299, %right_idx_300 : tensor<32x16xi32> loc(#loc189)
+    %cond_304 = arith.andi %eq_302, %cond_303 : tensor<32x16xi1> loc(#loc190)
+    %cond_305 = arith.ori %cond_301, %cond_304 : tensor<32x16xi1> loc(#loc191)
+    %ret_306 = arith.xori %ileft_288, %iright_289 : tensor<32x16xi32> loc(#loc194)
+    %ret_307 = arith.select %cond_305, %ret_306, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_308 = arith.xori %ret_275, %ret_307 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_309 = arith.xori %left_idx_299, %right_idx_300 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_310 = arith.select %cond_305, %new_idxs_309, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_311 = arith.xori %new_idxs_278, %new_idxs_310 : tensor<32x16xi32> loc(#loc199)
+    %y_312 = tt.reshape %ret_308 : tensor<32x16xi32> -> tensor<128x2x2xi32> loc(#loc162)
+    %ileft_313 = arith.muli %y_312, %ileft_62 : tensor<128x2x2xi32> loc(#loc164)
+    %ileft_314 = "tt.reduce"(%ileft_313) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc201)
+    %ileft_315 = tt.expand_dims %ileft_314 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc166)
+    %ileft_316 = tt.broadcast %ileft_315 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc167)
+    %iright_317 = arith.muli %y_312, %flip_22 : tensor<128x2x2xi32> loc(#loc168)
+    %iright_318 = "tt.reduce"(%iright_317) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc203)
+    %iright_319 = tt.expand_dims %iright_318 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc170)
+    %iright_320 = tt.broadcast %iright_319 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc171)
+    %ileft_321 = tt.reshape %ileft_316 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_322 = tt.reshape %iright_320 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_323 = tt.reshape %new_idxs_311 : tensor<32x16xi32> -> tensor<128x2x2xi32> loc(#loc174)
+    %left_idx_324 = arith.muli %y_idx_323, %ileft_62 : tensor<128x2x2xi32> loc(#loc176)
+    %left_idx_325 = "tt.reduce"(%left_idx_324) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc206)
+    %left_idx_326 = tt.expand_dims %left_idx_325 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc178)
+    %left_idx_327 = tt.broadcast %left_idx_326 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc179)
+    %right_idx_328 = arith.muli %y_idx_323, %flip_22 : tensor<128x2x2xi32> loc(#loc181)
+    %right_idx_329 = "tt.reduce"(%right_idx_328) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<128x2x2xi32>) -> tensor<128x2xi32> loc(#loc209)
+    %right_idx_330 = tt.expand_dims %right_idx_329 {axis = 1 : i32} : tensor<128x2xi32> -> tensor<128x1x2xi32> loc(#loc183)
+    %right_idx_331 = tt.broadcast %right_idx_330 : tensor<128x1x2xi32> -> tensor<128x2x2xi32> loc(#loc184)
+    %left_idx_332 = tt.reshape %left_idx_327 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_333 = tt.reshape %right_idx_331 : tensor<128x2x2xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_334 = arith.cmpi slt, %ileft_321, %iright_322 : tensor<32x16xi32> loc(#loc187)
+    %eq_335 = arith.cmpi eq, %ileft_321, %iright_322 : tensor<32x16xi32> loc(#loc188)
+    %cond_336 = arith.cmpi sgt, %left_idx_332, %right_idx_333 : tensor<32x16xi32> loc(#loc189)
+    %cond_337 = arith.andi %eq_335, %cond_336 : tensor<32x16xi1> loc(#loc190)
+    %cond_338 = arith.ori %cond_334, %cond_337 : tensor<32x16xi1> loc(#loc191)
+    %ret_339 = arith.xori %ileft_321, %iright_322 : tensor<32x16xi32> loc(#loc194)
+    %ret_340 = arith.select %cond_338, %ret_339, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc195)
+    %ret_341 = arith.xori %ret_308, %ret_340 : tensor<32x16xi32> loc(#loc196)
+    %new_idxs_342 = arith.xori %left_idx_332, %right_idx_333 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_343 = arith.select %cond_338, %new_idxs_342, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_344 = arith.xori %new_idxs_311, %new_idxs_343 : tensor<32x16xi32> loc(#loc199)
+    %y_345 = tt.reshape %ret_341 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc162)
+    %ileft_346 = arith.muli %y_345, %ileft : tensor<256x2x1xi32> loc(#loc164)
+    %ileft_347 = "tt.reduce"(%ileft_346) <{axis = 1 : i32}> ({
+    ^bb0(%ileft_377: i32 loc(callsite(#loc1 at #loc165)), %ileft_378: i32 loc(callsite(#loc1 at #loc165))):
+      %ileft_379 = arith.addi %ileft_377, %ileft_378 : i32 loc(#loc211)
+      tt.reduce.return %ileft_379 : i32 loc(#loc201)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc201)
+    %ileft_348 = tt.expand_dims %ileft_347 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc166)
+    %ileft_349 = tt.broadcast %ileft_348 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc167)
+    %iright_350 = arith.muli %y_345, %iright : tensor<256x2x1xi32> loc(#loc168)
+    %iright_351 = "tt.reduce"(%iright_350) <{axis = 1 : i32}> ({
+    ^bb0(%iright_377: i32 loc(callsite(#loc1 at #loc169)), %iright_378: i32 loc(callsite(#loc1 at #loc169))):
+      %iright_379 = arith.addi %iright_377, %iright_378 : i32 loc(#loc212)
+      tt.reduce.return %iright_379 : i32 loc(#loc203)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc203)
+    %iright_352 = tt.expand_dims %iright_351 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc170)
+    %iright_353 = tt.broadcast %iright_352 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc171)
+    %ileft_354 = tt.reshape %ileft_349 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc172)
+    %iright_355 = tt.reshape %iright_353 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc173)
+    %y_idx_356 = tt.reshape %new_idxs_344 : tensor<32x16xi32> -> tensor<256x2x1xi32> loc(#loc174)
+    %left_idx_357 = arith.muli %y_idx_356, %ileft : tensor<256x2x1xi32> loc(#loc176)
+    %left_idx_358 = "tt.reduce"(%left_idx_357) <{axis = 1 : i32}> ({
+    ^bb0(%left_idx_377: i32 loc(callsite(#loc1 at #loc177)), %left_idx_378: i32 loc(callsite(#loc1 at #loc177))):
+      %left_idx_379 = arith.addi %left_idx_377, %left_idx_378 : i32 loc(#loc213)
+      tt.reduce.return %left_idx_379 : i32 loc(#loc206)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc206)
+    %left_idx_359 = tt.expand_dims %left_idx_358 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc178)
+    %left_idx_360 = tt.broadcast %left_idx_359 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc179)
+    %right_idx_361 = arith.muli %y_idx_356, %iright : tensor<256x2x1xi32> loc(#loc181)
+    %right_idx_362 = "tt.reduce"(%right_idx_361) <{axis = 1 : i32}> ({
+    ^bb0(%right_idx_377: i32 loc(callsite(#loc1 at #loc182)), %right_idx_378: i32 loc(callsite(#loc1 at #loc182))):
+      %right_idx_379 = arith.addi %right_idx_377, %right_idx_378 : i32 loc(#loc214)
+      tt.reduce.return %right_idx_379 : i32 loc(#loc209)
+    }) : (tensor<256x2x1xi32>) -> tensor<256x1xi32> loc(#loc209)
+    %right_idx_363 = tt.expand_dims %right_idx_362 {axis = 1 : i32} : tensor<256x1xi32> -> tensor<256x1x1xi32> loc(#loc183)
+    %right_idx_364 = tt.broadcast %right_idx_363 : tensor<256x1x1xi32> -> tensor<256x2x1xi32> loc(#loc184)
+    %left_idx_365 = tt.reshape %left_idx_360 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc185)
+    %right_idx_366 = tt.reshape %right_idx_364 : tensor<256x2x1xi32> -> tensor<32x16xi32> loc(#loc186)
+    %cond_367 = arith.cmpi slt, %ileft_354, %iright_355 : tensor<32x16xi32> loc(#loc187)
+    %eq_368 = arith.cmpi eq, %ileft_354, %iright_355 : tensor<32x16xi32> loc(#loc188)
+    %cond_369 = arith.cmpi sgt, %left_idx_365, %right_idx_366 : tensor<32x16xi32> loc(#loc189)
+    %cond_370 = arith.andi %eq_368, %cond_369 : tensor<32x16xi1> loc(#loc190)
+    %cond_371 = arith.ori %cond_367, %cond_370 : tensor<32x16xi1> loc(#loc191)
+    %new_idxs_372 = arith.xori %left_idx_365, %right_idx_366 : tensor<32x16xi32> loc(#loc197)
+    %new_idxs_373 = arith.select %cond_371, %new_idxs_372, %cst_0 : tensor<32x16xi1>, tensor<32x16xi32> loc(#loc198)
+    %new_idxs_374 = arith.xori %new_idxs_344, %new_idxs_373 : tensor<32x16xi32> loc(#loc199)
+    %tmp7 = arith.extsi %tmp0_19 : tensor<32x16xi32> to tensor<32x16xi64> loc(#loc149)
+    %tmp10_375 = arith.select %tmp0_18, %tmp7, %tmp10 : tensor<32x16xi1>, tensor<32x16xi64> loc(#loc86)
+    %tmp11 = "tt.reduce"(%tmp10_375) <{axis = 1 : i32}> ({
+    ^bb0(%tmp11_377: i64 loc(callsite(#loc1 at #loc150)), %tmp11_378: i64 loc(callsite(#loc1 at #loc150))):
+      %tmp11_379 = arith.addi %tmp11_377, %tmp11_378 : i64 loc(#loc200)
+      tt.reduce.return %tmp11_379 : i64 loc(#loc160)
+    }) : (tensor<32x16xi64>) -> tensor<32xi64> loc(#loc160)
+    %tmp11_376 = tt.expand_dims %tmp11 {axis = 1 : i32} : tensor<32xi64> -> tensor<32x1xi64> loc(#loc151)
+    %tmp14 = arith.trunci %tmp11_376 : tensor<32x1xi64> to tensor<32x1xi32> loc(#loc152)
+    %0 = arith.muli %xindex_6, %cst_2 : tensor<32x1xi32> loc(#loc73)
+    %1 = tt.broadcast %r0_index_8 : tensor<1x16xi32> -> tensor<32x16xi32> loc(#loc74)
+    %2 = tt.broadcast %0 : tensor<32x1xi32> -> tensor<32x16xi32> loc(#loc74)
+    %3 = arith.addi %1, %2 : tensor<32x16xi32> loc(#loc74)
+    %4 = tt.splat %out_ptr2 : !tt.ptr<i32> -> tensor<32x16x!tt.ptr<i32>> loc(#loc75)
+    %5 = tt.addptr %4, %3 : tensor<32x16x!tt.ptr<i32>>, tensor<32x16xi32> loc(#loc75)
+    tt.store %5, %new_idxs_374, %tmp0_18 : tensor<32x16x!tt.ptr<i32>> loc(#loc76)
+    %6 = tt.splat %out_ptr3 : !tt.ptr<i32> -> tensor<32x1x!tt.ptr<i32>> loc(#loc77)
+    %7 = tt.addptr %6, %xindex_6 : tensor<32x1x!tt.ptr<i32>>, tensor<32x1xi32> loc(#loc77)
+    tt.store %7, %tmp14, %xmask_7 : tensor<32x1x!tt.ptr<i32>> loc(#loc78)
+    tt.return loc(#loc79)
+  } loc(#loc)
+} loc(#loc)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":44:34)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:49)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:38)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":26:21)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":24:28)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":24:33)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":25:36)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":25:44)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":25:23)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":27:28)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":27:38)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":33:19)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":34:19)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:35)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:45)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:30)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":36:54)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":38:19)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":40:33)
+#loc22 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:41)
+#loc24 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:44)
+#loc25 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:60)
+#loc26 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":627:68)
+#loc27 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":533:22)
+#loc29 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":537:21)
+#loc30 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:40)
+#loc31 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc33 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc34 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:65)
+#loc35 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":538:78)
+#loc36 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:41)
+#loc38 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:67)
+#loc39 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":539:80)
+#loc40 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":540:30)
+#loc41 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":541:32)
+#loc42 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":546:29)
+#loc43 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:36)
+#loc44 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:23)
+#loc45 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":290:25)
+#loc47 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:53)
+#loc48 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":548:66)
+#loc49 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:37)
+#loc50 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:23)
+#loc52 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:54)
+#loc53 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":551:67)
+#loc54 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":553:36)
+#loc55 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":554:38)
+#loc56 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":574:22)
+#loc57 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":591:21)
+#loc58 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":594:40)
+#loc59 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":594:29)
+#loc60 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":594:23)
+#loc61 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":599:19)
+#loc62 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":599:28)
+#loc63 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":600:38)
+#loc64 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":600:46)
+#loc65 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":600:15)
+#loc66 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":601:48)
+#loc67 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":601:59)
+#loc68 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":601:22)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":42:19)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":45:29)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":48:21)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:35)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:32)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:25)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":49:47)
+#loc77 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":50:25)
+#loc78 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":50:37)
+#loc79 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/hx/chxnygpvpmvr2mx2e6mwgdeojthrirnog7nmq6mcsi3wvegvi2so.py":50:4)
+#loc85 = loc(callsite(#loc1 at #loc2))
+#loc86 = loc("tmp10"(#loc3))
+#loc87 = loc("tmp0"(#loc4))
+#loc88 = loc("tmp0"(#loc5))
+#loc89 = loc("xmask"(#loc6))
+#loc90 = loc("xoffset"(#loc7))
+#loc91 = loc("xoffset"(#loc8))
+#loc92 = loc("xindex"(#loc9))
+#loc93 = loc("xindex"(#loc10))
+#loc94 = loc("xindex"(#loc11))
+#loc95 = loc("r0_index"(#loc12))
+#loc96 = loc("r0_index"(#loc13))
+#loc97 = loc("x0"(#loc14))
+#loc98 = loc("x1"(#loc15))
+#loc99 = loc("tmp0"(#loc16))
+#loc100 = loc("tmp0"(#loc17))
+#loc101 = loc("tmp0"(#loc18))
+#loc102 = loc("tmp0"(#loc19))
+#loc103 = loc("tmp2"(#loc20))
+#loc104 = loc("tmp4"(#loc21))
+#loc105 = loc("flip"(#loc22))
+#loc107 = loc("flip"(#loc24))
+#loc108 = loc("flip"(#loc25))
+#loc109 = loc("flip"(#loc26))
+#loc110 = loc("y"(#loc27))
+#loc111 = loc("left_mask"(#loc29))
+#loc112 = loc("ileft"(#loc30))
+#loc114 = loc("ileft"(#loc34))
+#loc115 = loc("ileft"(#loc35))
+#loc116 = loc("iright"(#loc36))
+#loc118 = loc("iright"(#loc38))
+#loc119 = loc("iright"(#loc39))
+#loc120 = loc("ileft"(#loc40))
+#loc121 = loc("iright"(#loc41))
+#loc122 = loc("y_idx"(#loc42))
+#loc123 = loc("left_idx"(#loc43))
+#loc124 = loc("left_idx"(#loc44))
+#loc125 = loc("input"(#loc45))
+#loc127 = loc("left_idx"(#loc47))
+#loc128 = loc("left_idx"(#loc48))
+#loc129 = loc("right_idx"(#loc49))
+#loc130 = loc("right_idx"(#loc50))
+#loc132 = loc("right_idx"(#loc52))
+#loc133 = loc("right_idx"(#loc53))
+#loc134 = loc("left_idx"(#loc54))
+#loc135 = loc("right_idx"(#loc55))
+#loc136 = loc("cond"(#loc56))
+#loc137 = loc("eq"(#loc57))
+#loc138 = loc("cond"(#loc58))
+#loc139 = loc("cond"(#loc59))
+#loc140 = loc("cond"(#loc60))
+#loc141 = loc("cond"(#loc61))
+#loc142 = loc("cond"(#loc62))
+#loc143 = loc("ret"(#loc63))
+#loc144 = loc("ret"(#loc64))
+#loc145 = loc("ret"(#loc65))
+#loc146 = loc("new_idxs"(#loc66))
+#loc147 = loc("new_idxs"(#loc67))
+#loc148 = loc("new_idxs"(#loc68))
+#loc149 = loc("tmp7"(#loc69))
+#loc151 = loc("tmp11"(#loc71))
+#loc152 = loc("tmp14"(#loc72))
+#loc153 = loc(callsite(#loc105 at #loc106))
+#loc154 = loc(callsite(#loc107 at #loc106))
+#loc155 = loc(callsite(#loc108 at #loc106))
+#loc156 = loc(callsite(#loc109 at #loc106))
+#loc158 = loc("cond"(#loc136))
+#loc159 = loc("eq"(#loc137))
+#loc160 = loc(callsite(#loc31 at #loc150))
+#loc162 = loc(callsite(#loc110 at #loc157))
+#loc163 = loc(callsite(#loc111 at #loc157))
+#loc164 = loc(callsite(#loc112 at #loc157))
+#loc166 = loc(callsite(#loc114 at #loc157))
+#loc167 = loc(callsite(#loc115 at #loc157))
+#loc168 = loc(callsite(#loc116 at #loc157))
+#loc170 = loc(callsite(#loc118 at #loc157))
+#loc171 = loc(callsite(#loc119 at #loc157))
+#loc172 = loc(callsite(#loc120 at #loc157))
+#loc173 = loc(callsite(#loc121 at #loc157))
+#loc174 = loc(callsite(#loc122 at #loc157))
+#loc175 = loc(callsite(#loc123 at #loc157))
+#loc176 = loc(callsite(#loc124 at #loc157))
+#loc178 = loc(callsite(#loc127 at #loc157))
+#loc179 = loc(callsite(#loc128 at #loc157))
+#loc180 = loc(callsite(#loc129 at #loc157))
+#loc181 = loc(callsite(#loc130 at #loc157))
+#loc183 = loc(callsite(#loc132 at #loc157))
+#loc184 = loc(callsite(#loc133 at #loc157))
+#loc185 = loc(callsite(#loc134 at #loc157))
+#loc186 = loc(callsite(#loc135 at #loc157))
+#loc187 = loc(callsite(#loc158 at #loc157))
+#loc188 = loc(callsite(#loc159 at #loc157))
+#loc189 = loc(callsite(#loc138 at #loc157))
+#loc190 = loc(callsite(#loc139 at #loc157))
+#loc191 = loc(callsite(#loc140 at #loc157))
+#loc192 = loc(callsite(#loc141 at #loc157))
+#loc193 = loc(callsite(#loc142 at #loc157))
+#loc194 = loc(callsite(#loc143 at #loc157))
+#loc195 = loc(callsite(#loc144 at #loc157))
+#loc196 = loc(callsite(#loc145 at #loc157))
+#loc197 = loc(callsite(#loc146 at #loc157))
+#loc198 = loc(callsite(#loc147 at #loc157))
+#loc199 = loc(callsite(#loc148 at #loc157))
+#loc200 = loc(callsite(#loc33 at #loc160))
+#loc201 = loc(callsite(#loc31 at #loc165))
+#loc203 = loc(callsite(#loc31 at #loc169))
+#loc205 = loc(callsite(#loc125 at #loc177))
+#loc206 = loc(callsite(#loc31 at #loc177))
+#loc208 = loc(callsite(#loc125 at #loc182))
+#loc209 = loc(callsite(#loc31 at #loc182))
+#loc211 = loc(callsite(#loc33 at #loc201))
+#loc212 = loc(callsite(#loc33 at #loc203))
+#loc213 = loc(callsite(#loc33 at #loc206))
+#loc214 = loc(callsite(#loc33 at #loc209))

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/__grp__triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.source", "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttir", "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttgir", "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.llir", "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ptx", "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.cubin", "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.cubin ADDED Viewed

Binary file (31.5 kB). View file

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"hash": "16ca2c8864609722b876bac87eb220846cb74e2fe4810a46a027c2f3e653aead", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 16, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 64, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0"}

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.llir ADDED Viewed

	@@ -0,0 +1,934 @@

+; ModuleID = 'LLVMDialectModule'
+source_filename = "LLVMDialectModule"
+target datalayout = "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
+@global_smem = external addrspace(3) global [0 x i8], align 16
+@.str = private unnamed_addr constant [11 x i8] c"__CUDA_FTZ\00", align 1
+; Function Attrs: nounwind
+define ptx_kernel void @triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0(ptr addrspace(1) %0, ptr addrspace(1) %1, i32 %2, i32 %3, ptr addrspace(1) readnone captures(none) %4, ptr addrspace(1) readnone captures(none) %5) local_unnamed_addr #0 !dbg !5 {
+  %7 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !8
+  %8 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !9
+  %9 = shl nuw nsw i32 %8, 2, !dbg !9
+  %10 = and i32 %9, 2044, !dbg !9
+  %11 = mul i32 %7, 32000, !dbg !10
+  %12 = zext nneg i32 %10 to i64, !dbg !11
+  br label %13, !dbg !11
+13:                                               ; preds = %6, %13
+  %indvars.iv = phi i64 [ 0, %6 ], [ %indvars.iv.next, %13 ]
+  %14 = phi <2 x float> [ zeroinitializer, %6 ], [ %271, %13 ]
+  %15 = phi <2 x float> [ splat (float 0xFFF0000000000000), %6 ], [ %269, %13 ]
+  %16 = phi <2 x float> [ zeroinitializer, %6 ], [ %270, %13 ]
+  %17 = phi <2 x float> [ splat (float 0xFFF0000000000000), %6 ], [ %268, %13 ]
+  %18 = or disjoint i64 %indvars.iv, %12, !dbg !12
+  %19 = icmp samesign ult i64 %18, 32000, !dbg !13
+  %20 = trunc nuw nsw i64 %18 to i32, !dbg !14
+  %21 = add i32 %11, %20, !dbg !14
+  %22 = sext i32 %21 to i64, !dbg !15
+  %23 = getelementptr bfloat, ptr addrspace(1) %0, i64 %22, !dbg !15
+  %24 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #6, !dbg !16
+  %25 = tail call { i32, i32 } asm sideeffect "mov.u32 $0, $2;\0A\09mov.u32 $1, $3;\0A\09@$6 ld.global.L1::evict_last.L2::cache_hint.v2.b32 { $0, $1 }, [ $4 + 0 ], $5;", "=r,=r,r,r,l,l,b"(i32 0, i32 0, ptr addrspace(1) %23, i64 %24, i1 %19) #6, !dbg !16
+  %26 = extractvalue { i32, i32 } %25, 0, !dbg !16
+  %27 = bitcast i32 %26 to <2 x bfloat>, !dbg !16
+  %28 = extractvalue { i32, i32 } %25, 1, !dbg !16
+  %29 = bitcast i32 %28 to <2 x bfloat>, !dbg !16
+  %30 = fcmp uno <2 x float> %17, zeroinitializer, !dbg !17
+  %31 = fcmp uno <2 x float> %15, zeroinitializer, !dbg !17
+  %32 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i72 = icmp eq i32 %32, 0, !dbg !21
+  %33 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i74 = icmp eq i32 %33, 0, !dbg !21
+  %34 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %35 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i78 = icmp eq i32 %35, 0, !dbg !21
+  %36 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i80 = icmp eq i32 %36, 0, !dbg !21
+  %37 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i82 = icmp eq i32 %37, 0, !dbg !21
+  %38 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i84 = icmp eq i32 %38, 0, !dbg !21
+  %39 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %40 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i88 = icmp eq i32 %40, 0, !dbg !21
+  %41 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i90 = icmp eq i32 %41, 0, !dbg !21
+  %42 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i92 = icmp eq i32 %42, 0, !dbg !21
+  %43 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i94 = icmp eq i32 %43, 0, !dbg !21
+  %44 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %45 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i98 = icmp eq i32 %45, 0, !dbg !21
+  %46 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i100 = icmp eq i32 %46, 0, !dbg !21
+  %47 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i102 = icmp eq i32 %47, 0, !dbg !21
+  %48 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i104 = icmp eq i32 %48, 0, !dbg !21
+  %49 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %50 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i108 = icmp eq i32 %50, 0, !dbg !21
+  %51 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i110 = icmp eq i32 %51, 0, !dbg !21
+  %52 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i112 = icmp eq i32 %52, 0, !dbg !21
+  %53 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i114 = icmp eq i32 %53, 0, !dbg !21
+  %54 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %55 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i118 = icmp eq i32 %55, 0, !dbg !21
+  %56 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i120 = icmp eq i32 %56, 0, !dbg !21
+  %57 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i122 = icmp eq i32 %57, 0, !dbg !21
+  %58 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i124 = icmp eq i32 %58, 0, !dbg !21
+  %59 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %60 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i128 = icmp eq i32 %60, 0, !dbg !21
+  %61 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i130 = icmp eq i32 %61, 0, !dbg !21
+  %62 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i132 = icmp eq i32 %62, 0, !dbg !21
+  %63 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i134 = icmp eq i32 %63, 0, !dbg !21
+  %64 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %65 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i138 = icmp eq i32 %65, 0, !dbg !21
+  %66 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i140 = icmp eq i32 %66, 0, !dbg !21
+  %67 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not.i142 = icmp eq i32 %67, 0, !dbg !21
+  %68 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not1.i144 = icmp eq i32 %68, 0, !dbg !21
+  %69 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %70 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not3.i148 = icmp eq i32 %70, 0, !dbg !21
+  %71 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !21
+  %.not4.i150 = icmp eq i32 %71, 0, !dbg !21
+  %72 = fpext <2 x bfloat> %27 to <2 x float>, !dbg !22
+  %73 = fcmp ogt <2 x float> %17, %72, !dbg !23
+  %74 = or <2 x i1> %30, %73, !dbg !24
+  %75 = select <2 x i1> %74, <2 x float> %17, <2 x float> %72, !dbg !25
+  %76 = fcmp oeq <2 x float> %75, splat (float 0xFFF0000000000000), !dbg !26
+  %foldExtExtBinop = fsub <2 x float> %17, %75, !dbg !27
+  %77 = extractelement <2 x float> %foldExtExtBinop, i64 0, !dbg !27
+  %foldExtExtBinop177 = fsub <2 x float> %17, %75, !dbg !27
+  %78 = extractelement <2 x float> %foldExtExtBinop177, i64 1, !dbg !27
+  %79 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %77, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %80 = tail call float @llvm.nvvm.fma.rn.f(float %77, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i73 = select i1 %.not.i72, float %80, float %79, !dbg !21
+  %81 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i73) #6, !dbg !21
+  %82 = tail call float @llvm.nvvm.saturate.f(float %.02.i73) #6, !dbg !21
+  %.03.i75 = select i1 %.not1.i74, float %82, float %81, !dbg !21
+  %83 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i75, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %84 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i75, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %85 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %78, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %86 = tail call float @llvm.nvvm.fma.rn.f(float %78, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i83 = select i1 %.not.i82, float %86, float %85, !dbg !21
+  %87 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i83) #6, !dbg !21
+  %88 = tail call float @llvm.nvvm.saturate.f(float %.02.i83) #6, !dbg !21
+  %.03.i85 = select i1 %.not1.i84, float %88, float %87, !dbg !21
+  %89 = insertelement <2 x i32> poison, i32 %34, i64 0, !dbg !21
+  %90 = insertelement <2 x i32> %89, i32 %39, i64 1, !dbg !21
+  %91 = icmp eq <2 x i32> %90, zeroinitializer, !dbg !21
+  %92 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i85, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %93 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i85, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %94 = insertelement <2 x float> poison, float %84, i64 0, !dbg !21
+  %95 = insertelement <2 x float> %94, float %93, i64 1, !dbg !21
+  %96 = insertelement <2 x float> poison, float %83, i64 0, !dbg !21
+  %97 = insertelement <2 x float> %96, float %92, i64 1, !dbg !21
+  %98 = select <2 x i1> %91, <2 x float> %95, <2 x float> %97, !dbg !21
+  %99 = extractelement <2 x float> %98, i64 0, !dbg !21
+  %100 = fadd float %99, 0xC168000FE0000000, !dbg !21
+  %101 = fneg float %100, !dbg !21
+  %102 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %77, float 0x3FF7154760000000, float %101) #6, !dbg !21
+  %103 = tail call float @llvm.nvvm.fma.rn.f(float %77, float 0x3FF7154760000000, float %101) #6, !dbg !21
+  %.0.i79 = select i1 %.not3.i78, float %103, float %102, !dbg !21
+  %104 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %77, float 0x3E54AE0C00000000, float %.0.i79) #6, !dbg !21
+  %105 = tail call float @llvm.nvvm.fma.rn.f(float %77, float 0x3E54AE0C00000000, float %.0.i79) #6, !dbg !21
+  %.01.i81 = select i1 %.not4.i80, float %105, float %104, !dbg !21
+  %106 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i81) #6, !dbg !21
+  %107 = extractelement <2 x float> %98, i64 1, !dbg !21
+  %108 = fadd float %107, 0xC168000FE0000000, !dbg !21
+  %109 = fneg float %108, !dbg !21
+  %110 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %78, float 0x3FF7154760000000, float %109) #6, !dbg !21
+  %111 = tail call float @llvm.nvvm.fma.rn.f(float %78, float 0x3FF7154760000000, float %109) #6, !dbg !21
+  %.0.i89 = select i1 %.not3.i88, float %111, float %110, !dbg !21
+  %112 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %78, float 0x3E54AE0C00000000, float %.0.i89) #6, !dbg !21
+  %113 = tail call float @llvm.nvvm.fma.rn.f(float %78, float 0x3E54AE0C00000000, float %.0.i89) #6, !dbg !21
+  %.01.i91 = select i1 %.not4.i90, float %113, float %112, !dbg !21
+  %114 = bitcast <2 x float> %98 to <2 x i32>, !dbg !21
+  %115 = shl <2 x i32> %114, splat (i32 23), !dbg !21
+  %116 = bitcast <2 x i32> %115 to <2 x float>, !dbg !21
+  %117 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i91) #6, !dbg !21
+  %118 = insertelement <2 x float> poison, float %106, i64 0, !dbg !21
+  %119 = insertelement <2 x float> %118, float %117, i64 1, !dbg !21
+  %120 = fmul <2 x float> %119, %116, !dbg !21
+  %121 = select <2 x i1> %76, <2 x float> splat (float 1.000000e+00), <2 x float> %120, !dbg !28
+  %foldExtExtBinop179 = fsub <2 x float> %72, %75, !dbg !29
+  %122 = extractelement <2 x float> %foldExtExtBinop179, i64 0, !dbg !29
+  %foldExtExtBinop181 = fsub <2 x float> %72, %75, !dbg !29
+  %123 = extractelement <2 x float> %foldExtExtBinop181, i64 1, !dbg !29
+  %124 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %122, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %125 = tail call float @llvm.nvvm.fma.rn.f(float %122, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i113 = select i1 %.not.i112, float %125, float %124, !dbg !21
+  %126 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i113) #6, !dbg !21
+  %127 = tail call float @llvm.nvvm.saturate.f(float %.02.i113) #6, !dbg !21
+  %.03.i115 = select i1 %.not1.i114, float %127, float %126, !dbg !21
+  %128 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i115, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %129 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i115, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %130 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %123, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %131 = tail call float @llvm.nvvm.fma.rn.f(float %123, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i123 = select i1 %.not.i122, float %131, float %130, !dbg !21
+  %132 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i123) #6, !dbg !21
+  %133 = tail call float @llvm.nvvm.saturate.f(float %.02.i123) #6, !dbg !21
+  %.03.i125 = select i1 %.not1.i124, float %133, float %132, !dbg !21
+  %134 = insertelement <2 x i32> poison, i32 %54, i64 0, !dbg !21
+  %135 = insertelement <2 x i32> %134, i32 %59, i64 1, !dbg !21
+  %136 = icmp eq <2 x i32> %135, zeroinitializer, !dbg !21
+  %137 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i125, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %138 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i125, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %139 = insertelement <2 x float> poison, float %129, i64 0, !dbg !21
+  %140 = insertelement <2 x float> %139, float %138, i64 1, !dbg !21
+  %141 = insertelement <2 x float> poison, float %128, i64 0, !dbg !21
+  %142 = insertelement <2 x float> %141, float %137, i64 1, !dbg !21
+  %143 = select <2 x i1> %136, <2 x float> %140, <2 x float> %142, !dbg !21
+  %144 = extractelement <2 x float> %143, i64 0, !dbg !21
+  %145 = fadd float %144, 0xC168000FE0000000, !dbg !21
+  %146 = fneg float %145, !dbg !21
+  %147 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %122, float 0x3FF7154760000000, float %146) #6, !dbg !21
+  %148 = tail call float @llvm.nvvm.fma.rn.f(float %122, float 0x3FF7154760000000, float %146) #6, !dbg !21
+  %.0.i119 = select i1 %.not3.i118, float %148, float %147, !dbg !21
+  %149 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %122, float 0x3E54AE0C00000000, float %.0.i119) #6, !dbg !21
+  %150 = tail call float @llvm.nvvm.fma.rn.f(float %122, float 0x3E54AE0C00000000, float %.0.i119) #6, !dbg !21
+  %.01.i121 = select i1 %.not4.i120, float %150, float %149, !dbg !21
+  %151 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i121) #6, !dbg !21
+  %152 = extractelement <2 x float> %143, i64 1, !dbg !21
+  %153 = fadd float %152, 0xC168000FE0000000, !dbg !21
+  %154 = fneg float %153, !dbg !21
+  %155 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %123, float 0x3FF7154760000000, float %154) #6, !dbg !21
+  %156 = tail call float @llvm.nvvm.fma.rn.f(float %123, float 0x3FF7154760000000, float %154) #6, !dbg !21
+  %.0.i129 = select i1 %.not3.i128, float %156, float %155, !dbg !21
+  %157 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %123, float 0x3E54AE0C00000000, float %.0.i129) #6, !dbg !21
+  %158 = tail call float @llvm.nvvm.fma.rn.f(float %123, float 0x3E54AE0C00000000, float %.0.i129) #6, !dbg !21
+  %.01.i131 = select i1 %.not4.i130, float %158, float %157, !dbg !21
+  %159 = bitcast <2 x float> %143 to <2 x i32>, !dbg !21
+  %160 = shl <2 x i32> %159, splat (i32 23), !dbg !21
+  %161 = bitcast <2 x i32> %160 to <2 x float>, !dbg !21
+  %162 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i131) #6, !dbg !21
+  %163 = insertelement <2 x float> poison, float %151, i64 0, !dbg !21
+  %164 = insertelement <2 x float> %163, float %162, i64 1, !dbg !21
+  %165 = fmul <2 x float> %164, %161, !dbg !21
+  %166 = select <2 x i1> %76, <2 x float> splat (float 1.000000e+00), <2 x float> %165, !dbg !30
+  %167 = fmul <2 x float> %16, %121, !dbg !31
+  %168 = fadd <2 x float> %167, %166, !dbg !32
+  %169 = fpext <2 x bfloat> %29 to <2 x float>, !dbg !22
+  %170 = fcmp ogt <2 x float> %15, %169, !dbg !23
+  %171 = or <2 x i1> %31, %170, !dbg !24
+  %172 = select <2 x i1> %171, <2 x float> %15, <2 x float> %169, !dbg !25
+  %173 = fcmp oeq <2 x float> %172, splat (float 0xFFF0000000000000), !dbg !26
+  %foldExtExtBinop183 = fsub <2 x float> %15, %172, !dbg !27
+  %174 = extractelement <2 x float> %foldExtExtBinop183, i64 0, !dbg !27
+  %foldExtExtBinop185 = fsub <2 x float> %15, %172, !dbg !27
+  %175 = extractelement <2 x float> %foldExtExtBinop185, i64 1, !dbg !27
+  %176 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %174, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %177 = tail call float @llvm.nvvm.fma.rn.f(float %174, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i93 = select i1 %.not.i92, float %177, float %176, !dbg !21
+  %178 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i93) #6, !dbg !21
+  %179 = tail call float @llvm.nvvm.saturate.f(float %.02.i93) #6, !dbg !21
+  %.03.i95 = select i1 %.not1.i94, float %179, float %178, !dbg !21
+  %180 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i95, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %181 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i95, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %182 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %175, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %183 = tail call float @llvm.nvvm.fma.rn.f(float %175, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i103 = select i1 %.not.i102, float %183, float %182, !dbg !21
+  %184 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i103) #6, !dbg !21
+  %185 = tail call float @llvm.nvvm.saturate.f(float %.02.i103) #6, !dbg !21
+  %.03.i105 = select i1 %.not1.i104, float %185, float %184, !dbg !21
+  %186 = insertelement <2 x i32> poison, i32 %44, i64 0, !dbg !21
+  %187 = insertelement <2 x i32> %186, i32 %49, i64 1, !dbg !21
+  %188 = icmp eq <2 x i32> %187, zeroinitializer, !dbg !21
+  %189 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i105, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %190 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i105, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %191 = insertelement <2 x float> poison, float %181, i64 0, !dbg !21
+  %192 = insertelement <2 x float> %191, float %190, i64 1, !dbg !21
+  %193 = insertelement <2 x float> poison, float %180, i64 0, !dbg !21
+  %194 = insertelement <2 x float> %193, float %189, i64 1, !dbg !21
+  %195 = select <2 x i1> %188, <2 x float> %192, <2 x float> %194, !dbg !21
+  %196 = extractelement <2 x float> %195, i64 0, !dbg !21
+  %197 = fadd float %196, 0xC168000FE0000000, !dbg !21
+  %198 = fneg float %197, !dbg !21
+  %199 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %174, float 0x3FF7154760000000, float %198) #6, !dbg !21
+  %200 = tail call float @llvm.nvvm.fma.rn.f(float %174, float 0x3FF7154760000000, float %198) #6, !dbg !21
+  %.0.i99 = select i1 %.not3.i98, float %200, float %199, !dbg !21
+  %201 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %174, float 0x3E54AE0C00000000, float %.0.i99) #6, !dbg !21
+  %202 = tail call float @llvm.nvvm.fma.rn.f(float %174, float 0x3E54AE0C00000000, float %.0.i99) #6, !dbg !21
+  %.01.i101 = select i1 %.not4.i100, float %202, float %201, !dbg !21
+  %203 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i101) #6, !dbg !21
+  %204 = extractelement <2 x float> %195, i64 1, !dbg !21
+  %205 = fadd float %204, 0xC168000FE0000000, !dbg !21
+  %206 = fneg float %205, !dbg !21
+  %207 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %175, float 0x3FF7154760000000, float %206) #6, !dbg !21
+  %208 = tail call float @llvm.nvvm.fma.rn.f(float %175, float 0x3FF7154760000000, float %206) #6, !dbg !21
+  %.0.i109 = select i1 %.not3.i108, float %208, float %207, !dbg !21
+  %209 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %175, float 0x3E54AE0C00000000, float %.0.i109) #6, !dbg !21
+  %210 = tail call float @llvm.nvvm.fma.rn.f(float %175, float 0x3E54AE0C00000000, float %.0.i109) #6, !dbg !21
+  %.01.i111 = select i1 %.not4.i110, float %210, float %209, !dbg !21
+  %211 = bitcast <2 x float> %195 to <2 x i32>, !dbg !21
+  %212 = shl <2 x i32> %211, splat (i32 23), !dbg !21
+  %213 = bitcast <2 x i32> %212 to <2 x float>, !dbg !21
+  %214 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i111) #6, !dbg !21
+  %215 = insertelement <2 x float> poison, float %203, i64 0, !dbg !21
+  %216 = insertelement <2 x float> %215, float %214, i64 1, !dbg !21
+  %217 = fmul <2 x float> %216, %213, !dbg !21
+  %218 = select <2 x i1> %173, <2 x float> splat (float 1.000000e+00), <2 x float> %217, !dbg !28
+  %foldExtExtBinop187 = fsub <2 x float> %169, %172, !dbg !29
+  %219 = extractelement <2 x float> %foldExtExtBinop187, i64 0, !dbg !29
+  %foldExtExtBinop189 = fsub <2 x float> %169, %172, !dbg !29
+  %220 = extractelement <2 x float> %foldExtExtBinop189, i64 1, !dbg !29
+  %221 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %219, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %222 = tail call float @llvm.nvvm.fma.rn.f(float %219, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i133 = select i1 %.not.i132, float %222, float %221, !dbg !21
+  %223 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i133) #6, !dbg !21
+  %224 = tail call float @llvm.nvvm.saturate.f(float %.02.i133) #6, !dbg !21
+  %.03.i135 = select i1 %.not1.i134, float %224, float %223, !dbg !21
+  %225 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i135, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %226 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i135, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %227 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %220, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %228 = tail call float @llvm.nvvm.fma.rn.f(float %220, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !21
+  %.02.i143 = select i1 %.not.i142, float %228, float %227, !dbg !21
+  %229 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i143) #6, !dbg !21
+  %230 = tail call float @llvm.nvvm.saturate.f(float %.02.i143) #6, !dbg !21
+  %.03.i145 = select i1 %.not1.i144, float %230, float %229, !dbg !21
+  %231 = insertelement <2 x i32> poison, i32 %64, i64 0, !dbg !21
+  %232 = insertelement <2 x i32> %231, i32 %69, i64 1, !dbg !21
+  %233 = icmp eq <2 x i32> %232, zeroinitializer, !dbg !21
+  %234 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i145, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %235 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i145, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !21
+  %236 = insertelement <2 x float> poison, float %226, i64 0, !dbg !21
+  %237 = insertelement <2 x float> %236, float %235, i64 1, !dbg !21
+  %238 = insertelement <2 x float> poison, float %225, i64 0, !dbg !21
+  %239 = insertelement <2 x float> %238, float %234, i64 1, !dbg !21
+  %240 = select <2 x i1> %233, <2 x float> %237, <2 x float> %239, !dbg !21
+  %241 = extractelement <2 x float> %240, i64 0, !dbg !21
+  %242 = fadd float %241, 0xC168000FE0000000, !dbg !21
+  %243 = fneg float %242, !dbg !21
+  %244 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %219, float 0x3FF7154760000000, float %243) #6, !dbg !21
+  %245 = tail call float @llvm.nvvm.fma.rn.f(float %219, float 0x3FF7154760000000, float %243) #6, !dbg !21
+  %.0.i139 = select i1 %.not3.i138, float %245, float %244, !dbg !21
+  %246 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %219, float 0x3E54AE0C00000000, float %.0.i139) #6, !dbg !21
+  %247 = tail call float @llvm.nvvm.fma.rn.f(float %219, float 0x3E54AE0C00000000, float %.0.i139) #6, !dbg !21
+  %.01.i141 = select i1 %.not4.i140, float %247, float %246, !dbg !21
+  %248 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i141) #6, !dbg !21
+  %249 = extractelement <2 x float> %240, i64 1, !dbg !21
+  %250 = fadd float %249, 0xC168000FE0000000, !dbg !21
+  %251 = fneg float %250, !dbg !21
+  %252 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %220, float 0x3FF7154760000000, float %251) #6, !dbg !21
+  %253 = tail call float @llvm.nvvm.fma.rn.f(float %220, float 0x3FF7154760000000, float %251) #6, !dbg !21
+  %.0.i149 = select i1 %.not3.i148, float %253, float %252, !dbg !21
+  %254 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %220, float 0x3E54AE0C00000000, float %.0.i149) #6, !dbg !21
+  %255 = tail call float @llvm.nvvm.fma.rn.f(float %220, float 0x3E54AE0C00000000, float %.0.i149) #6, !dbg !21
+  %.01.i151 = select i1 %.not4.i150, float %255, float %254, !dbg !21
+  %256 = bitcast <2 x float> %240 to <2 x i32>, !dbg !21
+  %257 = shl <2 x i32> %256, splat (i32 23), !dbg !21
+  %258 = bitcast <2 x i32> %257 to <2 x float>, !dbg !21
+  %259 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i151) #6, !dbg !21
+  %260 = insertelement <2 x float> poison, float %248, i64 0, !dbg !21
+  %261 = insertelement <2 x float> %260, float %259, i64 1, !dbg !21
+  %262 = fmul <2 x float> %261, %258, !dbg !21
+  %263 = select <2 x i1> %173, <2 x float> splat (float 1.000000e+00), <2 x float> %262, !dbg !30
+  %264 = fmul <2 x float> %14, %218, !dbg !31
+  %265 = fadd <2 x float> %264, %263, !dbg !32
+  %266 = insertelement <2 x i1> poison, i1 %19, i64 0, !dbg !33
+  %267 = shufflevector <2 x i1> %266, <2 x i1> poison, <2 x i32> zeroinitializer, !dbg !33
+  %268 = select <2 x i1> %267, <2 x float> %75, <2 x float> %17, !dbg !33
+  %269 = select <2 x i1> %267, <2 x float> %172, <2 x float> %15, !dbg !33
+  %270 = select <2 x i1> %267, <2 x float> %168, <2 x float> %16, !dbg !34
+  %271 = select <2 x i1> %267, <2 x float> %265, <2 x float> %14, !dbg !34
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 2048, !dbg !11
+  %272 = icmp samesign ult i64 %indvars.iv, 29952, !dbg !11
+  br i1 %272, label %13, label %273, !dbg !11
+273:                                              ; preds = %13
+  %274 = and i32 %8, 31, !dbg !9
+  %275 = lshr i32 %8, 5, !dbg !9
+  %276 = extractelement <2 x float> %268, i64 0, !dbg !35
+  %277 = extractelement <2 x float> %268, i64 1, !dbg !35
+  %278 = fcmp ogt float %276, %277, !dbg !35
+  %279 = fcmp uno float %276, 0.000000e+00, !dbg !37
+  %280 = or i1 %278, %279, !dbg !38
+  %281 = select i1 %280, float %276, float %277, !dbg !39
+  %282 = extractelement <2 x float> %269, i64 0, !dbg !35
+  %283 = fcmp ogt float %281, %282, !dbg !35
+  %284 = fcmp uno float %281, 0.000000e+00, !dbg !37
+  %285 = or i1 %283, %284, !dbg !38
+  %286 = select i1 %285, float %281, float %282, !dbg !39
+  %287 = extractelement <2 x float> %269, i64 1, !dbg !35
+  %288 = fcmp ogt float %286, %287, !dbg !35
+  %289 = fcmp uno float %286, 0.000000e+00, !dbg !37
+  %290 = or i1 %288, %289, !dbg !38
+  %291 = select i1 %290, float %286, float %287, !dbg !39
+  %292 = bitcast float %291 to i32, !dbg !40
+  %293 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %292, i32 16, i32 31), !dbg !40
+  %294 = bitcast i32 %293 to float, !dbg !40
+  %295 = fcmp ogt float %291, %294, !dbg !35
+  %296 = fcmp uno float %291, 0.000000e+00, !dbg !37
+  %297 = or i1 %296, %295, !dbg !38
+  %298 = select i1 %297, float %291, float %294, !dbg !39
+  %299 = bitcast float %298 to i32, !dbg !40
+  %300 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %299, i32 8, i32 31), !dbg !40
+  %301 = bitcast i32 %300 to float, !dbg !40
+  %302 = fcmp ogt float %298, %301, !dbg !35
+  %303 = fcmp uno float %298, 0.000000e+00, !dbg !37
+  %304 = or i1 %302, %303, !dbg !38
+  %305 = select i1 %304, float %298, float %301, !dbg !39
+  %306 = bitcast float %305 to i32, !dbg !40
+  %307 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %306, i32 4, i32 31), !dbg !40
+  %308 = bitcast i32 %307 to float, !dbg !40
+  %309 = fcmp ogt float %305, %308, !dbg !35
+  %310 = fcmp uno float %305, 0.000000e+00, !dbg !37
+  %311 = or i1 %309, %310, !dbg !38
+  %312 = select i1 %311, float %305, float %308, !dbg !39
+  %313 = bitcast float %312 to i32, !dbg !40
+  %314 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %313, i32 2, i32 31), !dbg !40
+  %315 = bitcast i32 %314 to float, !dbg !40
+  %316 = fcmp ogt float %312, %315, !dbg !35
+  %317 = fcmp uno float %312, 0.000000e+00, !dbg !37
+  %318 = or i1 %316, %317, !dbg !38
+  %319 = select i1 %318, float %312, float %315, !dbg !39
+  %320 = bitcast float %319 to i32, !dbg !40
+  %321 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %320, i32 1, i32 31), !dbg !40
+  %322 = bitcast i32 %321 to float, !dbg !40
+  %323 = fcmp ogt float %319, %322, !dbg !35
+  %324 = fcmp uno float %319, 0.000000e+00, !dbg !37
+  %325 = or i1 %323, %324, !dbg !38
+  %326 = and i32 %275, 15, !dbg !40
+  %327 = icmp eq i32 %274, 0, !dbg !40
+  %328 = getelementptr float, ptr addrspace(3) @global_smem, i32 %326, !dbg !40
+  %329 = select i1 %325, i32 %320, i32 %321, !dbg !39
+  %330 = insertelement <1 x i32> poison, i32 %329, i64 0, !dbg !40
+  tail call void asm sideeffect "@$2 st.shared.b32 [ $0 + 0 ], $1;", "r,r,b"(ptr addrspace(3) %328, <1 x i32> %330, i1 %327) #6, !dbg !40
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !40
+  %331 = icmp samesign ult i32 %8, 16, !dbg !40
+  %332 = getelementptr float, ptr addrspace(3) @global_smem, i32 %8, !dbg !40
+  %333 = tail call i32 asm sideeffect "@$2 ld.shared.b32 $0, [ $1 + 0 ];", "=r,r,b"(ptr addrspace(3) %332, i1 %331) #6, !dbg !40
+  %334 = bitcast i32 %333 to float, !dbg !40
+  %335 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %333, i32 8, i32 31), !dbg !40
+  %336 = bitcast i32 %335 to float, !dbg !40
+  %337 = fcmp ogt float %334, %336, !dbg !35
+  %338 = fcmp uno float %334, 0.000000e+00, !dbg !37
+  %339 = or i1 %338, %337, !dbg !38
+  %340 = select i1 %339, float %334, float %336, !dbg !39
+  %341 = bitcast float %340 to i32, !dbg !40
+  %342 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %341, i32 4, i32 31), !dbg !40
+  %343 = bitcast i32 %342 to float, !dbg !40
+  %344 = fcmp ogt float %340, %343, !dbg !35
+  %345 = fcmp uno float %340, 0.000000e+00, !dbg !37
+  %346 = or i1 %344, %345, !dbg !38
+  %347 = select i1 %346, float %340, float %343, !dbg !39
+  %348 = bitcast float %347 to i32, !dbg !40
+  %349 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %348, i32 2, i32 31), !dbg !40
+  %350 = bitcast i32 %349 to float, !dbg !40
+  %351 = fcmp ogt float %347, %350, !dbg !35
+  %352 = fcmp uno float %347, 0.000000e+00, !dbg !37
+  %353 = or i1 %351, %352, !dbg !38
+  %354 = select i1 %353, float %347, float %350, !dbg !39
+  %355 = bitcast float %354 to i32, !dbg !40
+  %356 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %355, i32 1, i32 31), !dbg !40
+  %357 = bitcast i32 %356 to float, !dbg !40
+  %358 = fcmp ogt float %354, %357, !dbg !35
+  %359 = fcmp uno float %354, 0.000000e+00, !dbg !37
+  %360 = or i1 %358, %359, !dbg !38
+  %361 = icmp eq i32 %8, 0, !dbg !40
+  %362 = select i1 %360, i32 %355, i32 %356, !dbg !39
+  %363 = insertelement <1 x i32> poison, i32 %362, i64 0, !dbg !40
+  tail call void asm sideeffect "@$2 st.shared.b32 [ $0 + 0 ], $1;", "r,r,b"(ptr addrspace(3) %332, <1 x i32> %363, i1 %361) #6, !dbg !40
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !40
+  %364 = load float, ptr addrspace(3) @global_smem, align 16, !dbg !40
+  %365 = fcmp oeq float %364, 0xFFF0000000000000, !dbg !41
+  %366 = fsub float %276, %364, !dbg !42
+  %367 = fsub float %277, %364, !dbg !42
+  %368 = fsub float %282, %364, !dbg !42
+  %369 = fsub float %287, %364, !dbg !42
+  %370 = select i1 %365, float 0.000000e+00, float %366, !dbg !43
+  %371 = select i1 %365, float 0.000000e+00, float %367, !dbg !43
+  %372 = select i1 %365, float 0.000000e+00, float %368, !dbg !43
+  %373 = select i1 %365, float 0.000000e+00, float %369, !dbg !43
+  %374 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not.i = icmp eq i32 %374, 0, !dbg !44
+  %375 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %370, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %376 = tail call float @llvm.nvvm.fma.rn.f(float %370, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %.02.i = select i1 %.not.i, float %376, float %375, !dbg !44
+  %377 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not1.i = icmp eq i32 %377, 0, !dbg !44
+  %378 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i) #6, !dbg !44
+  %379 = tail call float @llvm.nvvm.saturate.f(float %.02.i) #6, !dbg !44
+  %.03.i = select i1 %.not1.i, float %379, float %378, !dbg !44
+  %380 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %381 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %382 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %383 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not3.i = icmp eq i32 %383, 0, !dbg !44
+  %384 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not4.i = icmp eq i32 %384, 0, !dbg !44
+  %385 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not.i2 = icmp eq i32 %385, 0, !dbg !44
+  %386 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %371, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %387 = tail call float @llvm.nvvm.fma.rn.f(float %371, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %.02.i3 = select i1 %.not.i2, float %387, float %386, !dbg !44
+  %388 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not1.i4 = icmp eq i32 %388, 0, !dbg !44
+  %389 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i3) #6, !dbg !44
+  %390 = tail call float @llvm.nvvm.saturate.f(float %.02.i3) #6, !dbg !44
+  %.03.i5 = select i1 %.not1.i4, float %390, float %389, !dbg !44
+  %391 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %392 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i5, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %393 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i5, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %394 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not3.i8 = icmp eq i32 %394, 0, !dbg !44
+  %395 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not4.i10 = icmp eq i32 %395, 0, !dbg !44
+  %396 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not.i12 = icmp eq i32 %396, 0, !dbg !44
+  %397 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %372, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %398 = tail call float @llvm.nvvm.fma.rn.f(float %372, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %.02.i13 = select i1 %.not.i12, float %398, float %397, !dbg !44
+  %399 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not1.i14 = icmp eq i32 %399, 0, !dbg !44
+  %400 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i13) #6, !dbg !44
+  %401 = tail call float @llvm.nvvm.saturate.f(float %.02.i13) #6, !dbg !44
+  %.03.i15 = select i1 %.not1.i14, float %401, float %400, !dbg !44
+  %402 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %403 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i15, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %404 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i15, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %405 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not3.i18 = icmp eq i32 %405, 0, !dbg !44
+  %406 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not4.i20 = icmp eq i32 %406, 0, !dbg !44
+  %407 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not.i22 = icmp eq i32 %407, 0, !dbg !44
+  %408 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %373, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %409 = tail call float @llvm.nvvm.fma.rn.f(float %373, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !44
+  %.02.i23 = select i1 %.not.i22, float %409, float %408, !dbg !44
+  %410 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not1.i24 = icmp eq i32 %410, 0, !dbg !44
+  %411 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i23) #6, !dbg !44
+  %412 = tail call float @llvm.nvvm.saturate.f(float %.02.i23) #6, !dbg !44
+  %.03.i25 = select i1 %.not1.i24, float %412, float %411, !dbg !44
+  %413 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %414 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i25, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %415 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i25, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !44
+  %416 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not3.i28 = icmp eq i32 %416, 0, !dbg !44
+  %417 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !44
+  %.not4.i30 = icmp eq i32 %417, 0, !dbg !44
+  %418 = insertelement <2 x i32> poison, i32 %380, i64 0, !dbg !44
+  %419 = insertelement <2 x i32> %418, i32 %391, i64 1, !dbg !44
+  %420 = icmp eq <2 x i32> %419, zeroinitializer, !dbg !44
+  %421 = insertelement <2 x float> poison, float %382, i64 0, !dbg !44
+  %422 = insertelement <2 x float> %421, float %393, i64 1, !dbg !44
+  %423 = insertelement <2 x float> poison, float %381, i64 0, !dbg !44
+  %424 = insertelement <2 x float> %423, float %392, i64 1, !dbg !44
+  %425 = select <2 x i1> %420, <2 x float> %422, <2 x float> %424, !dbg !44
+  %426 = extractelement <2 x float> %425, i64 0, !dbg !44
+  %427 = fadd float %426, 0xC168000FE0000000, !dbg !44
+  %428 = fneg float %427, !dbg !44
+  %429 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %370, float 0x3FF7154760000000, float %428) #6, !dbg !44
+  %430 = tail call float @llvm.nvvm.fma.rn.f(float %370, float 0x3FF7154760000000, float %428) #6, !dbg !44
+  %.0.i = select i1 %.not3.i, float %430, float %429, !dbg !44
+  %431 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %370, float 0x3E54AE0C00000000, float %.0.i) #6, !dbg !44
+  %432 = tail call float @llvm.nvvm.fma.rn.f(float %370, float 0x3E54AE0C00000000, float %.0.i) #6, !dbg !44
+  %.01.i = select i1 %.not4.i, float %432, float %431, !dbg !44
+  %433 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i) #6, !dbg !44
+  %434 = extractelement <2 x float> %425, i64 1, !dbg !44
+  %435 = fadd float %434, 0xC168000FE0000000, !dbg !44
+  %436 = fneg float %435, !dbg !44
+  %437 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %371, float 0x3FF7154760000000, float %436) #6, !dbg !44
+  %438 = tail call float @llvm.nvvm.fma.rn.f(float %371, float 0x3FF7154760000000, float %436) #6, !dbg !44
+  %.0.i9 = select i1 %.not3.i8, float %438, float %437, !dbg !44
+  %439 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %371, float 0x3E54AE0C00000000, float %.0.i9) #6, !dbg !44
+  %440 = tail call float @llvm.nvvm.fma.rn.f(float %371, float 0x3E54AE0C00000000, float %.0.i9) #6, !dbg !44
+  %.01.i11 = select i1 %.not4.i10, float %440, float %439, !dbg !44
+  %441 = bitcast <2 x float> %425 to <2 x i32>, !dbg !44
+  %442 = shl <2 x i32> %441, splat (i32 23), !dbg !44
+  %443 = bitcast <2 x i32> %442 to <2 x float>, !dbg !44
+  %444 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i11) #6, !dbg !44
+  %445 = insertelement <2 x float> poison, float %433, i64 0, !dbg !44
+  %446 = insertelement <2 x float> %445, float %444, i64 1, !dbg !44
+  %447 = fmul <2 x float> %446, %443, !dbg !44
+  %448 = fmul <2 x float> %270, %447, !dbg !45
+  %449 = insertelement <2 x i32> poison, i32 %402, i64 0, !dbg !44
+  %450 = insertelement <2 x i32> %449, i32 %413, i64 1, !dbg !44
+  %451 = icmp eq <2 x i32> %450, zeroinitializer, !dbg !44
+  %452 = insertelement <2 x float> poison, float %404, i64 0, !dbg !44
+  %453 = insertelement <2 x float> %452, float %415, i64 1, !dbg !44
+  %454 = insertelement <2 x float> poison, float %403, i64 0, !dbg !44
+  %455 = insertelement <2 x float> %454, float %414, i64 1, !dbg !44
+  %456 = select <2 x i1> %451, <2 x float> %453, <2 x float> %455, !dbg !44
+  %457 = extractelement <2 x float> %456, i64 0, !dbg !44
+  %458 = fadd float %457, 0xC168000FE0000000, !dbg !44
+  %459 = fneg float %458, !dbg !44
+  %460 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %372, float 0x3FF7154760000000, float %459) #6, !dbg !44
+  %461 = tail call float @llvm.nvvm.fma.rn.f(float %372, float 0x3FF7154760000000, float %459) #6, !dbg !44
+  %.0.i19 = select i1 %.not3.i18, float %461, float %460, !dbg !44
+  %462 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %372, float 0x3E54AE0C00000000, float %.0.i19) #6, !dbg !44
+  %463 = tail call float @llvm.nvvm.fma.rn.f(float %372, float 0x3E54AE0C00000000, float %.0.i19) #6, !dbg !44
+  %.01.i21 = select i1 %.not4.i20, float %463, float %462, !dbg !44
+  %464 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i21) #6, !dbg !44
+  %465 = extractelement <2 x float> %456, i64 1, !dbg !44
+  %466 = fadd float %465, 0xC168000FE0000000, !dbg !44
+  %467 = fneg float %466, !dbg !44
+  %468 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %373, float 0x3FF7154760000000, float %467) #6, !dbg !44
+  %469 = tail call float @llvm.nvvm.fma.rn.f(float %373, float 0x3FF7154760000000, float %467) #6, !dbg !44
+  %.0.i29 = select i1 %.not3.i28, float %469, float %468, !dbg !44
+  %470 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %373, float 0x3E54AE0C00000000, float %.0.i29) #6, !dbg !44
+  %471 = tail call float @llvm.nvvm.fma.rn.f(float %373, float 0x3E54AE0C00000000, float %.0.i29) #6, !dbg !44
+  %.01.i31 = select i1 %.not4.i30, float %471, float %470, !dbg !44
+  %472 = bitcast <2 x float> %456 to <2 x i32>, !dbg !44
+  %473 = shl <2 x i32> %472, splat (i32 23), !dbg !44
+  %474 = bitcast <2 x i32> %473 to <2 x float>, !dbg !44
+  %475 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i31) #6, !dbg !44
+  %476 = insertelement <2 x float> poison, float %464, i64 0, !dbg !44
+  %477 = insertelement <2 x float> %476, float %475, i64 1, !dbg !44
+  %478 = fmul <2 x float> %477, %474, !dbg !44
+  %479 = fmul <2 x float> %271, %478, !dbg !45
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !46
+  %shift = shufflevector <2 x float> %448, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !49
+  %foldExtExtBinop191 = fadd <2 x float> %448, %shift, !dbg !49
+  %foldExtExtBinop193 = fadd <2 x float> %foldExtExtBinop191, %479, !dbg !49
+  %shift195 = shufflevector <2 x float> %479, <2 x float> poison, <2 x i32> <i32 1, i32 poison>, !dbg !49
+  %foldExtExtBinop196 = fadd <2 x float> %foldExtExtBinop193, %shift195, !dbg !49
+  %480 = extractelement <2 x float> %foldExtExtBinop196, i64 0, !dbg !49
+  %481 = bitcast float %480 to i32, !dbg !46
+  %482 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %481, i32 16, i32 31), !dbg !46
+  %483 = bitcast i32 %482 to float, !dbg !46
+  %484 = fadd float %480, %483, !dbg !49
+  %485 = bitcast float %484 to i32, !dbg !46
+  %486 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %485, i32 8, i32 31), !dbg !46
+  %487 = bitcast i32 %486 to float, !dbg !46
+  %488 = fadd float %484, %487, !dbg !49
+  %489 = bitcast float %488 to i32, !dbg !46
+  %490 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %489, i32 4, i32 31), !dbg !46
+  %491 = bitcast i32 %490 to float, !dbg !46
+  %492 = fadd float %488, %491, !dbg !49
+  %493 = bitcast float %492 to i32, !dbg !46
+  %494 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %493, i32 2, i32 31), !dbg !46
+  %495 = bitcast i32 %494 to float, !dbg !46
+  %496 = fadd float %492, %495, !dbg !49
+  %497 = bitcast float %496 to i32, !dbg !46
+  %498 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %497, i32 1, i32 31), !dbg !46
+  %499 = bitcast i32 %498 to float, !dbg !46
+  %500 = fadd float %496, %499, !dbg !49
+  %501 = bitcast float %500 to <1 x i32>, !dbg !46
+  tail call void asm sideeffect "@$2 st.shared.b32 [ $0 + 0 ], $1;", "r,r,b"(ptr addrspace(3) %328, <1 x i32> %501, i1 %327) #6, !dbg !46
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !46
+  %502 = tail call i32 asm sideeffect "@$2 ld.shared.b32 $0, [ $1 + 0 ];", "=r,r,b"(ptr addrspace(3) %332, i1 %331) #6, !dbg !46
+  %503 = bitcast i32 %502 to float, !dbg !46
+  %504 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %502, i32 8, i32 31), !dbg !46
+  %505 = bitcast i32 %504 to float, !dbg !46
+  %506 = fadd float %503, %505, !dbg !49
+  %507 = bitcast float %506 to i32, !dbg !46
+  %508 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %507, i32 4, i32 31), !dbg !46
+  %509 = bitcast i32 %508 to float, !dbg !46
+  %510 = fadd float %506, %509, !dbg !49
+  %511 = bitcast float %510 to i32, !dbg !46
+  %512 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %511, i32 2, i32 31), !dbg !46
+  %513 = bitcast i32 %512 to float, !dbg !46
+  %514 = fadd float %510, %513, !dbg !49
+  %515 = bitcast float %514 to i32, !dbg !46
+  %516 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %515, i32 1, i32 31), !dbg !46
+  %517 = bitcast i32 %516 to float, !dbg !46
+  %518 = fadd float %514, %517, !dbg !49
+  %519 = bitcast float %518 to <1 x i32>, !dbg !46
+  tail call void asm sideeffect "@$2 st.shared.b32 [ $0 + 0 ], $1;", "r,r,b"(ptr addrspace(3) %332, <1 x i32> %519, i1 %361) #6, !dbg !46
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !46
+  %520 = load float, ptr addrspace(3) @global_smem, align 16, !dbg !46
+  br label %521, !dbg !50
+521:                                              ; preds = %273, %521
+  %indvars.iv160 = phi i64 [ 0, %273 ], [ %indvars.iv.next161, %521 ]
+  %522 = or disjoint i64 %indvars.iv160, %12, !dbg !51
+  %523 = icmp samesign ult i64 %522, 32000, !dbg !52
+  %524 = trunc nuw nsw i64 %522 to i32, !dbg !53
+  %525 = add i32 %11, %524, !dbg !53
+  %526 = sext i32 %525 to i64, !dbg !54
+  %527 = getelementptr bfloat, ptr addrspace(1) %0, i64 %526, !dbg !54
+  %528 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_first.b64 $0, 1.0;", "=l"() #6, !dbg !55
+  %529 = tail call { i32, i32 } asm sideeffect "mov.u32 $0, $2;\0A\09mov.u32 $1, $3;\0A\09@$6 ld.global.L1::evict_first.L2::cache_hint.v2.b32 { $0, $1 }, [ $4 + 0 ], $5;", "=r,=r,r,r,l,l,b"(i32 0, i32 0, ptr addrspace(1) %527, i64 %528, i1 %523) #6, !dbg !55
+  %530 = extractvalue { i32, i32 } %529, 0, !dbg !55
+  %531 = bitcast i32 %530 to <2 x bfloat>, !dbg !55
+  %532 = extractvalue { i32, i32 } %529, 1, !dbg !55
+  %533 = bitcast i32 %532 to <2 x bfloat>, !dbg !55
+  %534 = extractelement <2 x bfloat> %531, i64 0, !dbg !55
+  %535 = extractelement <2 x bfloat> %531, i64 1, !dbg !55
+  %536 = extractelement <2 x bfloat> %533, i64 0, !dbg !55
+  %537 = extractelement <2 x bfloat> %533, i64 1, !dbg !55
+  %538 = fpext bfloat %534 to float, !dbg !56
+  %539 = fpext bfloat %535 to float, !dbg !56
+  %540 = fpext bfloat %536 to float, !dbg !56
+  %541 = fpext bfloat %537 to float, !dbg !56
+  %542 = fsub float %538, %364, !dbg !57
+  %543 = fsub float %539, %364, !dbg !57
+  %544 = fsub float %540, %364, !dbg !57
+  %545 = fsub float %541, %364, !dbg !57
+  %546 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not.i32 = icmp eq i32 %546, 0, !dbg !58
+  %547 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %542, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %548 = tail call float @llvm.nvvm.fma.rn.f(float %542, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %.02.i33 = select i1 %.not.i32, float %548, float %547, !dbg !58
+  %549 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not1.i34 = icmp eq i32 %549, 0, !dbg !58
+  %550 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i33) #6, !dbg !58
+  %551 = tail call float @llvm.nvvm.saturate.f(float %.02.i33) #6, !dbg !58
+  %.03.i35 = select i1 %.not1.i34, float %551, float %550, !dbg !58
+  %552 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not2.i36 = icmp eq i32 %552, 0, !dbg !58
+  %553 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i35, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %554 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i35, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %.04.i37 = select i1 %.not2.i36, float %554, float %553, !dbg !58
+  %555 = fadd float %.04.i37, 0xC168000FE0000000, !dbg !58
+  %556 = fneg float %555, !dbg !58
+  %557 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not3.i38 = icmp eq i32 %557, 0, !dbg !58
+  %558 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %542, float 0x3FF7154760000000, float %556) #6, !dbg !58
+  %559 = tail call float @llvm.nvvm.fma.rn.f(float %542, float 0x3FF7154760000000, float %556) #6, !dbg !58
+  %.0.i39 = select i1 %.not3.i38, float %559, float %558, !dbg !58
+  %560 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not4.i40 = icmp eq i32 %560, 0, !dbg !58
+  %561 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %542, float 0x3E54AE0C00000000, float %.0.i39) #6, !dbg !58
+  %562 = tail call float @llvm.nvvm.fma.rn.f(float %542, float 0x3E54AE0C00000000, float %.0.i39) #6, !dbg !58
+  %.01.i41 = select i1 %.not4.i40, float %562, float %561, !dbg !58
+  %563 = bitcast float %.04.i37 to i32, !dbg !58
+  %564 = shl i32 %563, 23, !dbg !58
+  %565 = bitcast i32 %564 to float, !dbg !58
+  %566 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i41) #6, !dbg !58
+  %567 = fmul float %566, %565, !dbg !58
+  %568 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not.i42 = icmp eq i32 %568, 0, !dbg !58
+  %569 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %543, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %570 = tail call float @llvm.nvvm.fma.rn.f(float %543, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %.02.i43 = select i1 %.not.i42, float %570, float %569, !dbg !58
+  %571 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not1.i44 = icmp eq i32 %571, 0, !dbg !58
+  %572 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i43) #6, !dbg !58
+  %573 = tail call float @llvm.nvvm.saturate.f(float %.02.i43) #6, !dbg !58
+  %.03.i45 = select i1 %.not1.i44, float %573, float %572, !dbg !58
+  %574 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not2.i46 = icmp eq i32 %574, 0, !dbg !58
+  %575 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i45, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %576 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i45, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %.04.i47 = select i1 %.not2.i46, float %576, float %575, !dbg !58
+  %577 = fadd float %.04.i47, 0xC168000FE0000000, !dbg !58
+  %578 = fneg float %577, !dbg !58
+  %579 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not3.i48 = icmp eq i32 %579, 0, !dbg !58
+  %580 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %543, float 0x3FF7154760000000, float %578) #6, !dbg !58
+  %581 = tail call float @llvm.nvvm.fma.rn.f(float %543, float 0x3FF7154760000000, float %578) #6, !dbg !58
+  %.0.i49 = select i1 %.not3.i48, float %581, float %580, !dbg !58
+  %582 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not4.i50 = icmp eq i32 %582, 0, !dbg !58
+  %583 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %543, float 0x3E54AE0C00000000, float %.0.i49) #6, !dbg !58
+  %584 = tail call float @llvm.nvvm.fma.rn.f(float %543, float 0x3E54AE0C00000000, float %.0.i49) #6, !dbg !58
+  %.01.i51 = select i1 %.not4.i50, float %584, float %583, !dbg !58
+  %585 = bitcast float %.04.i47 to i32, !dbg !58
+  %586 = shl i32 %585, 23, !dbg !58
+  %587 = bitcast i32 %586 to float, !dbg !58
+  %588 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i51) #6, !dbg !58
+  %589 = fmul float %588, %587, !dbg !58
+  %590 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not.i52 = icmp eq i32 %590, 0, !dbg !58
+  %591 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %544, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %592 = tail call float @llvm.nvvm.fma.rn.f(float %544, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %.02.i53 = select i1 %.not.i52, float %592, float %591, !dbg !58
+  %593 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not1.i54 = icmp eq i32 %593, 0, !dbg !58
+  %594 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i53) #6, !dbg !58
+  %595 = tail call float @llvm.nvvm.saturate.f(float %.02.i53) #6, !dbg !58
+  %.03.i55 = select i1 %.not1.i54, float %595, float %594, !dbg !58
+  %596 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not2.i56 = icmp eq i32 %596, 0, !dbg !58
+  %597 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i55, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %598 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i55, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %.04.i57 = select i1 %.not2.i56, float %598, float %597, !dbg !58
+  %599 = fadd float %.04.i57, 0xC168000FE0000000, !dbg !58
+  %600 = fneg float %599, !dbg !58
+  %601 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not3.i58 = icmp eq i32 %601, 0, !dbg !58
+  %602 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %544, float 0x3FF7154760000000, float %600) #6, !dbg !58
+  %603 = tail call float @llvm.nvvm.fma.rn.f(float %544, float 0x3FF7154760000000, float %600) #6, !dbg !58
+  %.0.i59 = select i1 %.not3.i58, float %603, float %602, !dbg !58
+  %604 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not4.i60 = icmp eq i32 %604, 0, !dbg !58
+  %605 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %544, float 0x3E54AE0C00000000, float %.0.i59) #6, !dbg !58
+  %606 = tail call float @llvm.nvvm.fma.rn.f(float %544, float 0x3E54AE0C00000000, float %.0.i59) #6, !dbg !58
+  %.01.i61 = select i1 %.not4.i60, float %606, float %605, !dbg !58
+  %607 = bitcast float %.04.i57 to i32, !dbg !58
+  %608 = shl i32 %607, 23, !dbg !58
+  %609 = bitcast i32 %608 to float, !dbg !58
+  %610 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i61) #6, !dbg !58
+  %611 = fmul float %610, %609, !dbg !58
+  %612 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not.i62 = icmp eq i32 %612, 0, !dbg !58
+  %613 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %545, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %614 = tail call float @llvm.nvvm.fma.rn.f(float %545, float 0x3F777313A0000000, float 5.000000e-01) #6, !dbg !58
+  %.02.i63 = select i1 %.not.i62, float %614, float %613, !dbg !58
+  %615 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not1.i64 = icmp eq i32 %615, 0, !dbg !58
+  %616 = tail call float @llvm.nvvm.saturate.ftz.f(float %.02.i63) #6, !dbg !58
+  %617 = tail call float @llvm.nvvm.saturate.f(float %.02.i63) #6, !dbg !58
+  %.03.i65 = select i1 %.not1.i64, float %617, float %616, !dbg !58
+  %618 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not2.i66 = icmp eq i32 %618, 0, !dbg !58
+  %619 = tail call float @llvm.nvvm.fma.rm.ftz.f(float %.03.i65, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %620 = tail call float @llvm.nvvm.fma.rm.f(float %.03.i65, float 2.520000e+02, float 0x4168000020000000) #6, !dbg !58
+  %.04.i67 = select i1 %.not2.i66, float %620, float %619, !dbg !58
+  %621 = fadd float %.04.i67, 0xC168000FE0000000, !dbg !58
+  %622 = fneg float %621, !dbg !58
+  %623 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not3.i68 = icmp eq i32 %623, 0, !dbg !58
+  %624 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %545, float 0x3FF7154760000000, float %622) #6, !dbg !58
+  %625 = tail call float @llvm.nvvm.fma.rn.f(float %545, float 0x3FF7154760000000, float %622) #6, !dbg !58
+  %.0.i69 = select i1 %.not3.i68, float %625, float %624, !dbg !58
+  %626 = tail call i32 @__nvvm_reflect(ptr nonnull @.str) #6, !dbg !58
+  %.not4.i70 = icmp eq i32 %626, 0, !dbg !58
+  %627 = tail call float @llvm.nvvm.fma.rn.ftz.f(float %545, float 0x3E54AE0C00000000, float %.0.i69) #6, !dbg !58
+  %628 = tail call float @llvm.nvvm.fma.rn.f(float %545, float 0x3E54AE0C00000000, float %.0.i69) #6, !dbg !58
+  %.01.i71 = select i1 %.not4.i70, float %628, float %627, !dbg !58
+  %629 = bitcast float %.04.i67 to i32, !dbg !58
+  %630 = shl i32 %629, 23, !dbg !58
+  %631 = bitcast i32 %630 to float, !dbg !58
+  %632 = tail call float @llvm.nvvm.ex2.approx.ftz.f(float %.01.i71) #6, !dbg !58
+  %633 = fmul float %632, %631, !dbg !58
+  %634 = tail call float @llvm.nvvm.div.full(float %567, float %520), !dbg !59
+  %635 = tail call float @llvm.nvvm.div.full(float %589, float %520), !dbg !59
+  %636 = tail call float @llvm.nvvm.div.full(float %611, float %520), !dbg !59
+  %637 = tail call float @llvm.nvvm.div.full(float %633, float %520), !dbg !59
+  %638 = getelementptr float, ptr addrspace(1) %1, i64 %526, !dbg !60
+  %639 = bitcast float %634 to i32, !dbg !61
+  %640 = bitcast float %635 to i32, !dbg !61
+  %641 = bitcast float %636 to i32, !dbg !61
+  %642 = bitcast float %637 to i32, !dbg !61
+  tail call void asm sideeffect "@$5 st.global.v4.b32 [ $4 + 0 ], { $0, $1, $2, $3 };", "r,r,r,r,l,b"(i32 %639, i32 %640, i32 %641, i32 %642, ptr addrspace(1) %638, i1 %523) #6, !dbg !61
+  %indvars.iv.next161 = add nuw nsw i64 %indvars.iv160, 2048, !dbg !50
+  %643 = icmp samesign ult i64 %indvars.iv160, 29952, !dbg !50
+  br i1 %643, label %521, label %644, !dbg !50
+644:                                              ; preds = %521
+  ret void, !dbg !62
+}
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 2147483647) i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 1024) i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1
+; Function Attrs: convergent nocallback nounwind memory(inaccessiblemem: readwrite)
+declare i32 @llvm.nvvm.shfl.sync.bfly.i32(i32, i32, i32, i32) #2
+; Function Attrs: convergent nocallback nounwind
+declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32) #3
+; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none)
+declare float @llvm.nvvm.div.full(float, float) #4
+declare i32 @__nvvm_reflect(ptr) local_unnamed_addr #5
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare float @llvm.nvvm.fma.rn.ftz.f(float, float, float) #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare float @llvm.nvvm.fma.rn.f(float, float, float) #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare float @llvm.nvvm.saturate.ftz.f(float) #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare float @llvm.nvvm.saturate.f(float) #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare float @llvm.nvvm.fma.rm.ftz.f(float, float, float) #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare float @llvm.nvvm.fma.rm.f(float, float, float) #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn memory(none)
+declare float @llvm.nvvm.ex2.approx.ftz.f(float) #4
+attributes #0 = { nounwind "nvvm.reqntid"="512" }
+attributes #1 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+attributes #2 = { convergent nocallback nounwind memory(inaccessiblemem: readwrite) }
+attributes #3 = { convergent nocallback nounwind }
+attributes #4 = { mustprogress nocallback nofree nosync nounwind willreturn memory(none) }
+attributes #5 = { "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
+attributes #6 = { nounwind }
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!2, !3}
+!llvm.ident = !{!4}
+!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "triton", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly)
+!1 = !DIFile(filename: "cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py", directory: "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx")
+!2 = !{i32 2, !"Debug Info Version", i32 3}
+!3 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
+!4 = !{!"clang version 3.8.0 (tags/RELEASE_380/final)"}
+!5 = distinct !DISubprogram(name: "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0", linkageName: "triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0", scope: !1, file: !1, line: 18, type: !6, scopeLine: 18, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0)
+!6 = !DISubroutineType(cc: DW_CC_normal, types: !7)
+!7 = !{}
+!8 = !DILocation(line: 23, column: 28, scope: !5)
+!9 = !DILocation(line: 26, column: 37, scope: !5)
+!10 = !DILocation(line: 37, column: 47, scope: !5)
+!11 = !DILocation(line: 31, column: 40, scope: !5)
+!12 = !DILocation(line: 32, column: 31, scope: !5)
+!13 = !DILocation(line: 33, column: 29, scope: !5)
+!14 = !DILocation(line: 37, column: 41, scope: !5)
+!15 = !DILocation(line: 37, column: 34, scope: !5)
+!16 = !DILocation(line: 37, column: 52, scope: !5)
+!17 = !DILocation(line: 112, column: 21, scope: !18, inlinedAt: !20)
+!18 = distinct !DILexicalBlockFile(scope: !5, file: !19, discriminator: 0)
+!19 = !DIFile(filename: "triton_helpers.py", directory: "/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime")
+!20 = !DILocation(line: 42, column: 40, scope: !5)
+!21 = !DILocation(line: 173, column: 29, scope: !18, inlinedAt: !20)
+!22 = !DILocation(line: 37, column: 105, scope: !5)
+!23 = !DILocation(line: 110, column: 15, scope: !18, inlinedAt: !20)
+!24 = !DILocation(line: 112, column: 16, scope: !18, inlinedAt: !20)
+!25 = !DILocation(line: 113, column: 29, scope: !18, inlinedAt: !20)
+!26 = !DILocation(line: 196, column: 19, scope: !18, inlinedAt: !20)
+!27 = !DILocation(line: 196, column: 53, scope: !18, inlinedAt: !20)
+!28 = !DILocation(line: 196, column: 39, scope: !18, inlinedAt: !20)
+!29 = !DILocation(line: 199, column: 53, scope: !18, inlinedAt: !20)
+!30 = !DILocation(line: 199, column: 39, scope: !18, inlinedAt: !20)
+!31 = !DILocation(line: 205, column: 24, scope: !18, inlinedAt: !20)
+!32 = !DILocation(line: 205, column: 36, scope: !18, inlinedAt: !20)
+!33 = !DILocation(line: 45, column: 54, scope: !5)
+!34 = !DILocation(line: 46, column: 54, scope: !5)
+!35 = !DILocation(line: 110, column: 15, scope: !18, inlinedAt: !36)
+!36 = !DILocation(line: 49, column: 33, scope: !5)
+!37 = !DILocation(line: 112, column: 21, scope: !18, inlinedAt: !36)
+!38 = !DILocation(line: 112, column: 16, scope: !18, inlinedAt: !36)
+!39 = !DILocation(line: 113, column: 29, scope: !18, inlinedAt: !36)
+!40 = !DILocation(line: 123, column: 29, scope: !18, inlinedAt: !36)
+!41 = !DILocation(line: 180, column: 40, scope: !18, inlinedAt: !36)
+!42 = !DILocation(line: 180, column: 68, scope: !18, inlinedAt: !36)
+!43 = !DILocation(line: 180, column: 58, scope: !18, inlinedAt: !36)
+!44 = !DILocation(line: 173, column: 29, scope: !18, inlinedAt: !36)
+!45 = !DILocation(line: 181, column: 31, scope: !18, inlinedAt: !36)
+!46 = !DILocation(line: 291, column: 36, scope: !47, inlinedAt: !36)
+!47 = distinct !DILexicalBlockFile(scope: !5, file: !48, discriminator: 0)
+!48 = !DIFile(filename: "standard.py", directory: "/workspace/specforge/lib/python3.11/site-packages/triton/language")
+!49 = !DILocation(line: 261, column: 15, scope: !47, inlinedAt: !36)
+!50 = !DILocation(line: 52, column: 40, scope: !5)
+!51 = !DILocation(line: 53, column: 31, scope: !5)
+!52 = !DILocation(line: 54, column: 29, scope: !5)
+!53 = !DILocation(line: 58, column: 41, scope: !5)
+!54 = !DILocation(line: 58, column: 34, scope: !5)
+!55 = !DILocation(line: 58, column: 52, scope: !5)
+!56 = !DILocation(line: 58, column: 106, scope: !5)
+!57 = !DILocation(line: 60, column: 22, scope: !5)
+!58 = !DILocation(line: 61, column: 29, scope: !5)
+!59 = !DILocation(line: 62, column: 23, scope: !5)
+!60 = !DILocation(line: 63, column: 29, scope: !5)
+!61 = !DILocation(line: 63, column: 53, scope: !5)
+!62 = !DILocation(line: 52, column: 4, scope: !5)

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ptx ADDED Viewed

	@@ -0,0 +1,921 @@

+//
+// Generated by LLVM NVPTX Back-End
+//
+.version 8.7
+.target sm_90a
+.address_size 64
+	// .globl	triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0 // -- Begin function triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0
+.extern .shared .align 16 .b8 global_smem[];
+.global .align 1 .b8 _$_str[11] = {95, 95, 67, 85, 68, 65, 95, 70, 84, 90};
+                                        // @triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0
+.visible .entry triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0(
+	.param .u64 .ptr .global .align 1 triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_0,
+	.param .u64 .ptr .global .align 1 triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_1,
+	.param .u32 triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_2,
+	.param .u32 triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_3,
+	.param .u64 .ptr .global .align 1 triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_4,
+	.param .u64 .ptr .global .align 1 triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_5
+)
+.reqntid 512
+{
+	.reg .pred 	%p<49>;
+	.reg .b16 	%rs<9>;
+	.reg .b32 	%r<354>;
+	.reg .b64 	%rd<68>;
+	.loc	1 18 0                          // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:18:0
+$L__func_begin0:
+	.loc	1 18 0                          // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:18:0
+// %bb.0:
+	ld.param.b64 	%rd16, [triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_1];
+	ld.param.b64 	%rd15, [triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0_param_0];
+$L__tmp0:
+	.loc	1 23 28                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:23:28
+	mov.u32 	%r4, %ctaid.x;
+	.loc	1 26 37                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:26:37
+	mov.u32 	%r1, %tid.x;
+	shl.b32 	%r5, %r1, 2;
+	and.b32 	%r6, %r5, 2044;
+	.loc	1 31 40                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:31:40
+	cvt.u64.u32 	%rd1, %r6;
+	mad.lo.s32 	%r7, %r4, 32000, %r6;
+	cvt.u64.u32 	%rd2, %r7;
+	mov.b32 	%r8, 0fFF800000;
+	mov.b64 	%rd64, {%r8, %r8};
+	mov.b32 	%r9, 0f00000000;
+	mov.b64 	%rd63, {%r9, %r9};
+	mov.b64 	%rd62, 0;
+	mov.b64 	%rd65, %rd63;
+	mov.b64 	%rd66, %rd64;
+$L__BB0_1:                              // =>This Inner Loop Header: Depth=1
+	.loc	1 33 29                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:33:29
+	add.s64 	%rd23, %rd1, %rd62;
+	setp.lt.u64 	%p1, %rd23, 32000;
+	.loc	1 37 34                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:37:34
+	add.s64 	%rd24, %rd2, %rd62;
+	cvt.u32.u64 	%r14, %rd24;
+	mad.wide.s32 	%rd21, %r14, 2, %rd15;
+	.loc	1 37 52                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:37:52
+	// begin inline asm
+	mov.u64 %rd20, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd20, 1.0;
+	// end inline asm
+	mov.b32 	%r12, 0;
+	// begin inline asm
+	mov.u32 %r10, %r12;
+	mov.u32 %r11, %r12;
+	@%p1 ld.global.L1::evict_last.L2::cache_hint.v2.b32 { %r10, %r11 }, [ %rd21 + 0 ], %rd20;
+	// end inline asm
+$L__tmp1:
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	mov.b64 	{%r15, %r16}, %rd66;
+	setp.nan.f32 	%p2, %r15, %r15;
+	setp.nan.f32 	%p3, %r16, %r16;
+	mov.b64 	{%r17, %r18}, %rd64;
+	setp.nan.f32 	%p4, %r17, %r17;
+	setp.nan.f32 	%p5, %r18, %r18;
+$L__tmp2:
+	.loc	1 37 105                        // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:37:105
+	mov.b32 	{%rs1, %rs2}, %r10;
+	cvt.f32.bf16 	%r19, %rs2;
+	cvt.f32.bf16 	%r20, %rs1;
+$L__tmp3:
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	setp.gt.f32 	%p6, %r15, %r20;
+	setp.gt.f32 	%p7, %r16, %r19;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	selp.f32 	%r21, %r16, %r19, %p7;
+	selp.f32 	%r22, %r16, %r21, %p3;
+	selp.f32 	%r23, %r15, %r20, %p6;
+	selp.f32 	%r24, %r15, %r23, %p2;
+	.loc	2 196 19                        // triton_helpers.py:196:19 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	setp.eq.f32 	%p8, %r24, 0fFF800000;
+	setp.eq.f32 	%p9, %r22, 0fFF800000;
+	.loc	2 196 53                        // triton_helpers.py:196:53 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	sub.f32 	%r25, %r16, %r22;
+	sub.f32 	%r26, %r15, %r24;
+	mov.b32 	%r27, 0f3F000000;
+	mov.b32 	%r28, 0f3BBB989D;
+	.loc	2 173 29                        // triton_helpers.py:173:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	fma.rn.ftz.f32 	%r29, %r26, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r30, %r29;
+	mov.b32 	%r31, 0f4B400001;
+	mov.b32 	%r32, 0f437C0000;
+	fma.rm.ftz.f32 	%r33, %r30, %r32, %r31;
+	fma.rn.ftz.f32 	%r34, %r25, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r35, %r34;
+	fma.rm.ftz.f32 	%r36, %r35, %r32, %r31;
+	mov.b64 	%rd25, {%r33, %r36};
+	cvt.u32.u64 	%r37, %rd25;
+	add.f32 	%r38, %r33, 0fCB40007F;
+	neg.f32 	%r39, %r38;
+	mov.b32 	%r40, 0f3FB8AA3B;
+	fma.rn.ftz.f32 	%r41, %r26, %r40, %r39;
+	mov.b32 	%r42, 0f32A57060;
+	fma.rn.ftz.f32 	%r43, %r26, %r42, %r41;
+	ex2.approx.ftz.f32 	%r44, %r43;
+	add.f32 	%r45, %r36, 0fCB40007F;
+	neg.f32 	%r46, %r45;
+	fma.rn.ftz.f32 	%r47, %r25, %r40, %r46;
+	fma.rn.ftz.f32 	%r48, %r25, %r42, %r47;
+	shl.b64 	%rd26, %rd25, 23;
+	and.b64 	%rd27, %rd26, -36028797018963968;
+	shl.b32 	%r49, %r37, 23;
+	cvt.u64.u32 	%rd28, %r49;
+	or.b64 	%rd29, %rd28, %rd27;
+	ex2.approx.ftz.f32 	%r50, %r48;
+	mov.b64 	{%r51, %r52}, %rd29;
+	mul.f32 	%r53, %r44, %r51;
+	mul.f32 	%r54, %r50, %r52;
+	.loc	2 196 39                        // triton_helpers.py:196:39 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	selp.f32 	%r55, 0f3F800000, %r54, %p9;
+	selp.f32 	%r56, 0f3F800000, %r53, %p8;
+	.loc	2 199 53                        // triton_helpers.py:199:53 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	sub.f32 	%r57, %r19, %r22;
+	sub.f32 	%r58, %r20, %r24;
+	.loc	2 173 29                        // triton_helpers.py:173:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	fma.rn.ftz.f32 	%r59, %r58, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r60, %r59;
+	fma.rm.ftz.f32 	%r61, %r60, %r32, %r31;
+	fma.rn.ftz.f32 	%r62, %r57, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r63, %r62;
+	fma.rm.ftz.f32 	%r64, %r63, %r32, %r31;
+	mov.b64 	%rd30, {%r61, %r64};
+	cvt.u32.u64 	%r65, %rd30;
+	add.f32 	%r66, %r61, 0fCB40007F;
+	neg.f32 	%r67, %r66;
+	fma.rn.ftz.f32 	%r68, %r58, %r40, %r67;
+	fma.rn.ftz.f32 	%r69, %r58, %r42, %r68;
+	ex2.approx.ftz.f32 	%r70, %r69;
+	add.f32 	%r71, %r64, 0fCB40007F;
+	neg.f32 	%r72, %r71;
+	fma.rn.ftz.f32 	%r73, %r57, %r40, %r72;
+	fma.rn.ftz.f32 	%r74, %r57, %r42, %r73;
+	shl.b64 	%rd31, %rd30, 23;
+	and.b64 	%rd32, %rd31, -36028797018963968;
+	shl.b32 	%r75, %r65, 23;
+	cvt.u64.u32 	%rd33, %r75;
+	or.b64 	%rd34, %rd33, %rd32;
+	ex2.approx.ftz.f32 	%r76, %r74;
+	mov.b64 	{%r77, %r78}, %rd34;
+	mul.f32 	%r79, %r70, %r77;
+	mul.f32 	%r80, %r76, %r78;
+	.loc	2 199 39                        // triton_helpers.py:199:39 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	selp.f32 	%r81, 0f3F800000, %r80, %p9;
+	selp.f32 	%r82, 0f3F800000, %r79, %p8;
+	.loc	2 205 36                        // triton_helpers.py:205:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	mov.b64 	{%r83, %r84}, %rd65;
+	fma.rn.f32 	%r85, %r83, %r56, %r82;
+	fma.rn.f32 	%r86, %r84, %r55, %r81;
+$L__tmp4:
+	.loc	1 37 105                        // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:37:105
+	mov.b32 	{%rs3, %rs4}, %r11;
+	cvt.f32.bf16 	%r87, %rs4;
+	cvt.f32.bf16 	%r88, %rs3;
+$L__tmp5:
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	setp.gt.f32 	%p10, %r17, %r88;
+	setp.gt.f32 	%p11, %r18, %r87;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	selp.f32 	%r89, %r18, %r87, %p11;
+	selp.f32 	%r90, %r18, %r89, %p5;
+	selp.f32 	%r91, %r17, %r88, %p10;
+	selp.f32 	%r92, %r17, %r91, %p4;
+	.loc	2 196 19                        // triton_helpers.py:196:19 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	setp.eq.f32 	%p12, %r92, 0fFF800000;
+	setp.eq.f32 	%p13, %r90, 0fFF800000;
+	.loc	2 196 53                        // triton_helpers.py:196:53 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	sub.f32 	%r93, %r18, %r90;
+	sub.f32 	%r94, %r17, %r92;
+	.loc	2 173 29                        // triton_helpers.py:173:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	fma.rn.ftz.f32 	%r95, %r94, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r96, %r95;
+	fma.rm.ftz.f32 	%r97, %r96, %r32, %r31;
+	fma.rn.ftz.f32 	%r98, %r93, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r99, %r98;
+	fma.rm.ftz.f32 	%r100, %r99, %r32, %r31;
+	mov.b64 	%rd35, {%r97, %r100};
+	cvt.u32.u64 	%r101, %rd35;
+	add.f32 	%r102, %r97, 0fCB40007F;
+	neg.f32 	%r103, %r102;
+	fma.rn.ftz.f32 	%r104, %r94, %r40, %r103;
+	fma.rn.ftz.f32 	%r105, %r94, %r42, %r104;
+	ex2.approx.ftz.f32 	%r106, %r105;
+	add.f32 	%r107, %r100, 0fCB40007F;
+	neg.f32 	%r108, %r107;
+	fma.rn.ftz.f32 	%r109, %r93, %r40, %r108;
+	fma.rn.ftz.f32 	%r110, %r93, %r42, %r109;
+	shl.b64 	%rd36, %rd35, 23;
+	and.b64 	%rd37, %rd36, -36028797018963968;
+	shl.b32 	%r111, %r101, 23;
+	cvt.u64.u32 	%rd38, %r111;
+	or.b64 	%rd39, %rd38, %rd37;
+	ex2.approx.ftz.f32 	%r112, %r110;
+	mov.b64 	{%r113, %r114}, %rd39;
+	mul.f32 	%r115, %r106, %r113;
+	mul.f32 	%r116, %r112, %r114;
+	.loc	2 196 39                        // triton_helpers.py:196:39 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	selp.f32 	%r117, 0f3F800000, %r116, %p13;
+	selp.f32 	%r118, 0f3F800000, %r115, %p12;
+	.loc	2 199 53                        // triton_helpers.py:199:53 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	sub.f32 	%r119, %r87, %r90;
+	sub.f32 	%r120, %r88, %r92;
+	.loc	2 173 29                        // triton_helpers.py:173:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	fma.rn.ftz.f32 	%r121, %r120, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r122, %r121;
+	fma.rm.ftz.f32 	%r123, %r122, %r32, %r31;
+	fma.rn.ftz.f32 	%r124, %r119, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r125, %r124;
+	fma.rm.ftz.f32 	%r126, %r125, %r32, %r31;
+	mov.b64 	%rd40, {%r123, %r126};
+	cvt.u32.u64 	%r127, %rd40;
+	add.f32 	%r128, %r123, 0fCB40007F;
+	neg.f32 	%r129, %r128;
+	fma.rn.ftz.f32 	%r130, %r120, %r40, %r129;
+	fma.rn.ftz.f32 	%r131, %r120, %r42, %r130;
+	ex2.approx.ftz.f32 	%r132, %r131;
+	add.f32 	%r133, %r126, 0fCB40007F;
+	neg.f32 	%r134, %r133;
+	fma.rn.ftz.f32 	%r135, %r119, %r40, %r134;
+	fma.rn.ftz.f32 	%r136, %r119, %r42, %r135;
+	shl.b64 	%rd41, %rd40, 23;
+	and.b64 	%rd42, %rd41, -36028797018963968;
+	shl.b32 	%r137, %r127, 23;
+	cvt.u64.u32 	%rd43, %r137;
+	or.b64 	%rd44, %rd43, %rd42;
+	ex2.approx.ftz.f32 	%r138, %r136;
+	mov.b64 	{%r139, %r140}, %rd44;
+	mul.f32 	%r141, %r132, %r139;
+	mul.f32 	%r142, %r138, %r140;
+	.loc	2 199 39                        // triton_helpers.py:199:39 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	selp.f32 	%r143, 0f3F800000, %r142, %p13;
+	selp.f32 	%r144, 0f3F800000, %r141, %p12;
+	.loc	2 205 36                        // triton_helpers.py:205:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:42:40 ]
+	mov.b64 	{%r145, %r146}, %rd63;
+	fma.rn.f32 	%r147, %r145, %r118, %r144;
+	fma.rn.f32 	%r148, %r146, %r117, %r143;
+$L__tmp6:
+	.loc	1 45 54                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:45:54
+	selp.f32 	%r149, %r22, %r16, %p1;
+	selp.f32 	%r150, %r24, %r15, %p1;
+	mov.b64 	%rd66, {%r150, %r149};
+	selp.f32 	%r151, %r90, %r18, %p1;
+	selp.f32 	%r152, %r92, %r17, %p1;
+	mov.b64 	%rd64, {%r152, %r151};
+	.loc	1 46 54                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:46:54
+	selp.f32 	%r153, %r86, %r84, %p1;
+	selp.f32 	%r154, %r85, %r83, %p1;
+	mov.b64 	%rd65, {%r154, %r153};
+	selp.f32 	%r155, %r148, %r146, %p1;
+	selp.f32 	%r156, %r147, %r145, %p1;
+	mov.b64 	%rd63, {%r156, %r155};
+	.loc	1 31 40                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:31:40
+	add.s64 	%rd12, %rd62, 2048;
+	setp.lt.u64 	%p14, %rd62, 29952;
+	mov.b64 	%rd62, %rd12;
+	@%p14 bra 	$L__BB0_1;
+// %bb.2:
+	.loc	1 26 37                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:26:37
+	and.b32 	%r169, %r1, 31;
+$L__tmp7:
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	mov.b64 	{%r170, %r171}, %rd66;
+	setp.gt.f32 	%p21, %r170, %r171;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p22, %r170, %r170;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r172, %r170, %r171, %p22;
+	selp.f32 	%r173, %r170, %r172, %p21;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	mov.b64 	{%r174, %r175}, %rd64;
+	setp.gt.f32 	%p23, %r173, %r174;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p24, %r173, %r173;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r176, %r173, %r174, %p24;
+	selp.f32 	%r177, %r173, %r176, %p23;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p25, %r177, %r175;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p26, %r177, %r177;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r178, %r177, %r175, %p26;
+	selp.f32 	%r179, %r177, %r178, %p25;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r180, %r179, 16, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p27, %r179, %r180;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p28, %r179, %r179;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r181, %r179, %r180, %p27;
+	selp.f32 	%r182, %r179, %r181, %p28;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r183, %r182, 8, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p29, %r182, %r183;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p30, %r182, %r182;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r184, %r182, %r183, %p30;
+	selp.f32 	%r185, %r182, %r184, %p29;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r186, %r185, 4, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p31, %r185, %r186;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p32, %r185, %r185;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r187, %r185, %r186, %p32;
+	selp.f32 	%r188, %r185, %r187, %p31;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r189, %r188, 2, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p33, %r188, %r189;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p34, %r188, %r188;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r190, %r188, %r189, %p34;
+	selp.f32 	%r191, %r188, %r190, %p33;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r192, %r191, 1, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p35, %r191, %r192;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p36, %r191, %r191;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.eq.b32 	%p15, %r169, 0;
+	shr.u32 	%r193, %r1, 3;
+	and.b32 	%r194, %r193, 60;
+	mov.b32 	%r195, global_smem;
+	add.s32 	%r157, %r195, %r194;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.b32 	%r196, %r191, %r192, %p36;
+	selp.b32 	%r158, %r191, %r196, %p35;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	// begin inline asm
+	@%p15 st.shared.b32 [ %r157 + 0 ], %r158;
+	// end inline asm
+	bar.sync 	0;
+	setp.lt.u32 	%p16, %r1, 16;
+	add.s32 	%r160, %r195, %r5;
+	// begin inline asm
+	@%p16 ld.shared.b32 %r159, [ %r160 + 0 ];
+	// end inline asm
+	shfl.sync.bfly.b32 	%r198, %r159, 8, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p37, %r159, %r198;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p38, %r159, %r159;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r199, %r159, %r198, %p37;
+	selp.f32 	%r200, %r159, %r199, %p38;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r201, %r200, 4, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p39, %r200, %r201;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p40, %r200, %r200;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r202, %r200, %r201, %p40;
+	selp.f32 	%r203, %r200, %r202, %p39;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r204, %r203, 2, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p41, %r203, %r204;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p42, %r203, %r203;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r205, %r203, %r204, %p42;
+	selp.f32 	%r206, %r203, %r205, %p41;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r207, %r206, 1, 31, -1;
+	.loc	2 110 15                        // triton_helpers.py:110:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.gt.f32 	%p43, %r206, %r207;
+	.loc	2 112 21                        // triton_helpers.py:112:21 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.nan.f32 	%p44, %r206, %r206;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.eq.b32 	%p17, %r1, 0;
+	.loc	2 113 29                        // triton_helpers.py:113:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.b32 	%r208, %r206, %r207, %p44;
+	selp.b32 	%r162, %r206, %r208, %p43;
+	.loc	2 123 29                        // triton_helpers.py:123:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	// begin inline asm
+	@%p17 st.shared.b32 [ %r160 + 0 ], %r162;
+	// end inline asm
+	bar.sync 	0;
+	ld.shared.b32 	%r2, [global_smem];
+	.loc	2 180 40                        // triton_helpers.py:180:40 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	setp.eq.f32 	%p45, %r2, 0fFF800000;
+	.loc	2 180 68                        // triton_helpers.py:180:68 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	sub.f32 	%r209, %r170, %r2;
+	sub.f32 	%r210, %r171, %r2;
+	sub.f32 	%r211, %r174, %r2;
+	sub.f32 	%r212, %r175, %r2;
+	.loc	2 180 58                        // triton_helpers.py:180:58 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	selp.f32 	%r213, 0f00000000, %r209, %p45;
+	selp.f32 	%r214, 0f00000000, %r210, %p45;
+	selp.f32 	%r215, 0f00000000, %r211, %p45;
+	selp.f32 	%r216, 0f00000000, %r212, %p45;
+	.loc	2 173 29                        // triton_helpers.py:173:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	fma.rn.ftz.f32 	%r219, %r213, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r220, %r219;
+	fma.rm.ftz.f32 	%r223, %r220, %r32, %r31;
+	fma.rn.ftz.f32 	%r224, %r214, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r225, %r224;
+	fma.rm.ftz.f32 	%r226, %r225, %r32, %r31;
+	fma.rn.ftz.f32 	%r227, %r215, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r228, %r227;
+	fma.rm.ftz.f32 	%r229, %r228, %r32, %r31;
+	fma.rn.ftz.f32 	%r230, %r216, %r28, %r27;
+	cvt.ftz.sat.f32.f32 	%r231, %r230;
+	fma.rm.ftz.f32 	%r232, %r231, %r32, %r31;
+	mov.b64 	%rd46, {%r223, %r226};
+	cvt.u32.u64 	%r233, %rd46;
+	add.f32 	%r234, %r223, 0fCB40007F;
+	neg.f32 	%r235, %r234;
+	fma.rn.ftz.f32 	%r237, %r213, %r40, %r235;
+	fma.rn.ftz.f32 	%r239, %r213, %r42, %r237;
+	ex2.approx.ftz.f32 	%r240, %r239;
+	add.f32 	%r241, %r226, 0fCB40007F;
+	neg.f32 	%r242, %r241;
+	fma.rn.ftz.f32 	%r243, %r214, %r40, %r242;
+	fma.rn.ftz.f32 	%r244, %r214, %r42, %r243;
+	shl.b64 	%rd47, %rd46, 23;
+	and.b64 	%rd48, %rd47, -36028797018963968;
+	shl.b32 	%r245, %r233, 23;
+	cvt.u64.u32 	%rd49, %r245;
+	or.b64 	%rd50, %rd49, %rd48;
+	ex2.approx.ftz.f32 	%r246, %r244;
+	mov.b64 	{%r247, %r248}, %rd50;
+	mul.f32 	%r249, %r240, %r247;
+	mul.f32 	%r250, %r246, %r248;
+	.loc	2 181 31                        // triton_helpers.py:181:31 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	mov.b64 	{%r251, %r252}, %rd65;
+	mul.f32 	%r253, %r252, %r250;
+	.loc	2 173 29                        // triton_helpers.py:173:29 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	mov.b64 	%rd51, {%r229, %r232};
+	cvt.u32.u64 	%r254, %rd51;
+	add.f32 	%r255, %r229, 0fCB40007F;
+	neg.f32 	%r256, %r255;
+	fma.rn.ftz.f32 	%r257, %r215, %r40, %r256;
+	fma.rn.ftz.f32 	%r258, %r215, %r42, %r257;
+	ex2.approx.ftz.f32 	%r259, %r258;
+	add.f32 	%r260, %r232, 0fCB40007F;
+	neg.f32 	%r261, %r260;
+	fma.rn.ftz.f32 	%r262, %r216, %r40, %r261;
+	fma.rn.ftz.f32 	%r263, %r216, %r42, %r262;
+	shl.b64 	%rd52, %rd51, 23;
+	and.b64 	%rd53, %rd52, -36028797018963968;
+	shl.b32 	%r264, %r254, 23;
+	cvt.u64.u32 	%rd54, %r264;
+	or.b64 	%rd55, %rd54, %rd53;
+	ex2.approx.ftz.f32 	%r265, %r263;
+	mov.b64 	{%r266, %r267}, %rd55;
+	mul.f32 	%r268, %r265, %r267;
+	mul.f32 	%r269, %r259, %r266;
+	.loc	2 181 31                        // triton_helpers.py:181:31 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	mov.b64 	{%r270, %r271}, %rd63;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	bar.sync 	0;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	fma.rn.f32 	%r272, %r251, %r249, %r253;
+	fma.rn.f32 	%r273, %r270, %r269, %r272;
+	fma.rn.f32 	%r274, %r271, %r268, %r273;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r275, %r274, 16, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r276, %r274, %r275;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r277, %r276, 8, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r278, %r276, %r277;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r279, %r278, 4, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r280, %r278, %r279;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r281, %r280, 2, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r282, %r280, %r281;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r283, %r282, 1, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r164, %r282, %r283;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	// begin inline asm
+	@%p15 st.shared.b32 [ %r157 + 0 ], %r164;
+	// end inline asm
+	bar.sync 	0;
+	// begin inline asm
+	@%p16 ld.shared.b32 %r165, [ %r160 + 0 ];
+	// end inline asm
+	shfl.sync.bfly.b32 	%r284, %r165, 8, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r285, %r165, %r284;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r286, %r285, 4, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r287, %r285, %r286;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r288, %r287, 2, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r289, %r287, %r288;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	shfl.sync.bfly.b32 	%r290, %r289, 1, 31, -1;
+	.loc	3 261 15                        // standard.py:261:15 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	add.f32 	%r168, %r289, %r290;
+	.loc	3 291 36                        // standard.py:291:36 @[ cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:49:33 ]
+	// begin inline asm
+	@%p17 st.shared.b32 [ %r160 + 0 ], %r168;
+	// end inline asm
+	bar.sync 	0;
+	mov.b64 	%rd67, 0;
+	ld.shared.b32 	%r3, [global_smem];
+$L__tmp8:
+$L__BB0_3:                              // =>This Inner Loop Header: Depth=1
+	.loc	1 54 29                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:54:29
+	add.s64 	%rd60, %rd1, %rd67;
+	setp.lt.u64 	%p46, %rd60, 32000;
+	.loc	1 58 34                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:58:34
+	add.s64 	%rd61, %rd2, %rd67;
+	cvt.u32.u64 	%r299, %rd61;
+	mad.wide.s32 	%rd57, %r299, 2, %rd15;
+	.loc	1 58 52                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:58:52
+	// begin inline asm
+	mov.u64 %rd56, 0x0;
+	createpolicy.fractional.L2::evict_first.b64 %rd56, 1.0;
+	// end inline asm
+	mov.b32 	%r293, 0;
+	// begin inline asm
+	mov.u32 %r291, %r293;
+	mov.u32 %r292, %r293;
+	@%p46 ld.global.L1::evict_first.L2::cache_hint.v2.b32 { %r291, %r292 }, [ %rd57 + 0 ], %rd56;
+	// end inline asm
+	mov.b32 	{%rs5, %rs6}, %r291;
+	mov.b32 	{%rs7, %rs8}, %r292;
+	.loc	1 58 106                        // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:58:106
+	cvt.f32.bf16 	%r300, %rs5;
+	cvt.f32.bf16 	%r301, %rs6;
+	cvt.f32.bf16 	%r302, %rs7;
+	cvt.f32.bf16 	%r303, %rs8;
+	.loc	1 60 22                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:60:22
+	sub.f32 	%r304, %r300, %r2;
+	sub.f32 	%r305, %r301, %r2;
+	sub.f32 	%r306, %r302, %r2;
+	sub.f32 	%r307, %r303, %r2;
+	mov.b32 	%r308, 0f3F000000;
+	mov.b32 	%r309, 0f3BBB989D;
+	.loc	1 61 29                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:61:29
+	fma.rn.ftz.f32 	%r310, %r304, %r309, %r308;
+	cvt.ftz.sat.f32.f32 	%r311, %r310;
+	mov.b32 	%r312, 0f4B400001;
+	mov.b32 	%r313, 0f437C0000;
+	fma.rm.ftz.f32 	%r314, %r311, %r313, %r312;
+	add.f32 	%r315, %r314, 0fCB40007F;
+	neg.f32 	%r316, %r315;
+	mov.b32 	%r317, 0f3FB8AA3B;
+	fma.rn.ftz.f32 	%r318, %r304, %r317, %r316;
+	mov.b32 	%r319, 0f32A57060;
+	fma.rn.ftz.f32 	%r320, %r304, %r319, %r318;
+	shl.b32 	%r321, %r314, 23;
+	ex2.approx.ftz.f32 	%r322, %r320;
+	mul.f32 	%r323, %r322, %r321;
+	fma.rn.ftz.f32 	%r324, %r305, %r309, %r308;
+	cvt.ftz.sat.f32.f32 	%r325, %r324;
+	fma.rm.ftz.f32 	%r326, %r325, %r313, %r312;
+	add.f32 	%r327, %r326, 0fCB40007F;
+	neg.f32 	%r328, %r327;
+	fma.rn.ftz.f32 	%r329, %r305, %r317, %r328;
+	fma.rn.ftz.f32 	%r330, %r305, %r319, %r329;
+	shl.b32 	%r331, %r326, 23;
+	ex2.approx.ftz.f32 	%r332, %r330;
+	mul.f32 	%r333, %r332, %r331;
+	fma.rn.ftz.f32 	%r334, %r306, %r309, %r308;
+	cvt.ftz.sat.f32.f32 	%r335, %r334;
+	fma.rm.ftz.f32 	%r336, %r335, %r313, %r312;
+	add.f32 	%r337, %r336, 0fCB40007F;
+	neg.f32 	%r338, %r337;
+	fma.rn.ftz.f32 	%r339, %r306, %r317, %r338;
+	fma.rn.ftz.f32 	%r340, %r306, %r319, %r339;
+	shl.b32 	%r341, %r336, 23;
+	ex2.approx.ftz.f32 	%r342, %r340;
+	mul.f32 	%r343, %r342, %r341;
+	fma.rn.ftz.f32 	%r344, %r307, %r309, %r308;
+	cvt.ftz.sat.f32.f32 	%r345, %r344;
+	fma.rm.ftz.f32 	%r346, %r345, %r313, %r312;
+	add.f32 	%r347, %r346, 0fCB40007F;
+	neg.f32 	%r348, %r347;
+	fma.rn.ftz.f32 	%r349, %r307, %r317, %r348;
+	fma.rn.ftz.f32 	%r350, %r307, %r319, %r349;
+	shl.b32 	%r351, %r346, 23;
+	ex2.approx.ftz.f32 	%r352, %r350;
+	mul.f32 	%r353, %r352, %r351;
+	.loc	1 62 23                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:62:23
+	div.full.f32 	%r295, %r323, %r3;
+	div.full.f32 	%r296, %r333, %r3;
+	div.full.f32 	%r297, %r343, %r3;
+	div.full.f32 	%r298, %r353, %r3;
+	.loc	1 63 29                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:63:29
+	mad.wide.s32 	%rd59, %r299, 4, %rd16;
+	.loc	1 63 53                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:63:53
+	// begin inline asm
+	@%p46 st.global.v4.b32 [ %rd59 + 0 ], { %r295, %r296, %r297, %r298 };
+	// end inline asm
+	.loc	1 52 40                         // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:52:40
+	add.s64 	%rd14, %rd67, 2048;
+	setp.lt.u64 	%p48, %rd67, 29952;
+	mov.b64 	%rd67, %rd14;
+	@%p48 bra 	$L__BB0_3;
+// %bb.4:
+	.loc	1 52 4                          // cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py:52:4
+	ret;
+$L__tmp9:
+$L__func_end0:
+                                        // -- End function
+}
+	.file	1 "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py"
+	.file	2 "/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py"
+	.file	3 "/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py"
+	.section	.debug_abbrev
+	{
+.b8 1                                   // Abbreviation Code
+.b8 17                                  // DW_TAG_compile_unit
+.b8 1                                   // DW_CHILDREN_yes
+.b8 37                                  // DW_AT_producer
+.b8 8                                   // DW_FORM_string
+.b8 19                                  // DW_AT_language
+.b8 5                                   // DW_FORM_data2
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 16                                  // DW_AT_stmt_list
+.b8 6                                   // DW_FORM_data4
+.b8 27                                  // DW_AT_comp_dir
+.b8 8                                   // DW_FORM_string
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 2                                   // Abbreviation Code
+.b8 46                                  // DW_TAG_subprogram
+.b8 0                                   // DW_CHILDREN_no
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 32                                  // DW_AT_inline
+.b8 11                                  // DW_FORM_data1
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 3                                   // Abbreviation Code
+.b8 46                                  // DW_TAG_subprogram
+.b8 1                                   // DW_CHILDREN_yes
+.b8 17                                  // DW_AT_low_pc
+.b8 1                                   // DW_FORM_addr
+.b8 18                                  // DW_AT_high_pc
+.b8 1                                   // DW_FORM_addr
+.b8 49                                  // DW_AT_abstract_origin
+.b8 19                                  // DW_FORM_ref4
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 4                                   // Abbreviation Code
+.b8 29                                  // DW_TAG_inlined_subroutine
+.b8 0                                   // DW_CHILDREN_no
+.b8 49                                  // DW_AT_abstract_origin
+.b8 19                                  // DW_FORM_ref4
+.b8 17                                  // DW_AT_low_pc
+.b8 1                                   // DW_FORM_addr
+.b8 18                                  // DW_AT_high_pc
+.b8 1                                   // DW_FORM_addr
+.b8 88                                  // DW_AT_call_file
+.b8 11                                  // DW_FORM_data1
+.b8 89                                  // DW_AT_call_line
+.b8 11                                  // DW_FORM_data1
+.b8 87                                  // DW_AT_call_column
+.b8 11                                  // DW_FORM_data1
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 0                                   // EOM(3)
+	}
+	.section	.debug_info
+	{
+.b32 276                                // Length of Unit
+.b8 2                                   // DWARF version number
+.b8 0
+.b32 .debug_abbrev                      // Offset Into Abbrev. Section
+.b8 8                                   // Address Size (in bytes)
+.b8 1                                   // Abbrev [1] 0xb:0x10d DW_TAG_compile_unit
+.b8 116                                 // DW_AT_producer
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 0
+.b8 2                                   // DW_AT_language
+.b8 0
+.b8 99                                  // DW_AT_name
+.b8 118
+.b8 120
+.b8 116
+.b8 98
+.b8 120
+.b8 105
+.b8 99
+.b8 105
+.b8 99
+.b8 52
+.b8 111
+.b8 109
+.b8 98
+.b8 109
+.b8 50
+.b8 100
+.b8 50
+.b8 102
+.b8 108
+.b8 99
+.b8 122
+.b8 120
+.b8 121
+.b8 100
+.b8 107
+.b8 111
+.b8 108
+.b8 50
+.b8 108
+.b8 50
+.b8 102
+.b8 102
+.b8 114
+.b8 52
+.b8 113
+.b8 116
+.b8 111
+.b8 116
+.b8 51
+.b8 114
+.b8 114
+.b8 122
+.b8 110
+.b8 106
+.b8 122
+.b8 55
+.b8 98
+.b8 104
+.b8 110
+.b8 103
+.b8 102
+.b8 46
+.b8 112
+.b8 121
+.b8 0
+.b32 .debug_line                        // DW_AT_stmt_list
+.b8 47                                  // DW_AT_comp_dir
+.b8 119
+.b8 111
+.b8 114
+.b8 107
+.b8 115
+.b8 112
+.b8 97
+.b8 99
+.b8 101
+.b8 47
+.b8 104
+.b8 97
+.b8 110
+.b8 114
+.b8 117
+.b8 105
+.b8 47
+.b8 83
+.b8 112
+.b8 101
+.b8 99
+.b8 70
+.b8 111
+.b8 114
+.b8 103
+.b8 101
+.b8 45
+.b8 101
+.b8 120
+.b8 116
+.b8 47
+.b8 99
+.b8 97
+.b8 99
+.b8 104
+.b8 101
+.b8 47
+.b8 99
+.b8 111
+.b8 109
+.b8 112
+.b8 105
+.b8 108
+.b8 101
+.b8 100
+.b8 95
+.b8 107
+.b8 101
+.b8 114
+.b8 110
+.b8 101
+.b8 108
+.b8 115
+.b8 47
+.b8 118
+.b8 120
+.b8 0
+.b8 2                                   // Abbrev [2] 0x8b:0x46 DW_TAG_subprogram
+.b8 116                                 // DW_AT_name
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 95
+.b8 114
+.b8 101
+.b8 100
+.b8 95
+.b8 102
+.b8 117
+.b8 115
+.b8 101
+.b8 100
+.b8 95
+.b8 95
+.b8 115
+.b8 111
+.b8 102
+.b8 116
+.b8 109
+.b8 97
+.b8 120
+.b8 95
+.b8 95
+.b8 116
+.b8 111
+.b8 95
+.b8 99
+.b8 111
+.b8 112
+.b8 121
+.b8 95
+.b8 101
+.b8 120
+.b8 112
+.b8 95
+.b8 112
+.b8 114
+.b8 101
+.b8 112
+.b8 97
+.b8 114
+.b8 101
+.b8 95
+.b8 115
+.b8 111
+.b8 102
+.b8 116
+.b8 109
+.b8 97
+.b8 120
+.b8 95
+.b8 111
+.b8 110
+.b8 108
+.b8 105
+.b8 110
+.b8 101
+.b8 95
+.b8 115
+.b8 117
+.b8 98
+.b8 95
+.b8 48
+.b8 0
+.b8 1                                   // DW_AT_inline
+.b8 3                                   // Abbrev [3] 0xd1:0x46 DW_TAG_subprogram
+.b64 $L__func_begin0                    // DW_AT_low_pc
+.b64 $L__func_end0                      // DW_AT_high_pc
+.b32 139                                // DW_AT_abstract_origin
+.b8 4                                   // Abbrev [4] 0xe6:0x18 DW_TAG_inlined_subroutine
+.b32 139                                // DW_AT_abstract_origin
+.b64 $L__tmp1                           // DW_AT_low_pc
+.b64 $L__tmp6                           // DW_AT_high_pc
+.b8 1                                   // DW_AT_call_file
+.b8 42                                  // DW_AT_call_line
+.b8 40                                  // DW_AT_call_column
+.b8 4                                   // Abbrev [4] 0xfe:0x18 DW_TAG_inlined_subroutine
+.b32 139                                // DW_AT_abstract_origin
+.b64 $L__tmp7                           // DW_AT_low_pc
+.b64 $L__tmp8                           // DW_AT_high_pc
+.b8 1                                   // DW_AT_call_file
+.b8 49                                  // DW_AT_call_line
+.b8 33                                  // DW_AT_call_column
+.b8 0                                   // End Of Children Mark
+.b8 0                                   // End Of Children Mark
+	}
+	.section	.debug_macinfo	{	}

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.source ADDED Viewed

	@@ -0,0 +1,449 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":18:0)
+#loc48 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":186:0)
+#loc62 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":109:0)
+#loc68 = loc(unknown)
+#loc72 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":86:0)
+#loc76 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":63:0)
+#loc81 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":169:0)
+#loc85 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":177:0)
+#loc96 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":122:0)
+#loc100 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":285:0)
+#loc104 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":260:0)
+#loc108 = loc("in_ptr0"(#loc))
+#loc109 = loc("out_ptr2"(#loc))
+#loc110 = loc("xnumel"(#loc))
+#loc111 = loc("r0_numel"(#loc))
+#loc146 = loc("lhs_max"(#loc48))
+#loc147 = loc("lhs_sum"(#loc48))
+#loc148 = loc("rhs_max"(#loc48))
+#loc160 = loc("a"(#loc62))
+#loc161 = loc("b"(#loc62))
+#loc165 = loc("x"(#loc72))
+#loc166 = loc("x"(#loc76))
+#loc167 = loc("x"(#loc81))
+#loc168 = loc("lhs_max"(#loc85))
+#loc169 = loc("lhs_sum"(#loc85))
+#loc178 = loc("a"(#loc96))
+#loc179 = loc("input"(#loc100))
+#loc180 = loc("a"(#loc104))
+#loc181 = loc("b"(#loc104))
+module {
+  tt.func public @triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0(%in_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr2: !tt.ptr<f32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %xnumel_0 = arith.constant 4096 : i32 loc(#loc112)
+    %r0_numel_1 = arith.constant 32000 : i32 loc(#loc113)
+    %xoffset = tt.get_program_id x : i32 loc(#loc114)
+    %xoffset_2 = arith.constant 1 : i32 loc(#loc115)
+    %xoffset_3 = arith.constant 1 : i32 loc(#loc115)
+    %xoffset_4 = arith.muli %xoffset, %xoffset_3 : i32 loc(#loc115)
+    %xindex = tt.make_range {end = 1 : i32, start = 0 : i32} : tensor<1xi32> loc(#loc116)
+    %xindex_5 = tt.expand_dims %xindex {axis = 1 : i32} : tensor<1xi32> -> tensor<1x1xi32> loc(#loc117)
+    %xindex_6 = tt.splat %xoffset_4 : i32 -> tensor<1x1xi32> loc(#loc118)
+    %xindex_7 = arith.addi %xindex_6, %xindex_5 : tensor<1x1xi32> loc(#loc118)
+    %xmask = arith.constant true loc(#loc119)
+    %xmask_8 = arith.constant dense<true> : tensor<1x2048xi1> loc(#loc119)
+    %r0_base = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> loc(#loc120)
+    %r0_base_9 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32> loc(#loc121)
+    %_tmp3_max = arith.constant 0xFF800000 : f32 loc(#loc122)
+    %_tmp3_max_10 = arith.constant dense<0xFF800000> : tensor<1x2048xf32> loc(#loc122)
+    %_tmp3_sum = tt.call @"triton.language.standard.zeros____(0, 0)cconstexpr_1__(0, 1)cconstexpr_2048__(1,)cconstexpr_fp32_"() : () -> tensor<1x2048xf32> loc(#loc123)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc13)
+    %c2048_i32 = arith.constant 2048 : i32 loc(#loc13)
+    %0 = arith.bitcast %c0_i32 : i32 to i32 loc(#loc13)
+    %1 = arith.bitcast %r0_numel_1 : i32 to i32 loc(#loc13)
+    %2 = arith.bitcast %c2048_i32 : i32 to i32 loc(#loc13)
+    %3 = ub.poison : i32 loc(#loc13)
+    %_tmp3_sum_11:2 = scf.for %r0_offset = %0 to %1 step %2 iter_args(%_tmp3_max_14 = %_tmp3_max_10, %_tmp3_sum_15 = %_tmp3_sum) -> (tensor<1x2048xf32>, tensor<1x2048xf32>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x2048xi32> loc(#loc125)
+      %r0_index_16 = arith.addi %r0_index, %r0_base_9 : tensor<1x2048xi32> loc(#loc125)
+      %r0_mask = arith.constant dense<32000> : tensor<1x2048xi32> loc(#loc126)
+      %r0_mask_17 = arith.cmpi slt, %r0_index_16, %r0_mask : tensor<1x2048xi32> loc(#loc126)
+      %tmp0 = arith.constant 32000 : i32 loc(#loc127)
+      %tmp0_18 = arith.constant 32000 : i32 loc(#loc127)
+      %tmp0_19 = arith.constant dense<32000> : tensor<1x1xi32> loc(#loc127)
+      %tmp0_20 = arith.muli %tmp0_19, %xindex_7 : tensor<1x1xi32> loc(#loc127)
+      %tmp0_21 = tt.broadcast %tmp0_20 : tensor<1x1xi32> -> tensor<1x2048xi32> loc(#loc128)
+      %tmp0_22 = arith.addi %r0_index_16, %tmp0_21 : tensor<1x2048xi32> loc(#loc128)
+      %tmp0_23 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1x2048x!tt.ptr<bf16>> loc(#loc129)
+      %tmp0_24 = tt.addptr %tmp0_23, %tmp0_22 : tensor<1x2048x!tt.ptr<bf16>>, tensor<1x2048xi32> loc(#loc129)
+      %tmp0_25 = arith.constant 0.000000e+00 : f32 loc(#loc130)
+      %tmp0_26 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32> loc(#loc130)
+      %tmp0_27 = arith.truncf %tmp0_26 : tensor<1x2048xf32> to tensor<1x2048xbf16> loc(#loc130)
+      %tmp0_28 = tt.load %tmp0_24, %r0_mask_17, %tmp0_27 evictionPolicy = evict_last : tensor<1x2048x!tt.ptr<bf16>> loc(#loc130)
+      %tmp0_29 = arith.extf %tmp0_28 : tensor<1x2048xbf16> to tensor<1x2048xf32> loc(#loc131)
+      %9:2 = tt.call @"torch._inductor.runtime.triton_helpers.online_softmax_combine__fp32S1_2048S_fp32S1_2048S_fp32S1_2048S__(3,)cconstexpr_False_"(%_tmp3_max_14, %_tmp3_sum_15, %tmp0_29) : (tensor<1x2048xf32>, tensor<1x2048xf32>, tensor<1x2048xf32>) -> (tensor<1x2048xf32>, tensor<1x2048xf32>) loc(#loc21)
+      %_tmp3_max_30 = arith.select %r0_mask_17, %9#0, %_tmp3_max_14 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc132)
+      %_tmp3_sum_31 = arith.select %r0_mask_17, %9#1, %_tmp3_sum_15 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc133)
+      scf.yield %_tmp3_max_30, %_tmp3_sum_31 : tensor<1x2048xf32>, tensor<1x2048xf32> loc(#loc24)
+    } loc(#loc182)
+    %4:2 = tt.call @"torch._inductor.runtime.triton_helpers.online_softmax_reduce__fp32S1_2048S_fp32S1_2048S__(2,)cconstexpr_1__(3,)cconstexpr_False_"(%_tmp3_sum_11#0, %_tmp3_sum_11#1) : (tensor<1x2048xf32>, tensor<1x2048xf32>) -> (tensor<1xf32>, tensor<1xf32>) loc(#loc25)
+    %tmp3 = tt.expand_dims %4#0 {axis = 1 : i32} : tensor<1xf32> -> tensor<1x1xf32> loc(#loc134)
+    %tmp4 = tt.expand_dims %4#1 {axis = 1 : i32} : tensor<1xf32> -> tensor<1x1xf32> loc(#loc135)
+    %c0_i32_12 = arith.constant 0 : i32 loc(#loc28)
+    %c2048_i32_13 = arith.constant 2048 : i32 loc(#loc28)
+    %5 = arith.bitcast %c0_i32_12 : i32 to i32 loc(#loc28)
+    %6 = arith.bitcast %r0_numel_1 : i32 to i32 loc(#loc28)
+    %7 = arith.bitcast %c2048_i32_13 : i32 to i32 loc(#loc28)
+    %8 = ub.poison : i32 loc(#loc28)
+    scf.for %r0_offset = %5 to %6 step %7  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x2048xi32> loc(#loc136)
+      %r0_index_14 = arith.addi %r0_index, %r0_base_9 : tensor<1x2048xi32> loc(#loc136)
+      %r0_mask = arith.constant dense<32000> : tensor<1x2048xi32> loc(#loc137)
+      %r0_mask_15 = arith.cmpi slt, %r0_index_14, %r0_mask : tensor<1x2048xi32> loc(#loc137)
+      %tmp5 = arith.constant 32000 : i32 loc(#loc138)
+      %tmp5_16 = arith.constant 32000 : i32 loc(#loc138)
+      %tmp5_17 = arith.constant dense<32000> : tensor<1x1xi32> loc(#loc138)
+      %tmp5_18 = arith.muli %tmp5_17, %xindex_7 : tensor<1x1xi32> loc(#loc138)
+      %tmp5_19 = tt.broadcast %tmp5_18 : tensor<1x1xi32> -> tensor<1x2048xi32> loc(#loc139)
+      %tmp5_20 = arith.addi %r0_index_14, %tmp5_19 : tensor<1x2048xi32> loc(#loc139)
+      %tmp5_21 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1x2048x!tt.ptr<bf16>> loc(#loc140)
+      %tmp5_22 = tt.addptr %tmp5_21, %tmp5_20 : tensor<1x2048x!tt.ptr<bf16>>, tensor<1x2048xi32> loc(#loc140)
+      %tmp5_23 = arith.constant 0.000000e+00 : f32 loc(#loc141)
+      %tmp5_24 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32> loc(#loc141)
+      %tmp5_25 = arith.truncf %tmp5_24 : tensor<1x2048xf32> to tensor<1x2048xbf16> loc(#loc141)
+      %tmp5_26 = tt.load %tmp5_22, %r0_mask_15, %tmp5_25 evictionPolicy = evict_first : tensor<1x2048x!tt.ptr<bf16>> loc(#loc141)
+      %tmp5_27 = arith.extf %tmp5_26 : tensor<1x2048xbf16> to tensor<1x2048xf32> loc(#loc142)
+      %tmp7 = tt.broadcast %tmp3 : tensor<1x1xf32> -> tensor<1x2048xf32> loc(#loc143)
+      %tmp7_28 = arith.subf %tmp5_27, %tmp7 : tensor<1x2048xf32> loc(#loc143)
+      %tmp8 = tt.extern_elementwise %tmp7_28 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc144)
+      %tmp9 = tt.broadcast %tmp4 : tensor<1x1xf32> -> tensor<1x2048xf32> loc(#loc145)
+      %tmp9_29 = arith.divf %tmp8, %tmp9 : tensor<1x2048xf32> loc(#loc145)
+      %c32000_i32 = arith.constant 32000 : i32 loc(#loc39)
+      %c32000_i32_30 = arith.constant 32000 : i32 loc(#loc39)
+      %cst = arith.constant dense<32000> : tensor<1x1xi32> loc(#loc39)
+      %9 = arith.muli %cst, %xindex_7 : tensor<1x1xi32> loc(#loc39)
+      %10 = tt.broadcast %9 : tensor<1x1xi32> -> tensor<1x2048xi32> loc(#loc40)
+      %11 = arith.addi %r0_index_14, %10 : tensor<1x2048xi32> loc(#loc40)
+      %12 = tt.splat %out_ptr2 : !tt.ptr<f32> -> tensor<1x2048x!tt.ptr<f32>> loc(#loc41)
+      %13 = tt.addptr %12, %11 : tensor<1x2048x!tt.ptr<f32>>, tensor<1x2048xi32> loc(#loc41)
+      tt.store %13, %tmp9_29, %r0_mask_15 : tensor<1x2048x!tt.ptr<f32>> loc(#loc42)
+    } loc(#loc28)
+    tt.return loc(#loc43)
+  } loc(#loc)
+  tt.func private @"triton.language.standard.zeros____(0, 0)cconstexpr_1__(0, 1)cconstexpr_2048__(1,)cconstexpr_fp32_"() -> tensor<1x2048xf32> attributes {noinline = false} {
+    %cst = arith.constant 0.000000e+00 : f32 loc(#loc45)
+    %cst_0 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32> loc(#loc45)
+    tt.return %cst_0 : tensor<1x2048xf32> loc(#loc46)
+  ^bb1:  // no predecessors
+    %0 = ub.poison : tensor<1x2048xf32> loc(#loc47)
+    tt.return %0 : tensor<1x2048xf32> loc(#loc47)
+  } loc(#loc44)
+  tt.func private @"torch._inductor.runtime.triton_helpers.online_softmax_combine__fp32S1_2048S_fp32S1_2048S_fp32S1_2048S__(3,)cconstexpr_False_"(%lhs_max: tensor<1x2048xf32> loc("lhs_max"(#loc48)), %lhs_sum: tensor<1x2048xf32> loc("lhs_sum"(#loc48)), %rhs_max: tensor<1x2048xf32> loc("rhs_max"(#loc48))) -> (tensor<1x2048xf32>, tensor<1x2048xf32>) attributes {noinline = false} {
+    %out_max = tt.call @torch._inductor.runtime.triton_helpers.maximum__fp32S1_2048S_fp32S1_2048S__(%lhs_max, %rhs_max) : (tensor<1x2048xf32>, tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc149)
+    %lhs_scale = arith.constant 0xFF800000 : f32 loc(#loc150)
+    %lhs_scale_0 = arith.constant dense<0xFF800000> : tensor<1x2048xf32> loc(#loc150)
+    %lhs_scale_1 = arith.cmpf oeq, %out_max, %lhs_scale_0 : tensor<1x2048xf32> loc(#loc150)
+    %lhs_scale_2 = arith.subf %lhs_max, %out_max : tensor<1x2048xf32> loc(#loc151)
+    %lhs_scale_3 = tt.call @"torch._inductor.runtime.triton_helpers.exp__fp32S1_2048S__(1,)cconstexpr_False_"(%lhs_scale_2) : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc152)
+    %lhs_scale_4 = arith.constant 1.000000e+00 : f32 loc(#loc153)
+    %lhs_scale_5 = arith.constant 1.000000e+00 : f32 loc(#loc153)
+    %lhs_scale_6 = arith.constant dense<1.000000e+00> : tensor<1x2048xf32> loc(#loc153)
+    %lhs_scale_7 = arith.select %lhs_scale_1, %lhs_scale_6, %lhs_scale_3 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc153)
+    %rhs_scale = arith.constant 0xFF800000 : f32 loc(#loc154)
+    %rhs_scale_8 = arith.constant dense<0xFF800000> : tensor<1x2048xf32> loc(#loc154)
+    %rhs_scale_9 = arith.cmpf oeq, %out_max, %rhs_scale_8 : tensor<1x2048xf32> loc(#loc154)
+    %rhs_scale_10 = arith.subf %rhs_max, %out_max : tensor<1x2048xf32> loc(#loc155)
+    %rhs_scale_11 = tt.call @"torch._inductor.runtime.triton_helpers.exp__fp32S1_2048S__(1,)cconstexpr_False_"(%rhs_scale_10) : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc156)
+    %rhs_scale_12 = arith.constant 1.000000e+00 : f32 loc(#loc157)
+    %rhs_scale_13 = arith.constant 1.000000e+00 : f32 loc(#loc157)
+    %rhs_scale_14 = arith.constant dense<1.000000e+00> : tensor<1x2048xf32> loc(#loc157)
+    %rhs_scale_15 = arith.select %rhs_scale_9, %rhs_scale_14, %rhs_scale_11 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc157)
+    %out_sum = arith.mulf %lhs_sum, %lhs_scale_7 : tensor<1x2048xf32> loc(#loc158)
+    %out_sum_16 = arith.addf %out_sum, %rhs_scale_15 : tensor<1x2048xf32> loc(#loc159)
+    tt.return %out_max, %out_sum_16 : tensor<1x2048xf32>, tensor<1x2048xf32> loc(#loc60)
+  ^bb1:  // no predecessors
+    %0 = ub.poison : tensor<1x2048xf32> loc(#loc61)
+    %1 = ub.poison : tensor<1x2048xf32> loc(#loc61)
+    tt.return %0, %1 : tensor<1x2048xf32>, tensor<1x2048xf32> loc(#loc61)
+  } loc(#loc48)
+  tt.func private @torch._inductor.runtime.triton_helpers.maximum__fp32S1_2048S_fp32S1_2048S__(%a: tensor<1x2048xf32> loc("a"(#loc62)), %b: tensor<1x2048xf32> loc("b"(#loc62))) -> tensor<1x2048xf32> attributes {noinline = false} {
+    %mask = arith.cmpf ogt, %a, %b : tensor<1x2048xf32> loc(#loc183)
+    %0 = tt.call @torch._inductor.runtime.triton_helpers.is_floating__fp32S1_2048S__(%a) : (tensor<1x2048xf32>) -> i1 loc(#loc64)
+    %1 = scf.if %0 -> (tensor<1x2048xi1>) {
+      %mask_0 = arith.cmpf une, %a, %a : tensor<1x2048xf32> loc(#loc163)
+      %mask_1 = arith.ori %mask, %mask_0 : tensor<1x2048xi1> loc(#loc184)
+      scf.yield %mask_1 : tensor<1x2048xi1> loc(#loc184)
+    } else {
+      scf.yield %mask : tensor<1x2048xi1> loc(#loc68)
+    } loc(#loc65)
+    %2 = arith.select %1, %a, %b : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc69)
+    tt.return %2 : tensor<1x2048xf32> loc(#loc70)
+  ^bb1:  // no predecessors
+    %3 = ub.poison : tensor<1x2048xf32> loc(#loc71)
+    tt.return %3 : tensor<1x2048xf32> loc(#loc71)
+  } loc(#loc62)
+  tt.func private @torch._inductor.runtime.triton_helpers.is_floating__fp32S1_2048S__(%x: tensor<1x2048xf32> loc("x"(#loc72))) -> i1 attributes {noinline = false} {
+    %0 = tt.call @torch._inductor.runtime.triton_helpers.promote_to_tensor__fp32S1_2048S__(%x) : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc73)
+    %true = arith.constant true loc(#loc74)
+    tt.return %true : i1 loc(#loc74)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : i1 loc(#loc75)
+    tt.return %1 : i1 loc(#loc75)
+  } loc(#loc72)
+  tt.func private @torch._inductor.runtime.triton_helpers.promote_to_tensor__fp32S1_2048S__(%x: tensor<1x2048xf32> loc("x"(#loc76))) -> tensor<1x2048xf32> attributes {noinline = false} {
+    %0 = tt.call @"triton.language.standard.zeros____(0, 0)cconstexpr_1__(1,)cconstexpr_int1_"() : () -> tensor<1xi1> loc(#loc77)
+    %1 = arith.uitofp %0 : tensor<1xi1> to tensor<1xf32> loc(#loc78)
+    %2 = tt.expand_dims %1 {axis = 0 : i32} : tensor<1xf32> -> tensor<1x1xf32> loc(#loc78)
+    %3 = tt.broadcast %2 : tensor<1x1xf32> -> tensor<1x2048xf32> loc(#loc78)
+    %4 = arith.addf %x, %3 : tensor<1x2048xf32> loc(#loc78)
+    tt.return %4 : tensor<1x2048xf32> loc(#loc79)
+  ^bb1:  // no predecessors
+    %5 = ub.poison : tensor<1x2048xf32> loc(#loc80)
+    tt.return %5 : tensor<1x2048xf32> loc(#loc80)
+  } loc(#loc76)
+  tt.func private @"triton.language.standard.zeros____(0, 0)cconstexpr_1__(1,)cconstexpr_int1_"() -> tensor<1xi1> attributes {noinline = false} {
+    %false = arith.constant false loc(#loc45)
+    %cst = arith.constant dense<false> : tensor<1xi1> loc(#loc45)
+    tt.return %cst : tensor<1xi1> loc(#loc46)
+  ^bb1:  // no predecessors
+    %0 = ub.poison : tensor<1xi1> loc(#loc47)
+    tt.return %0 : tensor<1xi1> loc(#loc47)
+  } loc(#loc44)
+  tt.func private @"torch._inductor.runtime.triton_helpers.exp__fp32S1_2048S__(1,)cconstexpr_False_"(%x: tensor<1x2048xf32> loc("x"(#loc81))) -> tensor<1x2048xf32> attributes {noinline = false} {
+    %0 = tt.extern_elementwise %x {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc82)
+    tt.return %0 : tensor<1x2048xf32> loc(#loc83)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : tensor<1x2048xf32> loc(#loc84)
+    tt.return %1 : tensor<1x2048xf32> loc(#loc84)
+  } loc(#loc81)
+  tt.func private @"torch._inductor.runtime.triton_helpers.online_softmax_reduce__fp32S1_2048S_fp32S1_2048S__(2,)cconstexpr_1__(3,)cconstexpr_False_"(%lhs_max: tensor<1x2048xf32> loc("lhs_max"(#loc85)), %lhs_sum: tensor<1x2048xf32> loc("lhs_sum"(#loc85))) -> (tensor<1xf32>, tensor<1xf32>) attributes {noinline = false} {
+    %out_max = tt.call @"torch._inductor.runtime.triton_helpers.max2__fp32S1_2048S__(1,)cconstexpr_1_"(%lhs_max) : (tensor<1x2048xf32>) -> tensor<1xf32> loc(#loc170)
+    %out_max_keepdim = tt.expand_dims %out_max {axis = 1 : i32} : tensor<1xf32> -> tensor<1x1xf32> loc(#loc171)
+    %delta = arith.constant 0xFF800000 : f32 loc(#loc172)
+    %delta_0 = arith.constant dense<0xFF800000> : tensor<1x1xf32> loc(#loc172)
+    %delta_1 = arith.cmpf oeq, %out_max_keepdim, %delta_0 : tensor<1x1xf32> loc(#loc172)
+    %delta_2 = tt.broadcast %out_max_keepdim : tensor<1x1xf32> -> tensor<1x2048xf32> loc(#loc173)
+    %delta_3 = arith.subf %lhs_max, %delta_2 : tensor<1x2048xf32> loc(#loc173)
+    %delta_4 = arith.constant 0 : i32 loc(#loc174)
+    %delta_5 = arith.constant 0.000000e+00 : f32 loc(#loc174)
+    %delta_6 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32> loc(#loc174)
+    %delta_7 = tt.broadcast %delta_1 : tensor<1x1xi1> -> tensor<1x2048xi1> loc(#loc174)
+    %delta_8 = arith.select %delta_7, %delta_6, %delta_3 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc174)
+    %out_sum = tt.call @"torch._inductor.runtime.triton_helpers.exp__fp32S1_2048S__(1,)cconstexpr_False_"(%delta_8) : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc175)
+    %out_sum_9 = arith.mulf %lhs_sum, %out_sum : tensor<1x2048xf32> loc(#loc176)
+    %out_sum_10 = tt.call @"triton.language.standard.sum__fp32S1_2048S__(1,)cconstexpr_1__(2,)cconstexpr_False__(3,)cNone"(%out_sum_9) : (tensor<1x2048xf32>) -> tensor<1xf32> loc(#loc177)
+    tt.return %out_max, %out_sum_10 : tensor<1xf32>, tensor<1xf32> loc(#loc94)
+  ^bb1:  // no predecessors
+    %0 = ub.poison : tensor<1xf32> loc(#loc95)
+    %1 = ub.poison : tensor<1xf32> loc(#loc95)
+    tt.return %0, %1 : tensor<1xf32>, tensor<1xf32> loc(#loc95)
+  } loc(#loc85)
+  tt.func private @"torch._inductor.runtime.triton_helpers.max2__fp32S1_2048S__(1,)cconstexpr_1_"(%a: tensor<1x2048xf32> loc("a"(#loc96))) -> tensor<1xf32> attributes {noinline = false} {
+    %0 = "tt.reduce"(%a) <{axis = 1 : i32}> ({
+    ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)):
+      %2 = tt.call @torch._inductor.runtime.triton_helpers.maximum__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc97)
+      tt.reduce.return %2 : f32 loc(#loc97)
+    }) : (tensor<1x2048xf32>) -> tensor<1xf32> loc(#loc97)
+    tt.return %0 : tensor<1xf32> loc(#loc98)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : tensor<1xf32> loc(#loc99)
+    tt.return %1 : tensor<1xf32> loc(#loc99)
+  } loc(#loc96)
+  tt.func private @torch._inductor.runtime.triton_helpers.maximum__fp32_fp32__(%a: f32 loc("a"(#loc62)), %b: f32 loc("b"(#loc62))) -> f32 attributes {noinline = false} {
+    %mask = arith.cmpf ogt, %a, %b : f32 loc(#loc183)
+    %0 = tt.call @torch._inductor.runtime.triton_helpers.is_floating__fp32__(%a) : (f32) -> i1 loc(#loc64)
+    %1 = scf.if %0 -> (i1) {
+      %mask_0 = arith.cmpf une, %a, %a : f32 loc(#loc163)
+      %mask_1 = arith.ori %mask, %mask_0 : i1 loc(#loc184)
+      scf.yield %mask_1 : i1 loc(#loc184)
+    } else {
+      scf.yield %mask : i1 loc(#loc68)
+    } loc(#loc65)
+    %2 = arith.select %1, %a, %b : f32 loc(#loc69)
+    tt.return %2 : f32 loc(#loc70)
+  ^bb1:  // no predecessors
+    %3 = ub.poison : f32 loc(#loc71)
+    tt.return %3 : f32 loc(#loc71)
+  } loc(#loc62)
+  tt.func private @torch._inductor.runtime.triton_helpers.is_floating__fp32__(%x: f32 loc("x"(#loc72))) -> i1 attributes {noinline = false} {
+    %0 = tt.call @torch._inductor.runtime.triton_helpers.promote_to_tensor__fp32__(%x) : (f32) -> tensor<1xf32> loc(#loc73)
+    %true = arith.constant true loc(#loc74)
+    tt.return %true : i1 loc(#loc74)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : i1 loc(#loc75)
+    tt.return %1 : i1 loc(#loc75)
+  } loc(#loc72)
+  tt.func private @torch._inductor.runtime.triton_helpers.promote_to_tensor__fp32__(%x: f32 loc("x"(#loc76))) -> tensor<1xf32> attributes {noinline = false} {
+    %0 = tt.call @"triton.language.standard.zeros____(0, 0)cconstexpr_1__(1,)cconstexpr_int1_"() : () -> tensor<1xi1> loc(#loc77)
+    %1 = arith.uitofp %0 : tensor<1xi1> to tensor<1xf32> loc(#loc78)
+    %2 = tt.splat %x : f32 -> tensor<1xf32> loc(#loc78)
+    %3 = arith.addf %2, %1 : tensor<1xf32> loc(#loc78)
+    tt.return %3 : tensor<1xf32> loc(#loc79)
+  ^bb1:  // no predecessors
+    %4 = ub.poison : tensor<1xf32> loc(#loc80)
+    tt.return %4 : tensor<1xf32> loc(#loc80)
+  } loc(#loc76)
+  tt.func private @"triton.language.standard.sum__fp32S1_2048S__(1,)cconstexpr_1__(2,)cconstexpr_False__(3,)cNone"(%input: tensor<1x2048xf32> loc("input"(#loc100))) -> tensor<1xf32> attributes {noinline = false} {
+    %0 = "tt.reduce"(%input) <{axis = 1 : i32}> ({
+    ^bb0(%arg1: f32 loc(unknown), %arg2: f32 loc(unknown)):
+      %2 = tt.call @triton.language.standard._sum_combine__fp32_fp32__(%arg1, %arg2) : (f32, f32) -> f32 loc(#loc101)
+      tt.reduce.return %2 : f32 loc(#loc101)
+    }) : (tensor<1x2048xf32>) -> tensor<1xf32> loc(#loc101)
+    tt.return %0 : tensor<1xf32> loc(#loc102)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : tensor<1xf32> loc(#loc103)
+    tt.return %1 : tensor<1xf32> loc(#loc103)
+  } loc(#loc100)
+  tt.func private @triton.language.standard._sum_combine__fp32_fp32__(%a: f32 loc("a"(#loc104)), %b: f32 loc("b"(#loc104))) -> f32 attributes {noinline = false} {
+    %0 = arith.addf %a, %b : f32 loc(#loc105)
+    tt.return %0 : f32 loc(#loc106)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : f32 loc(#loc107)
+    tt.return %1 : f32 loc(#loc107)
+  } loc(#loc104)
+} loc(#loc)
+#loc1 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":19:13)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":20:15)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":23:28)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":23:33)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":24:36)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":24:44)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":24:23)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":25:46)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":26:27)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":26:37)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":29:59)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":30:45)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":31:40)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":32:31)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":33:29)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:47)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:41)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:34)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:52)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:105)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":42:40)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":45:54)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":46:54)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":46:8)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":49:33)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":50:16)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":51:16)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":52:40)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":53:31)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":54:29)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:47)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:41)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:34)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:52)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:106)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":60:22)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":61:29)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":62:23)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:42)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:36)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:29)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:53)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":52:4)
+#loc44 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":118:0)
+#loc45 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":127:31)
+#loc46 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":127:11)
+#loc47 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":127:4)
+#loc49 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":193:31)
+#loc50 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:19)
+#loc51 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:53)
+#loc52 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:62)
+#loc53 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:39)
+#loc54 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:19)
+#loc55 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:53)
+#loc56 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:62)
+#loc57 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:39)
+#loc58 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":205:24)
+#loc59 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":205:36)
+#loc60 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":206:11)
+#loc61 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":206:4)
+#loc63 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":110:15)
+#loc64 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":111:19)
+#loc65 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":111:7)
+#loc66 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":112:21)
+#loc67 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":112:16)
+#loc69 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":113:29)
+#loc70 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":113:11)
+#loc71 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":113:4)
+#loc73 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":87:29)
+#loc74 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":87:11)
+#loc75 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":87:4)
+#loc77 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":65:30)
+#loc78 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":65:15)
+#loc79 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":65:11)
+#loc80 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":65:4)
+#loc82 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":173:29)
+#loc83 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":173:15)
+#loc84 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":170:4)
+#loc86 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":178:28)
+#loc87 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":179:46)
+#loc88 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:40)
+#loc89 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:68)
+#loc90 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:58)
+#loc91 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:42)
+#loc92 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:31)
+#loc93 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:58)
+#loc94 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":182:11)
+#loc95 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":182:4)
+#loc97 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":123:29)
+#loc98 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":123:11)
+#loc99 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":123:4)
+#loc101 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc102 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:11)
+#loc103 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:4)
+#loc105 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc106 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:11)
+#loc107 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:4)
+#loc112 = loc("xnumel"(#loc1))
+#loc113 = loc("r0_numel"(#loc2))
+#loc114 = loc("xoffset"(#loc3))
+#loc115 = loc("xoffset"(#loc4))
+#loc116 = loc("xindex"(#loc5))
+#loc117 = loc("xindex"(#loc6))
+#loc118 = loc("xindex"(#loc7))
+#loc119 = loc("xmask"(#loc8))
+#loc120 = loc("r0_base"(#loc9))
+#loc121 = loc("r0_base"(#loc10))
+#loc122 = loc("_tmp3_max"(#loc11))
+#loc123 = loc("_tmp3_sum"(#loc12))
+#loc124 = loc("_tmp3_max"(#loc13))
+#loc125 = loc("r0_index"(#loc14))
+#loc126 = loc("r0_mask"(#loc15))
+#loc127 = loc("tmp0"(#loc16))
+#loc128 = loc("tmp0"(#loc17))
+#loc129 = loc("tmp0"(#loc18))
+#loc130 = loc("tmp0"(#loc19))
+#loc131 = loc("tmp0"(#loc20))
+#loc132 = loc("_tmp3_max"(#loc22))
+#loc133 = loc("_tmp3_sum"(#loc23))
+#loc134 = loc("tmp3"(#loc26))
+#loc135 = loc("tmp4"(#loc27))
+#loc136 = loc("r0_index"(#loc29))
+#loc137 = loc("r0_mask"(#loc30))
+#loc138 = loc("tmp5"(#loc31))
+#loc139 = loc("tmp5"(#loc32))
+#loc140 = loc("tmp5"(#loc33))
+#loc141 = loc("tmp5"(#loc34))
+#loc142 = loc("tmp5"(#loc35))
+#loc143 = loc("tmp7"(#loc36))
+#loc144 = loc("tmp8"(#loc37))
+#loc145 = loc("tmp9"(#loc38))
+#loc149 = loc("out_max"(#loc49))
+#loc150 = loc("lhs_scale"(#loc50))
+#loc151 = loc("lhs_scale"(#loc51))
+#loc152 = loc("lhs_scale"(#loc52))
+#loc153 = loc("lhs_scale"(#loc53))
+#loc154 = loc("rhs_scale"(#loc54))
+#loc155 = loc("rhs_scale"(#loc55))
+#loc156 = loc("rhs_scale"(#loc56))
+#loc157 = loc("rhs_scale"(#loc57))
+#loc158 = loc("out_sum"(#loc58))
+#loc159 = loc("out_sum"(#loc59))
+#loc162 = loc("mask"(#loc63))
+#loc163 = loc("mask"(#loc66))
+#loc164 = loc("mask"(#loc67))
+#loc170 = loc("out_max"(#loc86))
+#loc171 = loc("out_max_keepdim"(#loc87))
+#loc172 = loc("delta"(#loc88))
+#loc173 = loc("delta"(#loc89))
+#loc174 = loc("delta"(#loc90))
+#loc175 = loc("out_sum"(#loc91))
+#loc176 = loc("out_sum"(#loc92))
+#loc177 = loc("out_sum"(#loc93))
+#loc182 = loc("_tmp3_sum"(#loc124))
+#loc183 = loc("mask"(#loc162))
+#loc184 = loc("mask"(#loc164))

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttgir ADDED Viewed

	@@ -0,0 +1,226 @@

+#blocked = #ttg.blocked<{sizePerThread = [1, 4], threadsPerWarp = [1, 32], warpsPerCTA = [1, 16], order = [1, 0]}>
+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":18:0)
+#loc1 = loc(unknown)
+#loc32 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":178:28)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":49:33)
+#loc41 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:58)
+#loc57 = loc("in_ptr0"(#loc))
+#loc58 = loc("out_ptr2"(#loc))
+#loc59 = loc("xnumel"(#loc))
+#loc60 = loc("r0_numel"(#loc))
+#loc86 = loc("out_max"(#loc32))
+#loc93 = loc("out_sum"(#loc41))
+#loc118 = loc(callsite(#loc86 at #loc33))
+#loc125 = loc(callsite(#loc93 at #loc33))
+#loc133 = loc(callsite(#loc1 at #loc118))
+#loc136 = loc(callsite(#loc1 at #loc125))
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "cuda:90", "ttg.threads-per-warp" = 32 : i32} {
+  tt.func public @triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0(%in_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr2: !tt.ptr<f32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<32000> : tensor<1x2048xi32, #blocked> loc(#loc1)
+    %cst_0 = arith.constant dense<0.000000e+00> : tensor<1x2048xbf16, #blocked> loc(#loc1)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc1)
+    %c32000_i32 = arith.constant 32000 : i32 loc(#loc1)
+    %c2048_i32 = arith.constant 2048 : i32 loc(#loc1)
+    %cst_1 = arith.constant dense<0xFF800000> : tensor<1x1xf32, #blocked> loc(#loc1)
+    %cst_2 = arith.constant dense<1.000000e+00> : tensor<1x2048xf32, #blocked> loc(#loc1)
+    %cst_3 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32, #blocked> loc(#loc1)
+    %cst_4 = arith.constant dense<0xFF800000> : tensor<1x2048xf32, #blocked> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc61)
+    %r0_base = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32, #ttg.slice<{dim = 0, parent = #blocked}>> loc(#loc62)
+    %r0_base_5 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<2048xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x2048xi32, #blocked> loc(#loc62)
+    %tmp0 = arith.muli %xoffset, %c32000_i32 : i32 loc(#loc63)
+    %tmp0_6 = tt.splat %tmp0 : i32 -> tensor<1x2048xi32, #blocked> loc(#loc104)
+    %tmp0_7 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1x2048x!tt.ptr<bf16>, #blocked> loc(#loc65)
+    %_tmp3_sum:2 = scf.for %_tmp3_sum_14 = %c0_i32 to %c32000_i32 step %c2048_i32 iter_args(%arg5 = %cst_4, %arg6 = %cst_3) -> (tensor<1x2048xf32, #blocked>, tensor<1x2048xf32, #blocked>)  : i32 {
+      %r0_index = tt.splat %_tmp3_sum_14 : i32 -> tensor<1x2048xi32, #blocked> loc(#loc67)
+      %r0_index_15 = arith.addi %r0_index, %r0_base_5 : tensor<1x2048xi32, #blocked> loc(#loc67)
+      %r0_mask = arith.cmpi slt, %r0_index_15, %cst : tensor<1x2048xi32, #blocked> loc(#loc68)
+      %tmp0_16 = arith.addi %r0_index_15, %tmp0_6 : tensor<1x2048xi32, #blocked> loc(#loc64)
+      %tmp0_17 = tt.addptr %tmp0_7, %tmp0_16 : tensor<1x2048x!tt.ptr<bf16>, #blocked>, tensor<1x2048xi32, #blocked> loc(#loc65)
+      %tmp0_18 = tt.load %tmp0_17, %r0_mask, %cst_0 evictionPolicy = evict_last : tensor<1x2048x!tt.ptr<bf16>, #blocked> loc(#loc69)
+      %tmp0_19 = arith.extf %tmp0_18 : tensor<1x2048xbf16, #blocked> to tensor<1x2048xf32, #blocked> loc(#loc70)
+      %mask = arith.cmpf ogt, %arg5, %tmp0_19 : tensor<1x2048xf32, #blocked> loc(#loc126)
+      %mask_20 = arith.cmpf une, %arg5, %arg5 : tensor<1x2048xf32, #blocked> loc(#loc127)
+      %mask_21 = arith.ori %mask, %mask_20 : tensor<1x2048xi1, #blocked> loc(#loc128)
+      %out_max_22 = arith.select %mask_21, %arg5, %tmp0_19 : tensor<1x2048xi1, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc129)
+      %lhs_scale = arith.cmpf oeq, %out_max_22, %cst_4 : tensor<1x2048xf32, #blocked> loc(#loc109)
+      %lhs_scale_23 = arith.subf %arg5, %out_max_22 : tensor<1x2048xf32, #blocked> loc(#loc110)
+      %lhs_scale_24 = tt.extern_elementwise %lhs_scale_23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32, #blocked>) -> tensor<1x2048xf32, #blocked> loc(#loc130)
+      %lhs_scale_25 = arith.select %lhs_scale, %cst_2, %lhs_scale_24 : tensor<1x2048xi1, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc112)
+      %rhs_scale = arith.subf %tmp0_19, %out_max_22 : tensor<1x2048xf32, #blocked> loc(#loc113)
+      %rhs_scale_26 = tt.extern_elementwise %rhs_scale {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32, #blocked>) -> tensor<1x2048xf32, #blocked> loc(#loc131)
+      %rhs_scale_27 = arith.select %lhs_scale, %cst_2, %rhs_scale_26 : tensor<1x2048xi1, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc115)
+      %out_sum_28 = arith.mulf %arg6, %lhs_scale_25 : tensor<1x2048xf32, #blocked> loc(#loc116)
+      %out_sum_29 = arith.addf %out_sum_28, %rhs_scale_27 : tensor<1x2048xf32, #blocked> loc(#loc117)
+      %_tmp3_max = arith.select %r0_mask, %out_max_22, %arg5 : tensor<1x2048xi1, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc84)
+      %_tmp3_sum_30 = arith.select %r0_mask, %out_sum_29, %arg6 : tensor<1x2048xi1, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc85)
+      scf.yield %_tmp3_max, %_tmp3_sum_30 : tensor<1x2048xf32, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc30)
+    } loc(#loc105)
+    %out_max = "tt.reduce"(%_tmp3_sum#0) <{axis = 1 : i32}> ({
+    ^bb0(%out_max_14: f32 loc(callsite(#loc1 at #loc118)), %out_max_15: f32 loc(callsite(#loc1 at #loc118))):
+      %mask = arith.cmpf ogt, %out_max_14, %out_max_15 : f32 loc(#loc137)
+      %mask_16 = arith.cmpf une, %out_max_14, %out_max_14 : f32 loc(#loc138)
+      %mask_17 = arith.ori %mask, %mask_16 : i1 loc(#loc139)
+      %out_max_18 = arith.select %mask_17, %out_max_14, %out_max_15 : f32 loc(#loc140)
+      tt.reduce.return %out_max_18 : f32 loc(#loc132)
+    }) : (tensor<1x2048xf32, #blocked>) -> tensor<1xf32, #ttg.slice<{dim = 1, parent = #blocked}>> loc(#loc132)
+    %out_max_keepdim = tt.expand_dims %out_max {axis = 1 : i32} : tensor<1xf32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1xf32, #blocked> loc(#loc119)
+    %delta = arith.cmpf oeq, %out_max_keepdim, %cst_1 : tensor<1x1xf32, #blocked> loc(#loc120)
+    %delta_8 = tt.broadcast %out_max_keepdim : tensor<1x1xf32, #blocked> -> tensor<1x2048xf32, #blocked> loc(#loc121)
+    %delta_9 = arith.subf %_tmp3_sum#0, %delta_8 : tensor<1x2048xf32, #blocked> loc(#loc121)
+    %delta_10 = tt.broadcast %delta : tensor<1x1xi1, #blocked> -> tensor<1x2048xi1, #blocked> loc(#loc122)
+    %delta_11 = arith.select %delta_10, %cst_3, %delta_9 : tensor<1x2048xi1, #blocked>, tensor<1x2048xf32, #blocked> loc(#loc122)
+    %out_sum = tt.extern_elementwise %delta_11 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32, #blocked>) -> tensor<1x2048xf32, #blocked> loc(#loc134)
+    %out_sum_12 = arith.mulf %_tmp3_sum#1, %out_sum : tensor<1x2048xf32, #blocked> loc(#loc124)
+    %out_sum_13 = "tt.reduce"(%out_sum_12) <{axis = 1 : i32}> ({
+    ^bb0(%out_sum_14: f32 loc(callsite(#loc1 at #loc125)), %out_sum_15: f32 loc(callsite(#loc1 at #loc125))):
+      %out_sum_16 = arith.addf %out_sum_14, %out_sum_15 : f32 loc(#loc141)
+      tt.reduce.return %out_sum_16 : f32 loc(#loc135)
+    }) : (tensor<1x2048xf32, #blocked>) -> tensor<1xf32, #ttg.slice<{dim = 1, parent = #blocked}>> loc(#loc135)
+    %tmp4 = tt.expand_dims %out_sum_13 {axis = 1 : i32} : tensor<1xf32, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1xf32, #blocked> loc(#loc94)
+    %tmp9 = tt.broadcast %tmp4 : tensor<1x1xf32, #blocked> -> tensor<1x2048xf32, #blocked> loc(#loc95)
+    %0 = tt.splat %out_ptr2 : !tt.ptr<f32> -> tensor<1x2048x!tt.ptr<f32>, #blocked> loc(#loc45)
+    scf.for %r0_offset = %c0_i32 to %c32000_i32 step %c2048_i32  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x2048xi32, #blocked> loc(#loc96)
+      %r0_index_14 = arith.addi %r0_index, %r0_base_5 : tensor<1x2048xi32, #blocked> loc(#loc96)
+      %r0_mask = arith.cmpi slt, %r0_index_14, %cst : tensor<1x2048xi32, #blocked> loc(#loc97)
+      %tmp5 = arith.addi %r0_index_14, %tmp0_6 : tensor<1x2048xi32, #blocked> loc(#loc98)
+      %tmp5_15 = tt.addptr %tmp0_7, %tmp5 : tensor<1x2048x!tt.ptr<bf16>, #blocked>, tensor<1x2048xi32, #blocked> loc(#loc99)
+      %tmp5_16 = tt.load %tmp5_15, %r0_mask, %cst_0 evictionPolicy = evict_first : tensor<1x2048x!tt.ptr<bf16>, #blocked> loc(#loc100)
+      %tmp5_17 = arith.extf %tmp5_16 : tensor<1x2048xbf16, #blocked> to tensor<1x2048xf32, #blocked> loc(#loc101)
+      %tmp7 = arith.subf %tmp5_17, %delta_8 : tensor<1x2048xf32, #blocked> loc(#loc102)
+      %tmp8 = tt.extern_elementwise %tmp7 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32, #blocked>) -> tensor<1x2048xf32, #blocked> loc(#loc103)
+      %tmp9_18 = arith.divf %tmp8, %tmp9 : tensor<1x2048xf32, #blocked> loc(#loc95)
+      %1 = tt.addptr %0, %tmp5 : tensor<1x2048x!tt.ptr<f32>, #blocked>, tensor<1x2048xi32, #blocked> loc(#loc45)
+      tt.store %1, %tmp9_18, %r0_mask : tensor<1x2048x!tt.ptr<f32>, #blocked> loc(#loc55)
+    } loc(#loc46)
+    tt.return loc(#loc56)
+  } loc(#loc)
+} loc(#loc)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":23:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":26:37)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:47)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:41)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:34)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":31:40)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":32:31)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":33:29)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:52)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:105)
+#loc12 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":110:15)
+#loc13 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":193:31)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":42:40)
+#loc15 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":112:21)
+#loc16 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":112:16)
+#loc17 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":113:29)
+#loc18 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:19)
+#loc19 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:53)
+#loc20 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":173:29)
+#loc21 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:62)
+#loc22 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:39)
+#loc23 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:53)
+#loc24 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:62)
+#loc25 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:39)
+#loc26 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":205:24)
+#loc27 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":205:36)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":45:54)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":46:54)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":46:8)
+#loc31 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":123:29)
+#loc34 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":179:46)
+#loc35 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:40)
+#loc36 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:68)
+#loc37 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:58)
+#loc38 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:42)
+#loc39 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:31)
+#loc40 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc42 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":51:16)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":62:23)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:29)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":52:40)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":53:31)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":54:29)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:41)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:34)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:52)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:106)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":60:22)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":61:29)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:53)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":52:4)
+#loc61 = loc("xoffset"(#loc2))
+#loc62 = loc("r0_base"(#loc3))
+#loc63 = loc("tmp0"(#loc4))
+#loc64 = loc("tmp0"(#loc5))
+#loc65 = loc("tmp0"(#loc6))
+#loc66 = loc("_tmp3_max"(#loc7))
+#loc67 = loc("r0_index"(#loc8))
+#loc68 = loc("r0_mask"(#loc9))
+#loc69 = loc("tmp0"(#loc10))
+#loc70 = loc("tmp0"(#loc11))
+#loc71 = loc("mask"(#loc12))
+#loc72 = loc("out_max"(#loc13))
+#loc73 = loc("mask"(#loc15))
+#loc74 = loc("mask"(#loc16))
+#loc75 = loc("lhs_scale"(#loc18))
+#loc76 = loc("lhs_scale"(#loc19))
+#loc77 = loc("lhs_scale"(#loc21))
+#loc78 = loc("lhs_scale"(#loc22))
+#loc79 = loc("rhs_scale"(#loc23))
+#loc80 = loc("rhs_scale"(#loc24))
+#loc81 = loc("rhs_scale"(#loc25))
+#loc82 = loc("out_sum"(#loc26))
+#loc83 = loc("out_sum"(#loc27))
+#loc84 = loc("_tmp3_max"(#loc28))
+#loc85 = loc("_tmp3_sum"(#loc29))
+#loc87 = loc("out_max_keepdim"(#loc34))
+#loc88 = loc("delta"(#loc35))
+#loc89 = loc("delta"(#loc36))
+#loc90 = loc("delta"(#loc37))
+#loc91 = loc("out_sum"(#loc38))
+#loc92 = loc("out_sum"(#loc39))
+#loc94 = loc("tmp4"(#loc43))
+#loc95 = loc("tmp9"(#loc44))
+#loc96 = loc("r0_index"(#loc47))
+#loc97 = loc("r0_mask"(#loc48))
+#loc98 = loc("tmp5"(#loc49))
+#loc99 = loc("tmp5"(#loc50))
+#loc100 = loc("tmp5"(#loc51))
+#loc101 = loc("tmp5"(#loc52))
+#loc102 = loc("tmp7"(#loc53))
+#loc103 = loc("tmp8"(#loc54))
+#loc104 = loc(fused[#loc64, #loc63])
+#loc105 = loc("_tmp3_sum"(#loc66))
+#loc106 = loc("mask"(#loc71))
+#loc107 = loc(callsite(#loc72 at #loc14))
+#loc108 = loc("mask"(#loc74))
+#loc109 = loc(callsite(#loc75 at #loc14))
+#loc110 = loc(callsite(#loc76 at #loc14))
+#loc111 = loc(callsite(#loc77 at #loc14))
+#loc112 = loc(callsite(#loc78 at #loc14))
+#loc113 = loc(callsite(#loc79 at #loc14))
+#loc114 = loc(callsite(#loc80 at #loc14))
+#loc115 = loc(callsite(#loc81 at #loc14))
+#loc116 = loc(callsite(#loc82 at #loc14))
+#loc117 = loc(callsite(#loc83 at #loc14))
+#loc119 = loc(callsite(#loc87 at #loc33))
+#loc120 = loc(callsite(#loc88 at #loc33))
+#loc121 = loc(callsite(#loc89 at #loc33))
+#loc122 = loc(callsite(#loc90 at #loc33))
+#loc123 = loc(callsite(#loc91 at #loc33))
+#loc124 = loc(callsite(#loc92 at #loc33))
+#loc126 = loc(callsite(#loc106 at #loc107))
+#loc127 = loc(callsite(#loc73 at #loc107))
+#loc128 = loc(callsite(#loc108 at #loc107))
+#loc129 = loc(callsite(#loc17 at #loc107))
+#loc130 = loc(callsite(#loc20 at #loc111))
+#loc131 = loc(callsite(#loc20 at #loc114))
+#loc132 = loc(callsite(#loc31 at #loc118))
+#loc134 = loc(callsite(#loc20 at #loc123))
+#loc135 = loc(callsite(#loc40 at #loc125))
+#loc137 = loc(callsite(#loc106 at #loc132))
+#loc138 = loc(callsite(#loc73 at #loc132))
+#loc139 = loc(callsite(#loc108 at #loc132))
+#loc140 = loc(callsite(#loc17 at #loc132))
+#loc141 = loc(callsite(#loc42 at #loc135))

SpecForge-ext/cache/compiled_kernels/triton/3/C3FCZCDEMCLSFODWXLEH5MRAQRWLOTRP4SAQURVAE7BPHZSTV2WQ/triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0.ttir ADDED Viewed

	@@ -0,0 +1,233 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":18:0)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":49:33)
+#loc3 = loc(unknown)
+#loc35 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":178:28)
+#loc42 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:58)
+#loc59 = loc("in_ptr0"(#loc))
+#loc60 = loc("out_ptr2"(#loc))
+#loc61 = loc("xnumel"(#loc))
+#loc62 = loc("r0_numel"(#loc))
+#loc90 = loc("out_max"(#loc35))
+#loc96 = loc("out_sum"(#loc42))
+#loc123 = loc(callsite(#loc90 at #loc2))
+#loc129 = loc(callsite(#loc96 at #loc2))
+#loc138 = loc(callsite(#loc3 at #loc123))
+#loc141 = loc(callsite(#loc3 at #loc129))
+module {
+  tt.func public @triton_red_fused__softmax__to_copy_exp_prepare_softmax_online_sub_0(%in_ptr0: !tt.ptr<bf16> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr2: !tt.ptr<f32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %xnumel: i32 {tt.divisibility = 16 : i32} loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %delta = arith.constant dense<0xFF800000> : tensor<1x1xf32> loc(#loc108)
+    %cst = arith.constant dense<1.000000e+00> : tensor<1x2048xf32> loc(#loc3)
+    %cst_0 = arith.constant dense<0.000000e+00> : tensor<1x2048xf32> loc(#loc3)
+    %cst_1 = arith.constant dense<0.000000e+00> : tensor<1x2048xbf16> loc(#loc3)
+    %c2048_i32 = arith.constant 2048 : i32 loc(#loc3)
+    %c32000_i32 = arith.constant 32000 : i32 loc(#loc3)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc3)
+    %cst_2 = arith.constant dense<32000> : tensor<1x2048xi32> loc(#loc3)
+    %cst_3 = arith.constant dense<0xFF800000> : tensor<1x2048xf32> loc(#loc3)
+    %xoffset = tt.get_program_id x : i32 loc(#loc64)
+    %r0_base = tt.make_range {end = 2048 : i32, start = 0 : i32} : tensor<2048xi32> loc(#loc65)
+    %r0_base_4 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<2048xi32> -> tensor<1x2048xi32> loc(#loc66)
+    %_tmp3_sum:2 = scf.for %r0_offset = %c0_i32 to %c32000_i32 step %c2048_i32 iter_args(%_tmp3_max = %cst_3, %_tmp3_sum_12 = %cst_0) -> (tensor<1x2048xf32>, tensor<1x2048xf32>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x2048xi32> loc(#loc68)
+      %r0_index_13 = arith.addi %r0_index, %r0_base_4 : tensor<1x2048xi32> loc(#loc68)
+      %r0_mask = arith.cmpi slt, %r0_index_13, %cst_2 : tensor<1x2048xi32> loc(#loc69)
+      %tmp0 = arith.muli %xoffset, %c32000_i32 : i32 loc(#loc70)
+      %tmp0_14 = tt.splat %tmp0 : i32 -> tensor<1x2048xi32> loc(#loc110)
+      %tmp0_15 = arith.addi %r0_index_13, %tmp0_14 : tensor<1x2048xi32> loc(#loc71)
+      %tmp0_16 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1x2048x!tt.ptr<bf16>> loc(#loc72)
+      %tmp0_17 = tt.addptr %tmp0_16, %tmp0_15 : tensor<1x2048x!tt.ptr<bf16>>, tensor<1x2048xi32> loc(#loc72)
+      %tmp0_18 = tt.load %tmp0_17, %r0_mask, %cst_1 evictionPolicy = evict_last : tensor<1x2048x!tt.ptr<bf16>> loc(#loc73)
+      %tmp0_19 = arith.extf %tmp0_18 : tensor<1x2048xbf16> to tensor<1x2048xf32> loc(#loc74)
+      %mask = arith.cmpf ogt, %_tmp3_max, %tmp0_19 : tensor<1x2048xf32> loc(#loc131)
+      %mask_20 = arith.cmpf une, %_tmp3_max, %_tmp3_max : tensor<1x2048xf32> loc(#loc132)
+      %mask_21 = arith.ori %mask, %mask_20 : tensor<1x2048xi1> loc(#loc133)
+      %out_max_22 = arith.select %mask_21, %_tmp3_max, %tmp0_19 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc134)
+      %lhs_scale = arith.cmpf oeq, %out_max_22, %cst_3 : tensor<1x2048xf32> loc(#loc114)
+      %lhs_scale_23 = arith.subf %_tmp3_max, %out_max_22 : tensor<1x2048xf32> loc(#loc115)
+      %lhs_scale_24 = tt.extern_elementwise %lhs_scale_23 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc135)
+      %lhs_scale_25 = arith.select %lhs_scale, %cst, %lhs_scale_24 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc117)
+      %rhs_scale = arith.subf %tmp0_19, %out_max_22 : tensor<1x2048xf32> loc(#loc118)
+      %rhs_scale_26 = tt.extern_elementwise %rhs_scale {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc136)
+      %rhs_scale_27 = arith.select %lhs_scale, %cst, %rhs_scale_26 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc120)
+      %out_sum_28 = arith.mulf %_tmp3_sum_12, %lhs_scale_25 : tensor<1x2048xf32> loc(#loc121)
+      %out_sum_29 = arith.addf %out_sum_28, %rhs_scale_27 : tensor<1x2048xf32> loc(#loc122)
+      %_tmp3_max_30 = arith.select %r0_mask, %out_max_22, %_tmp3_max : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc88)
+      %_tmp3_sum_31 = arith.select %r0_mask, %out_sum_29, %_tmp3_sum_12 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc89)
+      scf.yield %_tmp3_max_30, %_tmp3_sum_31 : tensor<1x2048xf32>, tensor<1x2048xf32> loc(#loc33)
+    } loc(#loc109)
+    %out_max = "tt.reduce"(%_tmp3_sum#0) <{axis = 1 : i32}> ({
+    ^bb0(%out_max_12: f32 loc(callsite(#loc3 at #loc123)), %out_max_13: f32 loc(callsite(#loc3 at #loc123))):
+      %mask = arith.cmpf ogt, %out_max_12, %out_max_13 : f32 loc(#loc142)
+      %mask_14 = arith.cmpf une, %out_max_12, %out_max_12 : f32 loc(#loc143)
+      %mask_15 = arith.ori %mask, %mask_14 : i1 loc(#loc144)
+      %out_max_16 = arith.select %mask_15, %out_max_12, %out_max_13 : f32 loc(#loc145)
+      tt.reduce.return %out_max_16 : f32 loc(#loc137)
+    }) : (tensor<1x2048xf32>) -> tensor<1xf32> loc(#loc137)
+    %out_max_keepdim = tt.expand_dims %out_max {axis = 1 : i32} : tensor<1xf32> -> tensor<1x1xf32> loc(#loc124)
+    %delta_5 = arith.cmpf oeq, %out_max_keepdim, %delta : tensor<1x1xf32> loc(#loc108)
+    %delta_6 = tt.broadcast %out_max_keepdim : tensor<1x1xf32> -> tensor<1x2048xf32> loc(#loc125)
+    %delta_7 = arith.subf %_tmp3_sum#0, %delta_6 : tensor<1x2048xf32> loc(#loc125)
+    %delta_8 = tt.broadcast %delta_5 : tensor<1x1xi1> -> tensor<1x2048xi1> loc(#loc126)
+    %delta_9 = arith.select %delta_8, %cst_0, %delta_7 : tensor<1x2048xi1>, tensor<1x2048xf32> loc(#loc126)
+    %out_sum = tt.extern_elementwise %delta_9 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc139)
+    %out_sum_10 = arith.mulf %_tmp3_sum#1, %out_sum : tensor<1x2048xf32> loc(#loc128)
+    %out_sum_11 = "tt.reduce"(%out_sum_10) <{axis = 1 : i32}> ({
+    ^bb0(%out_sum_12: f32 loc(callsite(#loc3 at #loc129)), %out_sum_13: f32 loc(callsite(#loc3 at #loc129))):
+      %out_sum_14 = arith.addf %out_sum_12, %out_sum_13 : f32 loc(#loc146)
+      tt.reduce.return %out_sum_14 : f32 loc(#loc140)
+    }) : (tensor<1x2048xf32>) -> tensor<1xf32> loc(#loc140)
+    %tmp4 = tt.expand_dims %out_sum_11 {axis = 1 : i32} : tensor<1xf32> -> tensor<1x1xf32> loc(#loc97)
+    scf.for %r0_offset = %c0_i32 to %c32000_i32 step %c2048_i32  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x2048xi32> loc(#loc98)
+      %r0_index_12 = arith.addi %r0_index, %r0_base_4 : tensor<1x2048xi32> loc(#loc98)
+      %r0_mask = arith.cmpi slt, %r0_index_12, %cst_2 : tensor<1x2048xi32> loc(#loc99)
+      %tmp5 = arith.muli %xoffset, %c32000_i32 : i32 loc(#loc100)
+      %tmp5_13 = tt.splat %tmp5 : i32 -> tensor<1x2048xi32> loc(#loc130)
+      %tmp5_14 = arith.addi %r0_index_12, %tmp5_13 : tensor<1x2048xi32> loc(#loc101)
+      %tmp5_15 = tt.splat %in_ptr0 : !tt.ptr<bf16> -> tensor<1x2048x!tt.ptr<bf16>> loc(#loc102)
+      %tmp5_16 = tt.addptr %tmp5_15, %tmp5_14 : tensor<1x2048x!tt.ptr<bf16>>, tensor<1x2048xi32> loc(#loc102)
+      %tmp5_17 = tt.load %tmp5_16, %r0_mask, %cst_1 evictionPolicy = evict_first : tensor<1x2048x!tt.ptr<bf16>> loc(#loc103)
+      %tmp5_18 = arith.extf %tmp5_17 : tensor<1x2048xbf16> to tensor<1x2048xf32> loc(#loc104)
+      %tmp7 = arith.subf %tmp5_18, %delta_6 : tensor<1x2048xf32> loc(#loc105)
+      %tmp8 = tt.extern_elementwise %tmp7 {libname = "", libpath = "", pure = true, symbol = "__nv_expf"} : (tensor<1x2048xf32>) -> tensor<1x2048xf32> loc(#loc106)
+      %tmp9 = tt.broadcast %tmp4 : tensor<1x1xf32> -> tensor<1x2048xf32> loc(#loc107)
+      %tmp9_19 = arith.divf %tmp8, %tmp9 : tensor<1x2048xf32> loc(#loc107)
+      %0 = tt.splat %out_ptr2 : !tt.ptr<f32> -> tensor<1x2048x!tt.ptr<f32>> loc(#loc56)
+      %1 = tt.addptr %0, %tmp5_14 : tensor<1x2048x!tt.ptr<f32>>, tensor<1x2048xi32> loc(#loc56)
+      tt.store %1, %tmp9_19, %r0_mask : tensor<1x2048x!tt.ptr<f32>> loc(#loc57)
+    } loc(#loc45)
+    tt.return loc(#loc58)
+  } loc(#loc)
+} loc(#loc)
+#loc1 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:40)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":23:28)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":26:27)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":26:37)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":31:40)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":32:31)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":33:29)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:47)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:41)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:34)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:52)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":37:105)
+#loc15 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":110:15)
+#loc16 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":193:31)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":42:40)
+#loc18 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":112:21)
+#loc19 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":112:16)
+#loc20 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":113:29)
+#loc21 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:19)
+#loc22 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:53)
+#loc23 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":173:29)
+#loc24 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:62)
+#loc25 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":196:39)
+#loc26 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:53)
+#loc27 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:62)
+#loc28 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":199:39)
+#loc29 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":205:24)
+#loc30 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":205:36)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":45:54)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":46:54)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":46:8)
+#loc34 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":123:29)
+#loc36 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":179:46)
+#loc37 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:68)
+#loc38 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":180:58)
+#loc39 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:42)
+#loc40 = loc("/workspace/specforge/lib/python3.11/site-packages/torch/_inductor/runtime/triton_helpers.py":181:31)
+#loc41 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc43 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":51:16)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":52:40)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":53:31)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":54:29)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:47)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:41)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:34)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:52)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":58:106)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":60:22)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":61:29)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":62:23)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:29)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":63:53)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/vx/cvxtbxicic4ombm2d2flczxydkol2l2ffr4qtot3rrznjz7bhngf.py":52:4)
+#loc63 = loc("delta"(#loc1))
+#loc64 = loc("xoffset"(#loc4))
+#loc65 = loc("r0_base"(#loc5))
+#loc66 = loc("r0_base"(#loc6))
+#loc67 = loc("_tmp3_max"(#loc7))
+#loc68 = loc("r0_index"(#loc8))
+#loc69 = loc("r0_mask"(#loc9))
+#loc70 = loc("tmp0"(#loc10))
+#loc71 = loc("tmp0"(#loc11))
+#loc72 = loc("tmp0"(#loc12))
+#loc73 = loc("tmp0"(#loc13))
+#loc74 = loc("tmp0"(#loc14))
+#loc75 = loc("mask"(#loc15))
+#loc76 = loc("out_max"(#loc16))
+#loc77 = loc("mask"(#loc18))
+#loc78 = loc("mask"(#loc19))
+#loc79 = loc("lhs_scale"(#loc21))
+#loc80 = loc("lhs_scale"(#loc22))
+#loc81 = loc("lhs_scale"(#loc24))
+#loc82 = loc("lhs_scale"(#loc25))
+#loc83 = loc("rhs_scale"(#loc26))
+#loc84 = loc("rhs_scale"(#loc27))
+#loc85 = loc("rhs_scale"(#loc28))
+#loc86 = loc("out_sum"(#loc29))
+#loc87 = loc("out_sum"(#loc30))
+#loc88 = loc("_tmp3_max"(#loc31))
+#loc89 = loc("_tmp3_sum"(#loc32))
+#loc91 = loc("out_max_keepdim"(#loc36))
+#loc92 = loc("delta"(#loc37))
+#loc93 = loc("delta"(#loc38))
+#loc94 = loc("out_sum"(#loc39))
+#loc95 = loc("out_sum"(#loc40))
+#loc97 = loc("tmp4"(#loc44))
+#loc98 = loc("r0_index"(#loc46))
+#loc99 = loc("r0_mask"(#loc47))
+#loc100 = loc("tmp5"(#loc48))
+#loc101 = loc("tmp5"(#loc49))
+#loc102 = loc("tmp5"(#loc50))
+#loc103 = loc("tmp5"(#loc51))
+#loc104 = loc("tmp5"(#loc52))
+#loc105 = loc("tmp7"(#loc53))
+#loc106 = loc("tmp8"(#loc54))
+#loc107 = loc("tmp9"(#loc55))
+#loc108 = loc(callsite(#loc63 at #loc2))
+#loc109 = loc("_tmp3_sum"(#loc67))
+#loc110 = loc(fused[#loc71, #loc70])
+#loc111 = loc("mask"(#loc75))
+#loc112 = loc(callsite(#loc76 at #loc17))
+#loc113 = loc("mask"(#loc78))
+#loc114 = loc(callsite(#loc79 at #loc17))
+#loc115 = loc(callsite(#loc80 at #loc17))
+#loc116 = loc(callsite(#loc81 at #loc17))
+#loc117 = loc(callsite(#loc82 at #loc17))
+#loc118 = loc(callsite(#loc83 at #loc17))
+#loc119 = loc(callsite(#loc84 at #loc17))
+#loc120 = loc(callsite(#loc85 at #loc17))
+#loc121 = loc(callsite(#loc86 at #loc17))
+#loc122 = loc(callsite(#loc87 at #loc17))
+#loc124 = loc(callsite(#loc91 at #loc2))
+#loc125 = loc(callsite(#loc92 at #loc2))
+#loc126 = loc(callsite(#loc93 at #loc2))
+#loc127 = loc(callsite(#loc94 at #loc2))
+#loc128 = loc(callsite(#loc95 at #loc2))
+#loc130 = loc(fused[#loc101, #loc100])
+#loc131 = loc(callsite(#loc111 at #loc112))
+#loc132 = loc(callsite(#loc77 at #loc112))
+#loc133 = loc(callsite(#loc113 at #loc112))
+#loc134 = loc(callsite(#loc20 at #loc112))
+#loc135 = loc(callsite(#loc23 at #loc116))
+#loc136 = loc(callsite(#loc23 at #loc119))
+#loc137 = loc(callsite(#loc34 at #loc123))
+#loc139 = loc(callsite(#loc23 at #loc127))
+#loc140 = loc(callsite(#loc41 at #loc129))
+#loc142 = loc(callsite(#loc111 at #loc137))
+#loc143 = loc(callsite(#loc77 at #loc137))
+#loc144 = loc(callsite(#loc113 at #loc137))
+#loc145 = loc(callsite(#loc20 at #loc137))
+#loc146 = loc(callsite(#loc43 at #loc140))

SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/__grp__triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.source", "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ttir", "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ttgir", "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.llir", "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.ptx", "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.cubin", "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/DE6XSSYLS7BWGGS4UO3WTFWZCN6OVYXIHMGZ5KR7P3YWZXLVATDQ/triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1.cubin ADDED Viewed

Binary file (28.9 kB). View file

	@@ -0,0 +1 @@

+ {"hash": "193d794b0b97c3631a5ca3b76996d9137ceae2e83b0d9eaa3f7ef16cdd7504c7", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 16, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 128, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1"}

	@@ -0,0 +1,318 @@

+; ModuleID = 'LLVMDialectModule'
+source_filename = "LLVMDialectModule"
+target datalayout = "e-p3:32:32-p4:32:32-p5:32:32-p6:32:32-p7:32:32-i64:64-i128:128-v16:16-v32:32-n16:32:64"
+@global_smem = external addrspace(3) global [0 x i8], align 16
+; Function Attrs: nounwind
+define ptx_kernel void @triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i64 %3, i64 %4, i64 %5, i64 %6, i64 %7, i64 %8, i32 %9, i32 %10, ptr addrspace(1) readnone captures(none) %11, ptr addrspace(1) readnone captures(none) %12) local_unnamed_addr #0 !dbg !4 {
+  %14 = tail call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x(), !dbg !7
+  %15 = icmp slt i32 %14, %9, !dbg !8
+  %16 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x(), !dbg !9
+  %17 = and i32 %16, 384, !dbg !9
+  %18 = zext nneg i32 %14 to i64, !dbg !10
+  %.frozen = freeze i64 %3, !dbg !10
+  %19 = sdiv i64 %18, %.frozen, !dbg !10
+  %20 = srem i64 %19, %4, !dbg !11
+  %21 = mul i64 %19, %.frozen, !dbg !12
+  %.decomposed = sub i64 %18, %21, !dbg !12
+  %22 = sdiv i64 %18, %7, !dbg !13
+  %23 = shl nsw i64 %20, 7, !dbg !14
+  %24 = shl nuw nsw i64 %.decomposed, 7, !dbg !15
+  %25 = getelementptr i64, ptr addrspace(1) %0, i64 %22, !dbg !16
+  %26 = and i32 %16, 127
+  %27 = zext nneg i32 %26 to i64
+  %28 = or disjoint i64 %24, %27
+  %29 = icmp slt i64 %28, %6
+  %30 = icmp sge i64 %28, %8
+  %31 = tail call i64 @llvm.smin.i64(i64 %8, i64 0)
+  %32 = sub nsw i64 %.decomposed, %20
+  %33 = shl nsw i64 %32, 7
+  %34 = zext nneg i32 %17 to i64, !dbg !17
+  %35 = zext nneg i32 %26 to i64, !dbg !17
+  %36 = zext nneg i32 %16 to i64, !dbg !17
+  %37 = insertelement <2 x i1> poison, i1 %15, i64 0, !dbg !18
+  %38 = shufflevector <2 x i1> %37, <2 x i1> poison, <2 x i32> zeroinitializer, !dbg !18
+  %39 = insertelement <2 x i1> poison, i1 %29, i64 0, !dbg !19
+  %40 = shufflevector <2 x i1> %39, <2 x i1> poison, <2 x i32> zeroinitializer, !dbg !19
+  %41 = insertelement <2 x i64> poison, i64 %23, i64 0, !dbg !20
+  %42 = shufflevector <2 x i64> %41, <2 x i64> poison, <2 x i32> zeroinitializer, !dbg !20
+  %43 = insertelement <2 x i64> poison, i64 %5, i64 0, !dbg !21
+  %44 = shufflevector <2 x i64> %43, <2 x i64> poison, <2 x i32> zeroinitializer, !dbg !21
+  %45 = insertelement <2 x i64> poison, i64 %28, i64 0, !dbg !22
+  %46 = shufflevector <2 x i64> %45, <2 x i64> poison, <2 x i32> zeroinitializer, !dbg !22
+  %47 = insertelement <2 x i1> poison, i1 %30, i64 0, !dbg !23
+  %48 = shufflevector <2 x i1> %47, <2 x i1> poison, <2 x i32> zeroinitializer, !dbg !23
+  %49 = insertelement <2 x i64> poison, i64 %33, i64 0, !dbg !24
+  %50 = shufflevector <2 x i64> %49, <2 x i64> poison, <2 x i32> zeroinitializer, !dbg !24
+  %51 = insertelement <2 x i64> poison, i64 %8, i64 0, !dbg !25
+  %52 = shufflevector <2 x i64> %51, <2 x i64> poison, <2 x i32> zeroinitializer, !dbg !25
+  br label %53, !dbg !17
+53:                                               ; preds = %13, %53
+  %indvars.iv = phi i64 [ 0, %13 ], [ %indvars.iv.next, %53 ]
+  %54 = phi <2 x i64> [ zeroinitializer, %13 ], [ %113, %53 ]
+  %55 = or disjoint i64 %indvars.iv, %34, !dbg !26
+  %56 = or disjoint i64 %indvars.iv, %36, !dbg !26
+  %57 = lshr exact i64 %55, 7, !dbg !27
+  %58 = lshr i64 %56, 7, !dbg !27
+  %59 = trunc nuw nsw i64 %58 to i32, !dbg !27
+  %60 = or i32 %59, 4, !dbg !27
+  %61 = zext nneg i32 %60 to i64, !dbg !20
+  %62 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #5, !dbg !28
+  %63 = sub nsw i64 %35, %57, !dbg !29
+  %64 = sub nsw i32 %26, %60, !dbg !29
+  %65 = sext i32 %64 to i64, !dbg !30
+  %66 = insertelement <2 x i64> poison, i64 %57, i64 0, !dbg !20
+  %67 = insertelement <2 x i64> %66, i64 %61, i64 1, !dbg !20
+  %68 = or disjoint <2 x i64> %42, %67, !dbg !20
+  %69 = icmp slt <2 x i64> %68, %44, !dbg !21
+  %70 = and <2 x i1> %40, %69, !dbg !19
+  %71 = icmp sge <2 x i64> %68, %46, !dbg !22
+  %72 = extractelement <2 x i1> %70, i64 0, !dbg !31
+  %73 = and i1 %15, %72, !dbg !31
+  %74 = extractelement <2 x i1> %70, i64 1, !dbg !31
+  %75 = and i1 %15, %74, !dbg !31
+  %76 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %25, i64 %62, i1 %73) #5, !dbg !28
+  %77 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09createpolicy.fractional.L2::evict_last.b64 $0, 1.0;", "=l"() #5, !dbg !28
+  %78 = tail call i64 asm sideeffect "mov.u64 $0, 0x0;\0A\09@$3 ld.global.L1::evict_last.L2::cache_hint.b64 { $0 }, [ $1 + 0 ], $2;", "=l,l,l,b"(ptr addrspace(1) %25, i64 %77, i1 %75) #5, !dbg !28
+  %79 = insertelement <2 x i64> poison, i64 %76, i64 0, !dbg !32
+  %80 = insertelement <2 x i64> %79, i64 %78, i64 1, !dbg !32
+  %81 = icmp slt <2 x i64> %46, %80, !dbg !32
+  %82 = icmp slt <2 x i64> %68, %80, !dbg !33
+  %83 = and <2 x i1> %81, %82, !dbg !34
+  %84 = and <2 x i1> %71, %83, !dbg !35
+  %85 = srem i64 %28, %8, !dbg !36
+  %.not = icmp eq i64 %85, 0, !dbg !37
+  %86 = select i1 %.not, i64 0, i64 %31, !dbg !38
+  %87 = add nsw i64 %86, %85, !dbg !38
+  %88 = insertelement <2 x i64> poison, i64 %87, i64 0, !dbg !39
+  %89 = shufflevector <2 x i64> %88, <2 x i64> poison, <2 x i32> zeroinitializer, !dbg !39
+  %90 = icmp slt <2 x i64> %89, %80, !dbg !39
+  %91 = insertelement <2 x i64> poison, i64 %63, i64 0, !dbg !24
+  %92 = insertelement <2 x i64> %91, i64 %65, i64 1, !dbg !24
+  %93 = add nsw <2 x i64> %50, %92, !dbg !24
+  %94 = srem <2 x i64> %93, %52, !dbg !25
+  %95 = icmp ne <2 x i64> %94, zeroinitializer, !dbg !40
+  %96 = extractelement <2 x i64> %94, i64 0, !dbg !41
+  %97 = xor i64 %96, %8, !dbg !41
+  %98 = extractelement <2 x i64> %94, i64 1, !dbg !41
+  %99 = xor i64 %98, %8, !dbg !41
+  %100 = insertelement <2 x i64> poison, i64 %97, i64 0, !dbg !41
+  %101 = insertelement <2 x i64> %100, i64 %99, i64 1, !dbg !41
+  %102 = icmp slt <2 x i64> %101, zeroinitializer, !dbg !41
+  %103 = and <2 x i1> %95, %102, !dbg !42
+  %104 = select <2 x i1> %103, <2 x i64> %52, <2 x i64> zeroinitializer, !dbg !43
+  %105 = sub <2 x i64> zeroinitializer, %104, !dbg !44
+  %106 = icmp eq <2 x i64> %94, %105, !dbg !44
+  %107 = and <2 x i1> %90, %106, !dbg !23
+  %108 = and <2 x i1> %48, %107, !dbg !23
+  %109 = or <2 x i1> %84, %108, !dbg !45
+  %110 = select <2 x i1> %38, <2 x i1> %70, <2 x i1> zeroinitializer, !dbg !18
+  %111 = select <2 x i1> %110, <2 x i1> %109, <2 x i1> zeroinitializer, !dbg !18
+  %112 = zext <2 x i1> %111 to <2 x i64>, !dbg !18
+  %113 = add <2 x i64> %54, %112, !dbg !18
+  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1024, !dbg !17
+  %114 = icmp samesign ult i64 %indvars.iv, 15360, !dbg !17
+  br i1 %114, label %53, label %115, !dbg !17
+115:                                              ; preds = %53
+  %116 = and i32 %16, 31, !dbg !9
+  %117 = lshr i32 %16, 5, !dbg !9
+  %shift = shufflevector <2 x i64> %113, <2 x i64> poison, <2 x i32> <i32 1, i32 poison>, !dbg !46
+  %foldExtExtBinop = add <2 x i64> %113, %shift, !dbg !46
+  %118 = extractelement <2 x i64> %foldExtExtBinop, i64 0, !dbg !46
+  %119 = bitcast <2 x i64> %foldExtExtBinop to <4 x i32>, !dbg !50
+  %120 = extractelement <4 x i32> %119, i64 1, !dbg !50
+  %121 = trunc i64 %118 to i32, !dbg !50
+  %122 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %121, i32 16, i32 31), !dbg !50
+  %123 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %120, i32 16, i32 31), !dbg !50
+  %124 = insertelement <2 x i32> poison, i32 %122, i64 0, !dbg !50
+  %125 = insertelement <2 x i32> %124, i32 %123, i64 1, !dbg !50
+  %126 = bitcast <2 x i32> %125 to i64, !dbg !50
+  %127 = add i64 %118, %126, !dbg !46
+  %extelt.offset1 = lshr i64 %127, 32, !dbg !50
+  %128 = trunc nuw i64 %extelt.offset1 to i32, !dbg !50
+  %129 = trunc i64 %127 to i32, !dbg !50
+  %130 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %129, i32 8, i32 31), !dbg !50
+  %131 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %128, i32 8, i32 31), !dbg !50
+  %132 = insertelement <2 x i32> poison, i32 %130, i64 0, !dbg !50
+  %133 = insertelement <2 x i32> %132, i32 %131, i64 1, !dbg !50
+  %134 = bitcast <2 x i32> %133 to i64, !dbg !50
+  %135 = add i64 %127, %134, !dbg !46
+  %extelt.offset2 = lshr i64 %135, 32, !dbg !50
+  %136 = trunc nuw i64 %extelt.offset2 to i32, !dbg !50
+  %137 = trunc i64 %135 to i32, !dbg !50
+  %138 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %137, i32 4, i32 31), !dbg !50
+  %139 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %136, i32 4, i32 31), !dbg !50
+  %140 = insertelement <2 x i32> poison, i32 %138, i64 0, !dbg !50
+  %141 = insertelement <2 x i32> %140, i32 %139, i64 1, !dbg !50
+  %142 = bitcast <2 x i32> %141 to i64, !dbg !50
+  %143 = add i64 %135, %142, !dbg !46
+  %extelt.offset3 = lshr i64 %143, 32, !dbg !50
+  %144 = trunc nuw i64 %extelt.offset3 to i32, !dbg !50
+  %145 = trunc i64 %143 to i32, !dbg !50
+  %146 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %145, i32 2, i32 31), !dbg !50
+  %147 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %144, i32 2, i32 31), !dbg !50
+  %148 = insertelement <2 x i32> poison, i32 %146, i64 0, !dbg !50
+  %149 = insertelement <2 x i32> %148, i32 %147, i64 1, !dbg !50
+  %150 = bitcast <2 x i32> %149 to i64, !dbg !50
+  %151 = add i64 %143, %150, !dbg !46
+  %extelt.offset4 = lshr i64 %151, 32, !dbg !50
+  %152 = trunc nuw i64 %extelt.offset4 to i32, !dbg !50
+  %153 = trunc i64 %151 to i32, !dbg !50
+  %154 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %153, i32 1, i32 31), !dbg !50
+  %155 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %152, i32 1, i32 31), !dbg !50
+  %156 = insertelement <2 x i32> poison, i32 %154, i64 0, !dbg !50
+  %157 = insertelement <2 x i32> %156, i32 %155, i64 1, !dbg !50
+  %158 = bitcast <2 x i32> %157 to i64, !dbg !50
+  %159 = add i64 %151, %158, !dbg !46
+  %160 = and i32 %117, 15, !dbg !50
+  %161 = icmp eq i32 %116, 0, !dbg !50
+  %162 = getelementptr i64, ptr addrspace(3) @global_smem, i32 %160, !dbg !50
+  %163 = insertelement <1 x i64> poison, i64 %159, i64 0, !dbg !50
+  tail call void asm sideeffect "@$2 st.shared.b64 [ $0 + 0 ], $1;", "r,l,b"(ptr addrspace(3) %162, <1 x i64> %163, i1 %161) #5, !dbg !50
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !50
+  %164 = icmp samesign ult i32 %16, 16, !dbg !50
+  %165 = getelementptr i64, ptr addrspace(3) @global_smem, i32 %16, !dbg !50
+  %166 = tail call i64 asm sideeffect "@$2 ld.shared.b64 $0, [ $1 + 0 ];", "=l,r,b"(ptr addrspace(3) %165, i1 %164) #5, !dbg !50
+  %extelt.offset5 = lshr i64 %166, 32, !dbg !50
+  %167 = trunc nuw i64 %extelt.offset5 to i32, !dbg !50
+  %168 = trunc i64 %166 to i32, !dbg !50
+  %169 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %168, i32 8, i32 31), !dbg !50
+  %170 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %167, i32 8, i32 31), !dbg !50
+  %171 = insertelement <2 x i32> poison, i32 %169, i64 0, !dbg !50
+  %172 = insertelement <2 x i32> %171, i32 %170, i64 1, !dbg !50
+  %173 = bitcast <2 x i32> %172 to i64, !dbg !50
+  %174 = add i64 %166, %173, !dbg !46
+  %extelt.offset6 = lshr i64 %174, 32, !dbg !50
+  %175 = trunc nuw i64 %extelt.offset6 to i32, !dbg !50
+  %176 = trunc i64 %174 to i32, !dbg !50
+  %177 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %176, i32 4, i32 31), !dbg !50
+  %178 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %175, i32 4, i32 31), !dbg !50
+  %179 = insertelement <2 x i32> poison, i32 %177, i64 0, !dbg !50
+  %180 = insertelement <2 x i32> %179, i32 %178, i64 1, !dbg !50
+  %181 = bitcast <2 x i32> %180 to i64, !dbg !50
+  %182 = add i64 %174, %181, !dbg !46
+  %extelt.offset7 = lshr i64 %182, 32, !dbg !50
+  %183 = trunc nuw i64 %extelt.offset7 to i32, !dbg !50
+  %184 = trunc i64 %182 to i32, !dbg !50
+  %185 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %184, i32 2, i32 31), !dbg !50
+  %186 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %183, i32 2, i32 31), !dbg !50
+  %187 = insertelement <2 x i32> poison, i32 %185, i64 0, !dbg !50
+  %188 = insertelement <2 x i32> %187, i32 %186, i64 1, !dbg !50
+  %189 = bitcast <2 x i32> %188 to i64, !dbg !50
+  %190 = add i64 %182, %189, !dbg !46
+  %extelt.offset8 = lshr i64 %190, 32, !dbg !50
+  %191 = trunc nuw i64 %extelt.offset8 to i32, !dbg !50
+  %192 = trunc i64 %190 to i32, !dbg !50
+  %193 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %192, i32 1, i32 31), !dbg !50
+  %194 = tail call i32 @llvm.nvvm.shfl.sync.bfly.i32(i32 -1, i32 %191, i32 1, i32 31), !dbg !50
+  %195 = insertelement <2 x i32> poison, i32 %193, i64 0, !dbg !50
+  %196 = insertelement <2 x i32> %195, i32 %194, i64 1, !dbg !50
+  %197 = bitcast <2 x i32> %196 to i64, !dbg !50
+  %198 = add i64 %190, %197, !dbg !46
+  %199 = icmp eq i32 %16, 0, !dbg !50
+  %200 = insertelement <1 x i64> poison, i64 %198, i64 0, !dbg !50
+  tail call void asm sideeffect "@$2 st.shared.b64 [ $0 + 0 ], $1;", "r,l,b"(ptr addrspace(3) %165, <1 x i64> %200, i1 %199) #5, !dbg !50
+  tail call void @llvm.nvvm.barrier.cta.sync.aligned.all(i32 0), !dbg !50
+  %201 = load i64, ptr addrspace(3) @global_smem, align 16, !dbg !50
+  %202 = add i64 %201, -1, !dbg !51
+  %203 = icmp ult i64 %202, 16383, !dbg !51
+  %204 = zext i1 %203 to i32, !dbg !52
+  %205 = icmp eq i64 %201, 16384, !dbg !53
+  %206 = zext i1 %205 to i32, !dbg !52
+  %207 = getelementptr i32, ptr addrspace(1) %1, i64 %18, !dbg !54
+  %208 = and i32 %16, 511, !dbg !55
+  %209 = icmp eq i32 %208, 0, !dbg !55
+  %210 = and i1 %209, %15, !dbg !55
+  tail call void asm sideeffect "@$2 st.global.b32 [ $1 + 0 ], { $0 };", "r,l,b"(i32 %204, ptr addrspace(1) %207, i1 %210) #5, !dbg !55
+  %211 = getelementptr i32, ptr addrspace(1) %2, i64 %18, !dbg !56
+  tail call void asm sideeffect "@$2 st.global.b32 [ $1 + 0 ], { $0 };", "r,l,b"(i32 %206, ptr addrspace(1) %211, i1 %210) #5, !dbg !57
+  ret void, !dbg !58
+}
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 2147483647) i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #1
+; Function Attrs: mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare noundef range(i32 0, 1024) i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1
+; Function Attrs: convergent nocallback nounwind memory(inaccessiblemem: readwrite)
+declare i32 @llvm.nvvm.shfl.sync.bfly.i32(i32, i32, i32, i32) #2
+; Function Attrs: convergent nocallback nounwind
+declare void @llvm.nvvm.barrier.cta.sync.aligned.all(i32) #3
+; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
+declare i64 @llvm.smin.i64(i64, i64) #4
+attributes #0 = { nounwind "nvvm.reqntid"="512" }
+attributes #1 = { mustprogress nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+attributes #2 = { convergent nocallback nounwind memory(inaccessiblemem: readwrite) }
+attributes #3 = { convergent nocallback nounwind }
+attributes #4 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
+attributes #5 = { nounwind }
+!llvm.dbg.cu = !{!0}
+!llvm.module.flags = !{!2, !3}
+!0 = distinct !DICompileUnit(language: DW_LANG_C, file: !1, producer: "triton", isOptimized: true, runtimeVersion: 0, emissionKind: LineTablesOnly)
+!1 = !DIFile(filename: "cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py", directory: "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av")
+!2 = !{i32 2, !"Debug Info Version", i32 3}
+!3 = !{i32 4, !"nvvm-reflect-ftz", i32 1}
+!4 = distinct !DISubprogram(name: "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1", linkageName: "triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1", scope: !1, file: !1, line: 18, type: !5, scopeLine: 18, spFlags: DISPFlagDefinition | DISPFlagOptimized, unit: !0)
+!5 = !DISubroutineType(cc: DW_CC_normal, types: !6)
+!6 = !{}
+!7 = !DILocation(line: 22, column: 28, scope: !4)
+!8 = !DILocation(line: 24, column: 21, scope: !4)
+!9 = !DILocation(line: 25, column: 37, scope: !4)
+!10 = !DILocation(line: 27, column: 21, scope: !4)
+!11 = !DILocation(line: 27, column: 28, scope: !4)
+!12 = !DILocation(line: 28, column: 19, scope: !4)
+!13 = !DILocation(line: 29, column: 19, scope: !4)
+!14 = !DILocation(line: 39, column: 26, scope: !4)
+!15 = !DILocation(line: 42, column: 26, scope: !4)
+!16 = !DILocation(line: 49, column: 35, scope: !4)
+!17 = !DILocation(line: 32, column: 40, scope: !4)
+!18 = !DILocation(line: 86, column: 50, scope: !4)
+!19 = !DILocation(line: 45, column: 22, scope: !4)
+!20 = !DILocation(line: 39, column: 22, scope: !4)
+!21 = !DILocation(line: 41, column: 22, scope: !4)
+!22 = !DILocation(line: 48, column: 23, scope: !4)
+!23 = !DILocation(line: 79, column: 24, scope: !4)
+!24 = !DILocation(line: 69, column: 51, scope: !4)
+!25 = !DILocation(line: 70, column: 25, scope: !4)
+!26 = !DILocation(line: 33, column: 31, scope: !4)
+!27 = !DILocation(line: 37, column: 27, scope: !4)
+!28 = !DILocation(line: 49, column: 77, scope: !4)
+!29 = !DILocation(line: 69, column: 24, scope: !4)
+!30 = !DILocation(line: 69, column: 38, scope: !4)
+!31 = !DILocation(line: 49, column: 94, scope: !4)
+!32 = !DILocation(line: 50, column: 23, scope: !4)
+!33 = !DILocation(line: 51, column: 23, scope: !4)
+!34 = !DILocation(line: 52, column: 24, scope: !4)
+!35 = !DILocation(line: 53, column: 23, scope: !4)
+!36 = !DILocation(line: 58, column: 24, scope: !4)
+!37 = !DILocation(line: 60, column: 25, scope: !4)
+!38 = !DILocation(line: 66, column: 39, scope: !4)
+!39 = !DILocation(line: 67, column: 24, scope: !4)
+!40 = !DILocation(line: 71, column: 25, scope: !4)
+!41 = !DILocation(line: 73, column: 25, scope: !4)
+!42 = !DILocation(line: 74, column: 24, scope: !4)
+!43 = !DILocation(line: 76, column: 39, scope: !4)
+!44 = !DILocation(line: 78, column: 25, scope: !4)
+!45 = !DILocation(line: 80, column: 24, scope: !4)
+!46 = !DILocation(line: 261, column: 15, scope: !47, inlinedAt: !49)
+!47 = distinct !DILexicalBlockFile(scope: !4, file: !48, discriminator: 0)
+!48 = !DIFile(filename: "standard.py", directory: "/workspace/specforge/lib/python3.11/site-packages/triton/language")
+!49 = !DILocation(line: 87, column: 27, scope: !4)
+!50 = !DILocation(line: 291, column: 36, scope: !47, inlinedAt: !49)
+!51 = !DILocation(line: 92, column: 20, scope: !4)
+!52 = !DILocation(line: 0, scope: !4)
+!53 = !DILocation(line: 95, column: 21, scope: !4)
+!54 = !DILocation(line: 98, column: 25, scope: !4)
+!55 = !DILocation(line: 98, column: 37, scope: !4)
+!56 = !DILocation(line: 99, column: 25, scope: !4)
+!57 = !DILocation(line: 99, column: 37, scope: !4)
+!58 = !DILocation(line: 99, column: 4, scope: !4)

	@@ -0,0 +1,736 @@

+//
+// Generated by LLVM NVPTX Back-End
+//
+.version 8.7
+.target sm_90a
+.address_size 64
+	// .globl	triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1 // -- Begin function triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1
+.extern .shared .align 16 .b8 global_smem[];
+                                        // @triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1
+.visible .entry triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1(
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_0,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_1,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_2,
+	.param .u64 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_3,
+	.param .u64 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_4,
+	.param .u64 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_5,
+	.param .u64 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_6,
+	.param .u64 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_7,
+	.param .u64 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_8,
+	.param .u32 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_9,
+	.param .u32 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_10,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_11,
+	.param .u64 .ptr .global .align 1 triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_12
+)
+.reqntid 512
+{
+	.reg .pred 	%p<53>;
+	.reg .b32 	%r<76>;
+	.reg .b64 	%rd<162>;
+	.loc	1 18 0                          // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:18:0
+$L__func_begin0:
+	.loc	1 18 0                          // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:18:0
+// %bb.0:
+	ld.param.b64 	%rd47, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_4];
+$L__tmp0:
+	.loc	1 22 28                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:22:28
+	mov.u32 	%r7, %ctaid.x;
+	.loc	1 27 21                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:27:21
+	cvt.u64.u32 	%rd1, %r7;
+	ld.param.b64 	%rd52, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_3];
+	and.b64 	%rd53, %rd52, -4294967296;
+	setp.ne.b64 	%p11, %rd53, 0;
+	cvt.u32.u64 	%r74, %rd1;
+	@%p11 bra 	$L__BB0_2;
+	bra.uni 	$L__BB0_1;
+$L__BB0_2:
+	div.s64 	%rd153, %rd1, %rd52;
+	bra.uni 	$L__BB0_3;
+$L__BB0_1:
+	cvt.u32.u64 	%r8, %rd52;
+	div.u32 	%r10, %r74, %r8;
+	cvt.u64.u32 	%rd153, %r10;
+$L__BB0_3:
+	.loc	1 0 21                          // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:0:21
+	ld.param.b64 	%rd50, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_7];
+	.loc	1 27 28                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:27:28
+	or.b64 	%rd54, %rd153, %rd47;
+	and.b64 	%rd55, %rd54, -4294967296;
+	setp.ne.b64 	%p12, %rd55, 0;
+	@%p12 bra 	$L__BB0_5;
+	bra.uni 	$L__BB0_4;
+$L__BB0_5:
+	rem.s64 	%rd154, %rd153, %rd47;
+	bra.uni 	$L__BB0_6;
+$L__BB0_4:
+	cvt.u32.u64 	%r11, %rd47;
+	cvt.u32.u64 	%r12, %rd153;
+	rem.u32 	%r13, %r12, %r11;
+	cvt.u64.u32 	%rd154, %r13;
+$L__BB0_6:
+	.loc	1 0 28                          // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:0:28
+	ld.param.b32 	%r6, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_9];
+	ld.param.b64 	%rd51, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_8];
+	ld.param.b64 	%rd49, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_6];
+	ld.param.b64 	%rd44, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_0];
+	mov.u32 	%r1, %tid.x;
+	.loc	1 28 19                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:28:19
+	mul.lo.s64 	%rd56, %rd153, %rd52;
+	sub.s64 	%rd9, %rd1, %rd56;
+	.loc	1 29 19                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:29:19
+	and.b64 	%rd57, %rd50, -4294967296;
+	setp.ne.b64 	%p13, %rd57, 0;
+	@%p13 bra 	$L__BB0_8;
+	bra.uni 	$L__BB0_7;
+$L__BB0_8:
+	div.s64 	%rd155, %rd1, %rd50;
+	bra.uni 	$L__BB0_9;
+$L__BB0_7:
+	cvt.u32.u64 	%r14, %rd50;
+	div.u32 	%r16, %r74, %r14;
+	cvt.u64.u32 	%rd155, %r16;
+$L__BB0_9:
+	.loc	1 0 19                          // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:0:19
+	ld.param.b64 	%rd48, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_5];
+	ld.param.b64 	%rd46, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_2];
+	ld.param.b64 	%rd45, [triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1_param_1];
+	.loc	1 24 21                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:24:21
+	setp.lt.s32 	%p1, %r74, %r6;
+	.loc	1 39 26                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:39:26
+	shl.b64 	%rd16, %rd154, 7;
+	.loc	1 42 26                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:42:26
+	shl.b64 	%rd61, %rd9, 7;
+	.loc	1 49 35                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:49:35
+	shl.b64 	%rd62, %rd155, 3;
+	add.s64 	%rd70, %rd44, %rd62;
+	and.b32 	%r2, %r1, 127;
+	cvt.u64.u32 	%rd63, %r2;
+	or.b64 	%rd20, %rd61, %rd63;
+	setp.lt.s64 	%p3, %rd20, %rd49;
+	setp.ge.s64 	%p5, %rd20, %rd51;
+	min.s64 	%rd15, %rd51, 0;
+	sub.s64 	%rd64, %rd9, %rd154;
+	shl.b64 	%rd22, %rd64, 7;
+	.loc	1 32 40                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:32:40
+	cvt.u64.u32 	%rd65, %r1;
+	shr.u64 	%rd66, %rd65, 7;
+	cvt.u32.u64 	%r75, %rd66;
+	shr.u32 	%r18, %r1, 7;
+	cvt.u64.u32 	%rd67, %r18;
+	and.b64 	%rd157, %rd67, 3;
+	sub.s64 	%rd156, %rd63, %rd157;
+	mov.b64 	%rd159, 0;
+	mov.b64 	%rd158, -1024;
+	mov.b64 	%rd160, %rd159;
+	bra.uni 	$L__BB0_10;
+$L__BB0_12:                             //   in Loop: Header=BB0_10 Depth=1
+	.loc	1 58 24                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:58:24
+	rem.s64 	%rd161, %rd20, %rd51;
+$L__BB0_13:                             //   in Loop: Header=BB0_10 Depth=1
+	.loc	1 0 0                           // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:0
+	sub.s32 	%r21, %r2, %r20;
+	cvt.s64.s32 	%rd33, %r21;
+	.loc	1 60 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:60:25
+	setp.eq.b64 	%p24, %rd161, 0;
+	.loc	1 66 39                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:66:39
+	selp.b64 	%rd83, 0, %rd15, %p24;
+	add.s64 	%rd84, %rd83, %rd161;
+	.loc	1 67 24                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:67:24
+	setp.lt.s64 	%p25, %rd84, %rd69;
+	setp.lt.s64 	%p26, %rd84, %rd73;
+	.loc	1 69 51                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:69:51
+	add.s64 	%rd85, %rd22, %rd33;
+	add.s64 	%rd86, %rd22, %rd156;
+	.loc	1 70 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:70:25
+	rem.s64 	%rd87, %rd86, %rd51;
+	rem.s64 	%rd88, %rd85, %rd51;
+	.loc	1 71 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:71:25
+	setp.ne.b64 	%p27, %rd88, 0;
+	setp.ne.b64 	%p28, %rd87, 0;
+	.loc	1 73 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:73:25
+	xor.b64 	%rd89, %rd87, %rd51;
+	xor.b64 	%rd90, %rd88, %rd51;
+	.loc	1 76 39                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:76:39
+	shr.s64 	%rd91, %rd89, 63;
+	and.b64 	%rd92, %rd91, %rd51;
+	selp.b64 	%rd93, %rd92, 0, %p28;
+	shr.s64 	%rd94, %rd90, 63;
+	and.b64 	%rd95, %rd94, %rd51;
+	selp.b64 	%rd96, %rd95, 0, %p27;
+	.loc	1 78 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:78:25
+	neg.s64 	%rd97, %rd96;
+	neg.s64 	%rd98, %rd93;
+	setp.eq.b64 	%p29, %rd87, %rd98;
+	setp.eq.b64 	%p30, %rd88, %rd97;
+	.loc	1 79 24                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:79:24
+	and.pred 	%p31, %p26, %p30;
+	and.pred 	%p33, %p25, %p29;
+	and.pred 	%p35, %p5, %p33;
+	and.pred 	%p36, %p5, %p31;
+	.loc	1 80 24                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:80:24
+	or.pred 	%p37, %p10, %p36;
+	or.pred 	%p38, %p9, %p35;
+	.loc	1 86 50                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:86:50
+	and.pred 	%p41, %p14, %p38;
+	and.pred 	%p42, %p15, %p37;
+	selp.b64 	%rd99, 1, 0, %p42;
+	selp.b64 	%rd100, 1, 0, %p41;
+	add.s64 	%rd159, %rd159, %rd100;
+	add.s64 	%rd160, %rd160, %rd99;
+	.loc	1 32 40                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:32:40
+	add.s64 	%rd158, %rd158, 1024;
+	add.s32 	%r75, %r75, 8;
+	add.s64 	%rd157, %rd157, 8;
+	add.s64 	%rd156, %rd156, -8;
+	setp.lt.u64 	%p43, %rd158, 15360;
+	@%p43 bra 	$L__BB0_10;
+	bra.uni 	$L__BB0_14;
+$L__BB0_10:                             // =>This Inner Loop Header: Depth=1
+	.loc	1 37 27                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:37:27
+	or.b32 	%r20, %r75, 4;
+	.loc	1 39 22                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:39:22
+	cvt.u64.u32 	%rd76, %r20;
+	.loc	1 49 77                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:49:77
+	// begin inline asm
+	mov.u64 %rd68, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd68, 1.0;
+	// end inline asm
+	.loc	1 39 22                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:39:22
+	or.b64 	%rd77, %rd16, %rd76;
+	or.b64 	%rd78, %rd16, %rd157;
+	.loc	1 41 22                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:41:22
+	setp.lt.s64 	%p17, %rd78, %rd48;
+	setp.lt.s64 	%p18, %rd77, %rd48;
+	.loc	1 45 22                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:45:22
+	and.pred 	%p8, %p3, %p18;
+	and.pred 	%p7, %p3, %p17;
+	.loc	1 48 23                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:48:23
+	setp.ge.s64 	%p19, %rd77, %rd20;
+	setp.ge.s64 	%p20, %rd78, %rd20;
+	.loc	1 49 94                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:49:94
+	and.pred 	%p14, %p1, %p7;
+	and.pred 	%p15, %p1, %p8;
+	.loc	1 49 77                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:49:77
+	// begin inline asm
+	mov.u64 %rd69, 0x0;
+	@%p14 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd69 }, [ %rd70 + 0 ], %rd68;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd72, 0x0;
+	createpolicy.fractional.L2::evict_last.b64 %rd72, 1.0;
+	// end inline asm
+	// begin inline asm
+	mov.u64 %rd73, 0x0;
+	@%p15 ld.global.L1::evict_last.L2::cache_hint.b64 { %rd73 }, [ %rd70 + 0 ], %rd72;
+	// end inline asm
+	.loc	1 52 24                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:52:24
+	max.s64 	%rd79, %rd20, %rd77;
+	setp.lt.s64 	%p21, %rd79, %rd73;
+	max.s64 	%rd80, %rd20, %rd78;
+	setp.lt.s64 	%p22, %rd80, %rd69;
+	.loc	1 53 23                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:53:23
+	and.pred 	%p9, %p20, %p22;
+	and.pred 	%p10, %p19, %p21;
+	.loc	1 58 24                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:58:24
+	or.b64 	%rd81, %rd20, %rd51;
+	and.b64 	%rd82, %rd81, -4294967296;
+	setp.ne.b64 	%p23, %rd82, 0;
+	@%p23 bra 	$L__BB0_12;
+// %bb.11:                              //   in Loop: Header=BB0_10 Depth=1
+	cvt.u32.u64 	%r22, %rd51;
+	cvt.u32.u64 	%r23, %rd20;
+	rem.u32 	%r24, %r23, %r22;
+	cvt.u64.u32 	%rd161, %r24;
+	bra.uni 	$L__BB0_13;
+$L__BB0_14:
+	.loc	1 25 37                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:25:37
+	and.b32 	%r31, %r1, 31;
+$L__tmp1:
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd106, %rd159, %rd160;
+	mov.b64 	{_, %r32}, %rd106;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	cvt.u32.u64 	%r33, %rd106;
+	shfl.sync.bfly.b32 	%r34, %r33, 16, 31, -1;
+	shfl.sync.bfly.b32 	%r35, %r32, 16, 31, -1;
+	cvt.u64.u32 	%rd107, %r34;
+	cvt.u64.u32 	%rd108, %r35;
+	shl.b64 	%rd109, %rd108, 32;
+	or.b64 	%rd110, %rd107, %rd109;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd111, %rd106, %rd110;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r36}, %rd111;
+	cvt.u32.u64 	%r37, %rd111;
+	shfl.sync.bfly.b32 	%r38, %r37, 8, 31, -1;
+	shfl.sync.bfly.b32 	%r39, %r36, 8, 31, -1;
+	cvt.u64.u32 	%rd112, %r38;
+	cvt.u64.u32 	%rd113, %r39;
+	shl.b64 	%rd114, %rd113, 32;
+	or.b64 	%rd115, %rd112, %rd114;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd116, %rd111, %rd115;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r40}, %rd116;
+	cvt.u32.u64 	%r41, %rd116;
+	shfl.sync.bfly.b32 	%r42, %r41, 4, 31, -1;
+	shfl.sync.bfly.b32 	%r43, %r40, 4, 31, -1;
+	cvt.u64.u32 	%rd117, %r42;
+	cvt.u64.u32 	%rd118, %r43;
+	shl.b64 	%rd119, %rd118, 32;
+	or.b64 	%rd120, %rd117, %rd119;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd121, %rd116, %rd120;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r44}, %rd121;
+	cvt.u32.u64 	%r45, %rd121;
+	shfl.sync.bfly.b32 	%r46, %r45, 2, 31, -1;
+	shfl.sync.bfly.b32 	%r47, %r44, 2, 31, -1;
+	cvt.u64.u32 	%rd122, %r46;
+	cvt.u64.u32 	%rd123, %r47;
+	shl.b64 	%rd124, %rd123, 32;
+	or.b64 	%rd125, %rd122, %rd124;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd126, %rd121, %rd125;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r48}, %rd126;
+	cvt.u32.u64 	%r49, %rd126;
+	shfl.sync.bfly.b32 	%r50, %r49, 1, 31, -1;
+	shfl.sync.bfly.b32 	%r51, %r48, 1, 31, -1;
+	cvt.u64.u32 	%rd127, %r50;
+	cvt.u64.u32 	%rd128, %r51;
+	shl.b64 	%rd129, %rd128, 32;
+	or.b64 	%rd130, %rd127, %rd129;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd101, %rd126, %rd130;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	setp.eq.b32 	%p44, %r31, 0;
+	shr.u32 	%r52, %r1, 2;
+	and.b32 	%r53, %r52, 120;
+	mov.b32 	%r54, global_smem;
+	add.s32 	%r25, %r54, %r53;
+	// begin inline asm
+	@%p44 st.shared.b64 [ %r25 + 0 ], %rd101;
+	// end inline asm
+	bar.sync 	0;
+	setp.lt.u32 	%p45, %r1, 16;
+	shl.b32 	%r55, %r1, 3;
+	add.s32 	%r26, %r54, %r55;
+	// begin inline asm
+	@%p45 ld.shared.b64 %rd102, [ %r26 + 0 ];
+	// end inline asm
+	mov.b64 	{_, %r56}, %rd102;
+	cvt.u32.u64 	%r57, %rd102;
+	shfl.sync.bfly.b32 	%r58, %r57, 8, 31, -1;
+	shfl.sync.bfly.b32 	%r59, %r56, 8, 31, -1;
+	cvt.u64.u32 	%rd131, %r58;
+	cvt.u64.u32 	%rd132, %r59;
+	shl.b64 	%rd133, %rd132, 32;
+	or.b64 	%rd134, %rd131, %rd133;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd135, %rd102, %rd134;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r60}, %rd135;
+	cvt.u32.u64 	%r61, %rd135;
+	shfl.sync.bfly.b32 	%r62, %r61, 4, 31, -1;
+	shfl.sync.bfly.b32 	%r63, %r60, 4, 31, -1;
+	cvt.u64.u32 	%rd136, %r62;
+	cvt.u64.u32 	%rd137, %r63;
+	shl.b64 	%rd138, %rd137, 32;
+	or.b64 	%rd139, %rd136, %rd138;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd140, %rd135, %rd139;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r64}, %rd140;
+	cvt.u32.u64 	%r65, %rd140;
+	shfl.sync.bfly.b32 	%r66, %r65, 2, 31, -1;
+	shfl.sync.bfly.b32 	%r67, %r64, 2, 31, -1;
+	cvt.u64.u32 	%rd141, %r66;
+	cvt.u64.u32 	%rd142, %r67;
+	shl.b64 	%rd143, %rd142, 32;
+	or.b64 	%rd144, %rd141, %rd143;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd145, %rd140, %rd144;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	mov.b64 	{_, %r68}, %rd145;
+	cvt.u32.u64 	%r69, %rd145;
+	shfl.sync.bfly.b32 	%r70, %r69, 1, 31, -1;
+	shfl.sync.bfly.b32 	%r71, %r68, 1, 31, -1;
+	cvt.u64.u32 	%rd146, %r70;
+	cvt.u64.u32 	%rd147, %r71;
+	shl.b64 	%rd148, %rd147, 32;
+	or.b64 	%rd149, %rd146, %rd148;
+	.loc	2 261 15                        // standard.py:261:15 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	add.s64 	%rd103, %rd145, %rd149;
+	.loc	2 291 36                        // standard.py:291:36 @[ cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:87:27 ]
+	setp.eq.b32 	%p46, %r1, 0;
+	// begin inline asm
+	@%p46 st.shared.b64 [ %r26 + 0 ], %rd103;
+	// end inline asm
+	bar.sync 	0;
+	ld.shared.b64 	%rd150, [global_smem];
+$L__tmp2:
+	.loc	1 92 20                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:92:20
+	add.s64 	%rd151, %rd150, -1;
+	setp.lt.u64 	%p50, %rd151, 16383;
+	.loc	1 0 0                           // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:0
+	selp.b32 	%r28, 1, 0, %p50;
+	.loc	1 95 21                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:95:21
+	setp.eq.b64 	%p51, %rd150, 16384;
+	.loc	1 0 0                           // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:0
+	selp.b32 	%r29, 1, 0, %p51;
+	.loc	1 98 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:98:25
+	shl.b64 	%rd152, %rd1, 2;
+	add.s64 	%rd104, %rd45, %rd152;
+	.loc	1 98 37                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:98:37
+	and.b32 	%r72, %r1, 511;
+	setp.eq.b32 	%p52, %r72, 0;
+	and.pred 	%p47, %p52, %p1;
+	// begin inline asm
+	@%p47 st.global.b32 [ %rd104 + 0 ], { %r28 };
+	// end inline asm
+	.loc	1 99 25                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:99:25
+	add.s64 	%rd105, %rd46, %rd152;
+	.loc	1 99 37                         // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:99:37
+	// begin inline asm
+	@%p47 st.global.b32 [ %rd105 + 0 ], { %r29 };
+	// end inline asm
+	.loc	1 99 4                          // cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py:99:4
+	ret;
+$L__tmp3:
+$L__func_end0:
+                                        // -- End function
+}
+	.file	1 "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py"
+	.file	2 "/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py"
+	.section	.debug_abbrev
+	{
+.b8 1                                   // Abbreviation Code
+.b8 17                                  // DW_TAG_compile_unit
+.b8 1                                   // DW_CHILDREN_yes
+.b8 37                                  // DW_AT_producer
+.b8 8                                   // DW_FORM_string
+.b8 19                                  // DW_AT_language
+.b8 5                                   // DW_FORM_data2
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 16                                  // DW_AT_stmt_list
+.b8 6                                   // DW_FORM_data4
+.b8 27                                  // DW_AT_comp_dir
+.b8 8                                   // DW_FORM_string
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 2                                   // Abbreviation Code
+.b8 46                                  // DW_TAG_subprogram
+.b8 0                                   // DW_CHILDREN_no
+.b8 3                                   // DW_AT_name
+.b8 8                                   // DW_FORM_string
+.b8 32                                  // DW_AT_inline
+.b8 11                                  // DW_FORM_data1
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 3                                   // Abbreviation Code
+.b8 46                                  // DW_TAG_subprogram
+.b8 1                                   // DW_CHILDREN_yes
+.b8 17                                  // DW_AT_low_pc
+.b8 1                                   // DW_FORM_addr
+.b8 18                                  // DW_AT_high_pc
+.b8 1                                   // DW_FORM_addr
+.b8 49                                  // DW_AT_abstract_origin
+.b8 19                                  // DW_FORM_ref4
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 4                                   // Abbreviation Code
+.b8 29                                  // DW_TAG_inlined_subroutine
+.b8 0                                   // DW_CHILDREN_no
+.b8 49                                  // DW_AT_abstract_origin
+.b8 19                                  // DW_FORM_ref4
+.b8 17                                  // DW_AT_low_pc
+.b8 1                                   // DW_FORM_addr
+.b8 18                                  // DW_AT_high_pc
+.b8 1                                   // DW_FORM_addr
+.b8 88                                  // DW_AT_call_file
+.b8 11                                  // DW_FORM_data1
+.b8 89                                  // DW_AT_call_line
+.b8 11                                  // DW_FORM_data1
+.b8 87                                  // DW_AT_call_column
+.b8 11                                  // DW_FORM_data1
+.b8 0                                   // EOM(1)
+.b8 0                                   // EOM(2)
+.b8 0                                   // EOM(3)
+	}
+	.section	.debug_info
+	{
+.b32 307                                // Length of Unit
+.b8 2                                   // DWARF version number
+.b8 0
+.b32 .debug_abbrev                      // Offset Into Abbrev. Section
+.b8 8                                   // Address Size (in bytes)
+.b8 1                                   // Abbrev [1] 0xb:0x12c DW_TAG_compile_unit
+.b8 116                                 // DW_AT_producer
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 0
+.b8 2                                   // DW_AT_language
+.b8 0
+.b8 99                                  // DW_AT_name
+.b8 97
+.b8 118
+.b8 112
+.b8 55
+.b8 120
+.b8 97
+.b8 110
+.b8 55
+.b8 55
+.b8 116
+.b8 102
+.b8 114
+.b8 55
+.b8 113
+.b8 121
+.b8 116
+.b8 102
+.b8 107
+.b8 112
+.b8 54
+.b8 115
+.b8 106
+.b8 114
+.b8 103
+.b8 107
+.b8 100
+.b8 54
+.b8 104
+.b8 118
+.b8 114
+.b8 117
+.b8 105
+.b8 97
+.b8 113
+.b8 102
+.b8 122
+.b8 107
+.b8 101
+.b8 105
+.b8 98
+.b8 116
+.b8 108
+.b8 53
+.b8 114
+.b8 116
+.b8 97
+.b8 103
+.b8 115
+.b8 99
+.b8 110
+.b8 103
+.b8 46
+.b8 112
+.b8 121
+.b8 0
+.b32 .debug_line                        // DW_AT_stmt_list
+.b8 47                                  // DW_AT_comp_dir
+.b8 119
+.b8 111
+.b8 114
+.b8 107
+.b8 115
+.b8 112
+.b8 97
+.b8 99
+.b8 101
+.b8 47
+.b8 104
+.b8 97
+.b8 110
+.b8 114
+.b8 117
+.b8 105
+.b8 47
+.b8 83
+.b8 112
+.b8 101
+.b8 99
+.b8 70
+.b8 111
+.b8 114
+.b8 103
+.b8 101
+.b8 45
+.b8 101
+.b8 120
+.b8 116
+.b8 47
+.b8 99
+.b8 97
+.b8 99
+.b8 104
+.b8 101
+.b8 47
+.b8 99
+.b8 111
+.b8 109
+.b8 112
+.b8 105
+.b8 108
+.b8 101
+.b8 100
+.b8 95
+.b8 107
+.b8 101
+.b8 114
+.b8 110
+.b8 101
+.b8 108
+.b8 115
+.b8 47
+.b8 97
+.b8 118
+.b8 0
+.b8 2                                   // Abbrev [2] 0x8b:0x7d DW_TAG_subprogram
+.b8 116                                 // DW_AT_name
+.b8 114
+.b8 105
+.b8 116
+.b8 111
+.b8 110
+.b8 95
+.b8 114
+.b8 101
+.b8 100
+.b8 95
+.b8 102
+.b8 117
+.b8 115
+.b8 101
+.b8 100
+.b8 95
+.b8 95
+.b8 116
+.b8 111
+.b8 95
+.b8 99
+.b8 111
+.b8 112
+.b8 121
+.b8 95
+.b8 97
+.b8 114
+.b8 97
+.b8 110
+.b8 103
+.b8 101
+.b8 95
+.b8 98
+.b8 105
+.b8 116
+.b8 119
+.b8 105
+.b8 115
+.b8 101
+.b8 95
+.b8 97
+.b8 110
+.b8 100
+.b8 95
+.b8 98
+.b8 105
+.b8 116
+.b8 119
+.b8 105
+.b8 115
+.b8 101
+.b8 95
+.b8 111
+.b8 114
+.b8 95
+.b8 99
+.b8 111
+.b8 110
+.b8 115
+.b8 116
+.b8 97
+.b8 110
+.b8 116
+.b8 95
+.b8 112
+.b8 97
+.b8 100
+.b8 95
+.b8 110
+.b8 100
+.b8 95
+.b8 101
+.b8 113
+.b8 95
+.b8 103
+.b8 101
+.b8 95
+.b8 103
+.b8 116
+.b8 95
+.b8 105
+.b8 110
+.b8 100
+.b8 101
+.b8 120
+.b8 95
+.b8 108
+.b8 116
+.b8 95
+.b8 112
+.b8 101
+.b8 114
+.b8 109
+.b8 117
+.b8 116
+.b8 101
+.b8 95
+.b8 114
+.b8 101
+.b8 109
+.b8 97
+.b8 105
+.b8 110
+.b8 100
+.b8 101
+.b8 114
+.b8 95
+.b8 115
+.b8 117
+.b8 98
+.b8 95
+.b8 115
+.b8 117
+.b8 109
+.b8 95
+.b8 118
+.b8 105
+.b8 101
+.b8 119
+.b8 95
+.b8 49
+.b8 0
+.b8 1                                   // DW_AT_inline
+.b8 3                                   // Abbrev [3] 0x108:0x2e DW_TAG_subprogram
+.b64 $L__func_begin0                    // DW_AT_low_pc
+.b64 $L__func_end0                      // DW_AT_high_pc
+.b32 139                                // DW_AT_abstract_origin
+.b8 4                                   // Abbrev [4] 0x11d:0x18 DW_TAG_inlined_subroutine
+.b32 139                                // DW_AT_abstract_origin
+.b64 $L__tmp1                           // DW_AT_low_pc
+.b64 $L__tmp2                           // DW_AT_high_pc
+.b8 1                                   // DW_AT_call_file
+.b8 87                                  // DW_AT_call_line
+.b8 27                                  // DW_AT_call_column
+.b8 0                                   // End Of Children Mark
+.b8 0                                   // End Of Children Mark
+	}
+	.section	.debug_macinfo	{	}

	@@ -0,0 +1,418 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":18:0)
+#loc97 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":285:0)
+#loc99 = loc(unknown)
+#loc102 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":260:0)
+#loc106 = loc("in_ptr0"(#loc))
+#loc107 = loc("out_ptr1"(#loc))
+#loc108 = loc("out_ptr2"(#loc))
+#loc109 = loc("ks0"(#loc))
+#loc110 = loc("ks1"(#loc))
+#loc111 = loc("ks2"(#loc))
+#loc112 = loc("ks3"(#loc))
+#loc113 = loc("ks4"(#loc))
+#loc114 = loc("ks5"(#loc))
+#loc115 = loc("xnumel"(#loc))
+#loc116 = loc("r0_numel"(#loc))
+#loc207 = loc("input"(#loc97))
+#loc208 = loc("a"(#loc102))
+#loc209 = loc("b"(#loc102))
+module {
+  tt.func public @triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1(%in_ptr0: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr1: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr1"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %ks2: i64 loc("ks2"(#loc)), %ks3: i64 loc("ks3"(#loc)), %ks4: i64 loc("ks4"(#loc)), %ks5: i64 loc("ks5"(#loc)), %xnumel: i32 loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %r0_numel_0 = arith.constant 16384 : i32 loc(#loc117)
+    %xoffset = tt.get_program_id x : i32 loc(#loc118)
+    %xoffset_1 = arith.constant 1 : i32 loc(#loc119)
+    %xoffset_2 = arith.constant 1 : i32 loc(#loc119)
+    %xoffset_3 = arith.muli %xoffset, %xoffset_2 : i32 loc(#loc119)
+    %xindex = tt.make_range {end = 1 : i32, start = 0 : i32} : tensor<1xi32> loc(#loc120)
+    %xindex_4 = tt.expand_dims %xindex {axis = 1 : i32} : tensor<1xi32> -> tensor<1x1xi32> loc(#loc121)
+    %xindex_5 = tt.splat %xoffset_3 : i32 -> tensor<1x1xi32> loc(#loc122)
+    %xindex_6 = arith.addi %xindex_5, %xindex_4 : tensor<1x1xi32> loc(#loc122)
+    %xmask = tt.splat %xnumel : i32 -> tensor<1x1xi32> loc(#loc123)
+    %xmask_7 = arith.cmpi slt, %xindex_6, %xmask : tensor<1x1xi32> loc(#loc123)
+    %r0_base = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> loc(#loc124)
+    %r0_base_8 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> loc(#loc125)
+    %x1 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc126)
+    %x1_9 = tt.splat %ks0 : i64 -> tensor<1x1xi64> loc(#loc126)
+    %x1_10 = arith.divsi %x1, %x1_9 : tensor<1x1xi64> loc(#loc126)
+    %x1_11 = tt.splat %ks1 : i64 -> tensor<1x1xi64> loc(#loc127)
+    %x1_12 = arith.remsi %x1_10, %x1_11 : tensor<1x1xi64> loc(#loc127)
+    %x0 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc128)
+    %x0_13 = tt.splat %ks0 : i64 -> tensor<1x1xi64> loc(#loc128)
+    %x0_14 = arith.remsi %x0, %x0_13 : tensor<1x1xi64> loc(#loc128)
+    %x2 = arith.extsi %xindex_6 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc129)
+    %x2_15 = tt.splat %ks4 : i64 -> tensor<1x1xi64> loc(#loc129)
+    %x2_16 = arith.divsi %x2, %x2_15 : tensor<1x1xi64> loc(#loc129)
+    %_tmp46 = arith.constant 0 : i64 loc(#loc130)
+    %_tmp46_17 = arith.constant dense<0> : tensor<1x1024xi64> loc(#loc130)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc15)
+    %c1024_i32 = arith.constant 1024 : i32 loc(#loc15)
+    %0 = arith.bitcast %c0_i32 : i32 to i32 loc(#loc15)
+    %1 = arith.bitcast %r0_numel_0 : i32 to i32 loc(#loc15)
+    %2 = arith.bitcast %c1024_i32 : i32 to i32 loc(#loc15)
+    %3 = ub.poison : i32 loc(#loc15)
+    %_tmp46_18 = scf.for %r0_offset = %0 to %1 step %2 iter_args(%_tmp46_22 = %_tmp46_17) -> (tensor<1x1024xi64>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x1024xi32> loc(#loc132)
+      %r0_index_23 = arith.addi %r0_index, %r0_base_8 : tensor<1x1024xi32> loc(#loc132)
+      %r0_mask = arith.constant dense<16384> : tensor<1x1024xi32> loc(#loc133)
+      %r0_mask_24 = arith.cmpi slt, %r0_index_23, %r0_mask : tensor<1x1024xi32> loc(#loc133)
+      %r0_4 = arith.constant 128 : i32 loc(#loc134)
+      %r0_4_25 = arith.constant 128 : i32 loc(#loc134)
+      %r0_4_26 = arith.constant dense<128> : tensor<1x1024xi32> loc(#loc134)
+      %r0_4_27 = arith.divsi %r0_index_23, %r0_4_26 : tensor<1x1024xi32> loc(#loc134)
+      %r0_3 = arith.constant 128 : i32 loc(#loc135)
+      %r0_3_28 = arith.constant 128 : i32 loc(#loc135)
+      %r0_3_29 = arith.constant dense<128> : tensor<1x1024xi32> loc(#loc135)
+      %r0_3_30 = arith.remsi %r0_index_23, %r0_3_29 : tensor<1x1024xi32> loc(#loc135)
+      %tmp0 = arith.constant 128 : i32 loc(#loc136)
+      %tmp0_31 = arith.constant 128 : i64 loc(#loc136)
+      %tmp0_32 = arith.constant dense<128> : tensor<1x1xi64> loc(#loc136)
+      %tmp0_33 = arith.muli %tmp0_32, %x1_12 : tensor<1x1xi64> loc(#loc136)
+      %tmp0_34 = arith.extsi %r0_4_27 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc137)
+      %tmp0_35 = tt.broadcast %tmp0_33 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc137)
+      %tmp0_36 = arith.addi %tmp0_34, %tmp0_35 : tensor<1x1024xi64> loc(#loc137)
+      %tmp2 = tt.splat %ks2 : i64 -> tensor<1x1024xi64> loc(#loc138)
+      %tmp2_37 = arith.cmpi slt, %tmp0_36, %tmp2 : tensor<1x1024xi64> loc(#loc138)
+      %tmp3 = arith.constant 128 : i32 loc(#loc139)
+      %tmp3_38 = arith.constant 128 : i64 loc(#loc139)
+      %tmp3_39 = arith.constant dense<128> : tensor<1x1xi64> loc(#loc139)
+      %tmp3_40 = arith.muli %tmp3_39, %x0_14 : tensor<1x1xi64> loc(#loc139)
+      %tmp3_41 = arith.extsi %r0_3_30 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc140)
+      %tmp3_42 = tt.broadcast %tmp3_40 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc140)
+      %tmp3_43 = arith.addi %tmp3_41, %tmp3_42 : tensor<1x1024xi64> loc(#loc140)
+      %tmp5 = tt.splat %ks3 : i64 -> tensor<1x1024xi64> loc(#loc141)
+      %tmp5_44 = arith.cmpi slt, %tmp3_43, %tmp5 : tensor<1x1024xi64> loc(#loc141)
+      %tmp6 = arith.andi %tmp2_37, %tmp5_44 : tensor<1x1024xi1> loc(#loc142)
+      %tmp7 = arith.constant 128 : i32 loc(#loc143)
+      %tmp7_45 = arith.constant 128 : i64 loc(#loc143)
+      %tmp7_46 = arith.constant dense<128> : tensor<1x1xi64> loc(#loc143)
+      %tmp7_47 = arith.muli %tmp7_46, %x1_12 : tensor<1x1xi64> loc(#loc143)
+      %tmp7_48 = arith.extsi %r0_4_27 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc144)
+      %tmp7_49 = tt.broadcast %tmp7_47 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc144)
+      %tmp7_50 = arith.addi %tmp7_48, %tmp7_49 : tensor<1x1024xi64> loc(#loc144)
+      %tmp8 = arith.constant 128 : i32 loc(#loc145)
+      %tmp8_51 = arith.constant 128 : i64 loc(#loc145)
+      %tmp8_52 = arith.constant dense<128> : tensor<1x1xi64> loc(#loc145)
+      %tmp8_53 = arith.muli %tmp8_52, %x0_14 : tensor<1x1xi64> loc(#loc145)
+      %tmp8_54 = arith.extsi %r0_3_30 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc146)
+      %tmp8_55 = tt.broadcast %tmp8_53 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc146)
+      %tmp8_56 = arith.addi %tmp8_54, %tmp8_55 : tensor<1x1024xi64> loc(#loc146)
+      %tmp9 = arith.cmpi sge, %tmp7_50, %tmp8_56 : tensor<1x1024xi64> loc(#loc147)
+      %tmp10 = tt.broadcast %x2_16 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc148)
+      %tmp10_57 = tt.splat %in_ptr0 : !tt.ptr<i64> -> tensor<1x1024x!tt.ptr<i64>> loc(#loc149)
+      %tmp10_58 = tt.addptr %tmp10_57, %tmp10 : tensor<1x1024x!tt.ptr<i64>>, tensor<1x1024xi64> loc(#loc149)
+      %tmp10_59 = arith.andi %r0_mask_24, %tmp6 : tensor<1x1024xi1> loc(#loc150)
+      %tmp10_60 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x1024xi1> loc(#loc151)
+      %tmp10_61 = arith.andi %tmp10_59, %tmp10_60 : tensor<1x1024xi1> loc(#loc151)
+      %tmp10_62 = arith.constant 0.000000e+00 : f32 loc(#loc152)
+      %tmp10_63 = arith.constant dense<0.000000e+00> : tensor<1x1024xf32> loc(#loc152)
+      %tmp10_64 = arith.fptosi %tmp10_63 : tensor<1x1024xf32> to tensor<1x1024xi64> loc(#loc152)
+      %tmp10_65 = tt.load %tmp10_58, %tmp10_61, %tmp10_64 evictionPolicy = evict_last : tensor<1x1024x!tt.ptr<i64>> loc(#loc152)
+      %tmp11 = arith.cmpi slt, %tmp8_56, %tmp10_65 : tensor<1x1024xi64> loc(#loc153)
+      %tmp12 = arith.cmpi slt, %tmp7_50, %tmp10_65 : tensor<1x1024xi64> loc(#loc154)
+      %tmp13 = arith.andi %tmp11, %tmp12 : tensor<1x1024xi1> loc(#loc155)
+      %tmp14 = arith.andi %tmp9, %tmp13 : tensor<1x1024xi1> loc(#loc156)
+      %tmp15 = arith.constant false loc(#loc157)
+      %tmp15_66 = arith.constant dense<false> : tensor<1x1xi1> loc(#loc157)
+      %tmp16 = arith.constant dense<false> : tensor<1x1024xi1> loc(#loc158)
+      %tmp16_67 = arith.ori %tmp16, %tmp14 : tensor<1x1024xi1> loc(#loc158)
+      %tmp17 = tt.splat %ks5 : i64 -> tensor<1x1024xi64> loc(#loc159)
+      %tmp18 = arith.cmpi sge, %tmp8_56, %tmp17 : tensor<1x1024xi64> loc(#loc160)
+      %tmp19 = arith.remsi %tmp8_56, %tmp17 : tensor<1x1024xi64> loc(#loc161)
+      %tmp20 = arith.constant 0 : i32 loc(#loc162)
+      %tmp20_68 = arith.constant dense<0> : tensor<1x1xi32> loc(#loc162)
+      %tmp21 = arith.extsi %tmp20_68 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc163)
+      %tmp21_69 = tt.broadcast %tmp21 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc163)
+      %tmp21_70 = arith.cmpi ne, %tmp19, %tmp21_69 : tensor<1x1024xi64> loc(#loc163)
+      %tmp22 = arith.constant 0 : i32 loc(#loc164)
+      %tmp22_71 = arith.extsi %tmp22 : i32 to i64 loc(#loc164)
+      %tmp22_72 = tt.splat %tmp22_71 : i64 -> tensor<1x1024xi64> loc(#loc164)
+      %tmp22_73 = arith.cmpi slt, %tmp19, %tmp22_72 : tensor<1x1024xi64> loc(#loc164)
+      %tmp23 = arith.constant 0 : i32 loc(#loc165)
+      %tmp23_74 = arith.extsi %tmp23 : i32 to i64 loc(#loc165)
+      %tmp23_75 = tt.splat %tmp23_74 : i64 -> tensor<1x1024xi64> loc(#loc165)
+      %tmp23_76 = arith.cmpi slt, %tmp17, %tmp23_75 : tensor<1x1024xi64> loc(#loc165)
+      %tmp24 = arith.cmpi ne, %tmp22_73, %tmp23_76 : tensor<1x1024xi1> loc(#loc166)
+      %tmp25 = arith.andi %tmp21_70, %tmp24 : tensor<1x1024xi1> loc(#loc167)
+      %tmp26 = arith.addi %tmp19, %tmp17 : tensor<1x1024xi64> loc(#loc168)
+      %tmp27 = arith.select %tmp25, %tmp26, %tmp19 : tensor<1x1024xi1>, tensor<1x1024xi64> loc(#loc169)
+      %tmp28 = arith.cmpi slt, %tmp27, %tmp10_65 : tensor<1x1024xi64> loc(#loc170)
+      %tmp29 = arith.andi %tmp18, %tmp28 : tensor<1x1024xi1> loc(#loc171)
+      %tmp30 = arith.constant -1 : i32 loc(#loc172)
+      %tmp30_77 = arith.constant -1 : i32 loc(#loc172)
+      %tmp30_78 = arith.constant dense<-1> : tensor<1x1024xi32> loc(#loc172)
+      %tmp30_79 = arith.muli %tmp30_78, %r0_4_27 : tensor<1x1024xi32> loc(#loc172)
+      %tmp30_80 = arith.addi %r0_3_30, %tmp30_79 : tensor<1x1024xi32> loc(#loc173)
+      %tmp30_81 = arith.constant -128 : i32 loc(#loc174)
+      %tmp30_82 = arith.constant -128 : i64 loc(#loc174)
+      %tmp30_83 = arith.constant dense<-128> : tensor<1x1xi64> loc(#loc174)
+      %tmp30_84 = arith.muli %tmp30_83, %x1_12 : tensor<1x1xi64> loc(#loc174)
+      %tmp30_85 = arith.extsi %tmp30_80 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc175)
+      %tmp30_86 = tt.broadcast %tmp30_84 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc175)
+      %tmp30_87 = arith.addi %tmp30_85, %tmp30_86 : tensor<1x1024xi64> loc(#loc175)
+      %tmp30_88 = arith.constant 128 : i32 loc(#loc176)
+      %tmp30_89 = arith.constant 128 : i64 loc(#loc176)
+      %tmp30_90 = arith.constant dense<128> : tensor<1x1xi64> loc(#loc176)
+      %tmp30_91 = arith.muli %tmp30_90, %x0_14 : tensor<1x1xi64> loc(#loc176)
+      %tmp30_92 = tt.broadcast %tmp30_91 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc177)
+      %tmp30_93 = arith.addi %tmp30_87, %tmp30_92 : tensor<1x1024xi64> loc(#loc177)
+      %tmp31 = arith.remsi %tmp30_93, %tmp17 : tensor<1x1024xi64> loc(#loc178)
+      %tmp32 = arith.extsi %tmp20_68 : tensor<1x1xi32> to tensor<1x1xi64> loc(#loc179)
+      %tmp32_94 = tt.broadcast %tmp32 : tensor<1x1xi64> -> tensor<1x1024xi64> loc(#loc179)
+      %tmp32_95 = arith.cmpi ne, %tmp31, %tmp32_94 : tensor<1x1024xi64> loc(#loc179)
+      %tmp33 = arith.constant 0 : i32 loc(#loc180)
+      %tmp33_96 = arith.extsi %tmp33 : i32 to i64 loc(#loc180)
+      %tmp33_97 = tt.splat %tmp33_96 : i64 -> tensor<1x1024xi64> loc(#loc180)
+      %tmp33_98 = arith.cmpi slt, %tmp31, %tmp33_97 : tensor<1x1024xi64> loc(#loc180)
+      %tmp34 = arith.cmpi ne, %tmp33_98, %tmp23_76 : tensor<1x1024xi1> loc(#loc181)
+      %tmp35 = arith.andi %tmp32_95, %tmp34 : tensor<1x1024xi1> loc(#loc182)
+      %tmp36 = arith.addi %tmp31, %tmp17 : tensor<1x1024xi64> loc(#loc183)
+      %tmp37 = arith.select %tmp35, %tmp36, %tmp31 : tensor<1x1024xi1>, tensor<1x1024xi64> loc(#loc184)
+      %tmp38 = arith.constant 0 : i64 loc(#loc185)
+      %tmp38_99 = arith.constant dense<0> : tensor<1x1xi64> loc(#loc185)
+      %tmp39 = arith.constant dense<0> : tensor<1x1024xi64> loc(#loc186)
+      %tmp39_100 = arith.cmpi eq, %tmp37, %tmp39 : tensor<1x1024xi64> loc(#loc186)
+      %tmp40 = arith.andi %tmp29, %tmp39_100 : tensor<1x1024xi1> loc(#loc187)
+      %tmp41 = arith.ori %tmp16_67, %tmp40 : tensor<1x1024xi1> loc(#loc188)
+      %tmp42 = arith.constant false loc(#loc189)
+      %tmp42_101 = arith.constant dense<false> : tensor<1x1024xi1> loc(#loc189)
+      %tmp43 = arith.select %tmp6, %tmp41, %tmp42_101 : tensor<1x1024xi1>, tensor<1x1024xi1> loc(#loc190)
+      %tmp44 = arith.extui %tmp43 : tensor<1x1024xi1> to tensor<1x1024xi64> loc(#loc191)
+      %tmp47 = arith.addi %_tmp46_22, %tmp44 : tensor<1x1024xi64> loc(#loc192)
+      %_tmp46_102 = tt.broadcast %xmask_7 : tensor<1x1xi1> -> tensor<1x1024xi1> loc(#loc193)
+      %_tmp46_103 = arith.andi %r0_mask_24, %_tmp46_102 : tensor<1x1024xi1> loc(#loc193)
+      %_tmp46_104 = arith.select %_tmp46_103, %tmp47, %_tmp46_22 : tensor<1x1024xi1>, tensor<1x1024xi64> loc(#loc194)
+      scf.yield %_tmp46_104 : tensor<1x1024xi64> loc(#loc79)
+    } loc(#loc131)
+    %tmp46 = tt.call @"triton.language.standard.sum__i64S1_1024S__(1,)cconstexpr_1__(2,)cconstexpr_False__(3,)cNone"(%_tmp46_18) : (tensor<1x1024xi64>) -> tensor<1xi64> loc(#loc195)
+    %tmp46_19 = tt.expand_dims %tmp46 {axis = 1 : i32} : tensor<1xi64> -> tensor<1x1xi64> loc(#loc196)
+    %tmp48 = arith.constant 0 : i64 loc(#loc197)
+    %tmp48_20 = arith.constant dense<0> : tensor<1x1xi64> loc(#loc197)
+    %tmp49 = arith.cmpi sgt, %tmp46_19, %tmp48_20 : tensor<1x1xi64> loc(#loc198)
+    %tmp50 = arith.constant 16384 : i64 loc(#loc199)
+    %tmp50_21 = arith.constant dense<16384> : tensor<1x1xi64> loc(#loc199)
+    %tmp51 = arith.cmpi slt, %tmp46_19, %tmp50_21 : tensor<1x1xi64> loc(#loc200)
+    %tmp52 = arith.andi %tmp49, %tmp51 : tensor<1x1xi1> loc(#loc201)
+    %tmp53 = arith.extui %tmp52 : tensor<1x1xi1> to tensor<1x1xi8> loc(#loc202)
+    %tmp54 = arith.extsi %tmp53 : tensor<1x1xi8> to tensor<1x1xi32> loc(#loc203)
+    %tmp55 = arith.cmpi eq, %tmp46_19, %tmp50_21 : tensor<1x1xi64> loc(#loc204)
+    %tmp56 = arith.extui %tmp55 : tensor<1x1xi1> to tensor<1x1xi8> loc(#loc205)
+    %tmp57 = arith.extsi %tmp56 : tensor<1x1xi8> to tensor<1x1xi32> loc(#loc206)
+    %4 = tt.splat %out_ptr1 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>> loc(#loc92)
+    %5 = tt.addptr %4, %xindex_6 : tensor<1x1x!tt.ptr<i32>>, tensor<1x1xi32> loc(#loc92)
+    tt.store %5, %tmp54, %xmask_7 : tensor<1x1x!tt.ptr<i32>> loc(#loc93)
+    %6 = tt.splat %out_ptr2 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>> loc(#loc94)
+    %7 = tt.addptr %6, %xindex_6 : tensor<1x1x!tt.ptr<i32>>, tensor<1x1xi32> loc(#loc94)
+    tt.store %7, %tmp57, %xmask_7 : tensor<1x1x!tt.ptr<i32>> loc(#loc95)
+    tt.return loc(#loc96)
+  } loc(#loc)
+  tt.func private @"triton.language.standard.sum__i64S1_1024S__(1,)cconstexpr_1__(2,)cconstexpr_False__(3,)cNone"(%input: tensor<1x1024xi64> loc("input"(#loc97))) -> tensor<1xi64> attributes {noinline = false} {
+    %0 = "tt.reduce"(%input) <{axis = 1 : i32}> ({
+    ^bb0(%arg1: i64 loc(unknown), %arg2: i64 loc(unknown)):
+      %2 = tt.call @triton.language.standard._sum_combine__i64_i64__(%arg1, %arg2) : (i64, i64) -> i64 loc(#loc98)
+      tt.reduce.return %2 : i64 loc(#loc98)
+    }) : (tensor<1x1024xi64>) -> tensor<1xi64> loc(#loc98)
+    tt.return %0 : tensor<1xi64> loc(#loc100)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : tensor<1xi64> loc(#loc101)
+    tt.return %1 : tensor<1xi64> loc(#loc101)
+  } loc(#loc97)
+  tt.func private @triton.language.standard._sum_combine__i64_i64__(%a: i64 loc("a"(#loc102)), %b: i64 loc("b"(#loc102))) -> i64 attributes {noinline = false} {
+    %0 = arith.addi %a, %b : i64 loc(#loc103)
+    tt.return %0 : i64 loc(#loc104)
+  ^bb1:  // no predecessors
+    %1 = ub.poison : i64 loc(#loc105)
+    tt.return %1 : i64 loc(#loc105)
+  } loc(#loc102)
+} loc(#loc)
+#loc1 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":19:15)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":22:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":22:33)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":23:36)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":23:44)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":23:23)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":24:21)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":25:27)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":25:37)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":27:21)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":27:28)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":28:19)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":29:19)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":30:44)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":32:40)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":33:31)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":34:29)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":37:27)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":38:27)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":39:26)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":39:22)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":41:22)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":42:26)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":42:22)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":44:22)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":45:22)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":46:26)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":46:22)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":47:26)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":47:22)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":48:23)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:55)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:35)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:87)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:94)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:77)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":50:23)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":51:23)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":52:24)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":53:23)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":54:39)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":55:24)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":56:37)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":57:24)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":58:24)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":59:35)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":60:25)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":61:92)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":62:92)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":63:25)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":64:24)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":65:24)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":66:39)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":67:24)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":68:24)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:29)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:24)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:45)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:38)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:55)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:51)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":70:25)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":71:25)
+#loc64 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":72:92)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":73:25)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":74:24)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":75:24)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":76:39)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":77:35)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":78:25)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":79:24)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":80:24)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":81:44)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":82:38)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":83:25)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":85:25)
+#loc77 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:36)
+#loc78 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:50)
+#loc79 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:8)
+#loc80 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":87:27)
+#loc81 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":87:30)
+#loc82 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":88:31)
+#loc83 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":89:20)
+#loc84 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":90:35)
+#loc85 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":91:20)
+#loc86 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":92:20)
+#loc87 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":93:21)
+#loc88 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":94:21)
+#loc89 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":95:21)
+#loc90 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":96:21)
+#loc91 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":97:21)
+#loc92 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":98:25)
+#loc93 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":98:37)
+#loc94 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:25)
+#loc95 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:37)
+#loc96 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:4)
+#loc98 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc100 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:11)
+#loc101 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:4)
+#loc103 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc104 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:11)
+#loc105 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:4)
+#loc117 = loc("r0_numel"(#loc1))
+#loc118 = loc("xoffset"(#loc2))
+#loc119 = loc("xoffset"(#loc3))
+#loc120 = loc("xindex"(#loc4))
+#loc121 = loc("xindex"(#loc5))
+#loc122 = loc("xindex"(#loc6))
+#loc123 = loc("xmask"(#loc7))
+#loc124 = loc("r0_base"(#loc8))
+#loc125 = loc("r0_base"(#loc9))
+#loc126 = loc("x1"(#loc10))
+#loc127 = loc("x1"(#loc11))
+#loc128 = loc("x0"(#loc12))
+#loc129 = loc("x2"(#loc13))
+#loc130 = loc("_tmp46"(#loc14))
+#loc131 = loc("_tmp46"(#loc15))
+#loc132 = loc("r0_index"(#loc16))
+#loc133 = loc("r0_mask"(#loc17))
+#loc134 = loc("r0_4"(#loc18))
+#loc135 = loc("r0_3"(#loc19))
+#loc136 = loc("tmp0"(#loc20))
+#loc137 = loc("tmp0"(#loc21))
+#loc138 = loc("tmp2"(#loc22))
+#loc139 = loc("tmp3"(#loc23))
+#loc140 = loc("tmp3"(#loc24))
+#loc141 = loc("tmp5"(#loc25))
+#loc142 = loc("tmp6"(#loc26))
+#loc143 = loc("tmp7"(#loc27))
+#loc144 = loc("tmp7"(#loc28))
+#loc145 = loc("tmp8"(#loc29))
+#loc146 = loc("tmp8"(#loc30))
+#loc147 = loc("tmp9"(#loc31))
+#loc148 = loc("tmp10"(#loc32))
+#loc149 = loc("tmp10"(#loc33))
+#loc150 = loc("tmp10"(#loc34))
+#loc151 = loc("tmp10"(#loc35))
+#loc152 = loc("tmp10"(#loc36))
+#loc153 = loc("tmp11"(#loc37))
+#loc154 = loc("tmp12"(#loc38))
+#loc155 = loc("tmp13"(#loc39))
+#loc156 = loc("tmp14"(#loc40))
+#loc157 = loc("tmp15"(#loc41))
+#loc158 = loc("tmp16"(#loc42))
+#loc159 = loc("tmp17"(#loc43))
+#loc160 = loc("tmp18"(#loc44))
+#loc161 = loc("tmp19"(#loc45))
+#loc162 = loc("tmp20"(#loc46))
+#loc163 = loc("tmp21"(#loc47))
+#loc164 = loc("tmp22"(#loc48))
+#loc165 = loc("tmp23"(#loc49))
+#loc166 = loc("tmp24"(#loc50))
+#loc167 = loc("tmp25"(#loc51))
+#loc168 = loc("tmp26"(#loc52))
+#loc169 = loc("tmp27"(#loc53))
+#loc170 = loc("tmp28"(#loc54))
+#loc171 = loc("tmp29"(#loc55))
+#loc172 = loc("tmp30"(#loc56))
+#loc173 = loc("tmp30"(#loc57))
+#loc174 = loc("tmp30"(#loc58))
+#loc175 = loc("tmp30"(#loc59))
+#loc176 = loc("tmp30"(#loc60))
+#loc177 = loc("tmp30"(#loc61))
+#loc178 = loc("tmp31"(#loc62))
+#loc179 = loc("tmp32"(#loc63))
+#loc180 = loc("tmp33"(#loc64))
+#loc181 = loc("tmp34"(#loc65))
+#loc182 = loc("tmp35"(#loc66))
+#loc183 = loc("tmp36"(#loc67))
+#loc184 = loc("tmp37"(#loc68))
+#loc185 = loc("tmp38"(#loc69))
+#loc186 = loc("tmp39"(#loc70))
+#loc187 = loc("tmp40"(#loc71))
+#loc188 = loc("tmp41"(#loc72))
+#loc189 = loc("tmp42"(#loc73))
+#loc190 = loc("tmp43"(#loc74))
+#loc191 = loc("tmp44"(#loc75))
+#loc192 = loc("tmp47"(#loc76))
+#loc193 = loc("_tmp46"(#loc77))
+#loc194 = loc("_tmp46"(#loc78))
+#loc195 = loc("tmp46"(#loc80))
+#loc196 = loc("tmp46"(#loc81))
+#loc197 = loc("tmp48"(#loc82))
+#loc198 = loc("tmp49"(#loc83))
+#loc199 = loc("tmp50"(#loc84))
+#loc200 = loc("tmp51"(#loc85))
+#loc201 = loc("tmp52"(#loc86))
+#loc202 = loc("tmp53"(#loc87))
+#loc203 = loc("tmp54"(#loc88))
+#loc204 = loc("tmp55"(#loc89))
+#loc205 = loc("tmp56"(#loc90))
+#loc206 = loc("tmp57"(#loc91))

	@@ -0,0 +1,280 @@

+#blocked = #ttg.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 32], warpsPerCTA = [1, 16], order = [0, 1]}>
+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":18:0)
+#loc1 = loc(unknown)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":87:27)
+#loc79 = loc("in_ptr0"(#loc))
+#loc80 = loc("out_ptr1"(#loc))
+#loc81 = loc("out_ptr2"(#loc))
+#loc82 = loc("ks0"(#loc))
+#loc83 = loc("ks1"(#loc))
+#loc84 = loc("ks2"(#loc))
+#loc85 = loc("ks3"(#loc))
+#loc86 = loc("ks4"(#loc))
+#loc87 = loc("ks5"(#loc))
+#loc88 = loc("xnumel"(#loc))
+#loc89 = loc("r0_numel"(#loc))
+#loc149 = loc("tmp46"(#loc63))
+#loc164 = loc(callsite(#loc1 at #loc149))
+module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 16 : i32, ttg.target = "cuda:90", "ttg.threads-per-warp" = 32 : i32} {
+  tt.func public @triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1(%in_ptr0: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr1: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr1"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %ks2: i64 loc("ks2"(#loc)), %ks3: i64 loc("ks3"(#loc)), %ks4: i64 loc("ks4"(#loc)), %ks5: i64 loc("ks5"(#loc)), %xnumel: i32 loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %cst = arith.constant dense<128> : tensor<1x1024xi32, #blocked> loc(#loc1)
+    %cst_0 = arith.constant dense<16384> : tensor<1x1024xi32, #blocked> loc(#loc1)
+    %c-128_i64 = arith.constant -128 : i64 loc(#loc1)
+    %c0_i64 = arith.constant 0 : i64 loc(#loc1)
+    %c128_i64 = arith.constant 128 : i64 loc(#loc1)
+    %c1024_i32 = arith.constant 1024 : i32 loc(#loc1)
+    %c16384_i32 = arith.constant 16384 : i32 loc(#loc1)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc1)
+    %cst_1 = arith.constant dense<16384> : tensor<1x1xi64, #blocked> loc(#loc1)
+    %cst_2 = arith.constant dense<0> : tensor<1x1xi64, #blocked> loc(#loc1)
+    %cst_3 = arith.constant dense<false> : tensor<1x1024xi1, #blocked> loc(#loc1)
+    %cst_4 = arith.constant dense<0> : tensor<1x1024xi64, #blocked> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc90)
+    %xmask = arith.cmpi slt, %xoffset, %xnumel : i32 loc(#loc91)
+    %r0_base = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked}>> loc(#loc92)
+    %r0_base_5 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<1024xi32, #ttg.slice<{dim = 0, parent = #blocked}>> -> tensor<1x1024xi32, #blocked> loc(#loc92)
+    %x1 = arith.extsi %xoffset : i32 to i64 loc(#loc93)
+    %x1_6 = arith.divsi %x1, %ks0 : i64 loc(#loc93)
+    %x1_7 = arith.remsi %x1_6, %ks1 : i64 loc(#loc94)
+    %x0 = arith.remsi %x1, %ks0 : i64 loc(#loc95)
+    %x2 = arith.divsi %x1, %ks4 : i64 loc(#loc96)
+    %tmp0 = arith.muli %x1_7, %c128_i64 : i64 loc(#loc97)
+    %tmp0_8 = tt.splat %tmp0 : i64 -> tensor<1x1024xi64, #blocked> loc(#loc159)
+    %tmp2 = tt.splat %ks2 : i64 -> tensor<1x1024xi64, #blocked> loc(#loc99)
+    %tmp3 = arith.muli %x0, %c128_i64 : i64 loc(#loc100)
+    %tmp3_9 = tt.splat %tmp3 : i64 -> tensor<1x1024xi64, #blocked> loc(#loc160)
+    %tmp5 = tt.splat %ks3 : i64 -> tensor<1x1024xi64, #blocked> loc(#loc102)
+    %tmp10 = tt.addptr %in_ptr0, %x2 : !tt.ptr<i64>, i64 loc(#loc103)
+    %tmp10_10 = tt.splat %xmask : i1 -> tensor<1x1024xi1, #blocked> loc(#loc161)
+    %tmp10_11 = tt.splat %tmp10 : !tt.ptr<i64> -> tensor<1x1024x!tt.ptr<i64>, #blocked> loc(#loc105)
+    %tmp17 = tt.splat %ks5 : i64 -> tensor<1x1024xi64, #blocked> loc(#loc106)
+    %tmp23 = arith.cmpi slt, %ks5, %c0_i64 : i64 loc(#loc107)
+    %tmp23_12 = tt.splat %tmp23 : i1 -> tensor<1x1024xi1, #blocked> loc(#loc107)
+    %tmp30 = arith.muli %x1_7, %c-128_i64 : i64 loc(#loc108)
+    %tmp30_13 = tt.splat %tmp30 : i64 -> tensor<1x1024xi64, #blocked> loc(#loc162)
+    %_tmp46 = scf.for %_tmp46_15 = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%arg12 = %cst_4) -> (tensor<1x1024xi64, #blocked>)  : i32 {
+      %r0_index = tt.splat %_tmp46_15 : i32 -> tensor<1x1024xi32, #blocked> loc(#loc111)
+      %r0_index_16 = arith.addi %r0_index, %r0_base_5 : tensor<1x1024xi32, #blocked> loc(#loc111)
+      %r0_mask = arith.cmpi slt, %r0_index_16, %cst_0 : tensor<1x1024xi32, #blocked> loc(#loc112)
+      %r0_4 = arith.divsi %r0_index_16, %cst : tensor<1x1024xi32, #blocked> loc(#loc113)
+      %r0_3 = arith.remsi %r0_index_16, %cst : tensor<1x1024xi32, #blocked> loc(#loc114)
+      %tmp0_17 = arith.extsi %r0_4 : tensor<1x1024xi32, #blocked> to tensor<1x1024xi64, #blocked> loc(#loc98)
+      %tmp0_18 = arith.addi %tmp0_17, %tmp0_8 : tensor<1x1024xi64, #blocked> loc(#loc98)
+      %tmp2_19 = arith.cmpi slt, %tmp0_18, %tmp2 : tensor<1x1024xi64, #blocked> loc(#loc99)
+      %tmp3_20 = arith.extsi %r0_3 : tensor<1x1024xi32, #blocked> to tensor<1x1024xi64, #blocked> loc(#loc101)
+      %tmp3_21 = arith.addi %tmp3_20, %tmp3_9 : tensor<1x1024xi64, #blocked> loc(#loc101)
+      %tmp5_22 = arith.cmpi slt, %tmp3_21, %tmp5 : tensor<1x1024xi64, #blocked> loc(#loc102)
+      %tmp6 = arith.andi %tmp2_19, %tmp5_22 : tensor<1x1024xi1, #blocked> loc(#loc115)
+      %tmp9 = arith.cmpi sge, %tmp0_18, %tmp3_21 : tensor<1x1024xi64, #blocked> loc(#loc116)
+      %tmp10_23 = arith.andi %r0_mask, %tmp6 : tensor<1x1024xi1, #blocked> loc(#loc117)
+      %tmp10_24 = arith.andi %tmp10_23, %tmp10_10 : tensor<1x1024xi1, #blocked> loc(#loc104)
+      %tmp10_25 = tt.load %tmp10_11, %tmp10_24, %cst_4 evictionPolicy = evict_last : tensor<1x1024x!tt.ptr<i64>, #blocked> loc(#loc105)
+      %tmp11 = arith.cmpi slt, %tmp3_21, %tmp10_25 : tensor<1x1024xi64, #blocked> loc(#loc118)
+      %tmp12 = arith.cmpi slt, %tmp0_18, %tmp10_25 : tensor<1x1024xi64, #blocked> loc(#loc119)
+      %tmp13 = arith.andi %tmp11, %tmp12 : tensor<1x1024xi1, #blocked> loc(#loc120)
+      %tmp14 = arith.andi %tmp9, %tmp13 : tensor<1x1024xi1, #blocked> loc(#loc121)
+      %tmp18 = arith.cmpi sge, %tmp3_21, %tmp17 : tensor<1x1024xi64, #blocked> loc(#loc122)
+      %tmp19 = arith.remsi %tmp3_21, %tmp17 : tensor<1x1024xi64, #blocked> loc(#loc123)
+      %tmp21 = arith.cmpi ne, %tmp19, %cst_4 : tensor<1x1024xi64, #blocked> loc(#loc124)
+      %tmp22 = arith.cmpi slt, %tmp19, %cst_4 : tensor<1x1024xi64, #blocked> loc(#loc125)
+      %tmp24 = arith.cmpi ne, %tmp22, %tmp23_12 : tensor<1x1024xi1, #blocked> loc(#loc126)
+      %tmp25 = arith.andi %tmp21, %tmp24 : tensor<1x1024xi1, #blocked> loc(#loc127)
+      %tmp26 = arith.addi %tmp19, %tmp17 : tensor<1x1024xi64, #blocked> loc(#loc128)
+      %tmp27 = arith.select %tmp25, %tmp26, %tmp19 : tensor<1x1024xi1, #blocked>, tensor<1x1024xi64, #blocked> loc(#loc129)
+      %tmp28 = arith.cmpi slt, %tmp27, %tmp10_25 : tensor<1x1024xi64, #blocked> loc(#loc130)
+      %tmp29 = arith.andi %tmp18, %tmp28 : tensor<1x1024xi1, #blocked> loc(#loc131)
+      %tmp30_26 = arith.subi %r0_3, %r0_4 : tensor<1x1024xi32, #blocked> loc(#loc132)
+      %tmp30_27 = arith.extsi %tmp30_26 : tensor<1x1024xi32, #blocked> to tensor<1x1024xi64, #blocked> loc(#loc109)
+      %tmp30_28 = arith.addi %tmp30_27, %tmp30_13 : tensor<1x1024xi64, #blocked> loc(#loc109)
+      %tmp30_29 = arith.addi %tmp30_28, %tmp3_9 : tensor<1x1024xi64, #blocked> loc(#loc133)
+      %tmp31 = arith.remsi %tmp30_29, %tmp17 : tensor<1x1024xi64, #blocked> loc(#loc134)
+      %tmp32 = arith.cmpi ne, %tmp31, %cst_4 : tensor<1x1024xi64, #blocked> loc(#loc135)
+      %tmp33 = arith.cmpi slt, %tmp31, %cst_4 : tensor<1x1024xi64, #blocked> loc(#loc136)
+      %tmp34 = arith.cmpi ne, %tmp33, %tmp23_12 : tensor<1x1024xi1, #blocked> loc(#loc137)
+      %tmp35 = arith.andi %tmp32, %tmp34 : tensor<1x1024xi1, #blocked> loc(#loc138)
+      %tmp36 = arith.addi %tmp31, %tmp17 : tensor<1x1024xi64, #blocked> loc(#loc139)
+      %tmp37 = arith.select %tmp35, %tmp36, %tmp31 : tensor<1x1024xi1, #blocked>, tensor<1x1024xi64, #blocked> loc(#loc140)
+      %tmp39 = arith.cmpi eq, %tmp37, %cst_4 : tensor<1x1024xi64, #blocked> loc(#loc141)
+      %tmp40 = arith.andi %tmp29, %tmp39 : tensor<1x1024xi1, #blocked> loc(#loc142)
+      %tmp41 = arith.ori %tmp14, %tmp40 : tensor<1x1024xi1, #blocked> loc(#loc143)
+      %tmp43 = arith.select %tmp6, %tmp41, %cst_3 : tensor<1x1024xi1, #blocked>, tensor<1x1024xi1, #blocked> loc(#loc144)
+      %tmp44 = arith.extui %tmp43 : tensor<1x1024xi1, #blocked> to tensor<1x1024xi64, #blocked> loc(#loc145)
+      %tmp47 = arith.addi %arg12, %tmp44 : tensor<1x1024xi64, #blocked> loc(#loc146)
+      %_tmp46_30 = arith.andi %r0_mask, %tmp10_10 : tensor<1x1024xi1, #blocked> loc(#loc147)
+      %_tmp46_31 = arith.select %_tmp46_30, %tmp47, %arg12 : tensor<1x1024xi1, #blocked>, tensor<1x1024xi64, #blocked> loc(#loc148)
+      scf.yield %_tmp46_31 : tensor<1x1024xi64, #blocked> loc(#loc61)
+    } loc(#loc110)
+    %tmp46 = "tt.reduce"(%_tmp46) <{axis = 1 : i32}> ({
+    ^bb0(%tmp46_15: i64 loc(callsite(#loc1 at #loc149)), %tmp46_16: i64 loc(callsite(#loc1 at #loc149))):
+      %tmp46_17 = arith.addi %tmp46_15, %tmp46_16 : i64 loc(#loc167)
+      tt.reduce.return %tmp46_17 : i64 loc(#loc163)
+    }) : (tensor<1x1024xi64, #blocked>) -> tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> loc(#loc163)
+    %tmp46_14 = tt.expand_dims %tmp46 {axis = 1 : i32} : tensor<1xi64, #ttg.slice<{dim = 1, parent = #blocked}>> -> tensor<1x1xi64, #blocked> loc(#loc150)
+    %tmp49 = arith.cmpi sgt, %tmp46_14, %cst_2 : tensor<1x1xi64, #blocked> loc(#loc151)
+    %tmp51 = arith.cmpi slt, %tmp46_14, %cst_1 : tensor<1x1xi64, #blocked> loc(#loc152)
+    %tmp52 = arith.andi %tmp49, %tmp51 : tensor<1x1xi1, #blocked> loc(#loc153)
+    %tmp54 = arith.extui %tmp52 : tensor<1x1xi1, #blocked> to tensor<1x1xi32, #blocked> loc(#loc165)
+    %tmp55 = arith.cmpi eq, %tmp46_14, %cst_1 : tensor<1x1xi64, #blocked> loc(#loc156)
+    %tmp57 = arith.extui %tmp55 : tensor<1x1xi1, #blocked> to tensor<1x1xi32, #blocked> loc(#loc166)
+    %0 = tt.addptr %out_ptr1, %xoffset : !tt.ptr<i32>, i32 loc(#loc74)
+    %1 = tt.splat %0 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>, #blocked> loc(#loc75)
+    %2 = tt.splat %xmask : i1 -> tensor<1x1xi1, #blocked> loc(#loc75)
+    tt.store %1, %tmp54, %2 : tensor<1x1x!tt.ptr<i32>, #blocked> loc(#loc75)
+    %3 = tt.addptr %out_ptr2, %xoffset : !tt.ptr<i32>, i32 loc(#loc76)
+    %4 = tt.splat %3 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>, #blocked> loc(#loc77)
+    tt.store %4, %tmp57, %2 : tensor<1x1x!tt.ptr<i32>, #blocked> loc(#loc77)
+    tt.return loc(#loc78)
+  } loc(#loc)
+} loc(#loc)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":22:28)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":24:21)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":25:37)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":27:21)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":27:28)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":28:19)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":29:19)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":39:26)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":39:22)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":41:22)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":42:26)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":42:22)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":44:22)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:35)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:94)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:77)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":56:37)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":62:92)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:45)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:38)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":32:40)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":33:31)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":34:29)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":37:27)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":38:27)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":45:22)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":48:23)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:87)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":50:23)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":51:23)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":52:24)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":53:23)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":57:24)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":58:24)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":60:25)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":61:92)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":63:25)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":64:24)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":65:24)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":66:39)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":67:24)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":68:24)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:24)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:51)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":70:25)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":71:25)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":72:92)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":73:25)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":74:24)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":75:24)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":76:39)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":78:25)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":79:24)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":80:24)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":82:38)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":83:25)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":85:25)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:36)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:50)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:8)
+#loc62 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc64 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":87:30)
+#loc66 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":89:20)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":91:20)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":92:20)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":94:21)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":93:21)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":95:21)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":97:21)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":96:21)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":98:25)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":98:37)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:25)
+#loc77 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:37)
+#loc78 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:4)
+#loc90 = loc("xoffset"(#loc2))
+#loc91 = loc("xmask"(#loc3))
+#loc92 = loc("r0_base"(#loc4))
+#loc93 = loc("x1"(#loc5))
+#loc94 = loc("x1"(#loc6))
+#loc95 = loc("x0"(#loc7))
+#loc96 = loc("x2"(#loc8))
+#loc97 = loc("tmp0"(#loc9))
+#loc98 = loc("tmp0"(#loc10))
+#loc99 = loc("tmp2"(#loc11))
+#loc100 = loc("tmp3"(#loc12))
+#loc101 = loc("tmp3"(#loc13))
+#loc102 = loc("tmp5"(#loc14))
+#loc103 = loc("tmp10"(#loc15))
+#loc104 = loc("tmp10"(#loc16))
+#loc105 = loc("tmp10"(#loc17))
+#loc106 = loc("tmp17"(#loc18))
+#loc107 = loc("tmp23"(#loc19))
+#loc108 = loc("tmp30"(#loc20))
+#loc109 = loc("tmp30"(#loc21))
+#loc110 = loc("_tmp46"(#loc22))
+#loc111 = loc("r0_index"(#loc23))
+#loc112 = loc("r0_mask"(#loc24))
+#loc113 = loc("r0_4"(#loc25))
+#loc114 = loc("r0_3"(#loc26))
+#loc115 = loc("tmp6"(#loc27))
+#loc116 = loc("tmp9"(#loc28))
+#loc117 = loc("tmp10"(#loc29))
+#loc118 = loc("tmp11"(#loc30))
+#loc119 = loc("tmp12"(#loc31))
+#loc120 = loc("tmp13"(#loc32))
+#loc121 = loc("tmp14"(#loc33))
+#loc122 = loc("tmp18"(#loc34))
+#loc123 = loc("tmp19"(#loc35))
+#loc124 = loc("tmp21"(#loc36))
+#loc125 = loc("tmp22"(#loc37))
+#loc126 = loc("tmp24"(#loc38))
+#loc127 = loc("tmp25"(#loc39))
+#loc128 = loc("tmp26"(#loc40))
+#loc129 = loc("tmp27"(#loc41))
+#loc130 = loc("tmp28"(#loc42))
+#loc131 = loc("tmp29"(#loc43))
+#loc132 = loc("tmp30"(#loc44))
+#loc133 = loc("tmp30"(#loc45))
+#loc134 = loc("tmp31"(#loc46))
+#loc135 = loc("tmp32"(#loc47))
+#loc136 = loc("tmp33"(#loc48))
+#loc137 = loc("tmp34"(#loc49))
+#loc138 = loc("tmp35"(#loc50))
+#loc139 = loc("tmp36"(#loc51))
+#loc140 = loc("tmp37"(#loc52))
+#loc141 = loc("tmp39"(#loc53))
+#loc142 = loc("tmp40"(#loc54))
+#loc143 = loc("tmp41"(#loc55))
+#loc144 = loc("tmp43"(#loc56))
+#loc145 = loc("tmp44"(#loc57))
+#loc146 = loc("tmp47"(#loc58))
+#loc147 = loc("_tmp46"(#loc59))
+#loc148 = loc("_tmp46"(#loc60))
+#loc150 = loc("tmp46"(#loc65))
+#loc151 = loc("tmp49"(#loc66))
+#loc152 = loc("tmp51"(#loc67))
+#loc153 = loc("tmp52"(#loc68))
+#loc154 = loc("tmp54"(#loc69))
+#loc155 = loc("tmp53"(#loc70))
+#loc156 = loc("tmp55"(#loc71))
+#loc157 = loc("tmp57"(#loc72))
+#loc158 = loc("tmp56"(#loc73))
+#loc159 = loc(fused[#loc98, #loc97])
+#loc160 = loc(fused[#loc101, #loc100])
+#loc161 = loc(fused[#loc104, #loc91])
+#loc162 = loc(fused[#loc109, #loc108])
+#loc163 = loc(callsite(#loc62 at #loc149))
+#loc165 = loc(fused[#loc154, #loc155])
+#loc166 = loc(fused[#loc157, #loc158])
+#loc167 = loc(callsite(#loc64 at #loc163))

	@@ -0,0 +1,283 @@

+#loc = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":18:0)
+#loc1 = loc(unknown)
+#loc65 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":87:27)
+#loc81 = loc("in_ptr0"(#loc))
+#loc82 = loc("out_ptr1"(#loc))
+#loc83 = loc("out_ptr2"(#loc))
+#loc84 = loc("ks0"(#loc))
+#loc85 = loc("ks1"(#loc))
+#loc86 = loc("ks2"(#loc))
+#loc87 = loc("ks3"(#loc))
+#loc88 = loc("ks4"(#loc))
+#loc89 = loc("ks5"(#loc))
+#loc90 = loc("xnumel"(#loc))
+#loc91 = loc("r0_numel"(#loc))
+#loc153 = loc("tmp46"(#loc65))
+#loc168 = loc(callsite(#loc1 at #loc153))
+module {
+  tt.func public @triton_red_fused__to_copy_arange_bitwise_and_bitwise_or_constant_pad_nd_eq_ge_gt_index_lt_permute_remainder_sub_sum_view_1(%in_ptr0: !tt.ptr<i64> {tt.divisibility = 16 : i32} loc("in_ptr0"(#loc)), %out_ptr1: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr1"(#loc)), %out_ptr2: !tt.ptr<i32> {tt.divisibility = 16 : i32} loc("out_ptr2"(#loc)), %ks0: i64 loc("ks0"(#loc)), %ks1: i64 loc("ks1"(#loc)), %ks2: i64 loc("ks2"(#loc)), %ks3: i64 loc("ks3"(#loc)), %ks4: i64 loc("ks4"(#loc)), %ks5: i64 loc("ks5"(#loc)), %xnumel: i32 loc("xnumel"(#loc)), %r0_numel: i32 {tt.divisibility = 16 : i32} loc("r0_numel"(#loc))) attributes {noinline = false} {
+    %c-128_i64 = arith.constant -128 : i64 loc(#loc1)
+    %c0_i64 = arith.constant 0 : i64 loc(#loc1)
+    %c128_i64 = arith.constant 128 : i64 loc(#loc1)
+    %c1024_i32 = arith.constant 1024 : i32 loc(#loc2)
+    %c16384_i32 = arith.constant 16384 : i32 loc(#loc2)
+    %c0_i32 = arith.constant 0 : i32 loc(#loc2)
+    %tmp50 = arith.constant dense<16384> : tensor<1x1xi64> loc(#loc92)
+    %cst = arith.constant dense<0> : tensor<1x1xi64> loc(#loc1)
+    %cst_0 = arith.constant dense<false> : tensor<1x1024xi1> loc(#loc1)
+    %cst_1 = arith.constant dense<128> : tensor<1x1024xi32> loc(#loc1)
+    %cst_2 = arith.constant dense<16384> : tensor<1x1024xi32> loc(#loc1)
+    %cst_3 = arith.constant dense<0> : tensor<1x1024xi64> loc(#loc1)
+    %xoffset = tt.get_program_id x : i32 loc(#loc93)
+    %xmask = arith.cmpi slt, %xoffset, %xnumel : i32 loc(#loc94)
+    %xmask_4 = tt.splat %xmask : i1 -> tensor<1x1xi1> loc(#loc94)
+    %r0_base = tt.make_range {end = 1024 : i32, start = 0 : i32} : tensor<1024xi32> loc(#loc95)
+    %r0_base_5 = tt.expand_dims %r0_base {axis = 0 : i32} : tensor<1024xi32> -> tensor<1x1024xi32> loc(#loc96)
+    %x1 = arith.extsi %xoffset : i32 to i64 loc(#loc97)
+    %x1_6 = arith.divsi %x1, %ks0 : i64 loc(#loc97)
+    %x1_7 = arith.remsi %x1_6, %ks1 : i64 loc(#loc98)
+    %x0 = arith.remsi %x1, %ks0 : i64 loc(#loc99)
+    %x2 = arith.divsi %x1, %ks4 : i64 loc(#loc100)
+    %_tmp46 = scf.for %r0_offset = %c0_i32 to %c16384_i32 step %c1024_i32 iter_args(%_tmp46_9 = %cst_3) -> (tensor<1x1024xi64>)  : i32 {
+      %r0_index = tt.splat %r0_offset : i32 -> tensor<1x1024xi32> loc(#loc102)
+      %r0_index_10 = arith.addi %r0_index, %r0_base_5 : tensor<1x1024xi32> loc(#loc102)
+      %r0_mask = arith.cmpi slt, %r0_index_10, %cst_2 : tensor<1x1024xi32> loc(#loc103)
+      %r0_4 = arith.divsi %r0_index_10, %cst_1 : tensor<1x1024xi32> loc(#loc104)
+      %r0_3 = arith.remsi %r0_index_10, %cst_1 : tensor<1x1024xi32> loc(#loc105)
+      %tmp0 = arith.muli %x1_7, %c128_i64 : i64 loc(#loc106)
+      %tmp0_11 = arith.extsi %r0_4 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc107)
+      %tmp0_12 = tt.splat %tmp0 : i64 -> tensor<1x1024xi64> loc(#loc163)
+      %tmp0_13 = arith.addi %tmp0_11, %tmp0_12 : tensor<1x1024xi64> loc(#loc107)
+      %tmp2 = tt.splat %ks2 : i64 -> tensor<1x1024xi64> loc(#loc108)
+      %tmp2_14 = arith.cmpi slt, %tmp0_13, %tmp2 : tensor<1x1024xi64> loc(#loc108)
+      %tmp3 = arith.muli %x0, %c128_i64 : i64 loc(#loc109)
+      %tmp3_15 = arith.extsi %r0_3 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc110)
+      %tmp3_16 = tt.splat %tmp3 : i64 -> tensor<1x1024xi64> loc(#loc164)
+      %tmp3_17 = arith.addi %tmp3_15, %tmp3_16 : tensor<1x1024xi64> loc(#loc110)
+      %tmp5 = tt.splat %ks3 : i64 -> tensor<1x1024xi64> loc(#loc111)
+      %tmp5_18 = arith.cmpi slt, %tmp3_17, %tmp5 : tensor<1x1024xi64> loc(#loc111)
+      %tmp6 = arith.andi %tmp2_14, %tmp5_18 : tensor<1x1024xi1> loc(#loc112)
+      %tmp9 = arith.cmpi sge, %tmp0_13, %tmp3_17 : tensor<1x1024xi64> loc(#loc113)
+      %tmp10 = tt.addptr %in_ptr0, %x2 : !tt.ptr<i64>, i64 loc(#loc114)
+      %tmp10_19 = tt.splat %tmp10 : !tt.ptr<i64> -> tensor<1x1024x!tt.ptr<i64>> loc(#loc114)
+      %tmp10_20 = arith.andi %r0_mask, %tmp6 : tensor<1x1024xi1> loc(#loc115)
+      %tmp10_21 = tt.splat %xmask : i1 -> tensor<1x1024xi1> loc(#loc165)
+      %tmp10_22 = arith.andi %tmp10_20, %tmp10_21 : tensor<1x1024xi1> loc(#loc116)
+      %tmp10_23 = tt.load %tmp10_19, %tmp10_22, %cst_3 evictionPolicy = evict_last : tensor<1x1024x!tt.ptr<i64>> loc(#loc117)
+      %tmp11 = arith.cmpi slt, %tmp3_17, %tmp10_23 : tensor<1x1024xi64> loc(#loc118)
+      %tmp12 = arith.cmpi slt, %tmp0_13, %tmp10_23 : tensor<1x1024xi64> loc(#loc119)
+      %tmp13 = arith.andi %tmp11, %tmp12 : tensor<1x1024xi1> loc(#loc120)
+      %tmp14 = arith.andi %tmp9, %tmp13 : tensor<1x1024xi1> loc(#loc121)
+      %tmp17 = tt.splat %ks5 : i64 -> tensor<1x1024xi64> loc(#loc122)
+      %tmp18 = arith.cmpi sge, %tmp3_17, %tmp17 : tensor<1x1024xi64> loc(#loc123)
+      %tmp19 = arith.remsi %tmp3_17, %tmp17 : tensor<1x1024xi64> loc(#loc124)
+      %tmp21 = arith.cmpi ne, %tmp19, %cst_3 : tensor<1x1024xi64> loc(#loc125)
+      %tmp22 = arith.cmpi slt, %tmp19, %cst_3 : tensor<1x1024xi64> loc(#loc126)
+      %tmp23 = arith.cmpi slt, %ks5, %c0_i64 : i64 loc(#loc127)
+      %tmp23_24 = tt.splat %tmp23 : i1 -> tensor<1x1024xi1> loc(#loc127)
+      %tmp24 = arith.cmpi ne, %tmp22, %tmp23_24 : tensor<1x1024xi1> loc(#loc128)
+      %tmp25 = arith.andi %tmp21, %tmp24 : tensor<1x1024xi1> loc(#loc129)
+      %tmp26 = arith.addi %tmp19, %tmp17 : tensor<1x1024xi64> loc(#loc130)
+      %tmp27 = arith.select %tmp25, %tmp26, %tmp19 : tensor<1x1024xi1>, tensor<1x1024xi64> loc(#loc131)
+      %tmp28 = arith.cmpi slt, %tmp27, %tmp10_23 : tensor<1x1024xi64> loc(#loc132)
+      %tmp29 = arith.andi %tmp18, %tmp28 : tensor<1x1024xi1> loc(#loc133)
+      %tmp30 = arith.subi %r0_3, %r0_4 : tensor<1x1024xi32> loc(#loc134)
+      %tmp30_25 = arith.muli %x1_7, %c-128_i64 : i64 loc(#loc135)
+      %tmp30_26 = arith.extsi %tmp30 : tensor<1x1024xi32> to tensor<1x1024xi64> loc(#loc136)
+      %tmp30_27 = tt.splat %tmp30_25 : i64 -> tensor<1x1024xi64> loc(#loc166)
+      %tmp30_28 = arith.addi %tmp30_26, %tmp30_27 : tensor<1x1024xi64> loc(#loc136)
+      %tmp30_29 = arith.addi %tmp30_28, %tmp3_16 : tensor<1x1024xi64> loc(#loc137)
+      %tmp31 = arith.remsi %tmp30_29, %tmp17 : tensor<1x1024xi64> loc(#loc138)
+      %tmp32 = arith.cmpi ne, %tmp31, %cst_3 : tensor<1x1024xi64> loc(#loc139)
+      %tmp33 = arith.cmpi slt, %tmp31, %cst_3 : tensor<1x1024xi64> loc(#loc140)
+      %tmp34 = arith.cmpi ne, %tmp33, %tmp23_24 : tensor<1x1024xi1> loc(#loc141)
+      %tmp35 = arith.andi %tmp32, %tmp34 : tensor<1x1024xi1> loc(#loc142)
+      %tmp36 = arith.addi %tmp31, %tmp17 : tensor<1x1024xi64> loc(#loc143)
+      %tmp37 = arith.select %tmp35, %tmp36, %tmp31 : tensor<1x1024xi1>, tensor<1x1024xi64> loc(#loc144)
+      %tmp39 = arith.cmpi eq, %tmp37, %cst_3 : tensor<1x1024xi64> loc(#loc145)
+      %tmp40 = arith.andi %tmp29, %tmp39 : tensor<1x1024xi1> loc(#loc146)
+      %tmp41 = arith.ori %tmp14, %tmp40 : tensor<1x1024xi1> loc(#loc147)
+      %tmp43 = arith.select %tmp6, %tmp41, %cst_0 : tensor<1x1024xi1>, tensor<1x1024xi1> loc(#loc148)
+      %tmp44 = arith.extui %tmp43 : tensor<1x1024xi1> to tensor<1x1024xi64> loc(#loc149)
+      %tmp47 = arith.addi %_tmp46_9, %tmp44 : tensor<1x1024xi64> loc(#loc150)
+      %_tmp46_30 = arith.andi %r0_mask, %tmp10_21 : tensor<1x1024xi1> loc(#loc151)
+      %_tmp46_31 = arith.select %_tmp46_30, %tmp47, %_tmp46_9 : tensor<1x1024xi1>, tensor<1x1024xi64> loc(#loc152)
+      scf.yield %_tmp46_31 : tensor<1x1024xi64> loc(#loc63)
+    } loc(#loc101)
+    %tmp46 = "tt.reduce"(%_tmp46) <{axis = 1 : i32}> ({
+    ^bb0(%tmp46_9: i64 loc(callsite(#loc1 at #loc153)), %tmp46_10: i64 loc(callsite(#loc1 at #loc153))):
+      %tmp46_11 = arith.addi %tmp46_9, %tmp46_10 : i64 loc(#loc171)
+      tt.reduce.return %tmp46_11 : i64 loc(#loc167)
+    }) : (tensor<1x1024xi64>) -> tensor<1xi64> loc(#loc167)
+    %tmp46_8 = tt.expand_dims %tmp46 {axis = 1 : i32} : tensor<1xi64> -> tensor<1x1xi64> loc(#loc154)
+    %tmp49 = arith.cmpi sgt, %tmp46_8, %cst : tensor<1x1xi64> loc(#loc155)
+    %tmp51 = arith.cmpi slt, %tmp46_8, %tmp50 : tensor<1x1xi64> loc(#loc156)
+    %tmp52 = arith.andi %tmp49, %tmp51 : tensor<1x1xi1> loc(#loc157)
+    %tmp54 = arith.extui %tmp52 : tensor<1x1xi1> to tensor<1x1xi32> loc(#loc169)
+    %tmp55 = arith.cmpi eq, %tmp46_8, %tmp50 : tensor<1x1xi64> loc(#loc160)
+    %tmp57 = arith.extui %tmp55 : tensor<1x1xi1> to tensor<1x1xi32> loc(#loc170)
+    %0 = tt.addptr %out_ptr1, %xoffset : !tt.ptr<i32>, i32 loc(#loc76)
+    %1 = tt.splat %0 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>> loc(#loc76)
+    tt.store %1, %tmp54, %xmask_4 : tensor<1x1x!tt.ptr<i32>> loc(#loc77)
+    %2 = tt.addptr %out_ptr2, %xoffset : !tt.ptr<i32>, i32 loc(#loc78)
+    %3 = tt.splat %2 : !tt.ptr<i32> -> tensor<1x1x!tt.ptr<i32>> loc(#loc78)
+    tt.store %3, %tmp57, %xmask_4 : tensor<1x1x!tt.ptr<i32>> loc(#loc79)
+    tt.return loc(#loc80)
+  } loc(#loc)
+} loc(#loc)
+#loc2 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":32:40)
+#loc3 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":90:35)
+#loc4 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":22:28)
+#loc5 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":24:21)
+#loc6 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":25:27)
+#loc7 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":25:37)
+#loc8 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":27:21)
+#loc9 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":27:28)
+#loc10 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":28:19)
+#loc11 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":29:19)
+#loc12 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":33:31)
+#loc13 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":34:29)
+#loc14 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":37:27)
+#loc15 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":38:27)
+#loc16 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":39:26)
+#loc17 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":39:22)
+#loc18 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":41:22)
+#loc19 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":42:26)
+#loc20 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":42:22)
+#loc21 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":44:22)
+#loc22 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":45:22)
+#loc23 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":48:23)
+#loc24 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:35)
+#loc25 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:87)
+#loc26 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:94)
+#loc27 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":49:77)
+#loc28 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":50:23)
+#loc29 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":51:23)
+#loc30 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":52:24)
+#loc31 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":53:23)
+#loc32 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":56:37)
+#loc33 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":57:24)
+#loc34 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":58:24)
+#loc35 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":60:25)
+#loc36 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":61:92)
+#loc37 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":62:92)
+#loc38 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":63:25)
+#loc39 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":64:24)
+#loc40 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":65:24)
+#loc41 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":66:39)
+#loc42 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":67:24)
+#loc43 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":68:24)
+#loc44 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:24)
+#loc45 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:45)
+#loc46 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:38)
+#loc47 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":69:51)
+#loc48 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":70:25)
+#loc49 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":71:25)
+#loc50 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":72:92)
+#loc51 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":73:25)
+#loc52 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":74:24)
+#loc53 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":75:24)
+#loc54 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":76:39)
+#loc55 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":78:25)
+#loc56 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":79:24)
+#loc57 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":80:24)
+#loc58 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":82:38)
+#loc59 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":83:25)
+#loc60 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":85:25)
+#loc61 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:36)
+#loc62 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:50)
+#loc63 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":86:8)
+#loc64 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":291:36)
+#loc66 = loc("/workspace/specforge/lib/python3.11/site-packages/triton/language/standard.py":261:15)
+#loc67 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":87:30)
+#loc68 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":89:20)
+#loc69 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":91:20)
+#loc70 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":92:20)
+#loc71 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":94:21)
+#loc72 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":93:21)
+#loc73 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":95:21)
+#loc74 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":97:21)
+#loc75 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":96:21)
+#loc76 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":98:25)
+#loc77 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":98:37)
+#loc78 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:25)
+#loc79 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:37)
+#loc80 = loc("/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/av/cavp7xan77tfr7qytfkp6sjrgkd6hvruiaqfzkeibtl5rtagscng.py":99:4)
+#loc92 = loc("tmp50"(#loc3))
+#loc93 = loc("xoffset"(#loc4))
+#loc94 = loc("xmask"(#loc5))
+#loc95 = loc("r0_base"(#loc6))
+#loc96 = loc("r0_base"(#loc7))
+#loc97 = loc("x1"(#loc8))
+#loc98 = loc("x1"(#loc9))
+#loc99 = loc("x0"(#loc10))
+#loc100 = loc("x2"(#loc11))
+#loc101 = loc("_tmp46"(#loc2))
+#loc102 = loc("r0_index"(#loc12))
+#loc103 = loc("r0_mask"(#loc13))
+#loc104 = loc("r0_4"(#loc14))
+#loc105 = loc("r0_3"(#loc15))
+#loc106 = loc("tmp0"(#loc16))
+#loc107 = loc("tmp0"(#loc17))
+#loc108 = loc("tmp2"(#loc18))
+#loc109 = loc("tmp3"(#loc19))
+#loc110 = loc("tmp3"(#loc20))
+#loc111 = loc("tmp5"(#loc21))
+#loc112 = loc("tmp6"(#loc22))
+#loc113 = loc("tmp9"(#loc23))
+#loc114 = loc("tmp10"(#loc24))
+#loc115 = loc("tmp10"(#loc25))
+#loc116 = loc("tmp10"(#loc26))
+#loc117 = loc("tmp10"(#loc27))
+#loc118 = loc("tmp11"(#loc28))
+#loc119 = loc("tmp12"(#loc29))
+#loc120 = loc("tmp13"(#loc30))
+#loc121 = loc("tmp14"(#loc31))
+#loc122 = loc("tmp17"(#loc32))
+#loc123 = loc("tmp18"(#loc33))
+#loc124 = loc("tmp19"(#loc34))
+#loc125 = loc("tmp21"(#loc35))
+#loc126 = loc("tmp22"(#loc36))
+#loc127 = loc("tmp23"(#loc37))
+#loc128 = loc("tmp24"(#loc38))
+#loc129 = loc("tmp25"(#loc39))
+#loc130 = loc("tmp26"(#loc40))
+#loc131 = loc("tmp27"(#loc41))
+#loc132 = loc("tmp28"(#loc42))
+#loc133 = loc("tmp29"(#loc43))
+#loc134 = loc("tmp30"(#loc44))
+#loc135 = loc("tmp30"(#loc45))
+#loc136 = loc("tmp30"(#loc46))
+#loc137 = loc("tmp30"(#loc47))
+#loc138 = loc("tmp31"(#loc48))
+#loc139 = loc("tmp32"(#loc49))
+#loc140 = loc("tmp33"(#loc50))
+#loc141 = loc("tmp34"(#loc51))
+#loc142 = loc("tmp35"(#loc52))
+#loc143 = loc("tmp36"(#loc53))
+#loc144 = loc("tmp37"(#loc54))
+#loc145 = loc("tmp39"(#loc55))
+#loc146 = loc("tmp40"(#loc56))
+#loc147 = loc("tmp41"(#loc57))
+#loc148 = loc("tmp43"(#loc58))
+#loc149 = loc("tmp44"(#loc59))
+#loc150 = loc("tmp47"(#loc60))
+#loc151 = loc("_tmp46"(#loc61))
+#loc152 = loc("_tmp46"(#loc62))
+#loc154 = loc("tmp46"(#loc67))
+#loc155 = loc("tmp49"(#loc68))
+#loc156 = loc("tmp51"(#loc69))
+#loc157 = loc("tmp52"(#loc70))
+#loc158 = loc("tmp54"(#loc71))
+#loc159 = loc("tmp53"(#loc72))
+#loc160 = loc("tmp55"(#loc73))
+#loc161 = loc("tmp57"(#loc74))
+#loc162 = loc("tmp56"(#loc75))
+#loc163 = loc(fused[#loc107, #loc106])
+#loc164 = loc(fused[#loc110, #loc109])
+#loc165 = loc(fused[#loc116, #loc94])
+#loc166 = loc(fused[#loc136, #loc135])
+#loc167 = loc(callsite(#loc64 at #loc153))
+#loc169 = loc(fused[#loc158, #loc159])
+#loc170 = loc(fused[#loc161, #loc162])
+#loc171 = loc(callsite(#loc66 at #loc167))

SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/__grp__triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"child_paths": {"triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.source": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.source", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttir", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttgir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ttgir", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.llir": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.llir", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ptx": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.ptx", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin", "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json": "/workspace/hanrui/SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json"}}

SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.cubin ADDED Viewed

Binary file (23.1 kB). View file

SpecForge-ext/cache/compiled_kernels/triton/3/EB4J5U2HKNQBLXRWK6B5L6ATOH55AWD3MB7P63KH5AKRGRDZER7A/triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"hash": "20789ed347536015de365783d5f81371fbd0587b607eff6d47e815134479247e", "target": {"backend": "cuda", "arch": 90, "warp_size": 32}, "num_warps": 2, "num_ctas": 1, "num_stages": 1, "warp_size": 32, "maxnreg": null, "cluster_dims": [1, 1, 1], "ptx_version": null, "ptx_options": null, "ir_override": null, "enable_fp_fusion": true, "launch_cooperative_grid": false, "launch_pdl": false, "supported_fp8_dtypes": ["fp8e4b15", "fp8e4nv", "fp8e5"], "deprecated_fp8_dot_operand_dtypes": ["fp8e4b15"], "default_dot_input_precision": "tf32", "allowed_dot_input_precisions": ["tf32", "tf32x3", "ieee"], "max_num_imprecise_acc_default": 1073741824, "extern_libs": [["libdevice", "/workspace/specforge/lib/python3.11/site-packages/triton/backends/nvidia/lib/libdevice.10.bc"]], "debug": true, "backend_name": "cuda", "sanitize_overflow": false, "arch": "sm90", "instrumentation_mode": "", "triton_version": "3.5.1", "tensordesc_meta": [], "shared": 0, "tmem_size": 0, "global_scratch_size": 0, "global_scratch_align": 1, "profile_scratch_size": 0, "profile_scratch_align": 1, "name": "triton_per_fused__to_copy_clone_slice_sort_sum_transpose_3"}