+

HF Kernels - SwiGLU Activation

+

GPU Info

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: nv | 0.21s + | + +Raw +GitHub +
+
+
+
import subprocess
+print(subprocess.run(["nvidia-smi"], capture_output=True, text=True).stdout)
+
+ +
+
+
+
+
Wed Oct 29 00:36:01 2025       
++-----------------------------------------------------------------------------------------+
+| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
+|-----------------------------------------+------------------------+----------------------+
+| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
+|                                         |                        |               MIG M. |
+|=========================================+========================+======================|
+|   0  NVIDIA L40S                    On  |   00000000:4D:00.0 Off |                    0 |
+| N/A   29C    P0             77W /  350W |       0MiB /  46068MiB |      0%      Default |
+|                                         |                        |                  N/A |
++-----------------------------------------+------------------------+----------------------+
+
++-----------------------------------------------------------------------------------------+
+| Processes:                                                                              |
+|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
+|        ID   ID                                                               Usage      |
+|=========================================================================================|
+|  No running processes found                                                             |
++-----------------------------------------------------------------------------------------+
+
+
+
+
+ +

SwiGLU Benchmark

+
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: benchmark | 4.27s + | + +Raw +GitHub +
+
+
+
# /// script
+# requires-python = ">=3.10"
+# dependencies = [
+#     "numpy",
+#     "torch==2.8.0",
+#     "kernels-benchmark-tools",
+#     "kernels",
+# ]
+#
+# [tool.uv.sources]
+# kernels-benchmark-tools = { path = "../../../../../tools", editable = true }
+# ///
+import torch
+import sys
+from kernels_benchmark_tools import KernelTypeEnum, run_benchmark
+from kernels import get_kernel
+
+# Load the activation kernel
+activation = get_kernel("kernels-community/activation")
+
+
+def hf_kernels_swiglu(input_tensor):
+    hidden_dim = input_tensor.shape[-1] // 2
+    out_shape = input_tensor.shape[:-1] + (hidden_dim,)
+    out = torch.empty(out_shape, dtype=input_tensor.dtype, device=input_tensor.device)
+    return activation.silu_and_mul(out, input_tensor)
+
+
+run_benchmark(
+    kernel_type=KernelTypeEnum.ACTIVATION,
+    impl_name="hf_kernels_swiglu",
+    impl_tags={"family": "hf-kernels", "backend": "cuda"},
+    impl_func=hf_kernels_swiglu,
+)
+
+ +
+
+
+
+
Running activation benchmark on cuda with 9 workloads.
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D768
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      74.624us      1850.79%      74.624us      74.624us             1  
+                                      hf_kernels_swiglu        11.04%     191.977us        99.56%       1.732ms       1.732ms       0.000us         0.00%       5.440us       5.440us             1  
+                      _activation_beeaae6::silu_and_mul         1.14%      19.900us        85.86%       1.493ms     497.784us       4.032us       100.00%       5.440us       1.813us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.032us       100.00%       4.032us       1.344us             3  
+                                Activity Buffer Request        82.36%       1.432ms        82.36%       1.432ms       1.432ms       1.408us        34.92%       1.408us       1.408us             1  
+                                            aten::empty         2.66%      46.201us         2.66%      46.201us      15.400us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         2.36%      41.042us         2.36%      41.042us      13.681us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.44%       7.690us         0.44%       7.690us       7.690us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.739ms
+Self CUDA time total: 4.032us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D1024
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      58.016us      1462.10%      58.016us      58.016us             1  
+                                      hf_kernels_swiglu         6.64%     105.933us        99.68%       1.591ms       1.591ms       0.000us         0.00%       5.280us       5.280us             1  
+                      _activation_beeaae6::silu_and_mul         1.34%      21.350us        91.75%       1.465ms     488.260us       3.968us       100.00%       5.280us       1.760us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       3.968us       100.00%       3.968us       1.323us             3  
+                                Activity Buffer Request        88.86%       1.419ms        88.86%       1.419ms       1.419ms       1.312us        33.06%       1.312us       1.312us             1  
+                                            aten::empty         1.30%      20.712us         1.30%      20.712us       6.904us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         1.56%      24.841us         1.56%      24.841us       8.280us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.32%       5.080us         0.32%       5.080us       5.080us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.597ms
+Self CUDA time total: 3.968us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T128_D2048
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      67.167us      1371.87%      67.167us      67.167us             1  
+                                      hf_kernels_swiglu         6.20%     101.314us        99.65%       1.628ms       1.628ms       0.000us         0.00%       6.560us       6.560us             1  
+                      _activation_beeaae6::silu_and_mul         1.28%      20.850us        92.18%       1.506ms     501.997us       4.896us       100.00%       6.560us       2.187us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.896us       100.00%       4.896us       1.632us             3  
+                                Activity Buffer Request        89.24%       1.458ms        89.24%       1.458ms       1.458ms       1.664us        33.99%       1.664us       1.664us             1  
+                                            aten::empty         1.26%      20.660us         1.26%      20.660us       6.887us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         1.67%      27.252us         1.67%      27.252us       9.084us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.35%       5.710us         0.35%       5.710us       5.710us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.634ms
+Self CUDA time total: 4.896us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D768
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      69.055us      1610.42%      69.055us      69.055us             1  
+                                      hf_kernels_swiglu         5.98%     106.323us        99.73%       1.773ms       1.773ms       0.000us         0.00%       5.728us       5.728us             1  
+                      _activation_beeaae6::silu_and_mul         1.23%      21.902us        92.63%       1.646ms     548.829us       4.288us       100.00%       5.728us       1.909us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       4.288us       100.00%       4.288us       1.429us             3  
+                                Activity Buffer Request        80.11%       1.424ms        80.11%       1.424ms       1.424ms       1.440us        33.58%       1.440us       1.440us             1  
+                                            aten::empty         1.11%      19.750us         1.11%      19.750us       6.583us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        11.30%     200.767us        11.30%     200.767us      66.922us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.27%       4.870us         0.27%       4.870us       4.870us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.777ms
+Self CUDA time total: 4.288us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D1024
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      61.438us      1043.62%      61.438us      61.438us             1  
+                                      hf_kernels_swiglu        19.33%      85.364us        98.97%     437.156us     437.156us       0.000us         0.00%       7.871us       7.871us             1  
+                      _activation_beeaae6::silu_and_mul         4.88%      21.551us        75.28%     332.532us     110.844us       5.887us       100.00%       7.871us       2.624us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       5.887us       100.00%       5.887us       1.962us             3  
+                                Activity Buffer Request        35.23%     155.635us        35.23%     155.635us     155.635us       1.984us        33.70%       1.984us       1.984us             1  
+                                            aten::empty         4.36%      19.260us         4.36%      19.260us       6.420us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        35.17%     155.346us        35.17%     155.346us      51.782us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.03%       4.560us         1.03%       4.560us       4.560us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 441.716us
+Self CUDA time total: 5.887us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T256_D2048
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      64.160us       828.30%      64.160us      64.160us             1  
+                                      hf_kernels_swiglu         7.42%     129.826us        99.74%       1.746ms       1.746ms       0.000us         0.00%      10.339us      10.339us             1  
+                      _activation_beeaae6::silu_and_mul         1.16%      20.220us        91.25%       1.597ms     532.391us       7.746us       100.00%      10.339us       3.446us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       7.746us       100.00%       7.746us       2.582us             3  
+                                Activity Buffer Request        81.29%       1.423ms        81.29%       1.423ms       1.423ms       2.593us        33.48%       2.593us       2.593us             1  
+                                            aten::empty         1.08%      18.840us         1.08%      18.840us       6.280us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         8.81%     154.125us         8.81%     154.125us      51.375us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.26%       4.481us         0.26%       4.481us       4.481us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.750ms
+Self CUDA time total: 7.746us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D768
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      70.847us      1069.55%      70.847us      70.847us             1  
+                                      hf_kernels_swiglu         6.38%     111.683us        99.73%       1.745ms       1.745ms       0.000us         0.00%       8.832us       8.832us             1  
+                      _activation_beeaae6::silu_and_mul         1.20%      21.011us        92.19%       1.613ms     537.758us       6.624us       100.00%       8.832us       2.944us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       6.624us       100.00%       6.624us       2.208us             3  
+                                Activity Buffer Request        82.19%       1.438ms        82.19%       1.438ms       1.438ms       2.208us        33.33%       2.208us       2.208us             1  
+                                            aten::empty         1.16%      20.281us         1.16%      20.281us       6.760us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel         8.80%     153.915us         8.80%     153.915us      51.305us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.27%       4.700us         0.27%       4.700us       4.700us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 1.750ms
+Self CUDA time total: 6.624us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D1024
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.070us       668.11%      63.070us      63.070us             1  
+                                      hf_kernels_swiglu        18.75%      87.072us        98.86%     459.026us     459.026us       0.000us         0.00%      12.608us      12.608us             1  
+                      _activation_beeaae6::silu_and_mul         4.59%      21.321us        76.16%     353.653us     117.884us       9.440us       100.00%      12.608us       4.203us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us       9.440us       100.00%       9.440us       3.147us             3  
+                                Activity Buffer Request        38.99%     181.046us        38.99%     181.046us     181.046us       3.168us        33.56%       3.168us       3.168us             1  
+                                            aten::empty         3.94%      18.301us         3.94%      18.301us       6.100us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        32.58%     151.286us        32.58%     151.286us      50.429us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         1.14%       5.310us         1.14%       5.310us       5.310us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 464.336us
+Self CUDA time total: 9.440us
+
+
+
+======================================================================
+PROFILE TRACE: hf_kernels_swiglu | cuda_T512_D2048
+======================================================================
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+                                      hf_kernels_swiglu         0.00%       0.000us         0.00%       0.000us       0.000us      63.326us       483.85%      63.326us      63.326us             1  
+                                      hf_kernels_swiglu        16.17%     100.313us        99.24%     615.771us     615.771us       0.000us         0.00%      17.472us      17.472us             1  
+                      _activation_beeaae6::silu_and_mul         3.48%      21.570us        80.17%     497.486us     165.829us      13.088us       100.00%      17.472us       5.824us             3  
+void vllm::act_and_mul_kernel<c10::BFloat16, &(c10::...         0.00%       0.000us         0.00%       0.000us       0.000us      13.088us       100.00%      13.088us       4.363us             3  
+                                Activity Buffer Request        52.45%     325.441us        52.45%     325.441us     325.441us       4.384us        33.50%       4.384us       4.384us             1  
+                                            aten::empty         2.90%      17.972us         2.90%      17.972us       5.991us       0.000us         0.00%       0.000us       0.000us             3  
+                                       cudaLaunchKernel        24.25%     150.475us        24.25%     150.475us      50.158us       0.000us         0.00%       0.000us       0.000us             3  
+                                  cudaDeviceSynchronize         0.76%       4.730us         0.76%       4.730us       4.730us       0.000us         0.00%       0.000us       0.000us             1  
+-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
+Self CPU time total: 620.501us
+Self CUDA time total: 13.088us
+
+
+impl                     wl                  p50(ms)  ok
+hf_kernels_swiglu        cuda_T128_D1024        0.03  True
+hf_kernels_swiglu        cuda_T128_D2048        0.03  True
+hf_kernels_swiglu        cuda_T128_D768         0.02  True
+hf_kernels_swiglu        cuda_T256_D1024        0.03  True
+hf_kernels_swiglu        cuda_T256_D2048        0.03  True
+hf_kernels_swiglu        cuda_T256_D768         0.03  True
+hf_kernels_swiglu        cuda_T512_D1024        0.03  True
+hf_kernels_swiglu        cuda_T512_D2048        0.03  True
+hf_kernels_swiglu        cuda_T512_D768         0.03  True
+
+
+
▶ UV Install Logs
+ +
+
Fetching 7 files: 0%| | 0/7 [00:00<?, ?it/s] +Fetching 7 files: 71%|███████▏ | 5/7 [00:00<00:00, 12.38it/s] +Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 17.32it/s]
+
+

Artifacts:

+activation.jsonl +
+
+
+