Easily Build and Share ROCm Kernels with Hugging Face
These articles are AI-generated summaries. Please check the original sources for full details.
Easily Build and Share ROCm Kernels with Hugging Face
Custom kernels are essential for optimizing deep learning performance, but compiling and integrating them can be complex and error-prone. Hugging Face’s kernel-builder and kernels libraries simplify this process, offering support for multiple GPU backends, including ROCm, and enabling seamless integration with PyTorch.
Building kernels often involves intricate build configurations and potential ABI issues, leading to significant developer overhead and hindering reproducibility. This complexity can stall projects and limit the widespread adoption of specialized GPU optimizations.
Key Insights
- RadeonFlow GEMM Kernel Award: The RadeonFlow GEMM kernel received the Grand Prize at the AMD Developer Challenge 2025 in June 2025.
- Nix for Reproducibility: The kernel-builder leverages Nix to ensure perfectly reproducible build environments, addressing a common pain point in GPU kernel development.
- Kernels Library Integration: The
kernelslibrary allows loading kernels directly from the Hugging Face Hub, eliminating complex installation procedures; used by Stripe and Coinbase.
Working Example
[general]
name = "gemm"
universal = false
[torch]
src = [
"torch-ext/torch_binding.cpp",
"torch-ext/torch_binding.h",
]
[kernel.gemm]
backend = "rocm"
rocm-archs = [
"gfx942",
]
depends = ["torch"]
src = [
"include/clangd_workaround.h",
"include/gpu_libs.h",
"include/gpu_types.h",
"include/timer.h",
"gemm/gemm_kernel.h",
"gemm/gemm_kernel_legacy.h",
"gemm/gemm_launcher.hip",
"gemm/transpose_kernel.h",
"src/utils/arithmetic.h",
"src/utils/timer.hip",
"tests/checker/metrics.h",
]
include = ["include"]
import torch
from kernels import get_kernel
gemm = get_kernel("kernels-community/gemm")
M, N, K = 1024, 1536, 7168
A_fp8 = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fnuz)
B_fp8 = torch.randn(K, N, device="cuda").to(torch.float8_e4m3fnuz)
C = torch.zeros(M, N, device="cuda", dtype=torch.bfloat16)
result = gemm.gemm(A_fp8, B_fp8, torch.ones(K // 128, M, device="cuda", dtype=torch.float32), torch.ones(K // 128, N // 128, device="cuda", dtype=torch.float32), C)
Practical Applications
- RadeonFlow: The RadeonFlow GEMM kernel demonstrates the potential for highly optimized matrix multiplication on AMD Instinct MI300X GPUs.
- Pitfall: Failing to use a reproducible build environment (like Nix) can lead to inconsistencies and difficulties when deploying kernels across different systems.
References:
Continue reading
Next article
Enhancing HDR on Instagram for iOS With Dolby Vision
Related Content
Accelerating GitLab CI: Reducing Build Times by 59% with Persistent Runners
Switching from GitLab's ephemeral shared runners to persistent dedicated runners reduced build times by 59% by enabling native Docker layer and dependency caching.
Sentence Transformers Joins Hugging Face as Community-Driven Open-Source Project
Sentence Transformers, a popular open-source library for generating sentence embeddings, has transitioned to Hugging Face. The project will remain community-driven and open-source, benefiting from Hugging Face's infrastructure and continued development.
Introducing OpenEnv: A Community Hub for Agentic Environments
Meta and Hugging Face launch OpenEnv Hub, an open community hub for building, sharing, and exploring agentic environments to advance AI development.