Skip to main content

On This Page

NVIDIA Releases cuda-oxide: A Native Rust-to-PTX Compiler for SIMT GPU Kernels

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend that Compiles SIMT GPU Kernels Directly to PTX

NVIDIA AI has launched cuda-oxide, an experimental compiler backend that allows developers to write CUDA SIMT kernels in standard Rust code. The project compiles Rust directly to PTX assembly without requiring C++ code or FFI bindings.

Why This Matters

Traditional GPU programming requires C++ or high-level Python abstractions like Triton, which often obscure hardware-level control. cuda-oxide bridges this gap by bringing the CUDA SIMT model natively to Rust, allowing for safe hardware-specific optimizations while maintaining Rust’s memory safety guarantees. This approach achieves significant performance, with GEMM implementations reaching 868 TFLOPS on B200 hardware, representing 58% of cuBLAS speed-of-light performance.

Key Insights

  • The backend utilizes rustc_public (Stable MIR) to provide a versioned, stable API over compiler internals, preventing build failures on nightly Rust updates.
  • The middle stages use Pliron, a Rust-native MLIR-like IR framework, enabling the entire compiler to build with cargo without a C++ toolchain or CMake.
  • Hardware-specific barrier semantics are protected by disabling JumpThreading optimizations and marking synchronization primitives as convergent in LLVM IR.
  • Native support for modern NVIDIA architectures includes intrinsics for Tensor Memory Accelerator (TMA), Thread Block Clusters, and scoped atomics.
  • Device code from library dependencies is compiled lazily by reading Stable MIR from .rlib metadata on demand during the kernel compilation phase.

Working Examples

Command to observe the full compilation trace from Rust MIR through Pliron dialects to PTX.

cargo oxide pipeline vecadd

Standard command to build the host binary and PTX file simultaneously and execute the application.

cargo oxide run vecadd

Practical Applications

  • High-performance linear algebra: Implementing GEMM pipelines that utilize B200 hardware features like WGMMA and tcgen05 for near-native performance. Pitfall: Attempting to use index_2d(stride), which is currently documented as unsound in the 0.x release.
  • Safe GPU Kernel Authoring: Using the DisjointSlice and ThreadIndex abstractions to create race-free kernels by construction. Pitfall: Manually duplicating bar.sync instructions across branches, which violates SIMT convergence and breaks hardware barrier semantics.

References:

Continue reading

Next article

Scaling Remote Infrastructure: Beyond GUI Limitations

Related Content