IBM Advances Open-Source AI with vLLM, torch.compile, and Spyre Accelerator Integration
These articles are AI-generated summaries. Please check the original sources for full details.
Summary of IBM’s Contributions to Open-Source AI
IBM is actively expanding its contributions to the open-source AI community, particularly within the PyTorch and vLLM ecosystems. At PyTorch 2025, the company showcased advancements in model training, inference, and accelerator integration, demonstrating a commitment to efficient, transparent, and community-driven AI infrastructure. These efforts involve developing hardware-agnostic kernels for vLLM, achieving significant cost and efficiency improvements in LLM training with torchtitan, and integrating its Spyre AI accelerator for enhanced inference capabilities.
Improving vLLM with Hardware-Agnostic Kernels
Overview: IBM Research has developed Triton kernels to enhance the performance of vLLM, a popular open-source LLM inference framework. This initiative aims to provide a fully contained attention backend that works efficiently on multiple GPU platforms, including AMD, without relying on proprietary libraries like FlashAttention.
Key Details:
- Problem: The previous version of vLLM (V1) relied on external FlashAttention libraries, limiting hardware support to NVIDIA GPUs.
- Solution: IBM researchers developed Triton kernels, a platform-independent framework for writing GPU kernels, to enable vLLM to run efficiently on AMD GPUs.
- Performance Improvement: Through extensive kernel-level optimizations, IBM achieved a more than 10% improvement in token throughput for vLLM V1 on AMD hardware compared to vLLM V0.
- Technical Details: The team addressed performance bottlenecks arising from the scheduler’s interleaving of prefill and decode requests, leading to significant optimizations for models using grouped query attention, such as IBM’s Granite 3.0 family.
- Community Benefit: The newly developed attention backend is self-contained, doesn’t depend on external libraries, and is maintained by the vLLM team, allowing for flexibility in adapting to new models.
- Future Work: IBM is working on reducing the launch overhead of these Triton kernels and expanding optimizations to other model architectures, including IBM’s Granite 4, which utilizes Mamba layers. Meta’s PyTorch Team is also announcing a public beta for Helion, a domain-specific language for kernel development, which compiles down to Triton.
A New Model Training Milestone with torchtitan
Overview: IBM Research has successfully trained a new branch of the Llama 3 70B model using torchtitan, a new PyTorch-native training framework. This achievement demonstrates the framework’s capability to achieve production-grade efficiency with significantly reduced resources.
Key Details:
- Switch to torchtitan: IBM’s research scientist Linsong Chu’s team transitioned from a proprietary software stack to torchtitan to actively participate in the open-source community and streamline training workflows.
- Resource Efficiency: The team achieved the same model quality with only one-third of the training budget compared to the original Llama 3 training, attributed to careful data curation and training in FP8 precision.
- FP8 Precision: Utilizing FP8 precision resulted in a 1.5 times faster training speed and halved the required token count.
- torchtitan Features: The project leveraged features contributed by IBM, including a high-throughput data loader that enables efficient checkpoint saving, workload distribution, and mid-job reconfiguration.
- Community Contribution: This accomplishment showcases PyTorch’s potential for training large-scale models using open-source frameworks, with IBM actively contributing kernels and new models to the community.
Integration of Spyre AI Accelerator with vLLM and torch.compile
Overview: IBM is integrating its Spyre AI accelerator with the PyTorch ecosystem, specifically with vLLM and torch.compile, to enhance inference performance and expand the reach of its hardware.
Key Details:
- Spyre AI Accelerator: IBM’s Spyre is designed for IBM Z and Power Systems, offering a heterogeneous compute platform.
- vLLM Integration: IBM developed a new compiler that seamlessly interfaces with torch.compile, enabling vLLM to utilize the Spyre accelerator with minimal code changes.
- Paged Attention Plugin: A key component of the integration is a vLLM plugin that implements paged attention, a technique that improves memory efficiency and scalability for long LLM outputs.
- Paged Attention Mechanism: This technique divides the model’s KV cache into smaller, addressable memory blocks (“pages”) fetched on demand, reducing memory bottlenecks.
- Ease of Use: The plugin allows developers to transparently substitute vLLM endpoints with Spyre, requiring no changes to existing frameworks like BeeAI and LangChain.
- Future Plans: IBM plans to deepen the integration in 2026, focusing on tighter compiler integration, unified runtime interfaces, and expanded multi-accelerator scheduling.
Conclusion
IBM’s multifaceted contributions to the open-source AI landscape, highlighted at PyTorch 2025, underscore its commitment to fostering a collaborative and innovative ecosystem. By advancing vLLM with hardware-agnostic kernels, achieving efficient training with torchtitan, and integrating its Spyre accelerator, IBM is paving the way for more powerful, accessible, and scalable AI solutions.
Reference Link: https://research.ibm.com/blog/pytorch-expanding-ai-model-training-and-inference-for-the-open-source-community?utm_medium=rss&utm_source=rss
Continue reading
Next article
Mauro Martino's AI-Powered Sculpture: Exploring the Intersection of Art and Artificial Intelligence
Related Content
LightSeek Foundation Releases TokenSpeed: An Open-Source Inference Engine for Agentic AI
LightSeek Foundation's TokenSpeed is an open-source LLM inference engine that outperforms TensorRT-LLM by 11% in throughput on NVIDIA B200 GPUs for agentic coding workloads.
Making open infrastructure for AI a reality, together
IBM Research and partners are transforming the AI age with complete solutions in an open ecosystem, launching the Spyre AI accelerator.
Sakana AI and NVIDIA Introduce TwELL: 20.5% Faster LLM Inference via Unstructured Sparsity
Sakana AI and NVIDIA introduced TwELL and custom CUDA kernels, achieving 20.5% inference and 21.9% training speedups in LLMs by exploiting activation sparsity.