Iris#
First-Class Multi-GPU Programming Experience in Triton
What is Iris?#
Iris is a Triton-based framework for Remote Memory Access (RMA) operations. Iris provides SHMEM-like APIs within Triton for Multi-GPU programming. Iris’ goal is to make Multi-GPU programming a first-class citizen in Triton while retaining Triton’s programmability and performance.
Key Features#
SHMEM-like RMA: Iris provides SHMEM-like RMA support in Triton
Simple and Intuitive API: Iris provides simple and intuitive RMA APIs. Writing multi-GPU programs is as easy as writing single-GPU programs
Triton-based: Iris is built on top of Triton and inherits Triton’s performance and capabilities
Experimental Gluon Backend: Optional Gluon-based API using
@aggregateand@gluon.jitfor improved ergonomics (requires ROCm 7.0 and Triton commit aafec417bded34db6308f5b3d6023daefae43905 or later)
Quick Start#
Quick Installation#
Requirements: Python 3.10+, PyTorch 2.0+ (ROCm version), ROCm 6.3.1+ HIP runtime, and Triton
For a quick installation directly from the repository:
pip install git+https://github.com/ROCm/iris.git Docker Compose (Recommended for Development)#
The recommended way to get started is using Docker Compose:
# Clone the repository git clone https://github.com/ROCm/iris.git cd iris # Start the development container docker compose up --build -d # Attach to the running container docker attach iris-dev # Install Iris in development mode cd iris && pip install -e . Run Your First Example#
Here’s a simple example showing how to perform remote memory operations between GPUs using Iris:
import torch import torch.distributed as dist import torch.multiprocessing as mp import triton import triton.language as tl import iris # Device-side APIs @triton.jit def kernel(buffer, buffer_size: tl.constexpr, block_size: tl.constexpr, heap_bases_ptr): # Compute start index of this block pid = tl.program_id(0) block_start = pid * block_size offsets = block_start + tl.arange(0, block_size) # Guard for out-of-bounds accesses mask = offsets < buffer_size # Store 1 in the target buffer at each offset source_rank = 0 target_rank = 1 iris.store(buffer + offsets, 1, source_rank, target_rank, heap_bases_ptr, mask=mask) def _worker(rank, world_size): # Torch distributed initialization device_id = rank % torch.cuda.device_count() dist.init_process_group( backend="nccl", rank=rank, world_size=world_size, init_method="tcp://127.0.0.1:29500", device_id=torch.device(f"cuda:{device_id}") ) # Iris initialization heap_size = 2**30 # 1GiB symmetric heap for inter-GPU communication iris_ctx = iris.iris(heap_size) cur_rank = iris_ctx.get_rank() # Iris tensor allocation buffer_size = 4096 # 4K elements buffer buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32) # Launch the kernel on rank 0 block_size = 1024 grid = lambda meta: (triton.cdiv(buffer_size, meta["block_size"]),) source_rank = 0 if cur_rank == source_rank: kernel[grid]( buffer, buffer_size, block_size, iris_ctx.get_heap_bases(), ) # Synchronize all ranks iris_ctx.barrier() dist.destroy_process_group() if __name__ == "__main__": world_size = 2 # Using two ranks mp.spawn(_worker, args=(world_size,), nprocs=world_size, join=True) Gluon-style API (Experimental)#
Iris also provides a cleaner API using Triton’s Gluon with @gluon.jit decorator:
import torch import torch.distributed as dist import torch.multiprocessing as mp from triton.experimental import gluon from triton.experimental.gluon import language as gl import iris.experimental.iris_gluon as iris_gl # Device-side APIs - context encapsulates heap_bases @gluon.jit def kernel(IrisDeviceCtx: gl.constexpr, context_tensor, buffer, buffer_size: gl.constexpr, block_size: gl.constexpr): # Initialize device context from tensor ctx = IrisDeviceCtx.initialize(context_tensor) pid = gl.program_id(0) block_start = pid * block_size layout: gl.constexpr = gl.BlockedLayout([1], [64], [1], [0]) offsets = block_start + gl.arange(0, block_size, layout=layout) mask = offsets < buffer_size # Store 1 in the target buffer - no need to pass heap_bases separately! target_rank = 1 ctx.store(buffer + offsets, 1, target_rank, mask=mask) def _worker(rank, world_size): # Torch distributed initialization device_id = rank % torch.cuda.device_count() dist.init_process_group( backend="nccl", rank=rank, world_size=world_size, init_method="tcp://127.0.0.1:29500", device_id=torch.device(f"cuda:{device_id}") ) # Iris initialization heap_size = 2**30 # 1GiB symmetric heap iris_ctx = iris_gl.iris(heap_size) context_tensor = iris_ctx.get_device_context() # Get encoded context cur_rank = iris_ctx.get_rank() # Iris tensor allocation buffer_size = 4096 # 4K elements buffer buffer = iris_ctx.zeros(buffer_size, device="cuda", dtype=torch.float32) # Launch the kernel on rank 0 block_size = 1024 grid = (buffer_size + block_size - 1) // block_size source_rank = 0 if cur_rank == source_rank: kernel[(grid,)](iris_gl.IrisDeviceCtx, context_tensor, buffer, buffer_size, block_size, num_warps=1) # Synchronize all ranks iris_ctx.barrier() dist.destroy_process_group() if __name__ == "__main__": world_size = 2 # Using two ranks mp.spawn(_worker, args=(world_size,), nprocs=world_size, join=True) For more examples, see the Examples page with ready-to-run scripts and usage patterns.
For other setup methods, see the Installation Guide.
Documentation Structure#
📚 Getting Started#
Installation: Set up Iris on your system
Examples: Working code examples
Contributing: How to contribute
🧠Conceptual#
Programming Model: How Iris works
Fine-grained Overlap: GEMM & communication overlap
Taxonomy: Multi-GPU programming patterns
📖 Reference#
API Reference: Structured API documentation
Triton APIs: Standard Iris APIs with Triton
Gluon APIs (Experimental): Cleaner API with Gluon decorators
Supported GPUs#
Iris currently supports:
MI300X, MI350X & MI355X
Note: Iris may work on other AMD GPUs with ROCm compatibility.
Roadmap#
We plan to extend Iris with the following features:
Extended GPU Support: Testing and optimization for other AMD GPUs
RDMA Support: Multi-node support using Remote Direct Memory Access (RDMA) for distributed computing across multiple machines
End-to-End Integration: Comprehensive examples covering various use cases and end-to-end patterns
Community & Support#
GitHub Discussions#
Join the GitHub Discussions to ask questions, share ideas, and connect with the Iris community.
GitHub Issues#
Found a bug or have a feature request? Report it on GitHub Issues.
Contributing#
Want to contribute to Iris? Check out the Contributing Guide to learn how you can help make Iris better for everyone.
Ready to start your multi-GPU journey? Begin with the Installation Guide!