Paul Krill
Editor at Large

PyTorch team unveils framework for programming clusters

news
Oct 22, 20252 mins

Monarch framework, currently experimental, allows Python programmers to program distributed systems as if they were just one machine.

container orchestration, clusters, clustering, Kubernetes
Credit: NicoElNino/Shutterstock

The PyTorch team at Meta, stewards of the PyTorch open source machine learning framework, has unveiled Monarch, a distributed programming framework intended to bring the simplicity of PyTorch to entire clusters. Monarch pairs a Python-based front end, supporting integration with existing code and libraries such as PyTorch, and a Rust-based back end, which facilitates performance, scalability, and robustness, the team said. .

Announced October 22, Monarch is a framework based on scalable actor messaging that lets users program distributed systems the way a single machine would be programmed, thus hiding the complexity of distributed computing, the PyTorch team said. Monarch is currently in an experimental stage; installation instructions can be found at meta-pytorch.org.

Monarch organizes processes, actors, and hosts into a scalable multidimensional array, or mesh, that can be manipulated directly. Users can operate on entire meshes, or slices of them, with simple APIs, with Monarch handling distribution and vectorization automatically. Developers can write code as if nothing fails, according to the PyTorch team. But when something does fail, Monarch fails fast by stopping the whole program. Later on, users can add fine-grained fault handling where needed, catching and recovering from failures.

Monarch splits control plane messaging from data plane transfers, enabling direct GPU-to-GPU memory transfers across a cluster. Commands are sent through one path while data moves through another.  Monarch integrates with PyTorch to provide tensors that are sharded across clusters of GPUs. Tensor operations look local but are executed across distributed large clusters, with Monarch handling the complexity of coordination across thousands of GPUs, the PyTorch team said.

The PyTorch team warned that in Monarch’s current stage of development, users should expect bugs, incomplete features, and APIs that may change in future versions.

Paul Krill

Paul Krill is editor at large at InfoWorld. Paul has been covering computer technology as a news and feature reporter for more than 35 years, including 30 years at InfoWorld. He has specialized in coverage of software development tools and technologies since the 1990s, and he continues to lead InfoWorld’s news coverage of software development platforms including Java and .NET and programming languages including JavaScript, TypeScript, PHP, Python, Ruby, Rust, and Go. Long trusted as a reporter who prioritizes accuracy, integrity, and the best interests of readers, Paul is sought out by technology companies and industry organizations who want to reach InfoWorld’s audience of software developers and other information technology professionals. Paul has won a “Best Technology News Coverage” award from IDG.

More from this author