This document helps you optimize the goodput, the rate of useful data transferred, for your workloads. To achieve this optimization, we have curated reproducible goodput recipes that use common machine learning (ML) frameworks and models. To review these recipes, see the AI Hypercomputer GitHub organization. The goodput recipes were tested on clusters that were created by using Cluster Toolkit.
To help ensure optimal workload reliability and maximize your goodput, you can also proactively identify nodes in a Google Kubernetes Engine (GKE) cluster that are likely to degrade in the next five hours. This early warning helps you avoid scheduling new workloads on at-risk VMs, thereby reducing the risk of interruptions to your jobs. For more information, see Enable node health prediction.
Before you begin
Before you use the goodput recipes in this document, complete the following steps if you haven't already:
Recipes
The following reproducible goodput recipes are available for pre-training on GKE clusters:
| Recipe name | Accelerator | Model | Framework | Workload type |
|---|---|---|---|---|
| Llama3.1 70B - A3 Mega | A3 Mega | Llama3.1 70B | NeMo | Pre-training on GKE |
What's next
Learn how to optimize cluster networking by using NCCL/gIB.
Learn how to Test clusters.