Goodput optimization recipes

This document helps you optimize the goodput, the rate of useful data transferred, for your workloads. To achieve this optimization, we have curated reproducible goodput recipes that use common machine learning (ML) frameworks and models. To review these recipes, see the AI Hypercomputer GitHub organization. The goodput recipes were tested on clusters that were created by using Cluster Toolkit.

To help ensure optimal workload reliability and maximize your goodput, you can also proactively identify nodes in a Google Kubernetes Engine (GKE) cluster that are likely to degrade in the next five hours. This early warning helps you avoid scheduling new workloads on at-risk VMs, thereby reducing the risk of interruptions to your jobs. For more information, see Enable node health prediction.

Before you begin

Before you use the goodput recipes in this document, complete the following steps if you haven't already:

Recipes

The following reproducible goodput recipes are available for pre-training on GKE clusters:

Recipe name	Accelerator	Model	Framework	Workload type
Llama3.1 70B - A3 Mega	A3 Mega	Llama3.1 70B	NeMo	Pre-training on GKE

What's next

Learn how to optimize cluster networking by using NCCL/gIB.
Learn how to Test clusters.

Goodput optimization recipes Stay organized with collections Save and categorize content based on your preferences.

Before you begin

Recipes

What's next

Goodput optimization recipes