Goodput optimization recipes

This document helps you optimize the goodput, the rate of useful data transferred, for your workloads. To achieve this optimization, we have curated reproducible goodput recipes that use common machine learning (ML) frameworks and models. To review these recipes, see the AI Hypercomputer GitHub organization. The goodput recipes were tested on clusters that were created by using Cluster Toolkit.

To help ensure optimal workload reliability and maximize your goodput, you can also proactively identify nodes in a Google Kubernetes Engine (GKE) cluster that are likely to degrade in the next five hours. This early warning helps you avoid scheduling new workloads on at-risk VMs, thereby reducing the risk of interruptions to your jobs. For more information, see Enable node health prediction.

Before you begin

Before you use the goodput recipes in this document, complete the following steps if you haven't already:

  1. Choose an accelerator that best suits your workload

  2. Choose a consumption method based on your accelerator of choice

  3. Create your cluster

Recipes

The following reproducible goodput recipes are available for pre-training on GKE clusters:

Recipe name Accelerator Model Framework Workload type
Llama3.1 70B - A3 Mega A3 Mega Llama3.1 70B NeMo Pre-training on GKE

What's next