MLCommons™ Algorithmic Efficiency

Paper (arXiv) • Installation • Rules • Contributing • License

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the competition rules and the benchmark code to run it. For a detailed description of the benchmark design, see our paper.

Getting Started

For instructions on developing and scoring your own algorithm in the benchmark see Getting Started Document.

Running a workload

To run a submission directly by running a Docker container, see Getting Started Document.

Alternatively from a your virtual environment or interactively running Docker container submission_runner.py run:

JAX

python3 submission_runner.py \ --framework=jax \ --workload=mnist \ --experiment_dir=$HOME/experiments \ --experiment_name=my_first_experiment \ --submission_path=reference_algorithms/development_algorithms/mnist/mnist_jax/submission.py \ --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json

Pytorch

python3 submission_runner.py \ --framework=pytorch \ --workload=mnist \ --experiment_dir=$HOME/experiments \ --experiment_name=my_first_experiment \ --submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \ --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json

Using Pytorch DDP (Recommended)

When using multiple GPUs on a single node it is recommended to use PyTorch's distributed data parallel. To do so, simply replace python3 by

torchrun --standalone --nnodes=1 --nproc_per_node=N_GPUS

where N_GPUS is the number of available GPUs on the node. To only see output from the first process, you can run the following to redirect the output from processes 1-7 to a log file:

torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8

So the complete command is for example:

torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \ submission_runner.py \ --framework=pytorch \ --workload=mnist \ --experiment_dir=/home/znado \ --experiment_name=baseline \ --submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \ --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json \

Rules

The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate rules document. Suggestions, clarifications and questions can be raised via pull requests.

Contributing

If you are interested in contributing to the work of the working group, feel free to join the weekly meetings, open issues. See our CONTRIBUTING.md for MLCommons contributing guidelines and setup and workflow instructions.

Note on shared data pipelines between JAX and PyTorch

The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.

Since we use PyTorch's DistributedDataParallel implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See this PR thread for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with rank == 0), and broadcast the batches to all other devices. This introduces an additional communication overhead for each batch. See the implementation for the WMT workload as an example.

Name		Name	Last commit message	Last commit date
Latest commit History 2,546 Commits
.assets		.assets
.github		.github
algorithmic_efficiency		algorithmic_efficiency
baselines		baselines
datasets		datasets
docker		docker
reference_algorithms		reference_algorithms
scoring		scoring
submissions/template		submissions/template
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
RULES.md		RULES.md
__init__.py		__init__.py
getting_started.md		getting_started.md
setup.cfg		setup.cfg
setup.py		setup.py
submission_runner.py		submission_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLCommons™ Algorithmic Efficiency

Table of Contents

Installation

Virtual environment

Docker

Building Docker Image

Running Docker Container (Interactive)

Running Docker Container (End-to-end)

Getting Started

Running a workload

Rules

Contributing

Note on shared data pipelines between JAX and PyTorch

About

Uh oh!

Releases

Packages

Languages

License

ezyang/algorithmic-efficiency

Folders and files

Latest commit

History

Repository files navigation

MLCommons™ Algorithmic Efficiency

Table of Contents

Installation

Virtual environment

Docker

Building Docker Image

Running Docker Container (Interactive)

Running Docker Container (End-to-end)

Getting Started

Running a workload

Rules

Contributing

Note on shared data pipelines between JAX and PyTorch

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages