Paper (arXiv) • Installation • Rules • Contributing • License
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the competition rules and the benchmark code to run it. For a detailed description of the benchmark design, see our paper.
- Table of Contents
- AlgoPerf Benchmark Workloads
- Installation
- Getting Started
- Rules
- Contributing
- Citing AlgoPerf Benchmark
You can install this package and dependences in a python virtual environment or use a Docker container (recommended).
TL;DR to install the Jax version for GPU run:
pip3 install -e '.[pytorch_cpu]' pip3 install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html' pip3 install -e '.[full]'TL;DR to install the PyTorch version for GPU run:
pip3 install -e '.[jax_cpu]' pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html' pip3 install -e '.[full]'Note: Python minimum requirement >= 3.8
To set up a virtual enviornment and install this repository
-
Create new environment, e.g. via
condaorvirtualenvsudo apt-get install python3-venv python3 -m venv env source env/bin/activate -
Clone this repository
git clone https://github.com/mlcommons/algorithmic-efficiency.git cd algorithmic-efficiency -
Run pip3 install commands above to install
algorithmic_efficiency.
Additional Details
You can also install the requirements for individual workloads, e.g. viapip3 install -e '.[librispeech]'or all workloads at once via
pip3 install -e '.[full]'We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.
Prerequisites for NVIDIA GPU set up: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs. See instructions here.
-
Clone this repository
cd ~ && git clone https://github.com/mlcommons/algorithmic-efficiency.git
-
Build Docker Image
cd `algorithmic-efficiency/docker` docker build -t <docker_image_name> . --build-args framework=<framework>
The
frameworkflag can be eitherpytorch,jaxorboth. Thedocker_image_nameis arbitrary.
- Run detached Docker Container This will print out a container id.
docker run -t -d \ -v $HOME/data/:/data/ \ -v $HOME/experiment_runs/:/experiment_runs \ -v $HOME/experiment_runs/logs:/logs \ -v $HOME/algorithmic-efficiency:/algorithmic-efficiency \ --gpus all \ --ipc=host \ <docker_image_name>
- Open a bash terminal
docker exec -it <container_id> /bin/bash
To run a submission end-to-end in a container see Getting Started Document.
For instructions on developing and scoring your own algorithm in the benchmark see Getting Started Document.
To run a submission directly by running a Docker container, see Getting Started Document.
Alternatively from a your virtual environment or interactively running Docker container submission_runner.py run:
JAX
python3 submission_runner.py \ --framework=jax \ --workload=mnist \ --experiment_dir=$HOME/experiments \ --experiment_name=my_first_experiment \ --submission_path=reference_algorithms/development_algorithms/mnist/mnist_jax/submission.py \ --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.jsonPytorch
python3 submission_runner.py \ --framework=pytorch \ --workload=mnist \ --experiment_dir=$HOME/experiments \ --experiment_name=my_first_experiment \ --submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \ --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.jsonUsing Pytorch DDP (Recommended)
When using multiple GPUs on a single node it is recommended to use PyTorch's distributed data parallel. To do so, simply replace python3 by
torchrun --standalone --nnodes=1 --nproc_per_node=N_GPUSwhere N_GPUS is the number of available GPUs on the node. To only see output from the first process, you can run the following to redirect the output from processes 1-7 to a log file:
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8So the complete command is for example:
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \ submission_runner.py \ --framework=pytorch \ --workload=mnist \ --experiment_dir=/home/znado \ --experiment_name=baseline \ --submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \ --tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json \ The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate rules document. Suggestions, clarifications and questions can be raised via pull requests.
If you are interested in contributing to the work of the working group, feel free to join the weekly meetings, open issues. See our CONTRIBUTING.md for MLCommons contributing guidelines and setup and workflow instructions.
The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
Since we use PyTorch's DistributedDataParallel implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See this PR thread for more details. While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with rank == 0), and broadcast the batches to all other devices. This introduces an additional communication overhead for each batch. See the implementation for the WMT workload as an example.
