Training on Intel CPU

How It Works For Training optimization in CPU

Accelerate has full support for Intel CPU, all you need to do is enabling it through the config.

Scenario 1: Acceleration of No distributed CPU training

Run accelerate config on your machine:

$ accelerate config ----------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine ----------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? No distributed training Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:yes Do you wish to optimize your script with torch dynamo?[yes/NO]:NO Do you want to use DeepSpeed? [yes/NO]: NO ----------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? bf16

This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with default_config.yaml which is generated by accelerate config

compute_environment: LOCAL_MACHINE distributed_type: 'NO' downcast_bf16: 'no' machine_rank: 0 main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 1 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true

accelerate launch examples/nlp_example.py

accelerator.prepare can currently only handle simultaneously preparing multiple models (and no optimizer) OR a single model-optimizer pair for training. Other attempts (e.g., two model-optimizer pairs) will raise a verbose error. To work around this limitation, consider separately using accelerator.prepare for each model-optimizer pair.

Scenario 2: Acceleration of distributed CPU training we use Intel oneCCL for communication, combined with Intel® MPI library to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. you could refer the here for the installation guide

Run accelerate config on your machine(node0):

$ accelerate config ----------------------------------------------------------------------------------------------------------------------------------------------------------- In which compute environment are you running? This machine ----------------------------------------------------------------------------------------------------------------------------------------------------------- Which type of machine are you using? multi-CPU How many different machines will you use (use more than 1 for multi-node training)? [1]: 4 ----------------------------------------------------------------------------------------------------------------------------------------------------------- What is the rank of this machine? 0 What is the IP address of the machine that will host the main process? 36.112.23.24 What is the port you will use to communicate with the main process? 29500 Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes Do you want accelerate to launch mpirun? [yes/NO]: yes Please enter the path to the hostfile to use with mpirun [~/hostfile]: ~/hostfile Enter the number of oneCCL worker threads [1]: 1 Do you wish to optimize your script with torch dynamo?[yes/NO]:NO How many processes should be used for distributed training? [1]:16 ----------------------------------------------------------------------------------------------------------------------------------------------------------- Do you wish to use FP16 or BF16 (mixed precision)? bf16

For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with IPEX enabled for distributed CPU training.

default_config.yaml which is generated by accelerate config

compute_environment: LOCAL_MACHINE distributed_type: MULTI_CPU downcast_bf16: 'no' machine_rank: 0 main_process_ip: 36.112.23.24 main_process_port: 29500 main_training_function: main mixed_precision: bf16 mpirun_config: mpirun_ccl: '1' mpirun_hostfile: /home/user/hostfile num_machines: 4 num_processes: 16 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: true

Set following env and using intel MPI to launch the training

In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.

If you selected to let Accelerate launch mpirun, ensure that the location of your hostfile matches the path in the config.

$ cat hostfile xxx.xxx.xxx.xxx #node0 ip xxx.xxx.xxx.xxx #node1 ip xxx.xxx.xxx.xxx #node2 ip xxx.xxx.xxx.xxx #node3 ip

Before executing accelerate launch command, you need source the oneCCL bindings setvars.sh to get your Intel MPI environment properly. Note that both the python script and environment need to be available on all of the machines being used for multi-CPU training.

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh accelerate launch examples/nlp_example.py

You can also directly launch distributed training with mpirun command, you need to run the following command in node0 and 16DDP will be enabled in node0,node1,node2,node3 with BF16 mixed precision. When using this method, the python script, python environment, and accelerate config file need to be available on all of the machines used for multi-CPU training.

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)") source $oneccl_bindings_for_pytorch_path/env/setvars.sh export CCL_WORKER_COUNT=1 export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip export CCL_ATL_TRANSPORT=ofi mpirun -f hostfile -n 16 -ppn 4 accelerate launch examples/nlp_example.py

Update on GitHub