- Notifications
You must be signed in to change notification settings - Fork 3.4k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Related to LanguageModeling/BERT/PyTorch
Describe the bug
When I tried to train BERT model based on PyTorch with slightly larger vocabulary compared to default setting, the following error happened. According to error message, it looks like the code needs to add a parameter, pickle_protocol, to torch.save().
Traceback (most recent call last): File "/path/to/workdir//run_pretraining.py", line 678, in <module> args, final_loss, train_time_raw, global_step = main() File "/path/to/workdir//run_pretraining.py", line 647, in main torch.save({'model': model_to_save.state_dict(), File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 372, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 476, in _save pickler.dump(obj) OverflowError: serializing a string larger than 4 GiB requires pickle protocol 4 or higher Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 303, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 294, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/path/to/workdir//run_pretraining.py', '--local_rank=7', '--input_dir=/path/to/dataset//phase1/', '--output_dir=/results//checkpoints', '--config_file=bert_config.json', '--bert_model=bert-large-uncased', '--train_batch_size=8192', '--max_seq_length=128', '--max_predictions_per_seq=20', '--max_steps=7038', '--warmup_proportion=0.2843', '--num_steps_per_checkpoint=200', '--learning_rate=6e-3', '--seed=12439', '--fp16', '--gradient_accumulation_steps=128', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--do_train', '--json-summary', '/results//dllogger.json']' returned non-zero exit status 1. By applying the following change, it looks like no error happened. (cf., https://stackoverflow.com/questions/29704139/pickle-in-python3-doesnt-work-for-large-data-saving)
--- ./run_pretraining.py 2021-03-27 05:22:26.332698069 +0000 +++ ./run_pretraining.py 2021-04-01 08:49:37.610154874 +0000 @@ -26,6 +26,7 @@ import time import argparse import random +import pickle import h5py from tqdm import tqdm, trange import os @@ -649,7 +650,9 @@ 'master params': list(amp.master_params(optimizer)), 'files': [f_id] + files, 'epoch': epoch, - 'data_loader': None if global_step >= args.max_steps else train_dataloader}, output_save_file) + 'data_loader': None if global_step >= args.max_steps else train_dataloader}, + output_save_file, + pickle_protocol=pickle.HIGHEST_PROTOCOL) most_recent_ckpts_paths.append(output_save_file) if len(most_recent_ckpts_paths) > 3:To Reproduce
Steps to reproduce the behavior:
- Prepare dataset with 32768 vocabularies (I think we can reproduce this issue with any corpus),
- Change
vocab_sizeinbert_config.jsonlike"vocab_size": 30522->"vocab_size": 32768
- Launch NGC container
nvcr.io/nvidia/pytorch:20.12-py3, - Run
bash ./scripts/run_pretraining.sh.
Expected behavior
Saving the file should be done without any error.
Environment
Please provide at least:
- Container version (e.g. pytorch:19.05-py3):
nvcr.io/nvidia/pytorch:20.12-py3 - GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): Tesla V100-SXM2-32GB
- CUDA driver version (e.g. 418.67): 440.64.00
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working