[LanguageModeling/BERT/PyTorch] Failed to save very large model file due to older pickle protocol version

Related to LanguageModeling/BERT/PyTorch

Describe the bug
When I tried to train BERT model based on PyTorch with slightly larger vocabulary compared to default setting, the following error happened. According to error message, it looks like the code needs to add a parameter, pickle_protocol, to torch.save().

Traceback (most recent call last): File "/path/to/workdir//run_pretraining.py", line 678, in <module> args, final_loss, train_time_raw, global_step = main() File "/path/to/workdir//run_pretraining.py", line 647, in main torch.save({'model': model_to_save.state_dict(), File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 372, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 476, in _save pickler.dump(obj) OverflowError: serializing a string larger than 4 GiB requires pickle protocol 4 or higher Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 303, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 294, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/path/to/workdir//run_pretraining.py', '--local_rank=7', '--input_dir=/path/to/dataset//phase1/', '--output_dir=/results//checkpoints', '--config_file=bert_config.json', '--bert_model=bert-large-uncased', '--train_batch_size=8192', '--max_seq_length=128', '--max_predictions_per_seq=20', '--max_steps=7038', '--warmup_proportion=0.2843', '--num_steps_per_checkpoint=200', '--learning_rate=6e-3', '--seed=12439', '--fp16', '--gradient_accumulation_steps=128', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--do_train', '--json-summary', '/results//dllogger.json']' returned non-zero exit status 1.

By applying the following change, it looks like no error happened. (cf., https://stackoverflow.com/questions/29704139/pickle-in-python3-doesnt-work-for-large-data-saving)

--- ./run_pretraining.py 2021-03-27 05:22:26.332698069 +0000 +++ ./run_pretraining.py 2021-04-01 08:49:37.610154874 +0000 @@ -26,6 +26,7 @@ import time import argparse import random +import pickle import h5py from tqdm import tqdm, trange import os @@ -649,7 +650,9 @@ 'master params': list(amp.master_params(optimizer)), 'files': [f_id] + files, 'epoch': epoch, - 'data_loader': None if global_step >= args.max_steps else train_dataloader}, output_save_file) + 'data_loader': None if global_step >= args.max_steps else train_dataloader}, + output_save_file, + pickle_protocol=pickle.HIGHEST_PROTOCOL) most_recent_ckpts_paths.append(output_save_file) if len(most_recent_ckpts_paths) > 3:

To Reproduce
Steps to reproduce the behavior:

Prepare dataset with 32768 vocabularies (I think we can reproduce this issue with any corpus),
Change vocab_size in bert_config.json like
- "vocab_size": 30522 -> "vocab_size": 32768
Launch NGC container nvcr.io/nvidia/pytorch:20.12-py3,
Run bash ./scripts/run_pretraining.sh.

Expected behavior
Saving the file should be done without any error.

Environment
Please provide at least:

Container version (e.g. pytorch:19.05-py3): nvcr.io/nvidia/pytorch:20.12-py3
GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): Tesla V100-SXM2-32GB
CUDA driver version (e.g. 418.67): 440.64.00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LanguageModeling/BERT/PyTorch] Failed to save very large model file due to older pickle protocol version #897

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[LanguageModeling/BERT/PyTorch] Failed to save very large model file due to older pickle protocol version #897

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions