Skip to content

[LanguageModeling/BERT/PyTorch] Failed to save very large model file due to older pickle protocol version #897

@lazykyama

Description

@lazykyama

Related to LanguageModeling/BERT/PyTorch

Describe the bug
When I tried to train BERT model based on PyTorch with slightly larger vocabulary compared to default setting, the following error happened. According to error message, it looks like the code needs to add a parameter, pickle_protocol, to torch.save().

Traceback (most recent call last): File "/path/to/workdir//run_pretraining.py", line 678, in <module> args, final_loss, train_time_raw, global_step = main() File "/path/to/workdir//run_pretraining.py", line 647, in main torch.save({'model': model_to_save.state_dict(), File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 372, in save _save(obj, opened_zipfile, pickle_module, pickle_protocol) File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 476, in _save pickler.dump(obj) OverflowError: serializing a string larger than 4 GiB requires pickle protocol 4 or higher Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 303, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 294, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/opt/conda/bin/python3', '-u', '/path/to/workdir//run_pretraining.py', '--local_rank=7', '--input_dir=/path/to/dataset//phase1/', '--output_dir=/results//checkpoints', '--config_file=bert_config.json', '--bert_model=bert-large-uncased', '--train_batch_size=8192', '--max_seq_length=128', '--max_predictions_per_seq=20', '--max_steps=7038', '--warmup_proportion=0.2843', '--num_steps_per_checkpoint=200', '--learning_rate=6e-3', '--seed=12439', '--fp16', '--gradient_accumulation_steps=128', '--allreduce_post_accumulation', '--allreduce_post_accumulation_fp16', '--do_train', '--json-summary', '/results//dllogger.json']' returned non-zero exit status 1. 

By applying the following change, it looks like no error happened. (cf., https://stackoverflow.com/questions/29704139/pickle-in-python3-doesnt-work-for-large-data-saving)

--- ./run_pretraining.py 2021-03-27 05:22:26.332698069 +0000 +++ ./run_pretraining.py 2021-04-01 08:49:37.610154874 +0000 @@ -26,6 +26,7 @@ import time import argparse import random +import pickle import h5py from tqdm import tqdm, trange import os @@ -649,7 +650,9 @@ 'master params': list(amp.master_params(optimizer)), 'files': [f_id] + files, 'epoch': epoch, - 'data_loader': None if global_step >= args.max_steps else train_dataloader}, output_save_file) + 'data_loader': None if global_step >= args.max_steps else train_dataloader}, + output_save_file, + pickle_protocol=pickle.HIGHEST_PROTOCOL) most_recent_ckpts_paths.append(output_save_file) if len(most_recent_ckpts_paths) > 3:

To Reproduce
Steps to reproduce the behavior:

  1. Prepare dataset with 32768 vocabularies (I think we can reproduce this issue with any corpus),
  2. Change vocab_size in bert_config.json like
    • "vocab_size": 30522 -> "vocab_size": 32768
  3. Launch NGC container nvcr.io/nvidia/pytorch:20.12-py3,
  4. Run bash ./scripts/run_pretraining.sh.

Expected behavior
Saving the file should be done without any error.

Environment
Please provide at least:

  • Container version (e.g. pytorch:19.05-py3): nvcr.io/nvidia/pytorch:20.12-py3
  • GPUs in the system: (e.g. 8x Tesla V100-SXM2-16GB): Tesla V100-SXM2-32GB
  • CUDA driver version (e.g. 418.67): 440.64.00

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions