You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pretrain/README.md
+12-3Lines changed: 12 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,22 @@
2
2
3
3
The code of pre-training CPT is based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
4
4
5
-
For **Setup**, **Data Processing**and **Training**of CPT, you can refer to the [README](README_megatron.md) of Megatron-LM. And the package [jieba_fast](https://github.com/deepcs233/jieba_fast) is needed for Whole Word Masking pre-training.
5
+
For **Setup**, **Data Processing** of CPT, you can refer to the [README](README_megatron.md) of Megatron-LM. And the package [jieba_fast](https://github.com/deepcs233/jieba_fast) is needed for Whole Word Masking pre-training.
6
6
7
-
After processing the data, place the `.bin` and `.idx` files into `./dataset/`. And place vocab files into `vocab/bert_zh_vocab/`. Then, use the scripts `run_pretrain_bart.sh` and `run_pretrain_cpt.sh` to train Chinese BART and CPT, respectively.
7
+
## Training
8
+
Firstly, prepare files in the following folders:
9
+
-`dataset/`: Place the `.bin` and `.idx` files that preprocessed from raw text.
10
+
-`vocab/`: Place the vocab files and model config file.
11
+
-`roberta_zh/`: Place the checkpoint of Chinese RoBERTa, as the CPT initialize the encoder from the checkpoint.
12
+
13
+
Then, use the scripts `run_pretrain_bart.sh` and `run_pretrain_cpt.sh` to train Chinese BART and CPT, respectively.
14
+
15
+
16
+
*NOTE: the training scripts is distributed examples for 8 GPUs. You may alter the number of GPUs and change the training steps to meet the need.*
8
17
9
18
## Main Changes
10
19
- Add `bart_model` and `cpt_model` for Megatron under `megatron/model`, to let Megatron can train on BART and CPT.
11
-
- Add `_HfAutoTokenizer` in `megatron/tokenizer/tokenizer.py` to let Megatron can use Tokenizers from Huggingface-Transformers.
20
+
- Add `_HfBertTokenizer` in `megatron/tokenizer/tokenizer.py` to let Megatron can use Tokenizers from Huggingface-Transformers.
12
21
- Add `bart_dataset` and `cpt_dataset` under `megatron/data` to produce data for Whole Word Masking (WWM) and Denoising Auto-Encoder (DAE) pre-training.
13
22
- Add `tools/convert_ckpt.py` to convert Megatron checkpoints to Huggingface-Transformers format.
14
23
- Add `tools/preprocess_data.py` to preprocess and chunk large amount of text data into binary format used in Megatron.
0 commit comments