Fix the core dump bug of DS2's training in docker #350

kuke · 2017-09-30T07:11:11Z

Resolve #268

kuke · 2017-09-30T07:38:51Z

The bug is related to the usage of numpy.absolute. And only need to separate the absolute and square operation, the training can go on normally in the latest nvidia-docker. But I don't know how the bug happens and why the tiny change works.

Here is the complete log of training on the tiny data by running examples/tiny/run_train.sh

----------- Configuration Arguments ----------- augment_conf_path: conf/augmentation.config batch_size: 16 dev_manifest: data/tiny/manifest.tiny init_model_path: None is_local: 1 learning_rate: 1e-05 max_duration: 27.0 mean_std_path: data/tiny/mean_std.npz min_duration: 0.0 num_conv_layers: 2 num_iter_print: 100 num_passes: 20 num_proc_data: 1 num_rnn_layers: 3 output_model_dir: ./checkpoints/tiny rnn_layer_size: 2048 share_rnn_weights: 1 shuffle_method: batch_shuffle_clipped specgram_type: linear test_off: 0 train_manifest: data/tiny/manifest.tiny trainer_count: 4 use_gpu: 1 use_gru: 0 use_sortagrad: 1 vocab_path: data/tiny/vocab.txt ------------------------------------------------ I0930 06:56:07.120182 21465 Util.cpp:166] commandline: --use_gpu=1 --trainer_count=4 --log_clipping=True [INFO 2017-09-30 06:56:09,917 layers.py:2554] output for __conv_0__: c = 32, h = 81, w = 54, size = 139968 [INFO 2017-09-30 06:56:09,919 layers.py:3077] output for __batch_norm_0__: c = 32, h = 81, w = 54, size = 139968 [INFO 2017-09-30 06:56:09,920 layers.py:2554] output for __conv_1__: c = 32, h = 41, w = 54, size = 70848 [INFO 2017-09-30 06:56:09,921 layers.py:3077] output for __batch_norm_1__: c = 32, h = 41, w = 54, size = 70848 I0930 06:56:09.948861 21465 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4 I0930 06:56:10.058734 21465 GradientMachine.cpp:85] Initing parameters.. I0930 06:56:13.628923 21465 GradientMachine.cpp:92] Init parameters done. .... ------- Time: 26 sec, Pass: 0, ValidationCost: 982.797485352 ... ------- Time: 21 sec, Pass: 1, ValidationCost: 888.178131104 ... ------- Time: 17 sec, Pass: 2, ValidationCost: 685.327529907 ... ------- Time: 20 sec, Pass: 3, ValidationCost: 563.537101746 ... ------- Time: 18 sec, Pass: 4, ValidationCost: 483.789489746 ... ------- Time: 17 sec, Pass: 5, ValidationCost: 356.234912872 ... ------- Time: 17 sec, Pass: 6, ValidationCost: 270.886962891 ... ------- Time: 18 sec, Pass: 7, ValidationCost: 259.592220306 ... ------- Time: 19 sec, Pass: 8, ValidationCost: 263.276557922 ... ------- Time: 18 sec, Pass: 9, ValidationCost: 259.665016174 ... ------- Time: 18 sec, Pass: 10, ValidationCost: 251.80607605 ... ------- Time: 17 sec, Pass: 11, ValidationCost: 247.049232483 ... ------- Time: 19 sec, Pass: 12, ValidationCost: 244.757091522 ... ------- Time: 19 sec, Pass: 13, ValidationCost: 244.193054199 ... ------- Time: 18 sec, Pass: 14, ValidationCost: 244.47366333 ... ------- Time: 19 sec, Pass: 15, ValidationCost: 244.523860931 ... ------- Time: 19 sec, Pass: 16, ValidationCost: 243.082180023 ... ------- Time: 16 sec, Pass: 17, ValidationCost: 236.684200287 ... ------- Time: 18 sec, Pass: 18, ValidationCost: 232.584197998 ... ------- Time: 18 sec, Pass: 19, ValidationCost: 230.130191803

jacquesqiao

BMJL！

kuke requested review from jacquesqiao and reyoung September 30, 2017 07:11

fix the core dump bug of DS2's training in docker

e3f49aa

kuke force-pushed the docker_bug_fix branch from b9ef765 to e3f49aa Compare September 30, 2017 07:22

kuke mentioned this pull request Sep 30, 2017

DeepSpeech2 core dump when trianing PaddlePaddle/Paddle#4269

Closed

jacquesqiao approved these changes Sep 30, 2017

View reviewed changes

kuke merged commit 0173cc5 into PaddlePaddle:develop Sep 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the core dump bug of DS2's training in docker #350

Fix the core dump bug of DS2's training in docker #350

Uh oh!

kuke commented Sep 30, 2017 •

edited

Loading

kuke commented Sep 30, 2017 •

edited

Loading

jacquesqiao left a comment

Labels

2 participants

Fix the core dump bug of DS2's training in docker #350

Fix the core dump bug of DS2's training in docker #350

Uh oh!

Conversation

kuke commented Sep 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

kuke commented Sep 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jacquesqiao left a comment

Choose a reason for hiding this comment

Labels

2 participants

kuke commented Sep 30, 2017 •

edited

Loading

kuke commented Sep 30, 2017 •

edited

Loading