Skip to content

Conversation

@kuke
Copy link
Collaborator

@kuke kuke commented Sep 30, 2017

Resolve #268

@kuke kuke requested review from jacquesqiao and reyoung September 30, 2017 07:11
@kuke
Copy link
Collaborator Author

kuke commented Sep 30, 2017

The bug is related to the usage of numpy.absolute. And only need to separate the absolute and square operation, the training can go on normally in the latest nvidia-docker. But I don't know how the bug happens and why the tiny change works.

Here is the complete log of training on the tiny data by running examples/tiny/run_train.sh

----------- Configuration Arguments ----------- augment_conf_path: conf/augmentation.config batch_size: 16 dev_manifest: data/tiny/manifest.tiny init_model_path: None is_local: 1 learning_rate: 1e-05 max_duration: 27.0 mean_std_path: data/tiny/mean_std.npz min_duration: 0.0 num_conv_layers: 2 num_iter_print: 100 num_passes: 20 num_proc_data: 1 num_rnn_layers: 3 output_model_dir: ./checkpoints/tiny rnn_layer_size: 2048 share_rnn_weights: 1 shuffle_method: batch_shuffle_clipped specgram_type: linear test_off: 0 train_manifest: data/tiny/manifest.tiny trainer_count: 4 use_gpu: 1 use_gru: 0 use_sortagrad: 1 vocab_path: data/tiny/vocab.txt ------------------------------------------------ I0930 06:56:07.120182 21465 Util.cpp:166] commandline: --use_gpu=1 --trainer_count=4 --log_clipping=True [INFO 2017-09-30 06:56:09,917 layers.py:2554] output for __conv_0__: c = 32, h = 81, w = 54, size = 139968 [INFO 2017-09-30 06:56:09,919 layers.py:3077] output for __batch_norm_0__: c = 32, h = 81, w = 54, size = 139968 [INFO 2017-09-30 06:56:09,920 layers.py:2554] output for __conv_1__: c = 32, h = 41, w = 54, size = 70848 [INFO 2017-09-30 06:56:09,921 layers.py:3077] output for __batch_norm_1__: c = 32, h = 41, w = 54, size = 70848 I0930 06:56:09.948861 21465 MultiGradientMachine.cpp:99] numLogicalDevices=1 numThreads=4 numDevices=4 I0930 06:56:10.058734 21465 GradientMachine.cpp:85] Initing parameters.. I0930 06:56:13.628923 21465 GradientMachine.cpp:92] Init parameters done. .... ------- Time: 26 sec, Pass: 0, ValidationCost: 982.797485352 ... ------- Time: 21 sec, Pass: 1, ValidationCost: 888.178131104 ... ------- Time: 17 sec, Pass: 2, ValidationCost: 685.327529907 ... ------- Time: 20 sec, Pass: 3, ValidationCost: 563.537101746 ... ------- Time: 18 sec, Pass: 4, ValidationCost: 483.789489746 ... ------- Time: 17 sec, Pass: 5, ValidationCost: 356.234912872 ... ------- Time: 17 sec, Pass: 6, ValidationCost: 270.886962891 ... ------- Time: 18 sec, Pass: 7, ValidationCost: 259.592220306 ... ------- Time: 19 sec, Pass: 8, ValidationCost: 263.276557922 ... ------- Time: 18 sec, Pass: 9, ValidationCost: 259.665016174 ... ------- Time: 18 sec, Pass: 10, ValidationCost: 251.80607605 ... ------- Time: 17 sec, Pass: 11, ValidationCost: 247.049232483 ... ------- Time: 19 sec, Pass: 12, ValidationCost: 244.757091522 ... ------- Time: 19 sec, Pass: 13, ValidationCost: 244.193054199 ... ------- Time: 18 sec, Pass: 14, ValidationCost: 244.47366333 ... ------- Time: 19 sec, Pass: 15, ValidationCost: 244.523860931 ... ------- Time: 19 sec, Pass: 16, ValidationCost: 243.082180023 ... ------- Time: 16 sec, Pass: 17, ValidationCost: 236.684200287 ... ------- Time: 18 sec, Pass: 18, ValidationCost: 232.584197998 ... ------- Time: 18 sec, Pass: 19, ValidationCost: 230.130191803 
Copy link
Member

@jacquesqiao jacquesqiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BMJL!

@kuke kuke merged commit 0173cc5 into PaddlePaddle:develop Sep 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants