- Notifications
You must be signed in to change notification settings - Fork 5.9k
Description
测P40上 resnet模型单机8卡跑不起来。(5,6,7,8卡都跑不起来)
run model resnet50
cudaid
1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES
1,2,3,4,5,6,7
----------- Configuration Arguments -----------
batch_size: 64
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False
*** Aborted at 1528264637 (unix time) try "date -d @1528264637" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x50) received by PID 46576 (TID 0x7f13dd421700) from PID 80; stack trace: ***
@ 0x7f13dcc447e0 (unknown)
@ 0x7f131320b7d3 commFree()
@ 0x7f131320f82d ncclCommInitAll
@ 0x7f13ba1ef075 paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7f13ba1eae01 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7f13ba186195 ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS2_13function_callEE1_clES23
@ 0x7f13ba1862ee ZZN8pybind1112cpp_function10initializeIZNS_6detail4initIIRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS8_8CPUPlaceENS8_15CUDAPinnedPlaceENS5_6detail7variant5void_ESE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_SE_EESaISF_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEESS_RKNS7_9framework11ProgramDescERKSsPNST_5ScopeERS4_IS10_SaIS10_EERKNST_7details17ExecutionStrategyERKNS14_13BuildStrategyEmmEE7executeINS_6class_INST_16ParallelExecutorEIEEEIELi0EEEvRT_DpRKT0_EUlPS1E_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmE_vIS1M_SJ_SS_SS_SW_SY_S10_S13_S17_S1A_mmEINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES23
@ 0x7f13ba149b74 pybind11::cpp_function::dispatcher()
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcec026d function_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceaaf6f instancemethod_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dceee7fe slot_tp_init
@ 0x7f13dceed468 type_call
@ 0x7f13dce980e3 PyObject_Call
@ 0x7f13dcf31877 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf32491 PyEval_EvalFrameEx
@ 0x7f13dcf34120 PyEval_EvalCodeEx
@ 0x7f13dcf34232 PyEval_EvalCode
@ 0x7f13dcf4e61c run_mod
@ 0x7f13dcf4e6f0 PyRun_FileExFlags
@ 0x7f13dcf4fbfc PyRun_SimpleFileExFlags
@ 0x7f13dcf614bc Py_Main
./run.xsh: line 22: 46576 Segmentation fault
模型代码
PaddlePaddle/paddle-ce-latest-kpis@a2d1273
nccl:
continuous_evaluation]# ls -ltr /chaorong/lib/libnccl*
-rwxrwxrwx 1 root root 232842694 Feb 22 22:00 /chaorong/lib/libnccl_static.a
-rwxrwxrwx 1 root root 227911007 Feb 22 22:00 /chaorong/lib/libnccl.so.2.1.15
lrwxrwxrwx 1 root root 17 Feb 22 22:00 /chaorong/lib/libnccl.so.2 -> libnccl.so.2.1.15
lrwxrwxrwx 1 root root 12 Feb 22 22:00 /chaorong/lib/libnccl.so -> libnccl.so.2
4卡能跑起来:
CUDA_VISIBLE_DEVICES
4,5,6,7
----------- Configuration Arguments -----------
batch_size: 128
data_format: NCHW
data_set: flowers
device: GPU
gpu_id: 4,5,6,7
infer_only: False
iterations: 80
log_dir: ./
model: resnet_imagenet
pass_num: 5
skip_batch_num: 5
use_cprof: False
use_fake_data: True
use_nvprof: False
Pass: 0, Iter: 0, loss: 6.1183624, acc: 0.0
Pass: 0, Iter: 1, loss: 5.5844965, acc: 0.0