Skip to content

Conversation

@cyanguwa
Copy link

Upon profiling the PyTorch Detection/SSD code with resnet50, we noticed a lot of NCHW-NHWC transpose kernels. The changes suggested in this pull request will switch the input data format from NCHW to NHWC to comply with the more efficient memory format on Tensor Cores, and we have observed a ~50% throughput improvement on an A100 80G card.

python ./main.py --backbone resnet50 --warmup 300 --bs 256 --amp --data --epochs 1 --mode benchmark-training --benchmark-iterations 100 --num-workers 1

Before: Total images: 25600 total time: 60.946 Average images/sec: 420.047 Median images/sec: 420.351
After: Total images: 25600 total time: 40.827 Average images/sec: 627.030 Median images/sec: 627.356

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant