PyTorch Developer Mailing List

performance

Topic		Replies	Views	Activity
About the performance category		0	771	January 22, 2021
Performance of "activation" sparsity		1	212	April 21, 2025
The "Ideal" PyTorch FLOP Counter (with __torch_dispatch__)		20	16680	August 22, 2024
Fast combined C++/Python/TorchScript/Inductor tracebacks		4	3106	August 17, 2023
Estimate theoritical FLOPs of backward pass of a DNN		1	1123	April 16, 2023
Performance gains w/ nanoGPT using SDPA Custom Kernel		0	4724	January 30, 2023
Making Transformer inference faster on GPUs		1	3993	December 16, 2022
Investigation report: what would it cost to optimize c10::intrusive_ptr destruction for refcount == 1? (A: too much)		0	718	May 3, 2022
Working With `c10::IValue` Efficiently		0	2719	April 18, 2022
Unionizing for Profit: How to Exploit the Power of Unions in C++		2	3360	January 7, 2022
CUDA loops case study: code generation vs templates		4	2584	December 12, 2021
Multiple workers for single batch		2	1175	July 14, 2021
Optimizing contiguous() for the case where the Tensor is_contiguous()?		6	1755	May 24, 2021
Converting weights flat buffer		0	902	March 25, 2021
Why `torch::jit::pop` (and sometimes push) is worse for performance than direct `std::vector` access		0	1122	March 17, 2021
Pytorch Benchmarks issues with general usability and issues with individual benchmarks		2	1334	March 12, 2021
GPU Overheads and Fused Strassen		0	2255	February 13, 2021
Comparing the performance of 0.4.1 and master		0	2364	February 9, 2021
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript		0	1694	January 28, 2021
Dispatcher Performance and Inlining: a Report on Two Days Spent on Dispatcher Performance		1	1223	January 28, 2021
We shouldn't feel bad about passing `Tensor` by reference		0	2368	January 25, 2021