About the performance category | | 0 | 771 | January 22, 2021 |
Performance of "activation" sparsity | | 1 | 212 | April 21, 2025 |
The "Ideal" PyTorch FLOP Counter (with __torch_dispatch__) | | 20 | 16680 | August 22, 2024 |
Fast combined C++/Python/TorchScript/Inductor tracebacks | | 4 | 3106 | August 17, 2023 |
Estimate theoritical FLOPs of backward pass of a DNN | | 1 | 1123 | April 16, 2023 |
Performance gains w/ nanoGPT using SDPA Custom Kernel | | 0 | 4724 | January 30, 2023 |
Making Transformer inference faster on GPUs | | 1 | 3993 | December 16, 2022 |
Investigation report: what would it cost to optimize c10::intrusive_ptr destruction for refcount == 1? (A: too much) | | 0 | 718 | May 3, 2022 |
Working With `c10::IValue` Efficiently | | 0 | 2719 | April 18, 2022 |
Unionizing for Profit: How to Exploit the Power of Unions in C++ | | 2 | 3360 | January 7, 2022 |
CUDA loops case study: code generation vs templates | | 4 | 2584 | December 12, 2021 |
Multiple workers for single batch | | 2 | 1175 | July 14, 2021 |
Optimizing contiguous() for the case where the Tensor is_contiguous()? | | 6 | 1755 | May 24, 2021 |
Converting weights flat buffer | | 0 | 902 | March 25, 2021 |
Why `torch::jit::pop` (and sometimes push) is worse for performance than direct `std::vector` access | | 0 | 1122 | March 17, 2021 |
Pytorch Benchmarks issues with general usability and issues with individual benchmarks | | 2 | 1334 | March 12, 2021 |
GPU Overheads and Fused Strassen | | 0 | 2255 | February 13, 2021 |
Comparing the performance of 0.4.1 and master | | 0 | 2364 | February 9, 2021 |
Overhead in `nn.Module` causing massive slowdowns compared to raw CuBLAS or Torchscript | | 0 | 1694 | January 28, 2021 |
Dispatcher Performance and Inlining: a Report on Two Days Spent on Dispatcher Performance | | 1 | 1223 | January 28, 2021 |
We shouldn't feel bad about passing `Tensor` by reference | | 0 | 2368 | January 25, 2021 |