Revisions to How fast can we *really* multiply matrices?

Formatting, thesis title

edited Apr 11, 2024 at 8:29

36.8k
13
140
386

Recently there is a PhD thesis, PhD thesisPractical fast matrix multiplication algorithms, about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying $n \times n$ matrices from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying $n \times n$ matrices from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

Recently there is a PhD thesis about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying $n \times n$ matrices from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

Recently there is a PhD thesis, Practical fast matrix multiplication algorithms, about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying $n \times n$ matrices from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

formatting

Source Link

edit approved May 12, 2019 at 20:10

Glorfindel

2.8k
6
30
39

Recently there is a PhD thesis about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n$n \times n$ matrices from O(n^3)$\mathcal{O}(n^3)$ to O(n^2.807)$\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

Other than that, you might be also interested in this paper:

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

Recently there is a PhD thesis about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n matrices from O(n^3) to O(n^2.807). Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

Other than that, you might be also interested in this paper:

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

Recently there is a PhD thesis about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying $n \times n$ matrices from $\mathcal{O}(n^3)$ to $\mathcal{O}(n^{2.807})$. Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

Other than that, you might be also interested in this paper:

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

Source Link

answered May 12, 2019 at 19:27

Jianyu Huang

511
4
3

Recently there is a PhD thesis about the practical fast matrix multiplication algorithms like Strassen:

Matrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n matrices from O(n^3) to O(n^2.807). Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.

This work includes several publications:

Strassen's Algorithm Reloaded. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

Generating Families of Practical Fast Matrix Multiplication Algorithms. In 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

Strassen's Algorithm for Tensor Contraction. In SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs. FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

The open source code repositories are here:

https://github.com/flame/fmm-gen

https://github.com/flame/tblis-strassen

Other than that, you might be also interested in this paper:

A framework for practical parallel fast matrix multiplication. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015).

Stack Exchange Network

Return to Answer