Reduce_mul implementation #1132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

emrys53 wants to merge 2 commits into xtensor-stack:master from emrys53:master

Contributor

emrys53 commented Jul 1, 2025

While using XSIMD in our library, we realized that a dedicated reduce_mul function was missing. This PR introduces a reduce_mul implementation for all architectures and supported types, with architecture-specific optimizations for SSE, AVX, and AVX512 instruction sets. Relevant unit tests are included.

I was unable to test the AVX512-specific implementation on my machine, which only supports AVX2. For AVX512, I used mm512_reduce_mul{type} intrinsics. According to Intel's documentation, these are sequential rather than parallel. Since I couldn't benchmark or compare alternative implementations, I left it as-is. I'm open to suggestions for improvement.
Motivation

Although XSIMD provides a general reduce function, using it with std::complex or std::complex fails, as as_unsigned_integer_t is not specialized for complex types. I'm not sure whether this is intended behavior, but I wanted to mention it for context. Providing a reduce_mul directly avoids this issue.

This is my first contribution to XSIMD, and I’ve only been using the library for about a week. Feedback is very welcome, and I’m happy to adjust or improve the code based on your review.

emrys53 and others added 2 commits

July 1, 2025 19:26

added reduce_mul implementation for common arch and specialized for x…

8cf4a1f

…86_64 architecture

Merge branch 'xtensor-stack:master' into master

1ebf5f6

serge-sans-paille reviewed

View reviewed changes

include/xsimd/arch/common/xsimd_common_arithmetic.hpp

      
     // hmul  
     template <class A, class T, class /*=typename std::enable_if<std::is_integral<T>::value, void>::type*/>  
     XSIMD_INLINE T hmul(batch<T, A> const& self, requires_arch<common>) noexcept

Contributor

serge-sans-paille Jul 1, 2025

shouldn't this just be the generic implementation of reduce_mul, and if so, why not using the current generic reduction implementation?

Contributor Author

emrys53 Jul 1, 2025

This function is only used for integral types with sizeof(T) < 4. The reason I implemented like this was because for example when testing reduce_mul for uint8_t with AVX machine. What happens is that it has 32 elements which means general reduce function multiplies and swizzle the batch 5 times (32->16->8->4->2->1) and in the end it calls self.get(0) (which is just too inefficient anyway issue #1133). So I thought it would be more efficient to implement it this way. What it does with uint8_t in my implementation is that it first splits the batch into 2 equal sse4_2 batches multiplies them together and then calls hmul.

Contributor Author

emrys53 Jul 1, 2025 •

edited

Loading

Also the swizzle function for uint8_t is just loading the batch to a buffer and storing the result back. So using that 5 times (in AVX) is just too inefficient compared to split_avx function which splits it using SIMD intrinsics.

Contributor Author

emrys53 Jul 14, 2025

@serge-sans-paille you haven't responded back. If you are open to the idea of reduce_mul implementation. I can fix the code so that it passes all the tests. If you have feedbacks or recommendations to the code I have written, you can write it here and I can fix it after we discuss.

Contributor

DiamonDinoia commented Aug 8, 2025

I'm also interested in this. @emrys53 could you rebase master?
@serge-sans-paille what do you think needs to be changes to integrate it?

serge-sans-paille added a commit that referenced this pull request

Generate a swizzle-friendly mask during generic reduction

28df416

It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132

serge-sans-paille mentioned this pull request

Generate a swizzle-friendly mask during generic reduction #1161

Merged

serge-sans-paille added a commit that referenced this pull request

Generate a swizzle-friendly mask during generic reduction

0726d91

It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132

serge-sans-paille added a commit that referenced this pull request

Generate a swizzle-friendly mask during generic reduction

429da70

It changes nothing for clang which has a nice shuffle optimizer, but it does help gcc a lot. Somehow related to #1132

serge-sans-paille added a commit that referenced this pull request

Provide generic and specialize implementation of reduce_mul

263b4e8

This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.

Contributor

serge-sans-paille commented Aug 23, 2025

@emrys53 I've submitted #1163 as a port of this patch for other targets, as well as some support for other architectures. I was strongly motivated and inspired by your work, so please don't take it as an offense, rather as a tribute!

serge-sans-paille added a commit that referenced this pull request

Provide generic and specialize implementation of reduce_mul

6f5a63b

This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.

serge-sans-paille added a commit that referenced this pull request

Provide generic and specialize implementation of reduce_mul

42055b1

This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.

serge-sans-paille added a commit that referenced this pull request

Provide generic and specialize implementation of reduce_mul

65e837b

This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.

serge-sans-paille added a commit that referenced this pull request

Provide generic and specialize implementation of reduce_mul

83727ee

This is a generalization of #1132 by @emrys53. Part of the Intel code is strongly inspired by the work from #1132, with some minor nits.

Member

JohanMabille commented Aug 29, 2025

Closing this in favor of #1163, thanks a lot @emrys53 for initiating the work on this!

JohanMabille closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment