Skip to content

Commit 896cf9d

Browse files
Jiong Gongpytorchmergebot
authored andcommitted
[inductor][cpp] vectorization support for int32/int64 (pytorch#119001)
This pull request aims to complete most of the support for vectorizing int32 and int64 data types except for indirect indexing and masks. The basic data type support for uint32 and uint64 is also added but without vectorization. More vectorized conversion functions are added between integer and float. In order to support int64 vectors, a new VectorizedN class to handle vectors of arbitrary length. Below are the details: 1. Complete most of the int32 and int64 vectorization support including load, store, reduction, constant and conversion. The indirect indexing and masks will be addressed in follow-up PRs, after which, the legality checking logic in `CppVecKernelChecker` can be further simplified. 2. Util functions for conversion between integer and float vectors (in cpp_prefix.h and ATen vec). Ideally, we'd better move them from cpp_prefix.h to ATen vec to simplify cpp_prefix.h, will be addressed in follow-up PRs. 3. Introduced a new template class VectorizedN, designed to handle vectors of arbitrary length by encapsulating multiple Vectorized<T> instances. This class supports most of the operations of `Vectorized<T>`. It makes the support of int64 vectorization simpler. I will also apply it to bf16/fp16/int8 in the follow-up PRs for better efficiency. For example, bf16 currently only uses half of the vector lanes. With `VectorizedN`, we can use full of the lanes and map bf16 vector to `VectorizedN<float,2>` on conversion. 4. Basic data type support is added for uint32 and uint64 (in graph.py). Vectorization support will be added later but not of high priority due to fewer usages. Next steps: - [ ] Refactor the vector mask handling to support data types other than float. Currently vector masks are implemented with float vectors. - [ ] Fully utilize vector lanes for bfloat16/float16/int8. - [ ] Support indirect indexing with vectorized index via scalarization. - [ ] Clean up `CppVecKernelChecker`. - [ ] Simplify `cpp_prefix.h` including refactoring vector conversion logic. Pull Request resolved: pytorch#119001 Approved by: https://github.com/peterbell10, https://github.com/jansel
1 parent 8182fce commit 896cf9d

File tree

8 files changed

+813
-134
lines changed

8 files changed

+813
-134
lines changed

aten/src/ATen/cpu/vec/vec256/vec256.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,24 @@ inline convert_to_int_of_same_size<float>(const Vectorized<float> &src) {
143143
return _mm256_cvttps_epi32(src);
144144
}
145145

146+
// Only works for inputs in the range: [-2^51, 2^51]
147+
// From: https://stackoverflow.com/a/41148578
148+
template<>
149+
Vectorized<double>
150+
inline convert_to_fp_of_same_size<double>(const Vectorized<int64_t> &src) {
151+
auto x = _mm256_add_epi64(src, _mm256_castpd_si256(_mm256_set1_pd(0x0018000000000000)));
152+
return _mm256_sub_pd(
153+
_mm256_castsi256_pd(x),
154+
_mm256_set1_pd(0x0018000000000000)
155+
);
156+
}
157+
158+
template<>
159+
Vectorized<float>
160+
inline convert_to_fp_of_same_size<float>(const Vectorized<int32_t> &src) {
161+
return _mm256_cvtepi32_ps(src);
162+
}
163+
146164
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ INTERLEAVE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
147165

148166
template <>

aten/src/ATen/cpu/vec/vec512/vec512.h

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,18 @@ inline convert_to_int_of_same_size<float>(const Vectorized<float> &src) {
127127
return _mm512_cvttps_epi32(src);
128128
}
129129

130+
template<>
131+
Vectorized<double>
132+
inline convert_to_fp_of_same_size<double>(const Vectorized<int64_t> &src) {
133+
return _mm512_cvtepi64_pd(src);
134+
}
135+
136+
template<>
137+
Vectorized<float>
138+
inline convert_to_fp_of_same_size<float>(const Vectorized<int32_t> &src) {
139+
return _mm512_cvtepi32_ps(src);
140+
}
141+
130142
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ INTERLEAVE ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
131143

132144
template <>

aten/src/ATen/cpu/vec/vec_base.h

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -622,6 +622,12 @@ template <class T> Vectorized<T> inline operator/(const Vectorized<T> &a, const
622622
return c;
623623
}
624624

625+
template <class T,
626+
typename std::enable_if<!is_floating_point_v<T>, int>::type = 0>
627+
Vectorized<T> inline operator%(const Vectorized<T> &a, const Vectorized<T> &b) __ubsan_ignore_float_divide_by_zero__ {
628+
return a - a / b * b;
629+
}
630+
625631
template <class T> Vectorized<T> inline operator||(
626632
const Vectorized<T> &a, const Vectorized<T> &b) {
627633
Vectorized<T> c;
@@ -989,6 +995,19 @@ inline Vectorized<IntType> convert_to_int_of_same_size(const Vectorized<T>& src)
989995
return Vectorized<IntType>::loadu(static_cast<const void*>(buffer.data()));
990996
}
991997

998+
template <typename T, typename IntType = int_same_size_t<T>>
999+
inline Vectorized<T> convert_to_fp_of_same_size(const Vectorized<IntType>& src) {
1000+
static_assert(sizeof(T) == sizeof(IntType));
1001+
static constexpr int size = Vectorized<T>::size();
1002+
1003+
std::array<IntType, size> src_arr;
1004+
src.store(static_cast<void*>(src_arr.data()));
1005+
std::array<T, size> buffer;
1006+
std::transform(src_arr.cbegin(), src_arr.cend(), buffer.begin(),
1007+
[](const IntType& x) { return static_cast<T>(x); });
1008+
return Vectorized<T>::loadu(static_cast<const void*>(buffer.data()));
1009+
}
1010+
9921011
// Example inputs for AVX512:
9931012
// a Vectorized<float> = {a0, b0, a1, b1, a2, b2, a3, b3, a4, b4, a5, b5, a6, b6, a7, b7}
9941013
// b Vectorized<float> = {a8, b8, a9, b9, a10, b10, a11, b11, a12, b12, a13, b13, a14, b14, a15, b15}

0 commit comments

Comments
 (0)