SimpleArray between Python and C++

SimpleArray between Python and C++ Yung-Yu Chen (yyc@sciwork.dev) PyCon Taiwan 2023.9.2 1

Speed is the king リニア中央新幹線 (Chuo Shinkansen) https://linear-chuo-shinkansen.jr-central.co.jp/about/design/ 天下武功無堅不破唯快不破 500km/h high speed rail by 2038 2

Array: HPC fundamental ... ... ... regularly allocated bu ff er regularly allocated bu ff er regularly allocated bu ff er regularly accessing algorithm randomly accessing algorithm quoted from SemiAnalysis https://www.semianalysis.com/p/apple-m2-die-shot-and-architecture die shots of popular chips The data structure for cache optimization, data parallelism, SIMD, GPU, all the techniques of high-performance computing (HPC) 3

Dimensionality one-dimensional array two-dimensional array three-dimensional array arrays of higher dimension ... data is always laid out linearly in memory, regardless of dimension 4

Hand-craft array library and don't forget that there will be some 5

Make your own array • Make SimpleArray • Vehicle to carry ad hoc optimization in addition to general array operations buffer management data access void calc_distance( size_t const n , double const * x , double const * y , double * r) { for (size_t i = 0 ; i < n ; ++i) { r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]); } } vmovupd ymm0, ymmword [rsi + r9*8] vmulpd ymm0, ymm0, ymm0 vmovupd ymm1, ymmword [rdx + r9*8] vmulpd ymm1, ymm1, ymm1 vaddpd ymm0, ymm0, ymm1 vsqrtpd ymm0, ymm0 movupd xmm0, xmmword [rsi + r8*8] mulpd xmm0, xmm0 movupd xmm1, xmmword [rdx + r8*8] mulpd xmm1, xmm1 addpd xmm1, xmm0 sqrtpd xmm0, xmm1 AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization C++ code 6 Key components:

Multi-dimensional array • SimpleArray is a class template • Holds a contiguous memory bu ff er • Provides multi-dimensional accessors to its elements • Designed for fundamental types; might be used for POD types (untested) ... ... ... ... ... array m × n 0,0 0,1 1,0 1,1 m,0 m,1 0,n 1,n m,n 0 1 2 3 4 5 6 7 8 ... 1D array 7

SimpleArray is not C++ std::vector SimpleArray std::vector SimpleArray is fi xed size • Only allocate memory on construction std::vector is variable size • Bu ff er may be invalidated • Implicit memory allocation (reallocation) Multi-dimensional access: operator() One-dimensional access: operator[] 8

Features • Header-only C++ code • Simple memory allocation by using fi xed-size • Separate bu ff er ownership and meta data • Meta data for multi-dimensional access: • Type • Shape and stride • Custom for application speci fi c behavior: • Ghost index (in the fi rst dimension) 9

Buffer management • Only constructor allocates memory; only destructor deallocates memory • Allocation and deallocation only happen when necessary • Management is implemented in the class ConcreteBuffer • ConcreteBuffer is a member datum of SimpleArray template <typename T> class SimpleArray { using buffer_type = ConcreteBuffer; std::shared_ptr<buffer_type> m_buffer; }; 10

ConcreteBuffer class ConcreteBuffer : public std::enable_shared_from_this<ConcreteBuffer> { private: // Protect the constructors struct ctor_passkey {}; public: static std::shared_ptr<ConcreteBuffer> construct(size_t nbytes) { return std::make_shared<ConcreteBuffer>(nbytes, ctor_passkey()); } ... }; // Turn off move ConcreteBuffer(ConcreteBuffer &&) = delete; ConcreteBuffer & operator=(ConcreteBuffer &&) = delete; // Use a deleter to customize memory management with unique_ptr using data_deleter_type = detail::ConcreteBufferDataDeleter; using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>; size_t m_nbytes; // Remember the amount of bytes unique_ptr_type m_data; // Point to the allocated momery ConcreteBuffer(size_t nbytes, const ctor_passkey &) : m_nbytes(nbytes) , m_data(allocate(nbytes)) {} // Allocate memory static unique_ptr_type allocate(size_t nbytes) { unique_ptr_type ret(nullptr, data_deleter_type()); if (0 != nbytes) { ret = unique_ptr_type(new int8_t[nbytes], data_deleter_type()); } return ret; } 1. Exclusively manage it using std::shared_ptr 1.1. Turn o ff move to prevent the pointer from going away 3. Use a deleter to customize deallocation from outside 2. Allocate memory and keep it in a std::unique_ptr 11

Design a custom management interface using data_deleter_type = detail::ConcreteBufferDataDeleter; using unique_ptr_type = std::unique_ptr<int8_t, data_deleter_type>; struct ConcreteBufferDataDeleter { using remover_type = ConcreteBufferRemover; explicit ConcreteBufferDataDeleter( std::unique_ptr<remover_type> && remover_in) : remover(std::move(remover_in)) {} void operator()(int8_t * p) const { if (!remover) { // Array deletion if no remover is available. delete[] p; } else { (*remover)(p); } } std::unique_ptr<remover_type> remover{nullptr}; }; struct ConcreteBufferRemover { virtual ~ConcreteBufferRemover() = default; virtual void operator()(int8_t * p) const { delete[] p; } }; struct ConcreteBufferNoRemove : public ConcreteBufferRemover { void operator()(int8_t *) const override {} }; unique_ptr deleter is a proxy to ConcreteBuffer remover Remover is a polymorphic functor type Allow change to work with a foreign sub- system for memory management 12

Data access template <typename T> class SimpleArray { public: // Type information. using value_type = T; static constexpr size_t ITEMSIZE = sizeof(value_type); using buffer_type = ConcreteBuffer; /// Contiguous data buffer for the array. std::shared_ptr<buffer_type> m_buffer; using shape_type = small_vector<size_t>; /// Each element in this vector is the number of element in the /// corresponding dimension. shape_type m_shape; /// Each element in this vector is the number of elements (not number of /// bytes) to skip for advancing an index in the corresponding dimension. shape_type m_stride; }; Type information is made available during compile time, so compiler can take the type information to generate fast instructions 13

Shape and stride explicit SimpleArray(size_t length) : m_buffer(buffer_type::construct(length * ITEMSIZE)) , m_shape{length} , m_stride{1} , m_body(m_buffer->data<T>()) {} explicit SimpleArray(shape_type const & shape) : m_shape(shape) , m_stride(calc_stride(m_shape)) { if (!m_shape.empty()) { m_buffer = buffer_type::construct( m_shape[0] * m_stride[0] * ITEMSIZE); m_body = m_buffer->data<T>(); } } static shape_type calc_stride(shape_type const & shape) { shape_type stride(shape.size()); if (!shape.empty()) { stride[shape.size() - 1] = 1; for (size_t it = shape.size() - 1; it > 0; --it) { stride[it - 1] = stride[it] * shape[it]; } } return stride; } One-dimensional shape Multi-dimensional shape Stride is the number of elements to be skipped when the index is changed by 1 in the corresponding dimension: ... ... ... ... ... array m × n 0,0 0,1 1,0 1,1 m,0 m,1 0,n 1,n m,n 14

Multi-dimensional accessor template <typename T> class SimpleArray { ... template <typename... Args> value_type const & operator()(Args... args) const { return *vptr(args...); } template <typename... Args> value_type & operator()(Args... args) { return *vptr(args...); } template <typename... Args> value_type const * vptr(Args... args) const { return m_body + buffer_offset(m_stride, args...); } template <typename... Args> value_type * vptr(Args... args) { return m_body + buffer_offset(m_stride, args...); } ... }; namespace detail { // Use recursion in templates. template <size_t D, typename S> size_t buffer_offset_impl(S const &) { return 0; } template <size_t D, typename S, typename Arg, typename... Args> size_t buffer_offset_impl (S const & strides, Arg arg, Args... args) { return arg * strides[D] + buffer_offset_impl<D + 1>(strides, args...); } } /* end namespace detail */ template <typename S, typename... Args> size_t buffer_offset(S const & strides, Args... args) { return detail::buffer_offset_impl<0>(strides, args...); } Element and pointer accessor by using variadic template Recursive implementation (compile-time) to save conditional branching 15

Ghost (negative) index • Ghost: elements indexed by negative integer • Similar to the negative index for the POD array • No overhead for normal arrays that start with 0 index // C-style POD array. int32_t data[100]; // Make a pointer to the head address of the array. int32_t * pdata = data; // Make another pointer to the 50-th element from the head of the array. int32_t * odata = pdata + 50; odata[-10] 0 1 2 3 4 5 6 7 8 9 40 41 42 43 44 45 46 47 48 49 data[40] data ... ... 50 *(data+40) *(odata-10) pdata odata 16

Get ghost information on construction template <typename T> class SimpleArray { private: // Number of ghost elements size_t m_nghost = 0; // Starting address of non-ghost value_type * m_body = nullptr; }; static T * calc_body(T * data, shape_type const & stride, size_t nghost) { if (nullptr == data || stride.empty() || 0 == nghost) { // Do nothing. } else { shape_type shape(stride.size(), 0); shape[0] = nghost; data += buffer_offset(stride, shape); } return data; } SimpleArray(SimpleArray const & other) : m_buffer(other.m_buffer->clone()) , m_shape(other.m_shape) , m_stride(other.m_stride) , m_nghost(other.m_nghost) , m_body(calc_body(m_buffer->data<T>(), m_stride, other.m_nghost)) {} SimpleArray(SimpleArray && other) noexcept : m_buffer(std::move(other.m_buffer)) , m_shape(std::move(other.m_shape)) , m_stride(std::move(other.m_stride)) , m_nghost(other.m_nghost) , m_body(other.m_body) {} Keep ghost number and body address Assign ghost and body information in constructors Calculate body address 17 No additional overhead during element access

Make your own small vector template <typename T, size_t N = 3> class small_vector { public: T const * data() const { return m_head; } T * data() { return m_head; } private: T * m_head = nullptr; unsigned int m_size = 0; unsigned int m_capacity = N; std::array<T, N> m_data; }; explicit small_vector(size_t size) : m_size(static_cast<unsigned int>(size)) { if (m_size > N) { m_capacity = m_size; // Allocate only when the size is too large. m_head = new T[m_capacity]; } else { m_capacity = N; m_head = m_data.data(); } } small_vector() { m_head = m_data.data(); } ~small_vector() { if (m_head != m_data.data() && m_head != nullptr) { delete[] m_head; m_head = nullptr; } } Spend a small size of memory to reduce dynamic memory allocation Allocate memory only when size is too large No deallocation is needed if no allocation *data val val initial after push-back (empty) (dynamically allocated bu ff er) std::vector *data val val initial and after fi rst couple of push-back (preallocated bu ff er) (our) small_vector 18

Small vector, cont. Copy constructor may also allocate memory Move constructor should not allocate memory small_vector(small_vector const & other) : m_size(other.m_size) { if (other.m_head == other.m_data.data()) { m_capacity = N; m_head = m_data.data(); } else { m_capacity = m_size; m_head = new T[m_capacity]; } std::copy_n(other.m_head, m_size, m_head); } small_vector(small_vector && other) noexcept : m_size(other.m_size) { if (other.m_head == other.m_data.data()) { m_capacity = N; std::copy_n(other.m_data.begin(), m_size, m_data.begin()); m_head = m_data.data(); } else { m_capacity = m_size; m_head = other.m_head; other.m_size = 0; other.m_capacity = N; other.m_head = other.m_data.data(); } } 19 Constructors are the key of making C++ work

Numpy for dense array • The numpy library provides everything we need for arrays in Python • Convention of using numpy: • The created array copies the Python data: >>> import numpy as np >>> lst = [1, 1, 2, 3, 5] >>> print('A list:', lst) A list: [1, 1, 2, 3, 5] >>> array = np.array(lst) >>> print('An array:', np.array(array)) An array: [1 1 2 3 5] Create a list: Create an array: >>> array[2] = 8 >>> print(array) # The array changes. [1 1 8 3 5] >>> print(lst) # The input sequence is not changed. [1, 1, 2, 3, 5] 20

Python is slow u = 0 u = 0 u = 0 u(y) = sin(πy) i, j i, j + 1 i, j − 1 i + 1,j i − 1,j i + 1,j + 1 i + 1,j − 1 i − 1,j + 1 i − 1,j − 1 grid points computing domain (and solution) ∂2 u ∂x2 + ∂2 u ∂y2 = 0 (0 < x < 1; 0 < y < 1) u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1) u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1) Equation for domain interior Boundary conditions un+1 (xi, yi) = un (xi+1, yj) + un (xi−1, yj) + un (xi, yj+1) + un (xi, yj−1) 4 Use point-Jacobi method to march Let us show how slow Python is by solving the Laplace equation, a simple PDE 21

Python vs numpy def solve_array(): u = uoriginal.copy() # Input from outer scope un = u.copy() # Create the buffer for the next time step converged = False step = 0 while not converged: step += 1 un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] + u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm def solve_python_loop(): u = uoriginal.copy() # Input from outer scope un = u.copy() # Create the buffer for the next time step converged = False step = 0 # Outer loop. while not converged: step += 1 # Inner loops. One for x and the other for y. for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 norm = np.abs(un-u).max() u[...] = un[...] converged = True if norm < 1.e-5 else False return u, step, norm >>> with Timer(): >>> u, step, norm = solve_array() Wall time: 0.0552309 s >>> with Timer(): >>> u, step, norm = solve_python_loop() Wall time: 4.79688 s numpy array for point-Jacobi python loop for point-Jacobi The array version is 87x faster than the loop version 22

Speed up by C++ std::tuple<modmesh::SimpleArray<double>, size_t, double> solve1(modmesh::SimpleArray<double> u) { const size_t nx = u.shape(0); modmesh::SimpleArray<double> un = u; bool converged = false; size_t step = 0; double norm; while (!converged) { norm = 0.0; ++step; for (size_t it=1; it<nx-1; ++it) { for (size_t jt=1; jt<nx-1; ++jt) { un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4; double const v = std::abs(un(it,jt) - u(it,jt)); if (v > norm) { norm = v; } } } // additional loop slows down: norm = calc_norm_amax(u, un); if (norm < 1.e-5) { converged = true; } u = un; } return std::make_tuple(u, step, norm); } // Expose the C++ helper to Python PYBIND11_MODULE(solve_cpp, m) { // Boilerplate for using the SimpleArray with Python { modmesh::python::import_numpy(); modmesh::python::wrap_ConcreteBuffer(m); modmesh::python::wrap_SimpleArray(m); } m.def ( "solve_cpp" , [](pybind11::array_t<double> & uin) { auto ret = solve1(modmesh::python::makeSimpleArray(uin)); return std::make_tuple( modmesh::python::to_ndarray(std::get<0>(ret)), std::get<1>(ret), std::get<2>(ret)); } ); } >>> with Timer(): >>> u, step, norm = solve_cpp.solve_cpp(uoriginal) Wall time: 0.0251369 s c++ loop for point-Jacobi expose to Python 23

Runtime and maintainability 24 for it in range(1, nx-1): for jt in range(1, nx-1): un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4 un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] + u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4 for (size_t it=1; it<nx-1; ++it) { for (size_t jt=1; jt<nx-1; ++jt) { un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4; } } Point-Jacobi method Python nested loop Numpy array C++ nested loop 4.797s (1x) 0.055s (87x) 0.025s (192x) un+1 (xi, yi) = un (xi+1, yj) + un (xi−1, yj) + un (xi, yj+1) + un (xi, yj−1) 4

Conversion between Python containers and C++ • Not really "conversion" or "translation", but data copy val val val val val C++ std::vector<double> ref ref ref ref ref Python list val val val val val python data are randomly allocation Reconstruct from Python to C++ val val val val val C++ std::vector<double> ref ref ref ref ref Python list val val val val val sequential construction makes more regular memory layout Reconstruct from C++ to Python 25

Zero-copy to speed up Python app C++ app a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn share memory bu ff er across language Ndarray C++ container Python app C++ app C++ container Ndarray manage access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory bu ff er shared across language 👍 bottom (C++) - up (Python) design Python app C++ app C++ container Ndarray manage access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn memory bu ff er shared across language top (Python) - down (C++) design 👎 Python-controlling lifecycle is bad for long-running process C++ allows fi ne-grained control of all resources 26

Buffer interface and numpy ndarray template <typename S> std::enable_if_t<is_simple_array_v<S>, pybind11::array> to_ndarray(S && sarr) { namespace py = pybind11; using T = typename std::remove_reference_t<S>::value_type; std::vector<size_t> const shape(sarr.shape().begin(), sarr.shape().end()); std::vector<size_t> stride(sarr.stride().begin(), sarr.stride().end()); for (size_t & v : stride) { v *= sarr.itemsize(); } return py::array( /* Numpy dtype */ py::detail::npy_format_descriptor<T>::dtype(), /* Buffer dimensions */ shape, /* Strides (in bytes) for each index */ stride, /* Pointer to buffer */ sarr.data(), /* Owning Python object */ py::cast(sarr.buffer().shared_from_this())); } template <typename T> static SimpleArray<T> makeSimpleArray( pybind11::array_t<T> & ndarr) { typename SimpleArray<T>::shape_type shape; for (ssize_t i = 0; i < ndarr.ndim(); ++i) { shape.push_back(ndarr.shape(i)); } std::shared_ptr<ConcreteBuffer> const buffer = ConcreteBuffer::construct( ndarr.nbytes(), ndarr.mutable_data(), std::make_unique<ConcreteBufferNdarrayRemover>( ndarr)); return SimpleArray<T>(shape, buffer); } take SimpleArray bu ff er to make ndarray take ndarray bu ff er to make SimpleArray 27

Conclusion It's important to always keep Python: • Think in arrays for speed • Speed is the king • Don't be afraid of C++ • Always keep Python -- it's your UI 28

SimpleArray between Python and C++

More Related Content

Similar to SimpleArray between Python and C++

More from Yung-Yu Chen

Recently uploaded

SimpleArray between Python and C++