qCUDA-ARM:Virtualization for Embedded GPU Architectures Jordan@NTHU 2019/07/25
Outline ● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 2
● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 3
Emergence of Internet of Things (IoT) ● Number of IoT device raise dramatically. ● Smart home, industry 4.0 and Autonomous car. ● New computing paradigm : Edge Computing. 4
Edge Computing ● Concept Deploy computational device to the nearby of data source ● Challenge: diversity of Application scenario ○ Design of edge node architecture still an open research problem. ○ possible solution 5
Research Trend in System Design ● Virtualization ○ Adds dimension of flexibility to ease the edge nod architectural design. ○ Customize a specific hardware resource and create an isolated environment for each application. ● Heterogeneous computing ○ heterogeneous system contains more than one types of processors so that different applications can find the proper resources to use. Ex:embedded GPU. 6
About This Research ● Investigate the fusion of those two important trends in IOT : heterogeneity and virtualization and present qCUDA-ARM, a virtualization solution for embedded GPU architecture. ● Evaluate qCUDA-ARM’s performance on an Nvidia TX2 development board using three different benchmarks and two applications and compared the results with rCUDA and native CUDA. 7
● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 8
Virtualization ● Software simulated platform/hardware. ● Multiple virtual machine on single physical machine. ● Benefits: ○ Improve hardware utilization ○ Isolation ○ Fault tolerance 9 http://techxposer.com/2018/01/30/understanding-virtualization-hxypervisors/
GPGPU 10 ● GPGPU ○ General Purpose GPU computation other than computer graphics. ○ Acceleration by massive parallel execution. Includes: scientific computing, digital currency and machine learning. ● Requirement: ○ Needs CPU assistance. ○ Able to program the GPU.
GPGPU Working Model 11 ● Communicate with CPU ● 4 steps
GPGPU Programming Model 12 ● Numerous solutions, includes: CUDA, OpenCL, OpenVIDIA and MATLAB ● CUDA is one of most famous solutions ○ Developed by NVIDIA ○ Provides syntax of high level languages for software developers to access to the computation elements of GPU.
GPU Virtualization 13 ● Provide GPU function to virtual machine. ● GPU virtualization methods ○ Hardware-assisted Virtualization ■ Direct Pass-through ■ Mediated Pass-through ○ Software Virtualization ■ Full Virtualization ■ Para Virtualization ● API-Remoting
Direct Pass-through ● Guest OS directly access GPUs with hardware extension features provided by either motherboard chipset or GPU manufacturers. ● Limitation ○ Cannot share one single GPU to multiple VM. ● Ex: ○ Intel VT-d [Abramson et al. 2006] ○ AMD-Vi [Van Doorn 2006] 14 https://blogs.vmware.com/apps/2018/05/machine-learning-using-virtualized- gpus-on-vmware-vsphere.html
Mediated Passthrough (1/2) ● Multiple Virtual GPU (vGPU) ● Each VM own a full-feature vGPU ● VMs can directly access performance-critical resources, without hypervisor intervention in most cases. ● Trap privileged operations from guest, provide secure isolation. ● Ex: ○ gVirt [Tian et al. 2014] ○ KVMGT [Song et al. 2014] ○ gScale [Xue et al. 2016] 15 https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.htm l
Full-Virtualization ● Software simulation ● Native driver can be ran on the guest without any modification of GPU libraries and drivers. ● Performance overhead since software trap-and-emulated the hardware interface of existing ○ GPUvm [Suzuki et al. 2014] ○ G-KVM [Hong et al. 2016] 16
Para-Virtualization ● Improve performance by slightly modifies the custom driver in the guest for delivering sensitive operations directly to the host driver , prevent hypervisor invention hence improve performance. ● EX: ○ LoGV [Gottschlag et al. 2013] 17
API-Remoting (1/3) 18 ● Wrapping the GPU APIs as a “front-end” in the guest. ● Mediating all accesses to the GPU as a “back-end” in the host through an transport layer. ● Challenge: minimize the communication overhead between backend and frontend. ● EX: ○ GVirtuS [Giunta et al. 2010] ○ vCUDA [Shi et al. 2009] ○ rCUDA [Duato et al. 2010b] ○ qCUDA [Yu-Shiang et al. 2017]
API-Remoting (2/3) ● rCUDA [Duato et al. 2010b] ○ Offloads CUDA computation to remote GPUs. ○ For HPC scenario. ○ Client / server architecture. ○ Suitable for InfiniBand, achieve native performance when under it. ○ Communicate by TCP/IP which may be the performance bottleneck. 19https://www.researchgate.net/figure/Overview-of-the-general-architecture-of-the-rCU DA-virtualization-solution_fig1_267514566
API-Remoting (3/3) ● qCUDA ○ Based on KVM, virtio ○ Prevent memory copy between guest and host machine. ○ Work that qCUDA-ARM based on. 20
● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 21
● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 22
qCUDA Overvice 23 GPU Driver Host QEMU-KVM CUDA Program Guest Nvidia GPU
qCUDA Architecture Overvice 24 GPU Driver Host QEMU-KVM CUDA Program Guest Nvidia GPU qcu-library qcu-driver qcu-device
qCUDA Components 25 ● qcu-library ○ The interposer library in guest system. ○ Provided native CUDA API and the wrapper functions of dynamic memory allocation. ● qcu-driver ○ Memory management. ○ Analyzing the CUDA command come from qcu-library to qcu-device. ● qcu-device ○ GPU binary registration. ○ GPU context management. ○ Address space transferring.
● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 26
Data Movement in GPU Virtualization 27 ● In GPU working model, data needs to be transferred between CPU and GPU memory ● Guest CUDA program need to transfer it’s data to host GPU ● In GPU virtualization, this data should be transferred through host and hypervisor. ● Critical to the virtualization performance. ● Prior works copy data from guest to host, however, requires extra data copy and increase latency.
Data Movement in qCUDA 28 ● Eliminate extra data copy between guest and host (Zero-copy). ○ Guest and host in same machine. ○ Memory allocated by guest should be seen in host. ● Address space conversion do the job.
Memory Management in qCUDA 29 Two goals: zero copy and pinned host memory
Memory Management in qCUDA 30 1. Guest CUDA program allocate / free memory via calling hooked allocation / free function in qcu-library. 2. Qcu-library change the behavior of memory allocation / release. 3. Allocate/release 4 MB size chunks. 4. Address space conversion (GVA->GPA) 5. Record / remove information 6. Communicate 7. Address space conversion (GPA->HVA)
● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 31
Pinned Host Memory 32 ● For CUDA, there are 2 kinds of host memory: pageable and page-lock. ○ Pageable: Could be page-out by OS ○ Page-lock: Will not be page-out ● Page-lock memory region can be directly copy by DMA, accelerate the data copy performance. ● However, default memory allocated by host is pageable. ● Pageable memory required to be copy to an temporary allocated pinned memory region than be copied to GPU.
33 ● Pinned memory related functions: ○ cudaHostRegister(): page-lock pre-allocated memory region. ○ cudaHostAlloc(): allocated an page-lock memory region. Allocate Pinned Host Memory
Allocate Pinned Host Memory in qCUDA 34 ● After step 7, guest CUDA allocated memory can be accessed from host. ● cudaHostRegister() this region. ● Since their maps to same physical frame. ● These guest CUDA allocated memory be page-locked
Malfunction of cudaHostRegister() in ARM Architecture 35 ● cudaHostRegister() not be supported in ARM architecture. ● ARM architecture can only allocates pinned host memory with cudaHostAlloc(). ● However, original implementation cudaHostAlloc() is out of function too since it’s based on cudaHostRegister(). ● qCUDA-ARM implements it in different way.
qCUDA cudaHostAlloc() Internal 36
qCUDA-ARM Host Pinned Allocation Mechanism 37
qCUDA-ARM Host Pinned Allocation Mechanism 38 ● Each memory allocation will create a unique device node under /dev. ● Guest memory region in host are mapped to one of these device node, let’s cal itl GMR_fd.
qCUDA-ARM Host Pinned Allocation Mechanism 39 ● Allocated another same-sized page-lock memory region and mapping it to GMR_fd. ● The content is identical on both region and can be transparently transferred from guest to host gpu or vise versa.
qCUDA-ARM Host Pinned Allocation Mechanism 40 The content is identical on both region and can be transparently transferred from guest to host gpu or vise versa.
● Introduction ● Background ● Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 41
Pinned Memory Bandwidth Improvement 42 ● The Initial experiment shows slow memory bandwidth when copy data with cudaMemcpy() / cudaMemcpyAsync(). ● Only 1/10 of native bandwidth. ● Try to improve it.
cudaMemcpy() / cudaMemcpyAsync() 43 ● cudaMemcpy (dst, src, count, ….) ○ dst - Destination memory address ○ src - Source memory address ○ count - Size in bytes to copy
qCUDA Pinned Memory Copy Mechanism 44
qCUDA-ARM Pinned Memory Copy Mechanism 45
qCUDA-ARM Pinned Memory Copy Mechanism 46 ● GMR_ptr can be obtained by ● Offset can be calculated by cudaHostAllocPtr + ( src - hva_start_addr) ● Copy pinned host memory with only one call of cudaMemcpy()
Enable KVM on TX2 47 ● KVM is fundamental building blocks for qCUDA but it’s originally not be supported on TX2 development board. ● In order to enable KVM on ARM architecture, the CPU should be boot into a CPU mode for running hypervisors called hypmode
Enable KVM on TX2 48 ● Because the device tree of Nvidia released linux kernel miss the necessary register settings for Virtual General Interrupt Controller. ● Rebuild the Linux kernel with those information provided solvethis problem.
● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 49
Experiment Environment (1/2) 50 ● NVIDIA TX2 development board ○ 4 ARM-A57 cores, 2 Denver 64-bit CPUs, 8 GB DDR4 RAM ○ NVIDIA Pascal Architecture GPU with compute-capability 6.2 ● Host software stack ○ Ubuntu 16.04 LTS with patched Linux kernel 4.4.38 ○ CUDA toolkit 8.0 ● Guest machine ○ 4 cores, 4GB RAM, 16GB disk image ○ Ubuntu 14.04.5 LTS ○ CUDA toolkit 8.0
Experiment Environment (2/2) 51 ● Although rCUDA is not designed for embedded GPU, it is the only GPU virtualization package that works on embedded systems. ● rCUDA only support ARM frontend, need a X86 machine as the backend. ● DELL XPS 9550 laptop ○ Intel Core i7-6700HQ, 8G DDR4 RAM, 256 GB SSD ○ NVIDIA GeForce GTX 960M GPU with compute-capability of 5 ● Host software stack ○ Ubuntu 16.04 LTS with Linux kernel 4.15.0-50-generic ○ CUDA toolkit 8.0
Benchmarks 52 ● Three benchmarks and two application are chosen. ● Goal: Compare the performance with rCUDA and Native. ● Since rCUDA computes the wrong results or out of function on all benchmarks except BandwidthTest, We can only compare the result native CUDA.
Memory Bandwidth 53 ● Measures transferring bandwidth between CPU memory and GPU memory. ● bandwidthTest benchmark ● Transferring direction ○ Data transferring bandwidth from host to device (H2D) ○ Data transfering from device to host (D2H) ● Data size range from 1KB to 1GB
Pageable Memory Bandwidth 54
Pinned Memory Bandwidth (original) 55
Pinned Memory Bandwidth (Optimized) 56
Pinned Memory Bandwidth Comparison (Original v.s Optimized) 57
Pinned Memory Bandwidth Compare with rCUDA and Native 58
Matrix Multiplication 59 ● Measures qCUDA-ARM’s performance when deals with task of large computation and small data transmission. ● 3 steps: ○ Copy two matrices A, B from guest memory to GPU memory. ○ Calculate matrix C (C = A×B) at GPU. ○ Copy matrix C from GPU memory back to guest machine memory. ● matrix dimensions range from 32 to 4096
Matrix Multiplication Performance 60
Matrix Multiplication Performance Profile 61
Matrix Multiplication CUDA APIs Time 62 Time of cudaEventSynchronize() increases as matrix dimension increases.
Vector Addition 63 ● Compare to matrixMul, this is a I/O-bound benchmark. ● Measures qCUDA-ARM’s performance when deals with task of small computation and large data transmission. ● Contains 3 steps: ○ Copy two vectors A, B from guest memory to GPG memory. ○ Calculate matrix C (C = A + B) at GPU. ○ Copy vector C from GPU memory back to guest machine memory. ● Vector length range from 100M to 25600M
Vector Addition Performance 64
Vector Addition Performance Profile 65
Vector Addition CUDA APIs Time 66 Time of both H2D and D2H gradually dominate the total execution time.
Edge Detection 67 ● This benchmark is a GPGPU implementation of Sobel Operator. ● The Sobel edge detector uses a pair of 3 x 3 convolution masks, one estimating gradient in the x-direction and the other estimating gradient in y-direction. ● 3 input images: ○ 800 × 532(1.3 MB) ○ 1920 × 1080(6 MB) ○ 4928 × 3624(47 MB) ● Contains 3 steps: ○ Copy input image from guest memory to GPGPU memory. ○ Do edge detection at GPGPU. ○ Copy processed image from GPGPU memory to host memory.
Edge Detection Performance 68
Edge Detection Performance Profile 69
Edge Detection CUDA APIs Time 70 Time of both H2D and D2H gradually dominate the total execution time.
Edge Detection CUDA APIs Efficiency 71
GPGPU Accelerated Cryptocurrency Miner 72 ● Measure the performance of qCUDA-ARM with with large computation and small data transmission. ● This benchmark simulates cryptocurrency mining process which takes ○ An input string and hash it as target hash. ○ An ACCEPTED_ZEROS parameter to control the mining difficulty. ● Contains 4 steps: ○ CPU read the input string and preprocess on it. ○ Allocated GPU memory and copy the input string into it. ○ GPU launch multiple thread to mine the hash. ○ Copy result from GPGPU memory to host memory. ● ACCEPTED_ZEROS range from 25 to 35, increase 2 each time.
GPGPU Accelerated Cryptocurrency Miner Performance 73
GPGPU Accelerated Cryptocurrency Miner Performance Profile 74
GPGPU Accelerated Cryptocurrency Miner CUDA APIs Time 75 Time of D2H gradually dominate the total execution time.
Pinned Memory Bandwidth Performance of Multiple VMs 76
● Introduction ● Background ● Design and Implementation ● Experiments ● Conclusion 77
Conclusion 78 ● In this thesis, we present a GPGPU virtualization solution for embedded GPU architecture qCUDA-ARM. ● Since the architecture difference, redesign qCUDA-ARM’s memory management system to fit the need. ● Improve the pinned memory copy bandwidth 9x faster. ● Conduct experiment on Nvidia TX2 board, achieve 90% of native bandwidth.
Future Works 79 ● Implement more CUDA API functions in order to support more real world applications. ● Fine-tune the pinned memory allocation mechanism, improve the memory utilization.
Thank you for your time and attention. 80
Appendix 81
Comparison 82 Full Virtualization Para Virtualization Direct passthrough Mediated passthrough API Remoting Performance Low High High High Depend GPU sharing Yes Yes No Yes Yes flexibility High Low High High Low
qcu-device (2/2) 83 ● Some CUDA runtime API could not be directly executed in the host due to the implicitly function call of CUDA context or kernel launch. ● quc-device implement these runtime API with low level driver API for context and kernel function management.
Full-Virtualization ● Software simulation ● Native driver can be ran on the guest without any modification of GPU libraries and drivers. ● Performance overhead since software trap-and-emulated the hardware interface of existing ○ GPUvm [Suzuki et al. 2014] ○ G-KVM [Hong et al. 2016] 84
API-Remoting (2/4) ● GVirtuS [Giunta et al. 2010] ○ Virtualize CUDA ○ Support Xen, KVM and VMware. ○ Try to minimize communication overhead by leveraging on high performance communication channel provided by hypervisors. 85 ● vCUDA [Shi et al. 2009] ○ Virtualize CUDA ○ Support Xen and KVM ○ vGPU per application ○ Prevent frequent context switching between the guest OS and the hypervisor and improves communication performance. ○ Batch CUDA call by Lazy RPC that performs batching specific CUDA calls that can be delayed.
Observation 86 ● Guest CUDA allocated memory can be seen from host after step 3. ● Need another way preserve these region in page-lock region.

qCUDA-ARM : Virtualization for Embedded GPU Architectures

  • 1.
    qCUDA-ARM:Virtualization for Embedded GPUArchitectures Jordan@NTHU 2019/07/25
  • 2.
    Outline ● Introduction ● Background ●Design and Implementation ● Experiments ● Conclusion 2
  • 3.
    ● Introduction ● Background ●Design and Implementation ● Experiments ● Conclusion 3
  • 4.
    Emergence of Internetof Things (IoT) ● Number of IoT device raise dramatically. ● Smart home, industry 4.0 and Autonomous car. ● New computing paradigm : Edge Computing. 4
  • 5.
    Edge Computing ● Concept Deploycomputational device to the nearby of data source ● Challenge: diversity of Application scenario ○ Design of edge node architecture still an open research problem. ○ possible solution 5
  • 6.
    Research Trend inSystem Design ● Virtualization ○ Adds dimension of flexibility to ease the edge nod architectural design. ○ Customize a specific hardware resource and create an isolated environment for each application. ● Heterogeneous computing ○ heterogeneous system contains more than one types of processors so that different applications can find the proper resources to use. Ex:embedded GPU. 6
  • 7.
    About This Research ●Investigate the fusion of those two important trends in IOT : heterogeneity and virtualization and present qCUDA-ARM, a virtualization solution for embedded GPU architecture. ● Evaluate qCUDA-ARM’s performance on an Nvidia TX2 development board using three different benchmarks and two applications and compared the results with rCUDA and native CUDA. 7
  • 8.
    ● Introduction ● Background ●Design and Implementation ● Experiments ● Conclusion 8
  • 9.
    Virtualization ● Software simulated platform/hardware. ●Multiple virtual machine on single physical machine. ● Benefits: ○ Improve hardware utilization ○ Isolation ○ Fault tolerance 9 http://techxposer.com/2018/01/30/understanding-virtualization-hxypervisors/
  • 10.
    GPGPU 10 ● GPGPU ○ GeneralPurpose GPU computation other than computer graphics. ○ Acceleration by massive parallel execution. Includes: scientific computing, digital currency and machine learning. ● Requirement: ○ Needs CPU assistance. ○ Able to program the GPU.
  • 11.
    GPGPU Working Model 11 ●Communicate with CPU ● 4 steps
  • 12.
    GPGPU Programming Model 12 ●Numerous solutions, includes: CUDA, OpenCL, OpenVIDIA and MATLAB ● CUDA is one of most famous solutions ○ Developed by NVIDIA ○ Provides syntax of high level languages for software developers to access to the computation elements of GPU.
  • 13.
    GPU Virtualization 13 ● ProvideGPU function to virtual machine. ● GPU virtualization methods ○ Hardware-assisted Virtualization ■ Direct Pass-through ■ Mediated Pass-through ○ Software Virtualization ■ Full Virtualization ■ Para Virtualization ● API-Remoting
  • 14.
    Direct Pass-through ● GuestOS directly access GPUs with hardware extension features provided by either motherboard chipset or GPU manufacturers. ● Limitation ○ Cannot share one single GPU to multiple VM. ● Ex: ○ Intel VT-d [Abramson et al. 2006] ○ AMD-Vi [Van Doorn 2006] 14 https://blogs.vmware.com/apps/2018/05/machine-learning-using-virtualized- gpus-on-vmware-vsphere.html
  • 15.
    Mediated Passthrough (1/2) ●Multiple Virtual GPU (vGPU) ● Each VM own a full-feature vGPU ● VMs can directly access performance-critical resources, without hypervisor intervention in most cases. ● Trap privileged operations from guest, provide secure isolation. ● Ex: ○ gVirt [Tian et al. 2014] ○ KVMGT [Song et al. 2014] ○ gScale [Xue et al. 2016] 15 https://projectacrn.github.io/latest/developer-guides/hld/hld-APL_GVT-g.htm l
  • 16.
    Full-Virtualization ● Software simulation ●Native driver can be ran on the guest without any modification of GPU libraries and drivers. ● Performance overhead since software trap-and-emulated the hardware interface of existing ○ GPUvm [Suzuki et al. 2014] ○ G-KVM [Hong et al. 2016] 16
  • 17.
    Para-Virtualization ● Improve performanceby slightly modifies the custom driver in the guest for delivering sensitive operations directly to the host driver , prevent hypervisor invention hence improve performance. ● EX: ○ LoGV [Gottschlag et al. 2013] 17
  • 18.
    API-Remoting (1/3) 18 ● Wrappingthe GPU APIs as a “front-end” in the guest. ● Mediating all accesses to the GPU as a “back-end” in the host through an transport layer. ● Challenge: minimize the communication overhead between backend and frontend. ● EX: ○ GVirtuS [Giunta et al. 2010] ○ vCUDA [Shi et al. 2009] ○ rCUDA [Duato et al. 2010b] ○ qCUDA [Yu-Shiang et al. 2017]
  • 19.
    API-Remoting (2/3) ● rCUDA[Duato et al. 2010b] ○ Offloads CUDA computation to remote GPUs. ○ For HPC scenario. ○ Client / server architecture. ○ Suitable for InfiniBand, achieve native performance when under it. ○ Communicate by TCP/IP which may be the performance bottleneck. 19https://www.researchgate.net/figure/Overview-of-the-general-architecture-of-the-rCU DA-virtualization-solution_fig1_267514566
  • 20.
    API-Remoting (3/3) ● qCUDA ○Based on KVM, virtio ○ Prevent memory copy between guest and host machine. ○ Work that qCUDA-ARM based on. 20
  • 21.
    ● Introduction ● Background ●Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 21
  • 22.
    ● Introduction ● Background ●Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 22
  • 23.
  • 24.
    qCUDA Architecture Overvice 24 GPUDriver Host QEMU-KVM CUDA Program Guest Nvidia GPU qcu-library qcu-driver qcu-device
  • 25.
    qCUDA Components 25 ● qcu-library ○The interposer library in guest system. ○ Provided native CUDA API and the wrapper functions of dynamic memory allocation. ● qcu-driver ○ Memory management. ○ Analyzing the CUDA command come from qcu-library to qcu-device. ● qcu-device ○ GPU binary registration. ○ GPU context management. ○ Address space transferring.
  • 26.
    ● Introduction ● Background ●Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 26
  • 27.
    Data Movement inGPU Virtualization 27 ● In GPU working model, data needs to be transferred between CPU and GPU memory ● Guest CUDA program need to transfer it’s data to host GPU ● In GPU virtualization, this data should be transferred through host and hypervisor. ● Critical to the virtualization performance. ● Prior works copy data from guest to host, however, requires extra data copy and increase latency.
  • 28.
    Data Movement inqCUDA 28 ● Eliminate extra data copy between guest and host (Zero-copy). ○ Guest and host in same machine. ○ Memory allocated by guest should be seen in host. ● Address space conversion do the job.
  • 29.
    Memory Management inqCUDA 29 Two goals: zero copy and pinned host memory
  • 30.
    Memory Management inqCUDA 30 1. Guest CUDA program allocate / free memory via calling hooked allocation / free function in qcu-library. 2. Qcu-library change the behavior of memory allocation / release. 3. Allocate/release 4 MB size chunks. 4. Address space conversion (GVA->GPA) 5. Record / remove information 6. Communicate 7. Address space conversion (GPA->HVA)
  • 31.
    ● Introduction ● Background ●Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 31
  • 32.
    Pinned Host Memory 32 ●For CUDA, there are 2 kinds of host memory: pageable and page-lock. ○ Pageable: Could be page-out by OS ○ Page-lock: Will not be page-out ● Page-lock memory region can be directly copy by DMA, accelerate the data copy performance. ● However, default memory allocated by host is pageable. ● Pageable memory required to be copy to an temporary allocated pinned memory region than be copied to GPU.
  • 33.
    33 ● Pinned memoryrelated functions: ○ cudaHostRegister(): page-lock pre-allocated memory region. ○ cudaHostAlloc(): allocated an page-lock memory region. Allocate Pinned Host Memory
  • 34.
    Allocate Pinned HostMemory in qCUDA 34 ● After step 7, guest CUDA allocated memory can be accessed from host. ● cudaHostRegister() this region. ● Since their maps to same physical frame. ● These guest CUDA allocated memory be page-locked
  • 35.
    Malfunction of cudaHostRegister()in ARM Architecture 35 ● cudaHostRegister() not be supported in ARM architecture. ● ARM architecture can only allocates pinned host memory with cudaHostAlloc(). ● However, original implementation cudaHostAlloc() is out of function too since it’s based on cudaHostRegister(). ● qCUDA-ARM implements it in different way.
  • 36.
  • 37.
    qCUDA-ARM Host PinnedAllocation Mechanism 37
  • 38.
    qCUDA-ARM Host PinnedAllocation Mechanism 38 ● Each memory allocation will create a unique device node under /dev. ● Guest memory region in host are mapped to one of these device node, let’s cal itl GMR_fd.
  • 39.
    qCUDA-ARM Host PinnedAllocation Mechanism 39 ● Allocated another same-sized page-lock memory region and mapping it to GMR_fd. ● The content is identical on both region and can be transparently transferred from guest to host gpu or vise versa.
  • 40.
    qCUDA-ARM Host PinnedAllocation Mechanism 40 The content is identical on both region and can be transparently transferred from guest to host gpu or vise versa.
  • 41.
    ● Introduction ● Background ●Design and Implementation ○ qCUDA System Architecture ○ Memory Management of qCUDA ○ Memory Management of qCUDA-ARM ○ Memory Bandwidth Improvement ● Experiments ● Conclusion 41
  • 42.
    Pinned Memory BandwidthImprovement 42 ● The Initial experiment shows slow memory bandwidth when copy data with cudaMemcpy() / cudaMemcpyAsync(). ● Only 1/10 of native bandwidth. ● Try to improve it.
  • 43.
    cudaMemcpy() / cudaMemcpyAsync() 43 ●cudaMemcpy (dst, src, count, ….) ○ dst - Destination memory address ○ src - Source memory address ○ count - Size in bytes to copy
  • 44.
    qCUDA Pinned MemoryCopy Mechanism 44
  • 45.
    qCUDA-ARM Pinned MemoryCopy Mechanism 45
  • 46.
    qCUDA-ARM Pinned MemoryCopy Mechanism 46 ● GMR_ptr can be obtained by ● Offset can be calculated by cudaHostAllocPtr + ( src - hva_start_addr) ● Copy pinned host memory with only one call of cudaMemcpy()
  • 47.
    Enable KVM onTX2 47 ● KVM is fundamental building blocks for qCUDA but it’s originally not be supported on TX2 development board. ● In order to enable KVM on ARM architecture, the CPU should be boot into a CPU mode for running hypervisors called hypmode
  • 48.
    Enable KVM onTX2 48 ● Because the device tree of Nvidia released linux kernel miss the necessary register settings for Virtual General Interrupt Controller. ● Rebuild the Linux kernel with those information provided solvethis problem.
  • 49.
    ● Introduction ● Background ●Design and Implementation ● Experiments ● Conclusion 49
  • 50.
    Experiment Environment (1/2) 50 ●NVIDIA TX2 development board ○ 4 ARM-A57 cores, 2 Denver 64-bit CPUs, 8 GB DDR4 RAM ○ NVIDIA Pascal Architecture GPU with compute-capability 6.2 ● Host software stack ○ Ubuntu 16.04 LTS with patched Linux kernel 4.4.38 ○ CUDA toolkit 8.0 ● Guest machine ○ 4 cores, 4GB RAM, 16GB disk image ○ Ubuntu 14.04.5 LTS ○ CUDA toolkit 8.0
  • 51.
    Experiment Environment (2/2) 51 ●Although rCUDA is not designed for embedded GPU, it is the only GPU virtualization package that works on embedded systems. ● rCUDA only support ARM frontend, need a X86 machine as the backend. ● DELL XPS 9550 laptop ○ Intel Core i7-6700HQ, 8G DDR4 RAM, 256 GB SSD ○ NVIDIA GeForce GTX 960M GPU with compute-capability of 5 ● Host software stack ○ Ubuntu 16.04 LTS with Linux kernel 4.15.0-50-generic ○ CUDA toolkit 8.0
  • 52.
    Benchmarks 52 ● Three benchmarksand two application are chosen. ● Goal: Compare the performance with rCUDA and Native. ● Since rCUDA computes the wrong results or out of function on all benchmarks except BandwidthTest, We can only compare the result native CUDA.
  • 53.
    Memory Bandwidth 53 ● Measurestransferring bandwidth between CPU memory and GPU memory. ● bandwidthTest benchmark ● Transferring direction ○ Data transferring bandwidth from host to device (H2D) ○ Data transfering from device to host (D2H) ● Data size range from 1KB to 1GB
  • 54.
  • 55.
  • 56.
  • 57.
    Pinned Memory BandwidthComparison (Original v.s Optimized) 57
  • 58.
    Pinned Memory Bandwidth Comparewith rCUDA and Native 58
  • 59.
    Matrix Multiplication 59 ● MeasuresqCUDA-ARM’s performance when deals with task of large computation and small data transmission. ● 3 steps: ○ Copy two matrices A, B from guest memory to GPU memory. ○ Calculate matrix C (C = A×B) at GPU. ○ Copy matrix C from GPU memory back to guest machine memory. ● matrix dimensions range from 32 to 4096
  • 60.
  • 61.
  • 62.
    Matrix Multiplication CUDAAPIs Time 62 Time of cudaEventSynchronize() increases as matrix dimension increases.
  • 63.
    Vector Addition 63 ● Compareto matrixMul, this is a I/O-bound benchmark. ● Measures qCUDA-ARM’s performance when deals with task of small computation and large data transmission. ● Contains 3 steps: ○ Copy two vectors A, B from guest memory to GPG memory. ○ Calculate matrix C (C = A + B) at GPU. ○ Copy vector C from GPU memory back to guest machine memory. ● Vector length range from 100M to 25600M
  • 64.
  • 65.
  • 66.
    Vector Addition CUDAAPIs Time 66 Time of both H2D and D2H gradually dominate the total execution time.
  • 67.
    Edge Detection 67 ● Thisbenchmark is a GPGPU implementation of Sobel Operator. ● The Sobel edge detector uses a pair of 3 x 3 convolution masks, one estimating gradient in the x-direction and the other estimating gradient in y-direction. ● 3 input images: ○ 800 × 532(1.3 MB) ○ 1920 × 1080(6 MB) ○ 4928 × 3624(47 MB) ● Contains 3 steps: ○ Copy input image from guest memory to GPGPU memory. ○ Do edge detection at GPGPU. ○ Copy processed image from GPGPU memory to host memory.
  • 68.
  • 69.
  • 70.
    Edge Detection CUDAAPIs Time 70 Time of both H2D and D2H gradually dominate the total execution time.
  • 71.
    Edge Detection CUDAAPIs Efficiency 71
  • 72.
    GPGPU Accelerated CryptocurrencyMiner 72 ● Measure the performance of qCUDA-ARM with with large computation and small data transmission. ● This benchmark simulates cryptocurrency mining process which takes ○ An input string and hash it as target hash. ○ An ACCEPTED_ZEROS parameter to control the mining difficulty. ● Contains 4 steps: ○ CPU read the input string and preprocess on it. ○ Allocated GPU memory and copy the input string into it. ○ GPU launch multiple thread to mine the hash. ○ Copy result from GPGPU memory to host memory. ● ACCEPTED_ZEROS range from 25 to 35, increase 2 each time.
  • 73.
  • 74.
    GPGPU Accelerated CryptocurrencyMiner Performance Profile 74
  • 75.
    GPGPU Accelerated CryptocurrencyMiner CUDA APIs Time 75 Time of D2H gradually dominate the total execution time.
  • 76.
    Pinned Memory BandwidthPerformance of Multiple VMs 76
  • 77.
    ● Introduction ● Background ●Design and Implementation ● Experiments ● Conclusion 77
  • 78.
    Conclusion 78 ● In thisthesis, we present a GPGPU virtualization solution for embedded GPU architecture qCUDA-ARM. ● Since the architecture difference, redesign qCUDA-ARM’s memory management system to fit the need. ● Improve the pinned memory copy bandwidth 9x faster. ● Conduct experiment on Nvidia TX2 board, achieve 90% of native bandwidth.
  • 79.
    Future Works 79 ● Implementmore CUDA API functions in order to support more real world applications. ● Fine-tune the pinned memory allocation mechanism, improve the memory utilization.
  • 80.
    Thank you foryour time and attention. 80
  • 81.
  • 82.
  • 83.
    qcu-device (2/2) 83 ● SomeCUDA runtime API could not be directly executed in the host due to the implicitly function call of CUDA context or kernel launch. ● quc-device implement these runtime API with low level driver API for context and kernel function management.
  • 84.
    Full-Virtualization ● Software simulation ●Native driver can be ran on the guest without any modification of GPU libraries and drivers. ● Performance overhead since software trap-and-emulated the hardware interface of existing ○ GPUvm [Suzuki et al. 2014] ○ G-KVM [Hong et al. 2016] 84
  • 85.
    API-Remoting (2/4) ● GVirtuS[Giunta et al. 2010] ○ Virtualize CUDA ○ Support Xen, KVM and VMware. ○ Try to minimize communication overhead by leveraging on high performance communication channel provided by hypervisors. 85 ● vCUDA [Shi et al. 2009] ○ Virtualize CUDA ○ Support Xen and KVM ○ vGPU per application ○ Prevent frequent context switching between the guest OS and the hypervisor and improves communication performance. ○ Batch CUDA call by Lazy RPC that performs batching specific CUDA calls that can be delayed.
  • 86.
    Observation 86 ● Guest CUDAallocated memory can be seen from host after step 3. ● Need another way preserve these region in page-lock region.