How to use Apache TVM to optimize your ML models Sameer Farooqui Product Marketing Manager, OctoML Faster inference in the cloud and at the edge
2 Faster Artificial Intelligence Everywhere
3 Optimizing Deep Learning Compiler
siliconANGLE ● “...cross-platform model compilers [...] are harbingers of the new age in which it won’t matter what front-end tool you used to build your AI algorithms and what back-end clouds, platforms or chipsets are used to execute them.” ● “Cross-platform AI compilers will become standard components of every AI development environment, enabling developers to access every deep learning framework and target platform without having to know the technical particular of each environment.” ● “...within the next two to three years, the AI industry will converge around one open-source cross-compilation supported by all front-end and back-end environments” 4 Read the article April 2018 Quotes from article:
Venture Beat “With PyTorch and TensorFlow, you’ve seen the frameworks sort of converge. The reason quantization comes up, and a bunch of other lower-level efficiencies come up, is because the next war is compilers for the frameworks — XLA, TVM, PyTorch has Glow, a lot of innovation is waiting to happen,” he said. “For the next few years, you’re going to see … how to quantize smarter, how to fuse better, how to use GPUs more efficiently, [and] how to automatically compile for new hardware.” 5 Read the article Quote from Soumith Chintala: (co-creator of PyTorch and distinguished engineer at Facebook AI) Jan 2020
This Talk 6 ● What is a ML Compiler? ● How TVM works ● TVM use cases ● OctoML Product Demo
7 Source code Classical Compiler Frontend Optimizer Backend Machine code
8 C Classical Compiler C Frontend Common Optimizer PowerPC Backend PowerPC Fortran Fortran Frontend Ada code Ada Frontend X86 Backend x86 Arm Backend Arm Source: The Architecture of Open Source Applications
9 Neural Network Deep Learning Compiler PyTorch Optimizing Compiler GPUs TensorFlow ONNX CPUs CPU optimized runtime Accelerators Neural Network Neural Network GPU optimized runtime Accelerator optimized runtime
10 Neural Network Deep Learning Compiler PyTorch GPUs TensorFlow ONNX CPUs CPU optimized runtime Accelerators Neural Network Neural Network GPU optimized runtime Accelerator optimized runtime
TVM: 11 An Automated End-to-End Optimizing Compiler for Deep Learning ● “There is an increasing need to bring machine learning to a wide diversity of hardware devices” ● TVM is “a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back- ends” ● “Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand- tuned libraries for low-power CPU, mobile GPU, and server-class GPUs” Read the paper Feb 2018
Relay: 12 A High-level Compiler for Deep Learning ● Relay is “a high-level IR that enables end-to- end optimization of deep learning models for a variety of devices” ● “Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of- the-art models” ● “With its extensible design and expressive language, Relay serves as a foundation for future work in applying compiler techniques to the domain of deep learning systems” Read the paper April 2019
Ansor: 13 Generating High-Performance Tensor Programs for Deep Learning ● “...obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging” ● Ansor is “a tensor program generation framework for deep learning applications” ● “Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches” ● “We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×, 2.6×, and 1.7×, respectively” Read the paper Nov 2020
14 Thank you Apache TVM contributors! 500+!
Who is using TVM? 15 Every Alexa wake-up today across all devices uses a TVM-optimized model “At Facebook, we've been contributing to TVM for the past year and a half or so, and it's been a really awesome experience” “We're really excited about the performance of TVM.” - Andrew Tulloch, AI Researcher Bing query understanding: 3x faster on CPU QnA bot: 2.6 faster on CPU, 1.8x faster on GPU
Who attended TVM Conf 2020? 16 950+ attendees
17 Deep Learning Systems Landscape (open source) Orchestrators Frameworks Accelerators Vendor Libraries Hardware NVIDIA cuDNN Intel oneDNN Arm Compute Library CPUs GPUs Accelerators
18 Graph Level Optimizations Rewrites dataflow graphs (nodes and edges) to simplify the graph and reduce device peak memory usage Operator Level Optimizations Hardware target-specific low-level optimizations for individual operators/nodes in the graph. Efficient Runtime TVM optimized models run in the lightweight TVM Runtime System, providing a minimal API for loading and executing the model in Python, C++, Rust, Go, Java or Javascript How does TVM work?
Deep Learning Operators 19 ● Deep Neural Networks look like Directed Acyclic Graphs (DAGs) ● Operators are the building blocks (nodes) of neural network models ● Network edges represent data flowing between operators Convolution Broadcast Add Matrix Multiplication Pooling Batch Normalization ArgMin/ArgMax Dropout DynamicQuantizeLinear Gemm LSTM LeakyRelu Softmax OneHotEncoder RNN Sigmoid
20 1 2 7 3 Relay PyTorch / TensorFlow / ONNX 4 5 6 TE + Computation AutoTVM/ Auto-scheduler TE + Schedule TIR Hardware Specific Compiler TVM Internals
21 Relay ● Relay has a functional, statically typed intermediate representation (IR)
22 Auto-scheduler (a.k.a. Ansor) ● Auto-scheduler (2nd gen) replaces AutoTVM ● Auto-scheduler/Ansor aims to a fully automated scheduler for generating high-performance code for tensor computations, without manual templates ● Auto-scheduler can achieve better performance with faster search time in a more automated way b/c of innovations in search space construction and search algorithm ● Goal: Automatically turn tensor operations (like matmul or conv2d) into efficient code implementation ● AutoTVM (1st gen): template-based search algorithm to find efficient implementation for tensor operations. ○ required domain experts to write a manual template for every operator on every platform, > 15k loc in TVM Collaborators:
23 AutoTVM vs Auto-scheduler Source: Apache TVMBlog: Introducing Auto-scheduler
24 Auto-scheduler’s Search Process Source: Apache TVMBlog: Introducing Auto-scheduler
25 Benchmarks: AutoTVM vs Auto-scheduler Source: Apache TVMBlog: Introducing Auto-scheduler Code Performance Comparison (higher is better) Search Time Comparison (lower is better)
26 Auto-scheduling on Apple M1 Source: OctoML Blog: Beating Apple's CoreML4 (lower is better) ● 22% faster on CPU ● 49% faster on GPU How? - Effective Auto-scheduler searching - Fuse qualified subgraphs
Relay 27 Conv2d bias + relu ... Conv2d bias + relu
Conv2d bias + relu ... Conv2d bias + relu Relay: Fusion 28 Combine into a single fused operation which can then be optimized specifically for your target.
Conv2d bias + relu ... Conv2d bias + relu Relay: Fusion 29 Combine into a single fused operation which can then be optimized specifically for your target.
Conv2d bias + relu ... Conv2d bias + relu Relay: Device Placement 30 Partition your network to run on multiple devices. CPU GPU
Conv2d bias + relu ... Conv2d bias + relu Relay: Layout Transformation 31 Generate efficient code for different data layouts. NHCW NHCW
Conv2d bias + relu ... Conv2d bias + relu Relay: Layout Transformation 32 Generate efficient code for different data layouts. NHWC NHWC
TIR Script ● TIR provides more flexibility than high level tensor expressions. ● Not everything is expressible in TE and auto-scheduling is not always perfect. ○ AutoScheduling 3.0 (code- named AutoTIR coming later this year) ○ We can also directly write TIR directly using TIRScript. 33 @tvm.script.tir def fuse_add_exp(a: ty.handle, c: ty.handle) -> None: A = tir.match_buffer(a, (64,)) C = tir.match_buffer(c, (64,)) B = tir.alloc_buffer((64,)) with tir.block([64], "B") as [vi]: B[vi] = A[vi] + 1 with tir.block([64], "C") as [vi]: C[vi] = exp(B[vi])
Select Performance Results 34
Faster Kernels for Dense- Sparse Multiplication ● Performance comparison on PruneBERT ● 3-10x faster than cuBLAS and cuSPARSE. ● 1 engineer writing TensorIR kernels 35
Model x hardware comparison points Performance at OctoML in 2020 Over 60 model x hardware benchmarking studies Each study compared TVM against best* baseline on the target Sorted by ascending log2 gain over baseline 36 TVM log2 fold improvement over baseline
Model x hardware comparison points 37 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 2.5x average performance improvement on non-public models (2.1x across all)
Model x hardware comparison points 38 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
Model x hardware comparison points 39 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
Model x hardware comparison points 40 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
Case Study: 90% cloud inference cost reduction Background ● Top 10 Tech Company running multiple variations of customized CV models ● Model in batch processing /offline mode using standard HW targets of a major public cloud. ● Billions of inferences per month ● Benchmarking on CPU and GPU Results ● 3.8x - TensorRT 8bit to TVM 8bit ● 10x - TensorRT 8bit to TVM 4bit ● Potential to reduce hourly costs by 90% 41 *V100at hourly price of $3.00per hour, T4at $0.53 Up to 10X inferences/doll ar increase
See https://github.com/tlc-pack/tlcbench for benchmark scripts 42 Results: TVM on CPU and GPU 20core Intel-Platinum-8269CYfp32performance data Intel X86 - 2-5X Performance Normalized performance Normalized performance V100fp32performance data NVIDIA GPU - 20-50% versus TensorRT Normalized performance Normalized performance
Why use the Octomizer vs “just” TVM OSS? 43 Octomizer Compile Optimize Benchmark Model x HW analytics data ML Performance Model ● Access to OctoML’s “cost models” ○ We aggregate Models x HW data ○ Continuous improvement ● No need to install any SW, latest TVM ● No need to set up benchmarking HW ● “Outer loop” automation ○ optimize/package multiple models against many HW targets in one go ● Access to comprehensive benchmarking data ○ E.g., for procurement, for HW vendor competitive analysis ● Access to OctoML support
44 Octomizer Live Demo API access Waitlist! octoml.ai
45 The Octonauts! You? View career opportunities at octoml.ai/careers
Thank you! How to use Apache TVM to optimize your ML models By Sameer Farooqui
48
49
50
51
52

How to use Apache TVM to optimize your ML models

  • 1.
    How to useApache TVM to optimize your ML models Sameer Farooqui Product Marketing Manager, OctoML Faster inference in the cloud and at the edge
  • 2.
  • 3.
  • 4.
    siliconANGLE ● “...cross-platform modelcompilers [...] are harbingers of the new age in which it won’t matter what front-end tool you used to build your AI algorithms and what back-end clouds, platforms or chipsets are used to execute them.” ● “Cross-platform AI compilers will become standard components of every AI development environment, enabling developers to access every deep learning framework and target platform without having to know the technical particular of each environment.” ● “...within the next two to three years, the AI industry will converge around one open-source cross-compilation supported by all front-end and back-end environments” 4 Read the article April 2018 Quotes from article:
  • 5.
    Venture Beat “With PyTorchand TensorFlow, you’ve seen the frameworks sort of converge. The reason quantization comes up, and a bunch of other lower-level efficiencies come up, is because the next war is compilers for the frameworks — XLA, TVM, PyTorch has Glow, a lot of innovation is waiting to happen,” he said. “For the next few years, you’re going to see … how to quantize smarter, how to fuse better, how to use GPUs more efficiently, [and] how to automatically compile for new hardware.” 5 Read the article Quote from Soumith Chintala: (co-creator of PyTorch and distinguished engineer at Facebook AI) Jan 2020
  • 6.
    This Talk 6 ● Whatis a ML Compiler? ● How TVM works ● TVM use cases ● OctoML Product Demo
  • 7.
    7 Source code Classical Compiler FrontendOptimizer Backend Machine code
  • 8.
    8 C Classical Compiler C Frontend Common Optimizer PowerPC Backend PowerPC Fortran Fortran Frontend Adacode Ada Frontend X86 Backend x86 Arm Backend Arm Source: The Architecture of Open Source Applications
  • 9.
    9 Neural Network Deep LearningCompiler PyTorch Optimizing Compiler GPUs TensorFlow ONNX CPUs CPU optimized runtime Accelerators Neural Network Neural Network GPU optimized runtime Accelerator optimized runtime
  • 10.
    10 Neural Network Deep LearningCompiler PyTorch GPUs TensorFlow ONNX CPUs CPU optimized runtime Accelerators Neural Network Neural Network GPU optimized runtime Accelerator optimized runtime
  • 11.
    TVM: 11 An Automated End-to-EndOptimizing Compiler for Deep Learning ● “There is an increasing need to bring machine learning to a wide diversity of hardware devices” ● TVM is “a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back- ends” ● “Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand- tuned libraries for low-power CPU, mobile GPU, and server-class GPUs” Read the paper Feb 2018
  • 12.
    Relay: 12 A High-level Compilerfor Deep Learning ● Relay is “a high-level IR that enables end-to- end optimization of deep learning models for a variety of devices” ● “Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of- the-art models” ● “With its extensible design and expressive language, Relay serves as a foundation for future work in applying compiler techniques to the domain of deep learning systems” Read the paper April 2019
  • 13.
    Ansor: 13 Generating High-Performance Tensor Programsfor Deep Learning ● “...obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging” ● Ansor is “a tensor program generation framework for deep learning applications” ● “Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches” ● “We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to 3.8×, 2.6×, and 1.7×, respectively” Read the paper Nov 2020
  • 14.
    14 Thank you ApacheTVM contributors! 500+!
  • 15.
    Who is usingTVM? 15 Every Alexa wake-up today across all devices uses a TVM-optimized model “At Facebook, we've been contributing to TVM for the past year and a half or so, and it's been a really awesome experience” “We're really excited about the performance of TVM.” - Andrew Tulloch, AI Researcher Bing query understanding: 3x faster on CPU QnA bot: 2.6 faster on CPU, 1.8x faster on GPU
  • 16.
    Who attended TVMConf 2020? 16 950+ attendees
  • 17.
    17 Deep Learning SystemsLandscape (open source) Orchestrators Frameworks Accelerators Vendor Libraries Hardware NVIDIA cuDNN Intel oneDNN Arm Compute Library CPUs GPUs Accelerators
  • 18.
    18 Graph Level Optimizations Rewritesdataflow graphs (nodes and edges) to simplify the graph and reduce device peak memory usage Operator Level Optimizations Hardware target-specific low-level optimizations for individual operators/nodes in the graph. Efficient Runtime TVM optimized models run in the lightweight TVM Runtime System, providing a minimal API for loading and executing the model in Python, C++, Rust, Go, Java or Javascript How does TVM work?
  • 19.
    Deep Learning Operators 19 ●Deep Neural Networks look like Directed Acyclic Graphs (DAGs) ● Operators are the building blocks (nodes) of neural network models ● Network edges represent data flowing between operators Convolution Broadcast Add Matrix Multiplication Pooling Batch Normalization ArgMin/ArgMax Dropout DynamicQuantizeLinear Gemm LSTM LeakyRelu Softmax OneHotEncoder RNN Sigmoid
  • 20.
    20 1 2 7 3 Relay PyTorch / TensorFlow/ ONNX 4 5 6 TE + Computation AutoTVM/ Auto-scheduler TE + Schedule TIR Hardware Specific Compiler TVM Internals
  • 21.
    21 Relay ● Relay hasa functional, statically typed intermediate representation (IR)
  • 22.
    22 Auto-scheduler (a.k.a. Ansor) ●Auto-scheduler (2nd gen) replaces AutoTVM ● Auto-scheduler/Ansor aims to a fully automated scheduler for generating high-performance code for tensor computations, without manual templates ● Auto-scheduler can achieve better performance with faster search time in a more automated way b/c of innovations in search space construction and search algorithm ● Goal: Automatically turn tensor operations (like matmul or conv2d) into efficient code implementation ● AutoTVM (1st gen): template-based search algorithm to find efficient implementation for tensor operations. ○ required domain experts to write a manual template for every operator on every platform, > 15k loc in TVM Collaborators:
  • 23.
    23 AutoTVM vs Auto-scheduler Source:Apache TVMBlog: Introducing Auto-scheduler
  • 24.
    24 Auto-scheduler’s Search Process Source:Apache TVMBlog: Introducing Auto-scheduler
  • 25.
    25 Benchmarks: AutoTVM vsAuto-scheduler Source: Apache TVMBlog: Introducing Auto-scheduler Code Performance Comparison (higher is better) Search Time Comparison (lower is better)
  • 26.
    26 Auto-scheduling on AppleM1 Source: OctoML Blog: Beating Apple's CoreML4 (lower is better) ● 22% faster on CPU ● 49% faster on GPU How? - Effective Auto-scheduler searching - Fuse qualified subgraphs
  • 27.
  • 28.
    Conv2d bias + relu ... Conv2d bias +relu Relay: Fusion 28 Combine into a single fused operation which can then be optimized specifically for your target.
  • 29.
    Conv2d bias + relu ... Conv2d bias +relu Relay: Fusion 29 Combine into a single fused operation which can then be optimized specifically for your target.
  • 30.
    Conv2d bias + relu ... Conv2d bias +relu Relay: Device Placement 30 Partition your network to run on multiple devices. CPU GPU
  • 31.
    Conv2d bias + relu ... Conv2d bias +relu Relay: Layout Transformation 31 Generate efficient code for different data layouts. NHCW NHCW
  • 32.
    Conv2d bias + relu ... Conv2d bias +relu Relay: Layout Transformation 32 Generate efficient code for different data layouts. NHWC NHWC
  • 33.
    TIR Script ● TIRprovides more flexibility than high level tensor expressions. ● Not everything is expressible in TE and auto-scheduling is not always perfect. ○ AutoScheduling 3.0 (code- named AutoTIR coming later this year) ○ We can also directly write TIR directly using TIRScript. 33 @tvm.script.tir def fuse_add_exp(a: ty.handle, c: ty.handle) -> None: A = tir.match_buffer(a, (64,)) C = tir.match_buffer(c, (64,)) B = tir.alloc_buffer((64,)) with tir.block([64], "B") as [vi]: B[vi] = A[vi] + 1 with tir.block([64], "C") as [vi]: C[vi] = exp(B[vi])
  • 34.
  • 35.
    Faster Kernels forDense- Sparse Multiplication ● Performance comparison on PruneBERT ● 3-10x faster than cuBLAS and cuSPARSE. ● 1 engineer writing TensorIR kernels 35
  • 36.
    Model x hardwarecomparison points Performance at OctoML in 2020 Over 60 model x hardware benchmarking studies Each study compared TVM against best* baseline on the target Sorted by ascending log2 gain over baseline 36 TVM log2 fold improvement over baseline
  • 37.
    Model x hardwarecomparison points 37 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 2.5x average performance improvement on non-public models (2.1x across all)
  • 38.
    Model x hardwarecomparison points 38 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
  • 39.
    Model x hardwarecomparison points 39 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
  • 40.
    Model x hardwarecomparison points 40 TVM log2 fold improvement over baseline Across a broad variety of models and platforms 34x for Yolo-V3 on a MIPS based camera platform 5.3x: video analysis model on Nvidia T4 against TensorRT 4x: random forest on Nvidia 1070 against XGBoost 2.5x: MobilenetV3 on ARM A72 CPU
  • 41.
    Case Study: 90%cloud inference cost reduction Background ● Top 10 Tech Company running multiple variations of customized CV models ● Model in batch processing /offline mode using standard HW targets of a major public cloud. ● Billions of inferences per month ● Benchmarking on CPU and GPU Results ● 3.8x - TensorRT 8bit to TVM 8bit ● 10x - TensorRT 8bit to TVM 4bit ● Potential to reduce hourly costs by 90% 41 *V100at hourly price of $3.00per hour, T4at $0.53 Up to 10X inferences/doll ar increase
  • 42.
    See https://github.com/tlc-pack/tlcbench forbenchmark scripts 42 Results: TVM on CPU and GPU 20core Intel-Platinum-8269CYfp32performance data Intel X86 - 2-5X Performance Normalized performance Normalized performance V100fp32performance data NVIDIA GPU - 20-50% versus TensorRT Normalized performance Normalized performance
  • 43.
    Why use theOctomizer vs “just” TVM OSS? 43 Octomizer Compile Optimize Benchmark Model x HW analytics data ML Performance Model ● Access to OctoML’s “cost models” ○ We aggregate Models x HW data ○ Continuous improvement ● No need to install any SW, latest TVM ● No need to set up benchmarking HW ● “Outer loop” automation ○ optimize/package multiple models against many HW targets in one go ● Access to comprehensive benchmarking data ○ E.g., for procurement, for HW vendor competitive analysis ● Access to OctoML support
  • 44.
    44 Octomizer Live Demo APIaccess Waitlist! octoml.ai
  • 45.
    45 The Octonauts! You? View careeropportunities at octoml.ai/careers
  • 46.
    Thank you! How touse Apache TVM to optimize your ML models By Sameer Farooqui
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.