|
3 | 3 | """ |
4 | 4 | .. meta:: |
5 | 5 | :description: An end-to-end example of how to use AOTInductor for Python runtime. |
6 | | - :keywords: torch.export, AOTInductor, torch._inductor.aot_compile, torch._export.aot_load |
| 6 | + :keywords: torch.export, AOTInductor, torch._inductor.aoti_compile_and_package, aot_compile, torch._export.aoti_load_package |
7 | 7 |
|
8 | 8 | ``torch.export`` AOTInductor Tutorial for Python runtime (Beta) |
9 | 9 | =============================================================== |
|
14 | 14 | # |
15 | 15 | # .. warning:: |
16 | 16 | # |
17 | | -# ``torch._inductor.aot_compile`` and ``torch._export.aot_load`` are in Beta status and are subject to backwards compatibility |
18 | | -# breaking changes. This tutorial provides an example of how to use these APIs for model deployment using Python runtime. |
| 17 | +# ``torch._inductor.aoti_compile_and_package`` and |
| 18 | +# ``torch._inductor.aoti_load_package`` are in Beta status and are subject |
| 19 | +# to backwards compatibility breaking changes. This tutorial provides an |
| 20 | +# example of how to use these APIs for model deployment using Python |
| 21 | +# runtime. |
19 | 22 | # |
20 | | -# It has been shown `previously <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`__ how AOTInductor can be used |
21 | | -# to do Ahead-of-Time compilation of PyTorch exported models by creating |
22 | | -# a shared library that can be run in a non-Python environment. |
23 | | -# |
24 | | -# |
25 | | -# In this tutorial, you will learn an end-to-end example of how to use AOTInductor for Python runtime. |
26 | | -# We will look at how to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a |
27 | | -# shared library. Additionally, we will examine how to execute the shared library in Python runtime using :func:`torch._export.aot_load`. |
28 | | -# You will learn about the speed up seen in the first inference time using AOTInductor, especially when using |
29 | | -# ``max-autotune`` mode which can take some time to execute. |
| 23 | +# It has been shown `previously |
| 24 | +# <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`__ how |
| 25 | +# AOTInductor can be used to do Ahead-of-Time compilation of PyTorch exported |
| 26 | +# models by creating an artifact that can be run in a non-Python environment. |
| 27 | +# In this tutorial, you will learn an end-to-end example of how to use |
| 28 | +# AOTInductor for Python runtime. |
30 | 29 | # |
31 | 30 | # **Contents** |
32 | 31 | # |
|
36 | 35 | ###################################################################### |
37 | 36 | # Prerequisites |
38 | 37 | # ------------- |
39 | | -# * PyTorch 2.4 or later |
| 38 | +# * PyTorch 2.6 or later |
40 | 39 | # * Basic understanding of ``torch.export`` and AOTInductor |
41 | 40 | # * Complete the `AOTInductor: Ahead-Of-Time Compilation for Torch.Export-ed Models <https://pytorch.org/docs/stable/torch.compiler_aot_inductor.html#>`_ tutorial |
42 | 41 |
|
43 | 42 | ###################################################################### |
44 | 43 | # What you will learn |
45 | 44 | # ---------------------- |
46 | | -# * How to use AOTInductor for python runtime. |
47 | | -# * How to use :func:`torch._inductor.aot_compile` along with :func:`torch.export.export` to generate a shared library |
48 | | -# * How to run a shared library in Python runtime using :func:`torch._export.aot_load`. |
49 | | -# * When do you use AOTInductor for python runtime |
| 45 | +# * How to use AOTInductor for Python runtime. |
| 46 | +# * How to use :func:`torch._inductor.aoti_compile_and_package` along with :func:`torch.export.export` to generate a compiled artifact |
| 47 | +# * How to load and run the artifact in a Python runtime using :func:`torch._export.aot_load`. |
| 48 | +# * When to you use AOTInductor with a Python runtime |
50 | 49 |
|
51 | 50 | ###################################################################### |
52 | 51 | # Model Compilation |
53 | 52 | # ----------------- |
54 | 53 | # |
55 | | -# We will use the TorchVision pretrained `ResNet18` model and TorchInductor on the |
56 | | -# exported PyTorch program using :func:`torch._inductor.aot_compile`. |
| 54 | +# We will use the TorchVision pretrained ``ResNet18`` model as an example. |
57 | 55 | # |
58 | | -# .. note:: |
| 56 | +# The first step is to export the model to a graph representation using |
| 57 | +# :func:`torch.export.export`. To learn more about using this function, you can |
| 58 | +# check out the `docs <https://pytorch.org/docs/main/export.html>`_ or the |
| 59 | +# `tutorial <https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html>`_. |
59 | 60 | # |
60 | | -# This API also supports :func:`torch.compile` options like ``mode`` |
61 | | -# This means that if used on a CUDA enabled device, you can, for example, set ``"max_autotune": True`` |
62 | | -# which leverages Triton based matrix multiplications & convolutions, and enables CUDA graphs by default. |
| 61 | +# Once we have exported the PyTorch model and obtained an ``ExportedProgram``, |
| 62 | +# we can apply :func:`torch._inductor.aoti_compile_and_package` to AOTInductor |
| 63 | +# to compile the program to a specified device, and save the generated contents |
| 64 | +# into a ".pt2" artifact. |
63 | 65 | # |
64 | | -# We also specify ``dynamic_shapes`` for the batch dimension. In this example, ``min=2`` is not a bug and is |
65 | | -# explained in `The 0/1 Specialization Problem <https://docs.google.com/document/d/16VPOa3d-Liikf48teAOmxLc92rgvJdfosIy-yoT38Io/edit?fbclid=IwAR3HNwmmexcitV0pbZm_x1a4ykdXZ9th_eJWK-3hBtVgKnrkmemz6Pm5jRQ#heading=h.ez923tomjvyk>`__ |
66 | | - |
| 66 | +# .. note:: |
| 67 | +# |
| 68 | +# This API supports the same available options that :func:`torch.compile` |
| 69 | +# has, such as ``mode`` and ``max_autotune`` (for those who want to enable |
| 70 | +# CUDA graphs and leverage Triton based matrix multiplications and |
| 71 | +# convolutions) |
67 | 72 |
|
68 | 73 | import os |
69 | 74 | import torch |
| 75 | +import torch._inductor |
70 | 76 | from torchvision.models import ResNet18_Weights, resnet18 |
71 | 77 |
|
72 | 78 | model = resnet18(weights=ResNet18_Weights.DEFAULT) |
73 | 79 | model.eval() |
74 | 80 |
|
75 | 81 | with torch.inference_mode(): |
| 82 | + inductor_configs = {} |
76 | 83 |
|
77 | | - # Specify the generated shared library path |
78 | | - aot_compile_options = { |
79 | | - "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"), |
80 | | - } |
81 | 84 | if torch.cuda.is_available(): |
82 | 85 | device = "cuda" |
83 | | - aot_compile_options.update({"max_autotune": True}) |
| 86 | + inductor_configs["max_autotune"] = True |
84 | 87 | else: |
85 | 88 | device = "cpu" |
86 | 89 |
|
87 | 90 | model = model.to(device=device) |
88 | 91 | example_inputs = (torch.randn(2, 3, 224, 224, device=device),) |
89 | 92 |
|
90 | | - # min=2 is not a bug and is explained in the 0/1 Specialization Problem |
91 | | - batch_dim = torch.export.Dim("batch", min=2, max=32) |
92 | 93 | exported_program = torch.export.export( |
93 | 94 | model, |
94 | 95 | example_inputs, |
95 | | - # Specify the first dimension of the input x as dynamic |
96 | | - dynamic_shapes={"x": {0: batch_dim}}, |
97 | 96 | ) |
98 | | - so_path = torch._inductor.aot_compile( |
99 | | - exported_program.module(), |
100 | | - example_inputs, |
101 | | - # Specify the generated shared library path |
102 | | - options=aot_compile_options |
| 97 | + path = torch._inductor.aoti_compile_and_package( |
| 98 | + exported_program, |
| 99 | + package_path=os.path.join(os.getcwd(), "resnet18.pt2"), |
| 100 | + inductor_configs=inductor_configs |
103 | 101 | ) |
104 | 102 |
|
| 103 | +###################################################################### |
| 104 | +# The result of :func:`aoti_compile_and_package` is an artifact "resnet18.pt2" |
| 105 | +# which can be loaded and executed in Python and C++. |
| 106 | +# |
| 107 | +# The artifact itself contains a bunch of AOTInductor generated code, such as |
| 108 | +# a generated C++ runner file, a shared library compiled from the C++ file, and |
| 109 | +# CUDA binary files, aka cubin files, if optimizing for CUDA. |
| 110 | +# |
| 111 | +# Structure-wise, the artifact is a structured ``.zip`` file, with the following |
| 112 | +# specification: |
| 113 | +# |
| 114 | +# .. code:: |
| 115 | +# . |
| 116 | +# ├── archive_format |
| 117 | +# ├── version |
| 118 | +# ├── data |
| 119 | +# │ ├── aotinductor |
| 120 | +# │ │ └── model |
| 121 | +# │ │ ├── xxx.cpp # AOTInductor generated cpp file |
| 122 | +# │ │ ├── xxx.so # AOTInductor generated shared library |
| 123 | +# │ │ ├── xxx.cubin # Cubin files (if running on CUDA) |
| 124 | +# │ │ └── xxx_metadata.json # Additional metadata to save |
| 125 | +# │ ├── weights |
| 126 | +# │ │ └── TBD |
| 127 | +# │ └── constants |
| 128 | +# │ └── TBD |
| 129 | +# └── extra |
| 130 | +# └── metadata.json |
| 131 | +# |
| 132 | +# We can use the following command to inspect the artifact contents: |
| 133 | +# |
| 134 | +# .. code:: bash |
| 135 | +# |
| 136 | +# $ unzip -l resnet18.pt2 |
| 137 | +# |
| 138 | +# .. code:: |
| 139 | +# |
| 140 | +# Archive: resnet18.pt2 |
| 141 | +# Length Date Time Name |
| 142 | +# --------- ---------- ----- ---- |
| 143 | +# 1 01-08-2025 16:40 version |
| 144 | +# 3 01-08-2025 16:40 archive_format |
| 145 | +# 10088 01-08-2025 16:40 data/aotinductor/model/cagzt6akdaczvxwtbvqe34otfe5jlorktbqlojbzqjqvbfsjlge4.cubin |
| 146 | +# 17160 01-08-2025 16:40 data/aotinductor/model/c6oytfjmt5w4c7onvtm6fray7clirxt7q5xjbwx3hdydclmwoujz.cubin |
| 147 | +# 16616 01-08-2025 16:40 data/aotinductor/model/c7ydp7nocyz323hij4tmlf2kcedmwlyg6r57gaqzcsy3huneamu6.cubin |
| 148 | +# 17776 01-08-2025 16:40 data/aotinductor/model/cyqdf46ordevqhiddvpdpp3uzwatfbzdpl3auj2nx23uxvplnne2.cubin |
| 149 | +# 10856 01-08-2025 16:40 data/aotinductor/model/cpzfebfgrusqslui7fxsuoo4tvwulmrxirc5tmrpa4mvrbdno7kn.cubin |
| 150 | +# 14608 01-08-2025 16:40 data/aotinductor/model/c5ukeoz5wmaszd7vczdz2qhtt6n7tdbl3b6wuy4rb2se24fjwfoy.cubin |
| 151 | +# 11376 01-08-2025 16:40 data/aotinductor/model/csu3nstcp56tsjfycygaqsewpu64l5s6zavvz7537cm4s4cv2k3r.cubin |
| 152 | +# 10984 01-08-2025 16:40 data/aotinductor/model/cp76lez4glmgq7gedf2u25zvvv6rksv5lav4q22dibd2zicbgwj3.cubin |
| 153 | +# 14736 01-08-2025 16:40 data/aotinductor/model/c2bb5p6tnwz4elgujqelsrp3unvkgsyiv7xqxmpvuxcm4jfl7pc2.cubin |
| 154 | +# 11376 01-08-2025 16:40 data/aotinductor/model/c6eopmb2b4ngodwsayae4r5q6ni3jlfogfbdk3ypg56tgpzhubfy.cubin |
| 155 | +# 11624 01-08-2025 16:40 data/aotinductor/model/chmwe6lvoekzfowdbiizitm3haiiuad5kdm6sd2m6mv6dkn2zk32.cubin |
| 156 | +# 15632 01-08-2025 16:40 data/aotinductor/model/c3jop5g344hj3ztsu4qm6ibxyaaerlhkzh2e6emak23rxfje6jam.cubin |
| 157 | +# 25472 01-08-2025 16:40 data/aotinductor/model/chaiixybeiuuitm2nmqnxzijzwgnn2n7uuss4qmsupgblfh3h5hk.cubin |
| 158 | +# 139389 01-08-2025 16:40 data/aotinductor/model/cvk6qzuybruhwxtfblzxiov3rlrziv5fkqc4mdhbmantfu3lmd6t.cpp |
| 159 | +# 27 01-08-2025 16:40 data/aotinductor/model/cvk6qzuybruhwxtfblzxiov3rlrziv5fkqc4mdhbmantfu3lmd6t_metadata.json |
| 160 | +# 47195424 01-08-2025 16:40 data/aotinductor/model/cvk6qzuybruhwxtfblzxiov3rlrziv5fkqc4mdhbmantfu3lmd6t.so |
| 161 | +# --------- ------- |
| 162 | +# 47523148 18 files |
| 163 | + |
105 | 164 |
|
106 | 165 | ###################################################################### |
107 | 166 | # Model Inference in Python |
108 | 167 | # ------------------------- |
109 | 168 | # |
110 | | -# Typically, the shared object generated above is used in a non-Python environment. In PyTorch 2.3, |
111 | | -# we added a new API called :func:`torch._export.aot_load` to load the shared library in the Python runtime. |
112 | | -# The API follows a structure similar to the :func:`torch.jit.load` API . You need to specify the path |
113 | | -# of the shared library and the device where it should be loaded. |
| 169 | +# To load and run the artifact in Python, we can use :func:`torch._inductor.aoti_load_package`. |
114 | 170 | # |
115 | | -# .. note:: |
116 | | -# In the example above, we specified ``batch_size=1`` for inference and it still functions correctly even though we specified ``min=2`` in |
117 | | -# :func:`torch.export.export`. |
118 | | - |
119 | 171 |
|
120 | 172 | import os |
121 | 173 | import torch |
| 174 | +import torch._inductor |
122 | 175 |
|
123 | | -device = "cuda" if torch.cuda.is_available() else "cpu" |
124 | | -model_so_path = os.path.join(os.getcwd(), "resnet18_pt2.so") |
| 176 | +model_path = os.path.join(os.getcwd(), "resnet18.pt2") |
125 | 177 |
|
126 | | -model = torch._export.aot_load(model_so_path, device) |
127 | | -example_inputs = (torch.randn(1, 3, 224, 224, device=device),) |
| 178 | +compiled_model = torch._inductor.aoti_load_package(model_path) |
| 179 | +example_inputs = (torch.randn(2, 3, 224, 224, device=device),) |
128 | 180 |
|
129 | 181 | with torch.inference_mode(): |
130 | | - output = model(example_inputs) |
| 182 | + output = compiled_model(example_inputs) |
| 183 | + |
131 | 184 |
|
132 | 185 | ###################################################################### |
133 | | -# When to use AOTInductor for Python Runtime |
134 | | -# ------------------------------------------ |
| 186 | +# When to use AOTInductor with a Python Runtime |
| 187 | +# --------------------------------------------- |
135 | 188 | # |
136 | | -# One of the requirements for using AOTInductor is that the model shouldn't have any graph breaks. |
137 | | -# Once this requirement is met, the primary use case for using AOTInductor Python Runtime is for |
138 | | -# model deployment using Python. |
139 | | -# There are mainly two reasons why you would use AOTInductor Python Runtime: |
| 189 | +# There are mainly two reasons why one would use AOTInductor with a Python Runtime: |
140 | 190 | # |
141 | | -# - ``torch._inductor.aot_compile`` generates a shared library. This is useful for model |
142 | | -# versioning for deployments and tracking model performance over time. |
| 191 | +# - ``torch._inductor.aoti_compile_and_package`` generates a singular |
| 192 | +# serialized artifact. This is useful for model versioning for deployments |
| 193 | +# and tracking model performance over time. |
143 | 194 | # - With :func:`torch.compile` being a JIT compiler, there is a warmup |
144 | | -# cost associated with the first compilation. Your deployment needs to account for the |
145 | | -# compilation time taken for the first inference. With AOTInductor, the compilation is |
146 | | -# done offline using ``torch.export.export`` & ``torch._indutor.aot_compile``. The deployment |
147 | | -# would only load the shared library using ``torch._export.aot_load`` and run inference. |
| 195 | +# cost associated with the first compilation. Your deployment needs to |
| 196 | +# account for the compilation time taken for the first inference. With |
| 197 | +# AOTInductor, the compilation is done ahead of time using |
| 198 | +# ``torch.export.export`` and ``torch._inductor.aoti_compile_and_package``. |
| 199 | +# At deployment time, after loading the model, running inference does not |
| 200 | +# have any additional cost. |
148 | 201 | # |
149 | 202 | # |
150 | 203 | # The section below shows the speedup achieved with AOTInductor for first inference |
@@ -185,7 +238,7 @@ def timed(fn): |
185 | 238 |
|
186 | 239 | torch._dynamo.reset() |
187 | 240 |
|
188 | | -model = torch._export.aot_load(model_so_path, device) |
| 241 | +model = torch._inductor.aoti_load_package(model_path) |
189 | 242 | example_inputs = (torch.randn(1, 3, 224, 224, device=device),) |
190 | 243 |
|
191 | 244 | with torch.inference_mode(): |
@@ -217,8 +270,7 @@ def timed(fn): |
217 | 270 | # ---------- |
218 | 271 | # |
219 | 272 | # In this recipe, we have learned how to effectively use the AOTInductor for Python runtime by |
220 | | -# compiling and loading a pretrained ``ResNet18`` model using the ``torch._inductor.aot_compile`` |
221 | | -# and ``torch._export.aot_load`` APIs. This process demonstrates the practical application of |
222 | | -# generating a shared library and running it within a Python environment, even with dynamic shape |
223 | | -# considerations and device-specific optimizations. We also looked at the advantage of using |
| 273 | +# compiling and loading a pretrained ``ResNet18`` model. This process |
| 274 | +# demonstrates the practical application of generating a compiled artifact and |
| 275 | +# running it within a Python environment. We also looked at the advantage of using |
224 | 276 | # AOTInductor in model deployments, with regards to speed up in first inference time. |
0 commit comments