You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some engines take >> than the amount of GPU memory of their original PyTorch modules. This means that workflows that worked in PyTorch will fail in TensorRT even if the engines are compiled.
If users are able to dynamically provide the memory to run an engine and then reuse it later then this would address this issue.
Goal(s)
Do not capture GPU memory until execution, as part of the execute engine process, the runtime will allocate the memory for the engine and then release it once the execution is completed.
Usecases
Running a diffusers pipeline where there are multiple sub modules, if the TRT engine is larger than the original pytorch module these workflows will fail unless the memory is dynamically released
Proposed APIs / UX
Similar to other runtime settings there will be two sets of APIs.
withtorch_tensorrt.runtime.resource_allocation_strategy("dynamic") # Iterates through sub-modules and flips the settings to re initialize the engines.
Example Workflow
with torch_tensorrt.runtime.resource_allocation_strategy("dynamic")
Limitations
Does not address memory utilization of an engine, only handles multi module pipelines to allow a TRT engine to evacuate GPU memory before the next engine or PyTorch module runs.
Internal Implementation
Design
TRTEngine::set_resource_allocation_strategy(TRTEngine::ResourceAllocationStrategy new_strategy) { # If the allocation strategy is changed, destroy current execution context and recreate if (new_strategy != this->resource_allocation_strategy) { this->resource_allocation_strategy = new_strategy if (this->resource_allocation_strategy == ResourceAllocationStrategy::kDynamic) { this->exec_ctx = this->engine.create_execution_context_without_device_memory(); } else { this->exec_ctx = this->engine.create_execution_context(); } } } TRTEngine::get_resource_allocation_strategy() {...} TRTEngine::TRTEngine(..., ResourceAllocationStrategy resource_allocation_strategy) { ... if (this->resource_allocation_strategy == ResourceAllocationStrategy::kDynamic) { this->exec_ctx = this->engine.create_execution_context_without_device_memory(); } else { this->exec_ctx = this->engine.create_execution_context(); } ... } execute_engine(...) { ... torch::Tensor dynamic_workspace; if (engine.resource_allocation_strategy == ResourceAllocationStrategy::kDynamic) { dynamic_workspace = torch.empty(engine.device_memory_size, dtype=torch.uint8, device='cuda') exec_ctx.device_memory = dynamic_workspace.data_ptr; } ... } + Associated torchbind lifting and exposure through TorchTensorRTModule
Extensions Required to Core API implementations
We need an additional runtime mode added to both the C++ and Python runtime
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Dynamically Allocated Engines
TL;DR
Some engines take >> than the amount of GPU memory of their original PyTorch modules. This means that workflows that worked in PyTorch will fail in TensorRT even if the engines are compiled.
If users are able to dynamically provide the memory to run an engine and then reuse it later then this would address this issue.
Goal(s)
Do not capture GPU memory until execution, as part of the execute engine process, the runtime will allocate the memory for the engine and then release it once the execution is completed.
Usecases
Running a diffusers pipeline where there are multiple sub modules, if the TRT engine is larger than the original pytorch module these workflows will fail unless the memory is dynamically released
Proposed APIs / UX
Similar to other runtime settings there will be two sets of APIs.
Example Workflow
Limitations
Does not address memory utilization of an engine, only handles multi module pipelines to allow a TRT engine to evacuate GPU memory before the next engine or PyTorch module runs.
Internal Implementation
Design
Extensions Required to Core API implementations
We need an additional runtime mode added to both the C++ and Python runtime
Data Structures
New enum to describe the runtime mode.
Implementation Phases
Prototype - S
Support in C++ runtime
MVP
(2.9.0)SAll of the above supported in C++ and Python
Beta Was this translation helpful? Give feedback.
All reactions