Azure Execution Provider (Preview)
The Azure Execution Provider enables ONNX Runtime to invoke a remote Azure endpoint for inference, the endpoint must be deployed or available beforehand.
Since 1.16, below pluggable operators are available from onnxruntime-extensions:
With the operators, Azure Execution Provider supports two mode of usage:
Azure Execution Provider is in preview stage, and all API(s) and usage are subject to change.
Contents
Install
Since 1.16, Azure Execution Provider is shipped by default in both python and nuget packages.
Requirements
Since 1.16, all Azure Execution Provider operators are shipped with onnxruntime-extensions (>=v0.9.0) python and nuget packages. Please ensure the installation of correct onnxruntime-extension packages before using Azure Execution Provider.
Build
For build instructions, please see the BUILD page.
Usage
Edge and azure side by side
In this mode, there are two models running simultaneously. The azure model runs asynchronously by RunAsync API, which is also available through python and csharp.
import os import onnx from onnx import helper, TensorProto from onnxruntime_extensions import get_library_path from onnxruntime import SessionOptions, InferenceSession import numpy as np import threading # Generate the local model by: # https://github.com/microsoft/onnxruntime-extensions/blob/main/tutorials/whisper_e2e.py def get_whiper_tiny(): return '/onnxruntime-extensions/tutorials/whisper_onnx_tiny_en_fp32_e2e.onnx' # Generate the azure model def get_openai_audio_azure_model(): auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [1]) model = helper.make_tensor_value_info('model_name', TensorProto.STRING, [1]) response_format = helper.make_tensor_value_info('response_format', TensorProto.STRING, [-1]) file = helper.make_tensor_value_info('file', TensorProto.UINT8, [-1]) transcriptions = helper.make_tensor_value_info('transcriptions', TensorProto.STRING, [-1]) invoker = helper.make_node('OpenAIAudioToText', ['auth_token', 'model_name', 'response_format', 'file'], ['transcriptions'], domain='com.microsoft.extensions', name='audio_invoker', model_uri='https://api.openai.com/v1/audio/transcriptions', audio_format='wav', verbose=False) graph = helper.make_graph([invoker], 'graph', [auth_token, model, response_format, file], [transcriptions]) model = helper.make_model(graph, ir_version=8, opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)]) model_name = 'openai_whisper_azure.onnx' onnx.save(model, model_name) return model_name if __name__ == '__main__': sess_opt = SessionOptions() sess_opt.register_custom_ops_library(get_library_path()) azure_model_path = get_openai_audio_azure_model() azure_model_sess = InferenceSession(azure_model_path, sess_opt, providers=['CPUExecutionProvider', 'AzureExecutionProvider']) # load AzureEP with open('test16.wav', "rb") as _f: # read raw audio data from a local wav file audio_stream = np.asarray(list(_f.read()), dtype=np.uint8) azure_model_inputs = { "auth_token": np.array([os.getenv('AUDIO', '')]), # read auth from env variable "model_name": np.array(['whisper-1']), "response_format": np.array(['text']), "file": audio_stream } class RunAsyncState: def __init__(self): self.__event = threading.Event() self.__outputs = None self.__err = '' def fill_outputs(self, outputs, err): self.__outputs = outputs self.__err = err self.__event.set() def get_outputs(self): if self.__err != '': raise Exception(self.__err) return self.__outputs; def wait(self, sec): self.__event.wait(sec) def azureRunCallback(outputs: np.ndarray, state: RunAsyncState, err: str) -> None: state.fill_outputs(outputs, err) run_async_state = RunAsyncState(); # infer azure model asynchronously azure_model_sess.run_async(None, azure_model_inputs, azureRunCallback, run_async_state) # in the same time, run the edge edge_model_path = get_whiper_tiny() edge_model_sess = InferenceSession(edge_model_path, sess_opt, providers=['CPUExecutionProvider']) edge_model_outputs = edge_model_sess.run(None, { 'audio_stream': np.expand_dims(audio_stream, 0), 'max_length': np.asarray([200], dtype=np.int32), 'min_length': np.asarray([0], dtype=np.int32), 'num_beams': np.asarray([2], dtype=np.int32), 'num_return_sequences': np.asarray([1], dtype=np.int32), 'length_penalty': np.asarray([1.0], dtype=np.float32), 'repetition_penalty': np.asarray([1.0], dtype=np.float32) }) print("\noutput from whisper tiny: ", edge_model_outputs) run_async_state.wait(10) print("\nresponse from openAI: ", run_async_state.get_outputs()) # compare results and pick the better
Merge and run the hybrid
Alternatively, one could also merge local and azure models into a hybrid, then infer as an ordinary onnx model. Sample scripts could be found here.
Current Limitations
- Only builds and run on Windows, Linux and Android.
- For Android, AzureTritonInvoker is not supported.