Azure Execution Provider (Preview)

The Azure Execution Provider enables ONNX Runtime to invoke a remote Azure endpoint for inference, the endpoint must be deployed or available beforehand.

Since 1.16, below pluggable operators are available from onnxruntime-extensions:

With the operators, Azure Execution Provider supports two mode of usage:

Azure Execution Provider is in preview stage, and all API(s) and usage are subject to change.

Install

Since 1.16, Azure Execution Provider is shipped by default in both python and nuget packages.

Requirements

Since 1.16, all Azure Execution Provider operators are shipped with onnxruntime-extensions (>=v0.9.0) python and nuget packages. Please ensure the installation of correct onnxruntime-extension packages before using Azure Execution Provider.

Build

For build instructions, please see the BUILD page.

Usage

Edge and azure side by side

In this mode, there are two models running simultaneously. The azure model runs asynchronously by RunAsync API, which is also available through python and csharp.

import os import onnx from onnx import helper, TensorProto from onnxruntime_extensions import get_library_path from onnxruntime import SessionOptions, InferenceSession import numpy as np import threading # Generate the local model by: # https://github.com/microsoft/onnxruntime-extensions/blob/main/tutorials/whisper_e2e.py def get_whiper_tiny(): return '/onnxruntime-extensions/tutorials/whisper_onnx_tiny_en_fp32_e2e.onnx' # Generate the azure model def get_openai_audio_azure_model(): auth_token = helper.make_tensor_value_info('auth_token', TensorProto.STRING, [1]) model = helper.make_tensor_value_info('model_name', TensorProto.STRING, [1]) response_format = helper.make_tensor_value_info('response_format', TensorProto.STRING, [-1]) file = helper.make_tensor_value_info('file', TensorProto.UINT8, [-1]) transcriptions = helper.make_tensor_value_info('transcriptions', TensorProto.STRING, [-1]) invoker = helper.make_node('OpenAIAudioToText', ['auth_token', 'model_name', 'response_format', 'file'], ['transcriptions'], domain='com.microsoft.extensions', name='audio_invoker', model_uri='https://api.openai.com/v1/audio/transcriptions', audio_format='wav', verbose=False) graph = helper.make_graph([invoker], 'graph', [auth_token, model, response_format, file], [transcriptions]) model = helper.make_model(graph, ir_version=8, opset_imports=[helper.make_operatorsetid('com.microsoft.extensions', 1)]) model_name = 'openai_whisper_azure.onnx' onnx.save(model, model_name) return model_name if __name__ == '__main__': sess_opt = SessionOptions() sess_opt.register_custom_ops_library(get_library_path()) azure_model_path = get_openai_audio_azure_model() azure_model_sess = InferenceSession(azure_model_path, sess_opt, providers=['CPUExecutionProvider', 'AzureExecutionProvider']) # load AzureEP  with open('test16.wav', "rb") as _f: # read raw audio data from a local wav file  audio_stream = np.asarray(list(_f.read()), dtype=np.uint8) azure_model_inputs = { "auth_token": np.array([os.getenv('AUDIO', '')]), # read auth from env variable  "model_name": np.array(['whisper-1']), "response_format": np.array(['text']), "file": audio_stream } class RunAsyncState: def __init__(self): self.__event = threading.Event() self.__outputs = None self.__err = '' def fill_outputs(self, outputs, err): self.__outputs = outputs self.__err = err self.__event.set() def get_outputs(self): if self.__err != '': raise Exception(self.__err) return self.__outputs; def wait(self, sec): self.__event.wait(sec) def azureRunCallback(outputs: np.ndarray, state: RunAsyncState, err: str) -> None: state.fill_outputs(outputs, err) run_async_state = RunAsyncState(); # infer azure model asynchronously  azure_model_sess.run_async(None, azure_model_inputs, azureRunCallback, run_async_state) # in the same time, run the edge  edge_model_path = get_whiper_tiny() edge_model_sess = InferenceSession(edge_model_path, sess_opt, providers=['CPUExecutionProvider']) edge_model_outputs = edge_model_sess.run(None, { 'audio_stream': np.expand_dims(audio_stream, 0), 'max_length': np.asarray([200], dtype=np.int32), 'min_length': np.asarray([0], dtype=np.int32), 'num_beams': np.asarray([2], dtype=np.int32), 'num_return_sequences': np.asarray([1], dtype=np.int32), 'length_penalty': np.asarray([1.0], dtype=np.float32), 'repetition_penalty': np.asarray([1.0], dtype=np.float32) }) print("\noutput from whisper tiny: ", edge_model_outputs) run_async_state.wait(10) print("\nresponse from openAI: ", run_async_state.get_outputs()) # compare results and pick the better 

Merge and run the hybrid

Alternatively, one could also merge local and azure models into a hybrid, then infer as an ordinary onnx model. Sample scripts could be found here.

Current Limitations

Only builds and run on Windows, Linux and Android.
For Android, AzureTritonInvoker is not supported.