Scaling AI in production using PyTorch

1 7 J U N E 2 0 2 1 S C A L I N G A I I N P R O D U C T I O N U S I N G P Y T O R C H G E E T A C H A U H A N PyTorch Partner Engineering, Facebook AI @ C H A U H A N G

MLOPS World 2021 A G E N D A 0 1 C H A L L E N G E S W I T H M L I N P R O D U C T I O N 0 2 T O R C H S E R V E O V E R V I E W 0 3 B E S T P R A C T I C E S F O R P R O D U C T I O N D E P L O Y M E N T

MLOps World 2021 P Y T O R C H C O M M U N I T Y G R O W T H Source: https://paperswithcode.com/trends

MLOps World 2021 ● ● ● Cloud / On-Prem Preprocessing Application Application logic Application logic Postprocessing . . . . . . . . . Performance Ease of use Cost efficiency Deployment at scale C H A L L E N G E S W I T H M L I N D E P L O Y M E N T

MLOps World 2021 INFERENCE AT SCALE Deploying and managing models in production is di ffi cult. Some of the pain points include: Loading and managing multiple models, on multiple servers or end devices Running pre-processing and post-processing code on prediction requests. How to log, monitor and secure predictions What happens when you hit scale?

MLOps World 2021 TORCHSERVE Easily deploy PyTorch models in production at scale D E F A U LT H A N D L E R S F O R C O M M O N T A S K S L O W L AT E N C Y M O D E L S E R V I N G W O R K S W I T H A N Y M L E N V I R O N M E N T

MLOps World 2021 • Default handlers for common use cases (e.g., image segmentation, text classification) along with custom handlers support for other use cases and a Model Zoo • Multi-model serving, Model versioning and ability to roll back to an earlier version • Automatic batching of individual inferences across HTTP requests • Logging including common metrics, and the ability to incorporate custom metrics • Robust HTTP APIS - Management and Inference model1.pth model1.pth model1.pth torch-model-archiver HTTP HTTP http://localhost:8080/ … http://localhost:8081/ … Logging Metrics model1.mar model2.mar model3.mar model4.mar model5.mar <path>/model_store Inference API Management API TorchServe Metrics API Inference API Serving Model 3 Serving Model 2 Serving Model 1 torchserve --start TORCHSERVE

T O R C H S E R V E D E T A I L : M O D E L H A N D L E R S TorchServe has default model handlers that perform boilerplate data transforms for common cases: • Image Classification • Image Segmentation • Object Detection • Text Classification You can also create custom model handlers for any model and inference task. import torch class MyModelHandler(object): def initialize(self, context): # get GPU status & device handle # load model & supporting files (vocabularies etc.) def preprocess(self, data): # put incoming data into tensor # transform as needed for your model def inference(self, context): # do predictions def postprocess(self, output): # process inference output, e.g. extracting top K # package output for web delivery def handle(self, context): if not _service.initialized: _service.initialize(context) if data is None: return None data = _service.preprocess(data) data = _service.inference(data) data = _service.postprocess(data) return data

M O D E L A R C H I V E torch-model-archiver cli tool for packaging all model artifacts into a single deployment unit • model checkpoints or model definition file with state_dict • torchscript and eager mode support • Extra files like vocab, config, index_to_name mapping torch-model-archiver   —model-name BERTSeqClassification_Torchscript   --version 1.0   --serialized-file Transformer_model/traced_model.pt   --handler ./Transformer_handler_generalized.py   --extra-files "./setup_config.json,./ Seq_classification_artifacts/index_to_name.json"      setup.config   { “model_name": "bert-base-uncased", “mode": "sequence_classification", “do_lower_case": "True", “num_labels": "2", “save_mode": "torchscript", “max_length": "150" }     torchserve --start   --model-store model_store   —-models <path-to model-file/s3-url/azure-blob-url> https://github.com/pytorch/serve/tree/master/model-archiver#creating-a-model-archive

D Y N A M I C B A T C H I N G Via Custom Handlers • Model Configuration based • batch_size Max batch size • max_batch_delay The max batch delay time TorchServe waits to receive batch_size number of requests   • (Coming soon) Batching support in default handlers curl localhost:8081/models/resnet-152 { "modelName": "resnet-152", "modelUrl": "https://s3.amazonaws.com/model-server/ model_archive_1.0/examples/resnet-152-batching/resnet-152.ma "runtime": "python", "minWorkers": 1, "maxWorkers": 1, "batchSize": 8, "maxBatchDelay": 10, "workers": [ { "id": "9008", "startTime": "2019-02-19T23:56:33.907Z", "status": "READY", "gpu": false, "memoryUsage": 607715328 } ] } https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md

M E T R I C S Out of box metrics with ability to extend • CPU, Disk, Memory utilization • Requests type count • ts.metrics class for extension • Types supported - Size, percentage, counter, general metric • Prometheus metrics support available # Access context metrics as follows metrics = context.metrics # Create Dimension Object from ts.metrics.dimension import Dimension # Dimensions are name value pairs dim1 = Dimension(name, value) . dimN= Dimension(name_n, value_n) # Add Distance as a metric # dimensions = [dim1, dim2, dim3, ..., dimN] metrics.add_metric('DistanceInKM', distance, 'km', dimensions=dimensions) # Add Image size as a size metric metrics.add_size('SizeOfImage', img_size, None, 'MB', dimensions) # Add MemoryUtilization as a percentage metric metrics.add_percent('MemoryUtilization', utilization_percent, None, dimensions) # Create a counter with name 'LoopCount' and dimensions metrics.add_counter('LoopCount', 1, None, dimensions) # Log custom metrics for metric in metrics.store: logger.info("[METRICS]%s", str(metric)) https://github.com/pytorch/serve/blob/master/docs/metrics.md

MLOps World 2021 RECENT FEATURES + Ensemble Model support, Captum Model Interpretability + Kubeflow Pipelines /KFServing Integration with Auto-scaling and Canary rollout on any cloud/on-prem   + GCP Vertex AI Serverless pipelines + MLflow Integration + Prometheus Integration with Grafana + Multiple nodes on EC2, Autoscaling on SageMaker/EKS, AWS Inferentia support + MMF, NMT, DeepLapV3 new examples

Deployment models Optimizations Resilience Measurement Responsible AI Standalon e Primary backu p Orchestratio n Cloud vs.   on-premises Performance vs. latency TorchScript profilin g Offline vs. real-tim e Cost Robust endpoin t Auto-scalin g Canary deployment s A / B testing Metric s Model performanc e Interpretabilit y Feedback loop Fairnes s Human-centered design B E S T P R A C T I C E S F O R P R O D U C T I O N D E P L O Y M E N T S

MLOps World 2021 Fairness by design • Measure skewness of data, model bias, data bias; identify relevant metrics • Transparency, Explainable AI, inclusive design Human-centered design • Consider AI-driven decisions and their impact on people at the time of model design • Provide ability to have human recourse vs. full automation – for example, need to avoid a mortgage applications AI rejecting people of certain category or race • Computer vision models measure results based on demographics; for example, include support for different skin tones, age groups R E S P O N S I B L E A I

MLOps World 2021 • Build with performance vs. latency goals in mind • Reduce size of the model: Quantization, pruning, mixed precision training • Reduce latency: TorchScript model; use SnakeViz profiler • Evaluate GPU vs. CPU for low latency • Evaluate REST vs. gRPC for your prediction service O P T I M I Z A T I O N S

MLOps World 2021 fp32 accuracy int8 accuracy change Technique CPU inference speed up ResNet50 76.1   Top-1, Imagenet -0.2   75.9 Post Training 2x   214ms ➙102ms,   Intel Skylake-DE MobileNetV2 71.9 Top-1, Imagenet -0.3 71.6 Quantization-Aware Training 4x   75ms ➙18ms   OnePlus 5, Snapdragon 835 Translate / FairSeq 32.78   BLEU, IWSLT 2014 de-en 0.0   32.78 Dynamic   (weights only) 4x   for encoder   Intel Skylake-SE These models and more available on TorchHub - https://pytorch.org/hub/ QUANTIZATION

MLOps World 2021 B E R T M O D E L P R O F I L I N G Eager Mode

MLOps World 2021 B E R T M O D E L P R O F I L I N G Torchscript Mode 4x speedup

MLOps World 2021 Offline vs. real-time predictions • Offline: Dynamic batching • Online: Async processing – push/poll • Pre-computed predictions for certain elements Cost optimizations • Spot Instances for offline • Autoscaling based on metrics, on-demand cluster • Evaluate AI Accelerators supported like AWS Inferentia for lower cost point O P T I M I Z A T I O N S ( C O N T D . )

MLOps World 2021 Develop , Test Production Staging , Experiments Hybrid Cloud On-prem Cloud Managed Install from Source Standalone Docker Large Scale  Production MLflow, Kubeflow Kubernetes, Kubeflow/KFserving Primary/Backup, ML Microservices Autoscaling, Canary Rollouts Minikub e Self managed Docker AWS CloudFormation CLOUD VMs/ Containers Microservices behind API Gateway CLOUD VMs/ Containers AWS SageMaker Endpoints, BYOC AWS SageMaker EKS/AKS/GKE AWS SageMaker/ GCP AI Platform Serverless Functions GCP Vertex AI, AWS SageMaker Canary Rollouts Databricks Managed MLflow D E P L O Y I N G M O D E L S I N P R O D U C T I O N

MLOps World 2021 Create robust endpoint for serving, for example, SageMaker endpoint Auto-scaling with orchestration deployments, multi-node for EC2, and other scenarios Canary deployments, test new version of a model on small subset before making default Shadow inference, deploy new version of model in parallel A / B testing of different versions of model R E S I L L I E N C E

MLOps World 2021 Define model performance metrics, such as accuracy, while designing the AI service; use-case specific Add custom metrics as appropriate Use CloudWatch or Prometheus dashboards for monitoring model performance Model interpretability analysis via Captum Deploy with a feedback loop, if model accuracy drops over time or new version, analyze issues like concept drift, stale data, etc. M E A S U R E M E N T

MLOps World 2021 Understand Align Mitigate Monitor Measure Stakeholder conversations to find   consensus and outline measurement and mitigation plans Analyze model performance,   label bias, outcomes, and other relevant signals Address observed   issues in dataset,   models, policies, etc How might the product’s goals, its policy, and its implementation affect users from different subgroups? Identify contextual definitions of fairness Monitor effect of mitigations on   subgroups, and ensure fairness analysis holds as product adapts FAIRNESS BY DESIGN

CAPTUM Text Contributions: 7.54 Image Contributions: 11.19 Total Contributions: 18.73 0 200 400 600 800 400 300 200 100 0 S U P P O R T F O R AT T R I B U T I O N A LG O R I T H M S   T O I N T E R P R E T: • Output predictions with respect to inputs • Output predictions with respect to layers • Neurons with respect to inputs • Currently provides gradient & perturbation based approaches (e.g. Integrated Gradients) Model interpretability library for PyTorch https://captum.ai/

MLOps World 2021 DYNABOARD & FLORES 101 WMT COMPETITION http://www.statmt.org/wmt21/large-scale-multilingual-translation-task.html https://github.com/facebookresearch/dynalab https://dynabench.org/tasks/3#overall

MLOps World 2021 COMMUNIT Y PROJECTS https://github.com/cceyda/torchserve-dashboard https://github.com/Unity-Technologies/SynthDet https://medium.com/pytorch/how-wadhwani-ai-uses-pytorch- to-empower-cotton-farmers-14397f4c9f2b

MLOps World 2021 FUTURE RELEASES + Improved memory and resource usage for better scalability + C++ Backend for lower latency + Enhanced profiling tools

• TorchServe: https://github.com/pytorch/serve • Management API: https://github.com/pytorch/serve/blob/master/docs/management_api.md • Inference API: https://github.com/pytorch/serve/blob/master/docs/inference_api.md • Language Translation Ensemble example: https://github.com/pytorch/serve/tree/master/examples/Work fl ows/nmt_tranformers_pipeline • BERT Model example: https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers • Model Zoo: https://github.com/pytorch/serve/blob/master/docs/model_zoo.md • SnakeViz visualizations: https://github.com/pytorch/serve/tree/master/benchmarks#visualize-snakeviz-results • Logging: https://github.com/pytorch/serve/blob/master/docs/logging.md • Metrics: https://github.com/pytorch/serve/blob/master/docs/metrics.md • Prometheus Metrics: https://gith ub.com/pytorch/serve/blob/master/docs/metrics_api.md • Batch Inference: https://github.com/pytorch/serve/blob/master/docs/batch_inference_with_ts.md • Kube fl ow Pipelines: https://github.com/kube fl ow/pipelines/tree/master/components/PyTorch/pytorch-kfp-components • Kubernetes support: https://github.com/pytorch/serve/blob/master/kubernetes/README.md • TorchServe Dashboard (Community): https://cceyda.github.io/blog/torchserve/streamlit/dashboard/2020/10/15/torchserve.html • Custom Handler community blog: https://towardsdatascience.com/deploy-models-and-create-custom-handlers-in-torchserve- fc2d048fbe91 • Captum Interpretability for BERT models: https://github.com/pytorch/serve/blob/master/captum/Captum_visualization_for_bert.ipynb • Operationalize, Scale and Infuse Trust in AI using KFServing: https://blog.kube fl ow.org/release/o ffi cial/2021/03/08/kfserving-0.5.html REFERENCES

QUESTIONS? Contact: Email: gchauhan@fb.com Linkedin: https://www.linkedin.com/in/geetachauhan/

Scaling AI in production using PyTorch

More Related Content

What's hot

Similar to Scaling AI in production using PyTorch

More from geetachauhan

Recently uploaded

In this document

Scaling AI in production using PyTorch