No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models

dariusz.filipski · February 18, 2019, 1:25pm

I can’t get any improvements from TensorRT on Drive PX 2 AutoChauffeur (P2379, the one without dGPU). I simply clonned your Jetson example from https://github.com/NVIDIA-AI-IOT/tf_trt_models and created a benchmark.py script, which is not much but a copy-paste from https://github.com/NVIDIA-AI-IOT/tf_trt_models/blob/master/examples/detection/detection.ipynb. Since Jetson TX2 has similar specs as one node my Drive PX 2, I expected similar values and improvements as shown in the table at https://github.com/NVIDIA-AI-IOT/tf_trt_models#models-1
Unfortunately, in my case I see no difference in inference speed between the original models and TensorRT ones (I could even argue there’s a slight drop in performance). Here’s what I see (full logs below):

ssd_mobilenet_v1_coco: Original - 0.051792s, TRT - 0.053618s ssd_mobilenet_v2_coco: Original - 0.084560s, TRT - 0.093455s ssd_inception_v2_coco: Original - 0.100977s, TRT - 0.106853s

Taking a closer look, it seems that TRT slims down the graph by ~1000 nodes but fails to put anything to TRTEngineOp:

ssd_mobilenet_v1_coco: Original - 7571 nodes, TRT - 6518 nodes out of which 0 are TRTEngineOp ssd_mobilenet_v2_coco: Original - 8062 nodes, TRT - 6865 nodes out of which 0 are TRTEngineOp ssd_inception_v2_coco: Original - 8278 nodes, TRT - 7015 nodes out of which 0 are TRTEngineOp

I see errors like the following one in the logs as well:

Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping...

What’s wrong? How to make TensorRT work?

My configuration

TensorFlow 1.12.0 built from sources with TRT support.
Protobuf updated according to https://devtalk.nvidia.com/default/topic/1046492/tensorrt/extremely-long-time-to-load-trt-optimized-frozen-tf-graphs/post/5315675/#5315675

$ protoc --version libprotoc 3.6.1

TensorRT config:

$ dpkg -l | grep nvinfer ii libnvinfer-dev 4.1.1-1+cuda9.2 arm64 TensorRT development libraries and headers ii libnvinfer-samples 4.1.1-1+cuda9.2 arm64 TensorRT samples and documentation ii libnvinfer4 4.1.1-1+cuda9.2 arm64 TensorRT runtime libraries

GPU data:

$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X2" CUDA Driver Version / Runtime Version 9.2 / 9.2 CUDA Capability Major/Minor version number: 6.2 Total amount of global memory: 6402 MBytes (6712545280 bytes) ( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores GPU Max Clock rate: 1275 MHz (1.27 GHz) Memory Clock rate: 1600 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1 Result = PASS

benchmark.py:

import argparse from PIL import Image import sys import os import urllib import tensorflow.contrib.tensorrt as trt #import matplotlib #matplotlib.use('Agg') #import matplotlib.pyplot as plt #import matplotlib.patches as patches import tensorflow as tf import numpy as np import time from tf_trt_models.detection import download_detection_model, build_detection_graph MODEL = 'ssd_inception_v2_coco' DATA_DIR = './data/' IMAGE_PATH = './examples/detection/data/huskies.jpg' def parse_args(): """Parse input arguments.""" desc = ('TRT benchmark') parser = argparse.ArgumentParser(description=desc) parser.add_argument('--model', dest='model', help='name of the object detecion model [{}]'.format(MODEL), default=MODEL, type=str) parser.add_argument('--trt', dest='use_trt', help='build and test TensorRT model', action='store_true') args = parser.parse_args() return args def main(): args = parse_args() print('Called with args: {}'.format(args)) CONFIG_FILE = args.model + '.config' # ./data/ssd_inception_v2_coco.config CHECKPOINT_FILE = 'model.ckpt' # ./data/ssd_inception_v2_coco/model.ckpt config_path, checkpoint_path = download_detection_model(args.model, 'data') frozen_graph, input_names, output_names = build_detection_graph( config=config_path, checkpoint=checkpoint_path, score_threshold=0.3, batch_size=1 ) print('Model: {}'.format(args.model)) print(output_names) print('Total nodes in the original graph: {}'.format(len([1 for n in frozen_graph.node]))) if args.use_trt: trt_graph = trt.create_inference_graph( input_graph_def=frozen_graph, outputs=output_names, max_batch_size=1, max_workspace_size_bytes=1 << 25, precision_mode='FP16', minimum_segment_size=50 ) all_nodes = len([1 for n in trt_graph.node]) trt_engine_nodes = len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp']) print('Total nodes in the optimized graph: {} out of which {} are TRTEngineOp'.format(all_nodes, trt_engine_nodes)) print('Creating the session') tf_config = tf.ConfigProto() tf_config.gpu_options.allow_growth = True tf_sess = tf.Session(config=tf_config) if args.use_trt: print('Running with TRT model') tf.import_graph_def(trt_graph, name='') else: print('Running with ORIGINAL model') tf.import_graph_def(frozen_graph, name='') tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0') tf_scores = tf_sess.graph.get_tensor_by_name('detection_scores:0') tf_boxes = tf_sess.graph.get_tensor_by_name('detection_boxes:0') tf_classes = tf_sess.graph.get_tensor_by_name('detection_classes:0') tf_num_detections = tf_sess.graph.get_tensor_by_name('num_detections:0') image = Image.open(IMAGE_PATH) image_resized = np.array(image.resize((300, 300))) image = np.array(image) print('Running the inference on a single image to warm up the net') t0 = time.time() scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={ tf_input: image_resized[None, ...] }) t1 = time.time() print('Runtime: {:.2f} seconds'.format(t1 - t0)) boxes = boxes[0] # index by 0 to remove batch dimension scores = scores[0] classes = classes[0] num_detections = num_detections[0] print('Running the benchmark') num_samples = 50 t0 = time.time() for i in range(num_samples): scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={ tf_input: image_resized[None, ...] }) t1 = time.time() print('Average runtime: %f seconds' % (float(t1 - t0) / num_samples)) tf_sess.close() if __name__ == '__main__': main()

FULL LOGS

original ssd_mobilenet_v1_coco:

$ python3 benchmark.py --model ssd_mobilenet_v1_coco Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=False) --2019-02-18 04:37:35-- http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:4005:80a::2010 Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 76541073 (73M) [application/x-tar] Saving to: ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz 100%[============================================================================================================================================================================>] 73.00M 10.9MB/s in 6.8s 2019-02-18 04:37:42 (10.7 MB/s) - ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ saved [76541073/76541073] 2019-02-18 04:37:44.136212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:37:44.136411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.42GiB 2019-02-18 04:37:44.136531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:37:46.326887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:37:46.327035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:37:46.327095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:37:46.328349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:38:20.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:20.291834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:20.291934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:20.291984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:20.292106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:38:30.064116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:30.064307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:30.064367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:30.064408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:30.064522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:38:33.110725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:33.110866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:33.110909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:33.110947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:33.111053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v1_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 7571 Creating the session 2019-02-18 04:38:40.770776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:40.770951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:40.771024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:40.771075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:40.771218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with ORIGINAL model Running the inference on a single image to warm up the net 2019-02-18 04:38:59.904909: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:39:00.059551: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:39:00.245903: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:39:00.595751: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 15.87 seconds Running the benchmark Average runtime: 0.051792 seconds

ssd_mobilenet_v1_coco with TRT:

$ python3 benchmark.py --model ssd_mobilenet_v1_coco --trt Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=True) 2019-02-18 04:45:21.845684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:45:21.845872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.14GiB 2019-02-18 04:45:21.845993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:45:23.202972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:45:23.203113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:45:23.203161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:45:23.203376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:45:57.904372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:45:57.904526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:45:57.904569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:45:57.904611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:45:57.904770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:46:07.725793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:07.725949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:07.725994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:07.726033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:07.726171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:46:10.766851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:10.766994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:10.767038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:10.767077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:10.767207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v1_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 7571 2019-02-18 04:46:26.024094: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 04:46:26.030507: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-18 04:46:26.037184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:26.037433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:26.037485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:26.037524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:26.037659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:46:32.355050: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:46:32.355353: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:46:32.494402: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:46:32.494563: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:46:36.095038: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:46:36.095299: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:46:36.370087: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists. 2019-02-18 04:46:37.200207: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:46:37.444711: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:46:37.532859: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-02-18 04:46:37.533062: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6503 nodes (-1068), 8572 edges (-1676), time = 2288.70093ms. 2019-02-18 04:46:37.533107: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 6518 nodes (15), 8598 edges (26), time = 761.058ms. 2019-02-18 04:46:37.533146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3022.36499ms. 2019-02-18 04:46:37.533186: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6518 nodes (0), 8598 edges (0), time = 802.444ms. 2019-02-18 04:46:37.533224: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3086.44897ms. 2019-02-18 04:46:37.533343: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-18 04:46:37.533435: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 267.055ms. 2019-02-18 04:46:37.533475: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 435 nodes (0), 503 edges (0), time = 155.322ms. 2019-02-18 04:46:37.533512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.109ms. 2019-02-18 04:46:37.533549: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 217.434ms. 2019-02-18 04:46:37.533584: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.192ms. Total nodes in the optimized graph: 6518 out of which 0 are TRTEngineOp Creating the session 2019-02-18 04:46:38.540298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:38.540442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:38.540527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:38.540569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:38.540681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net 2019-02-18 04:46:55.954166: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:46:56.109469: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:46:56.295034: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:46:56.642952: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 13.62 seconds Running the benchmark Average runtime: 0.053618 seconds

original ssd_mobilenet_v2_coco:

$ python3 benchmark.py --model ssd_mobilenet_v2_coco Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=False) --2019-02-18 04:47:46-- http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:400f:806::2010 Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 187925923 (179M) [application/x-tar] Saving to: ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz 100%[============================================================================================================================================================================>] 179.22M 10.3MB/s in 18s 2019-02-18 04:48:04 (10.1 MB/s) - ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ saved [187925923/187925923] 2019-02-18 04:48:08.325486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:48:08.325631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 3.74GiB 2019-02-18 04:48:08.325694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:48:09.640993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:48:09.641157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:48:09.641197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:48:09.641522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:48:48.706244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:48:48.706398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:48:48.706442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:48:48.706478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:48:48.706590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:49:00.113150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:49:00.113377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:49:00.113423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:49:00.113471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:49:00.113610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:49:04.441489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:49:04.441627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:49:04.441673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:49:04.441713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:49:04.441841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8062 Creating the session 2019-02-18 04:49:13.197948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:49:13.198120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:49:13.198175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:49:13.198224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:49:13.198413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with ORIGINAL model Running the inference on a single image to warm up the net 2019-02-18 04:49:38.590984: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:49:38.607152: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 20.24 seconds Running the benchmark Average runtime: 0.084560 seconds

ssd_mobilenet_v2_coco with TRT:

$ python3 benchmark.py --model ssd_mobilenet_v2_coco --trt Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=True) 2019-02-18 04:50:30.934503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:50:30.934779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.41GiB 2019-02-18 04:50:30.934939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:50:32.242912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:50:32.243053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:50:32.243093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:50:32.243487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:51:11.865247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:11.865473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:11.865536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:11.865573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:11.865702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:51:23.567995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:23.568248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:23.568312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:23.568349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:23.568464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:51:27.799623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:27.799797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:27.799840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:27.799877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:27.799983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8062 2019-02-18 04:51:45.670572: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 04:51:45.676597: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-18 04:51:45.680622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:45.680794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:45.680839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:45.680884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:45.681008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:51:53.469662: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:51:53.470062: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists. 2019-02-18 04:52:00.692498: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:52:01.438146: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:52:01.613665: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-02-18 04:52:01.613843: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6850 nodes (-1212), 8953 edges (-1820), time = 2828.16ms. 2019-02-18 04:52:01.613888: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 6865 nodes (15), 8979 edges (26), time = 835.079ms. 2019-02-18 04:52:01.613927: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 3860.146ms. 2019-02-18 04:52:01.613970: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6865 nodes (0), 8979 edges (0), time = 994.512ms. 2019-02-18 04:52:01.614012: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 4454.34082ms. 2019-02-18 04:52:01.614078: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-18 04:52:01.614118: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 624.612ms. 2019-02-18 04:52:01.614207: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 781 nodes (0), 883 edges (0), time = 396.25ms. 2019-02-18 04:52:01.614248: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 49.671ms. 2019-02-18 04:52:01.614286: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 695.233ms. 2019-02-18 04:52:01.614322: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 53.073ms. Total nodes in the optimized graph: 6865 out of which 0 are TRTEngineOp Creating the session 2019-02-18 04:52:03.094518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:52:03.094648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:52:03.094692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:52:03.094730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:52:03.094885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net 2019-02-18 04:52:38.719800: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 31.43 seconds Running the benchmark Average runtime: 0.093455 seconds

original ssd_inception_v2_coco:

$ python3 benchmark.py --model ssd_inception_v2_coco Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=False) 2019-02-18 04:10:13.974149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:10:13.974348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.29GiB 2019-02-18 04:10:13.974412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:10:15.398904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:10:15.399058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:10:15.399103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:10:15.399360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:10:58.050991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:10:58.051141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:10:58.051184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:10:58.051231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:10:58.051349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:11:10.699534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:11:10.699689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:11:10.699743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:11:10.699792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:11:10.699910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:11:15.841888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:11:15.842028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:11:15.842070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:11:15.842106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:11:15.842217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_inception_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8278 Creating the session 2019-02-18 04:11:26.078042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:11:26.078225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:11:26.078275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:11:26.078319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:11:26.078547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with ORIGINAL model Running the inference on a single image to warm up the net 2019-02-18 04:11:57.791649: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 26.46 seconds Running the benchmark Average runtime: 0.100977 seconds

ssd_inception_v2_coco with TRT:

nvidia@dpx2tegraa-lund:~/dariusz/projects/nvidia/tf_trt_models$ python3 benchmark.py --model ssd_inception_v2_coco --trt Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True) 2019-02-18 04:18:05.555255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:18:05.555595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.54GiB 2019-02-18 04:18:05.555680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:18:07.756886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:18:07.757042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:18:07.757086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:18:07.757460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:18:50.782544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:18:50.782700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:18:50.782752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:18:50.782788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:18:50.782916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:19:03.670306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:03.670446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:03.670493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:03.670534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:03.670669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:19:08.911273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:08.911411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:08.911473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:08.911513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:08.911773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_inception_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8278 2019-02-18 04:19:29.284995: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 04:19:29.290733: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-18 04:19:29.295736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:29.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:29.295956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:29.295996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:29.296253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:19:38.501962: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:19:38.502486: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:19:38.903763: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:19:38.903994: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:19:45.094916: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:19:45.095409: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:19:46.228436: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:19:46.228742: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:19:48.180974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:19:48.992334: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:19:49.170470: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-02-18 04:19:49.170642: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 3424.68ms. 2019-02-18 04:19:49.170691: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 979.995ms. 2019-02-18 04:19:49.170731: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 7025 nodes (0), 9207 edges (0), time = 4509.50488ms. 2019-02-18 04:19:49.170900: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 7015 nodes (-10), 9207 edges (0), time = 1884.54199ms. 2019-02-18 04:19:49.170941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 7015 nodes (0), 9207 edges (0), time = 5634.18701ms. 2019-02-18 04:19:49.170979: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-18 04:19:49.171017: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 784.723ms. 2019-02-18 04:19:49.171070: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Invalid argument: The graph is already optimized by layout optimizer. 2019-02-18 04:19:49.171108: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 53.572ms. 2019-02-18 04:19:49.171146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 757.245ms. 2019-02-18 04:19:49.171181: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 54.12ms. Total nodes in the optimized graph: 7015 out of which 0 are TRTEngineOp Creating the session 2019-02-18 04:19:51.273352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:51.273486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:51.273532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:51.273572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:51.273691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net 2019-02-18 04:20:35.683734: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 38.11 seconds Running the benchmark Average runtime: 0.106853 seconds

dariusz.filipski · February 18, 2019, 2:48pm

Additional pieces of information - when I enable python logging by simply adding this piece of code to the very beginning of main() in benchmark.py:

logging.basicConfig(level=logging.DEBUG, format='%(asctime)s.%(msecs)03d %(levelname)-8s %(threadName)-10s %(message)s', datefmt='%Y-%m-%d %H:%M:%S', handlers=[ logging.FileHandler('benchmark.log', 'w'), # mode 'w' for overwrite, 'a' for append logging.StreamHandler(sys.stdout) ]) logger = logging.getLogger(__name__) # Ask tensorflow logger not to propagate logs to parent (which causes # duplicated logging) logging.getLogger('tensorflow').propagate = False

I see that TensorFlow claims it runs against TensorRT version 4.0.0, even though I have version 4.1.1 installed (see above for environment details). TensorFlow was built on the very same machine with no changes to TensorRT whatsoever.

Total nodes in the original graph: 8062 INFO:tensorflow:Running against TensorRT version 4.0.0 2019-02-18 06:38:11.224120: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 06:38:11.233967: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 06:38:11.234015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 06:38:11.234055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 06:38:11.234180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3849 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)

As you can see, it also claims that number of eligible GPUs is zero, but still creates TensorFlow device.
I tried the same with

export TF_MIN_GPU_MULTIPROCESSOR_COUNT=2

but there was no difference.

Does the TensorRT version mismatch and no eligible GPU matter in this case?

dariusz.filipski · February 19, 2019, 1:50pm

Another question on this - looking closely to the logs one can see errors causing skipping creation of my_trt_op_0:

2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists.

Why does it happen and how to avoid it?

NVES · February 21, 2019, 9:46pm

Hello,

using benchmark.py on a TX2 with jetpack 3.3, I’m seeing performance improvements with TRT vs. TF

TF TRT ssd_mobilenet_v1_coco	0.049736	0.036951 ssd_mobilenet_v2_coco	0.102131	0.042651 ssd_inception_v2_coco	0.1101	0.040059

MY GPU

./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X2" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 6.2 Total amount of global memory: 7846 MBytes (8227401728 bytes) ( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores GPU Max Clock rate: 1301 MHz (1.30 GHz) Memory Clock rate: 1600 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1 Result = PASS

TRT config

nvidia@tegra-ubuntu:/usr/local/cuda/samples/1_Utilities/deviceQuery$ dpkg -l | grep nvinfer ii libnvinfer-dev 4.1.3-1+cuda9.0 arm64 TensorRT development libraries and headers ii libnvinfer-samples 4.1.3-1+cuda9.0 arm64 TensorRT samples and documentation ii libnvinfer4 4.1.3-1+cuda9.0 arm64 TensorRT runtime libraries

TensorFlow config

root@tegra-ubuntu:/home/scratch.zhenyih_sw/jetson/tf_trt_models# python Python 2.7.12 (default, Nov 12 2018, 14:36:49) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import tensorflow >>> tensorflow.__version__ '1.11.0'

The " no eligible GPU " is expected. I’m not seeing the op skipping messages you are seeing.

root@tegra-ubuntu:/home/scratch.zhenyih_sw/jetson/tf_trt_models# python benchmark.py --model ssd_inception_v2_coco --trt Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True) 2019-02-21 21:28:16.221037: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-21 21:28:16.221172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005 pciBusID: 0000:00:00.0 totalMemory: 7.66GiB freeMemory: 2.57GiB 2019-02-21 21:28:16.221230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-02-21 21:28:17.428181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-21 21:28:17.428280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-02-21 21:28:17.428309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-02-21 21:28:17.428595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python2.7/site-packages/object_detection-0.1-py2.7.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-21 21:29:11.710866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-02-21 21:29:11.711111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-21 21:29:11.711154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-02-21 21:29:11.711184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-02-21 21:29:11.711289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-21 21:30:03.883130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-02-21 21:30:03.883377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-21 21:30:03.883438: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-02-21 21:30:03.883470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-02-21 21:30:03.883586: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-21 21:30:18.984016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-02-21 21:30:18.984116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-21 21:30:18.984149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-02-21 21:30:18.984172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-02-21 21:30:18.984279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_inception_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8278 2019-02-21 21:31:22.059748: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-21 21:31:22.060275: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-21 21:31:22.060798: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-02-21 21:31:22.060938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-21 21:31:22.060997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-02-21 21:31:22.061025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-02-21 21:31:22.061138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-21 21:31:34.270468: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2952] Segment @scope '', converted to graph 2019-02-21 21:31:34.270783: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-21 21:33:22.447185: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:952] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes succeeded. 2019-02-21 21:33:27.242045: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-21 21:33:27.841326: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-21 21:33:28.027736: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:404] Optimization results for grappler item: tf_graph 2019-02-21 21:33:28.027989: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 2841.95898ms. 2019-02-21 21:33:28.028026: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 754.749ms. 2019-02-21 21:33:28.028052: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] TensorRTOptimizer: Graph size after: 6095 nodes (-930), 8096 edges (-1111), time = 111755.695ms. 2019-02-21 21:33:28.028081: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] constant folding: Graph size after: 6085 nodes (-10), 8096 edges (0), time = 1223.552ms. 2019-02-21 21:33:28.028106: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] TensorRTOptimizer: Graph size after: 6085 nodes (0), 8096 edges (0), time = 2281.38696ms. 2019-02-21 21:33:28.028129: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:404] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-21 21:33:28.028154: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 566.961ms. 2019-02-21 21:33:28.028183: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] layout: Invalid argument: The graph is already optimized by layout optimizer. 2019-02-21 21:33:28.028215: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 57.022ms. 2019-02-21 21:33:28.028240: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 541.715ms. 2019-02-21 21:33:28.028264: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:406] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 55.732ms. Total nodes in the optimized graph: 6085 out of which 1 are TRTEngineOp Creating the session 2019-02-21 21:35:02.447477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-02-21 21:35:02.447599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-21 21:35:02.447636: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-02-21 21:35:02.447668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-02-21 21:35:02.447782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2020 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net Runtime: 11.35 seconds Running the benchmark Average runtime: 0.040059 seconds

I’d recommend reflashing your xavier to resolve any mix trt version issues. then follow the instructions at GitHub - NVIDIA-AI-IOT/tf_trt_models: TensorFlow models accelerated with NVIDIA TensorRT .

Topic		Replies	Views
Don't get any 'TRTEngineOp' after optimizing model via TensorRT in Jeton TX2 TensorRT	17	3785	October 12, 2021
TensorFlow object detection and image classification accelerated for NVIDIA Jetson Jetson TX2	25	10645	June 3, 2019
TF-TRT issue Jetson TX2	26	3990	October 18, 2021
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2995	January 18, 2019
Inference Time is not stable TensorRT	10	1840	January 3, 2019
TRT issue with Graph Creation - TRTEngineOP TensorRT	12	3240	November 4, 2019
TensorRT Integration Speeds Up TensorFlow Inference Technical Blog	40	1057	March 27, 2020
No improvement in inference performance after Opt. with TensorRT TensorRT	6	1274	April 15, 2020
Model inferencing with TensorRT on Jetson (TX2) Jetson TX2	4	1002	October 18, 2021
Dont see any speedups using TensorRT TensorRT	14	3075	October 12, 2021

No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models

FULL LOGS

Related topics