I can’t get any improvements from TensorRT on Drive PX 2 AutoChauffeur (P2379, the one without dGPU). I simply clonned your Jetson example from https://github.com/NVIDIA-AI-IOT/tf_trt_models and created a benchmark.py script, which is not much but a copy-paste from https://github.com/NVIDIA-AI-IOT/tf_trt_models/blob/master/examples/detection/detection.ipynb. Since Jetson TX2 has similar specs as one node my Drive PX 2, I expected similar values and improvements as shown in the table at https://github.com/NVIDIA-AI-IOT/tf_trt_models#models-1
Unfortunately, in my case I see no difference in inference speed between the original models and TensorRT ones (I could even argue there’s a slight drop in performance). Here’s what I see (full logs below):
ssd_mobilenet_v1_coco: Original - 0.051792s, TRT - 0.053618s ssd_mobilenet_v2_coco: Original - 0.084560s, TRT - 0.093455s ssd_inception_v2_coco: Original - 0.100977s, TRT - 0.106853s Taking a closer look, it seems that TRT slims down the graph by ~1000 nodes but fails to put anything to TRTEngineOp:
ssd_mobilenet_v1_coco: Original - 7571 nodes, TRT - 6518 nodes out of which 0 are TRTEngineOp ssd_mobilenet_v2_coco: Original - 8062 nodes, TRT - 6865 nodes out of which 0 are TRTEngineOp ssd_inception_v2_coco: Original - 8278 nodes, TRT - 7015 nodes out of which 0 are TRTEngineOp I see errors like the following one in the logs as well:
Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping... What’s wrong? How to make TensorRT work?
My configuration
TensorFlow 1.12.0 built from sources with TRT support.
Protobuf updated according to https://devtalk.nvidia.com/default/topic/1046492/tensorrt/extremely-long-time-to-load-trt-optimized-frozen-tf-graphs/post/5315675/#5315675
$ protoc --version libprotoc 3.6.1 TensorRT config:
$ dpkg -l | grep nvinfer ii libnvinfer-dev 4.1.1-1+cuda9.2 arm64 TensorRT development libraries and headers ii libnvinfer-samples 4.1.1-1+cuda9.2 arm64 TensorRT samples and documentation ii libnvinfer4 4.1.1-1+cuda9.2 arm64 TensorRT runtime libraries GPU data:
$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "NVIDIA Tegra X2" CUDA Driver Version / Runtime Version 9.2 / 9.2 CUDA Capability Major/Minor version number: 6.2 Total amount of global memory: 6402 MBytes (6712545280 bytes) ( 2) Multiprocessors, (128) CUDA Cores/MP: 256 CUDA Cores GPU Max Clock rate: 1275 MHz (1.27 GHz) Memory Clock rate: 1600 Mhz Memory Bus Width: 128-bit L2 Cache Size: 524288 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.2, CUDA Runtime Version = 9.2, NumDevs = 1 Result = PASS benchmark.py:
import argparse from PIL import Image import sys import os import urllib import tensorflow.contrib.tensorrt as trt #import matplotlib #matplotlib.use('Agg') #import matplotlib.pyplot as plt #import matplotlib.patches as patches import tensorflow as tf import numpy as np import time from tf_trt_models.detection import download_detection_model, build_detection_graph MODEL = 'ssd_inception_v2_coco' DATA_DIR = './data/' IMAGE_PATH = './examples/detection/data/huskies.jpg' def parse_args(): """Parse input arguments.""" desc = ('TRT benchmark') parser = argparse.ArgumentParser(description=desc) parser.add_argument('--model', dest='model', help='name of the object detecion model [{}]'.format(MODEL), default=MODEL, type=str) parser.add_argument('--trt', dest='use_trt', help='build and test TensorRT model', action='store_true') args = parser.parse_args() return args def main(): args = parse_args() print('Called with args: {}'.format(args)) CONFIG_FILE = args.model + '.config' # ./data/ssd_inception_v2_coco.config CHECKPOINT_FILE = 'model.ckpt' # ./data/ssd_inception_v2_coco/model.ckpt config_path, checkpoint_path = download_detection_model(args.model, 'data') frozen_graph, input_names, output_names = build_detection_graph( config=config_path, checkpoint=checkpoint_path, score_threshold=0.3, batch_size=1 ) print('Model: {}'.format(args.model)) print(output_names) print('Total nodes in the original graph: {}'.format(len([1 for n in frozen_graph.node]))) if args.use_trt: trt_graph = trt.create_inference_graph( input_graph_def=frozen_graph, outputs=output_names, max_batch_size=1, max_workspace_size_bytes=1 << 25, precision_mode='FP16', minimum_segment_size=50 ) all_nodes = len([1 for n in trt_graph.node]) trt_engine_nodes = len([1 for n in trt_graph.node if str(n.op) == 'TRTEngineOp']) print('Total nodes in the optimized graph: {} out of which {} are TRTEngineOp'.format(all_nodes, trt_engine_nodes)) print('Creating the session') tf_config = tf.ConfigProto() tf_config.gpu_options.allow_growth = True tf_sess = tf.Session(config=tf_config) if args.use_trt: print('Running with TRT model') tf.import_graph_def(trt_graph, name='') else: print('Running with ORIGINAL model') tf.import_graph_def(frozen_graph, name='') tf_input = tf_sess.graph.get_tensor_by_name(input_names[0] + ':0') tf_scores = tf_sess.graph.get_tensor_by_name('detection_scores:0') tf_boxes = tf_sess.graph.get_tensor_by_name('detection_boxes:0') tf_classes = tf_sess.graph.get_tensor_by_name('detection_classes:0') tf_num_detections = tf_sess.graph.get_tensor_by_name('num_detections:0') image = Image.open(IMAGE_PATH) image_resized = np.array(image.resize((300, 300))) image = np.array(image) print('Running the inference on a single image to warm up the net') t0 = time.time() scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={ tf_input: image_resized[None, ...] }) t1 = time.time() print('Runtime: {:.2f} seconds'.format(t1 - t0)) boxes = boxes[0] # index by 0 to remove batch dimension scores = scores[0] classes = classes[0] num_detections = num_detections[0] print('Running the benchmark') num_samples = 50 t0 = time.time() for i in range(num_samples): scores, boxes, classes, num_detections = tf_sess.run([tf_scores, tf_boxes, tf_classes, tf_num_detections], feed_dict={ tf_input: image_resized[None, ...] }) t1 = time.time() print('Average runtime: %f seconds' % (float(t1 - t0) / num_samples)) tf_sess.close() if __name__ == '__main__': main() FULL LOGS
original ssd_mobilenet_v1_coco:
$ python3 benchmark.py --model ssd_mobilenet_v1_coco Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=False) --2019-02-18 04:37:35-- http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2018_01_28.tar.gz Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:4005:80a::2010 Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 76541073 (73M) [application/x-tar] Saving to: ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz 100%[============================================================================================================================================================================>] 73.00M 10.9MB/s in 6.8s 2019-02-18 04:37:42 (10.7 MB/s) - ‘data/ssd_mobilenet_v1_coco_2018_01_28.tar.gz’ saved [76541073/76541073] 2019-02-18 04:37:44.136212: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:37:44.136411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.42GiB 2019-02-18 04:37:44.136531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:37:46.326887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:37:46.327035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:37:46.327095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:37:46.328349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:38:20.291676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:20.291834: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:20.291934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:20.291984: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:20.292106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:38:30.064116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:30.064307: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:30.064367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:30.064408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:30.064522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:38:33.110725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:33.110866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:33.110909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:33.110947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:33.111053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v1_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 7571 Creating the session 2019-02-18 04:38:40.770776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:38:40.770951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:38:40.771024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:38:40.771075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:38:40.771218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3753 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with ORIGINAL model Running the inference on a single image to warm up the net 2019-02-18 04:38:59.904909: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:39:00.059551: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:39:00.245903: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:39:00.595751: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 15.87 seconds Running the benchmark Average runtime: 0.051792 seconds ssd_mobilenet_v1_coco with TRT:
$ python3 benchmark.py --model ssd_mobilenet_v1_coco --trt Called with args: Namespace(model='ssd_mobilenet_v1_coco', use_trt=True) 2019-02-18 04:45:21.845684: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:45:21.845872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.14GiB 2019-02-18 04:45:21.845993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:45:23.202972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:45:23.203113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:45:23.203161: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:45:23.203376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:45:57.904372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:45:57.904526: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:45:57.904569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:45:57.904611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:45:57.904770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:46:07.725793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:07.725949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:07.725994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:07.726033: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:07.726171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:46:10.766851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:10.766994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:10.767038: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:10.767077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:10.767207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v1_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 7571 2019-02-18 04:46:26.024094: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 04:46:26.030507: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-18 04:46:26.037184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:26.037433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:26.037485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:26.037524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:26.037659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:46:32.355050: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:46:32.355353: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:46:32.494402: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:46:32.494563: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 434 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:46:36.095038: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:46:36.095299: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:46:36.370087: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists. 2019-02-18 04:46:37.200207: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:46:37.444711: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:46:37.532859: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-02-18 04:46:37.533062: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6503 nodes (-1068), 8572 edges (-1676), time = 2288.70093ms. 2019-02-18 04:46:37.533107: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 6518 nodes (15), 8598 edges (26), time = 761.058ms. 2019-02-18 04:46:37.533146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3022.36499ms. 2019-02-18 04:46:37.533186: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6518 nodes (0), 8598 edges (0), time = 802.444ms. 2019-02-18 04:46:37.533224: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6518 nodes (0), 8598 edges (0), time = 3086.44897ms. 2019-02-18 04:46:37.533343: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-18 04:46:37.533435: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 267.055ms. 2019-02-18 04:46:37.533475: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 435 nodes (0), 503 edges (0), time = 155.322ms. 2019-02-18 04:46:37.533512: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.109ms. 2019-02-18 04:46:37.533549: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 435 nodes (0), 503 edges (0), time = 217.434ms. 2019-02-18 04:46:37.533584: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 435 nodes (0), 503 edges (0), time = 26.192ms. Total nodes in the optimized graph: 6518 out of which 0 are TRTEngineOp Creating the session 2019-02-18 04:46:38.540298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:46:38.540442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:46:38.540527: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:46:38.540569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:46:38.540681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3639 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net 2019-02-18 04:46:55.954166: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:46:56.109469: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:46:56.295034: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:46:56.642952: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 13.62 seconds Running the benchmark Average runtime: 0.053618 seconds original ssd_mobilenet_v2_coco:
$ python3 benchmark.py --model ssd_mobilenet_v2_coco Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=False) --2019-02-18 04:47:46-- http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v2_coco_2018_03_29.tar.gz Resolving download.tensorflow.org (download.tensorflow.org)... 172.217.20.48, 2a00:1450:400f:806::2010 Connecting to download.tensorflow.org (download.tensorflow.org)|172.217.20.48|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 187925923 (179M) [application/x-tar] Saving to: ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz 100%[============================================================================================================================================================================>] 179.22M 10.3MB/s in 18s 2019-02-18 04:48:04 (10.1 MB/s) - ‘data/ssd_mobilenet_v2_coco_2018_03_29.tar.gz’ saved [187925923/187925923] 2019-02-18 04:48:08.325486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:48:08.325631: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 3.74GiB 2019-02-18 04:48:08.325694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:48:09.640993: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:48:09.641157: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:48:09.641197: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:48:09.641522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:48:48.706244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:48:48.706398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:48:48.706442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:48:48.706478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:48:48.706590: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:49:00.113150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:49:00.113377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:49:00.113423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:49:00.113471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:49:00.113610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:49:04.441489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:49:04.441627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:49:04.441673: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:49:04.441713: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:49:04.441841: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8062 Creating the session 2019-02-18 04:49:13.197948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:49:13.198120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:49:13.198175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:49:13.198224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:49:13.198413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3233 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with ORIGINAL model Running the inference on a single image to warm up the net 2019-02-18 04:49:38.590984: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.53GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-02-18 04:49:38.607152: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 20.24 seconds Running the benchmark Average runtime: 0.084560 seconds ssd_mobilenet_v2_coco with TRT:
$ python3 benchmark.py --model ssd_mobilenet_v2_coco --trt Called with args: Namespace(model='ssd_mobilenet_v2_coco', use_trt=True) 2019-02-18 04:50:30.934503: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:50:30.934779: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.41GiB 2019-02-18 04:50:30.934939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:50:32.242912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:50:32.243053: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:50:32.243093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:50:32.243487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:51:11.865247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:11.865473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:11.865536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:11.865573: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:11.865702: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:51:23.567995: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:23.568248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:23.568312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:23.568349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:23.568464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:51:27.799623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:27.799797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:27.799840: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:27.799877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:27.799983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_mobilenet_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8062 2019-02-18 04:51:45.670572: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 04:51:45.676597: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-18 04:51:45.680622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:51:45.680794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:51:45.680839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:51:45.680884: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:51:45.681008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:51:53.469662: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:51:53.470062: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:51:53.727136: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:51:53.727286: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 780 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:51:58.248570: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:51:58.249029: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:51:59.108786: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:891] Failed to register segment graphdef as a function 0: Invalid argument: Cannot add function 'my_trt_op_0_native_segment' because a different function with the same name already exists. 2019-02-18 04:52:00.692498: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:52:01.438146: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:52:01.613665: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-02-18 04:52:01.613843: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6850 nodes (-1212), 8953 edges (-1820), time = 2828.16ms. 2019-02-18 04:52:01.613888: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 6865 nodes (15), 8979 edges (26), time = 835.079ms. 2019-02-18 04:52:01.613927: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 3860.146ms. 2019-02-18 04:52:01.613970: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 6865 nodes (0), 8979 edges (0), time = 994.512ms. 2019-02-18 04:52:01.614012: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 6865 nodes (0), 8979 edges (0), time = 4454.34082ms. 2019-02-18 04:52:01.614078: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-18 04:52:01.614118: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 624.612ms. 2019-02-18 04:52:01.614207: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 781 nodes (0), 883 edges (0), time = 396.25ms. 2019-02-18 04:52:01.614248: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 49.671ms. 2019-02-18 04:52:01.614286: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 781 nodes (0), 883 edges (0), time = 695.233ms. 2019-02-18 04:52:01.614322: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 781 nodes (0), 883 edges (0), time = 53.073ms. Total nodes in the optimized graph: 6865 out of which 0 are TRTEngineOp Creating the session 2019-02-18 04:52:03.094518: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:52:03.094648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:52:03.094692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:52:03.094730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:52:03.094885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3909 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net 2019-02-18 04:52:38.719800: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 31.43 seconds Running the benchmark Average runtime: 0.093455 seconds original ssd_inception_v2_coco:
$ python3 benchmark.py --model ssd_inception_v2_coco Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=False) 2019-02-18 04:10:13.974149: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:10:13.974348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.29GiB 2019-02-18 04:10:13.974412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:10:15.398904: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:10:15.399058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:10:15.399103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:10:15.399360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:10:58.050991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:10:58.051141: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:10:58.051184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:10:58.051231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:10:58.051349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:11:10.699534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:11:10.699689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:11:10.699743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:11:10.699792: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:11:10.699910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:11:15.841888: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:11:15.842028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:11:15.842070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:11:15.842106: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:11:15.842217: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_inception_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8278 Creating the session 2019-02-18 04:11:26.078042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:11:26.078225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:11:26.078275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:11:26.078319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:11:26.078547: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3793 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with ORIGINAL model Running the inference on a single image to warm up the net 2019-02-18 04:11:57.791649: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 26.46 seconds Running the benchmark Average runtime: 0.100977 seconds ssd_inception_v2_coco with TRT:
nvidia@dpx2tegraa-lund:~/dariusz/projects/nvidia/tf_trt_models$ python3 benchmark.py --model ssd_inception_v2_coco --trt Called with args: Namespace(model='ssd_inception_v2_coco', use_trt=True) 2019-02-18 04:18:05.555255: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:931] ARM64 does not support NUMA - returning NUMA node zero 2019-02-18 04:18:05.555595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.275 pciBusID: 0000:00:00.0 totalMemory: 6.25GiB freeMemory: 4.54GiB 2019-02-18 04:18:05.555680: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:18:07.756886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:18:07.757042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:18:07.757086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:18:07.757460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) WARNING:tensorflow:From /home/nvidia/.local/lib/python3.5/site-packages/object_detection-0.1-py3.5.egg/object_detection/exporter.py:356: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step 2019-02-18 04:18:50.782544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:18:50.782700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:18:50.782752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:18:50.782788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:18:50.782916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:19:03.670306: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:03.670446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:03.670493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:03.670534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:03.670669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:19:08.911273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:08.911411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:08.911473: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:08.911513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:08.911773: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Model: ssd_inception_v2_coco ['detection_boxes', 'detection_classes', 'detection_scores', 'num_detections'] Total nodes in the original graph: 8278 2019-02-18 04:19:29.284995: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2019-02-18 04:19:29.290733: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session 2019-02-18 04:19:29.295736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:29.295909: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:29.295956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:29.295996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:29.296253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) 2019-02-18 04:19:38.501962: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:19:38.502486: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:19:38.903763: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:19:38.903994: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:19:45.094916: I tensorflow/contrib/tensorrt/convert/convert_nodes.cc:2957] Segment @scope '', converted to graph 2019-02-18 04:19:45.095409: E tensorflow/contrib/tensorrt/convert/convert_graph.cc:418] Can't find a device placement for the op! 2019-02-18 04:19:46.228436: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addScale::120, condition: hasChannelDimension == true 2019-02-18 04:19:46.228742: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:958] Engine my_trt_op_0 creation for segment 0, composed of 931 nodes failed: Internal: TFTRT::ConvertFusedBatchNormfailed to add TRT layer, at: FeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/BatchNorm/FusedBatchNorm. Skipping... 2019-02-18 04:19:48.180974: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:19:48.992334: W tensorflow/contrib/tensorrt/convert/trt_optimization_pass.cc:185] TensorRTOptimizer is probably called on funcdef! This optimizer must *NOT* be called on function objects. 2019-02-18 04:19:49.170470: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: tf_graph 2019-02-18 04:19:49.170642: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 7000 nodes (-1278), 9181 edges (-1886), time = 3424.68ms. 2019-02-18 04:19:49.170691: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Graph size after: 7025 nodes (25), 9207 edges (26), time = 979.995ms. 2019-02-18 04:19:49.170731: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 7025 nodes (0), 9207 edges (0), time = 4509.50488ms. 2019-02-18 04:19:49.170900: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 7015 nodes (-10), 9207 edges (0), time = 1884.54199ms. 2019-02-18 04:19:49.170941: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 7015 nodes (0), 9207 edges (0), time = 5634.18701ms. 2019-02-18 04:19:49.170979: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:501] Optimization results for grappler item: my_trt_op_0_native_segment 2019-02-18 04:19:49.171017: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 784.723ms. 2019-02-18 04:19:49.171070: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] layout: Invalid argument: The graph is already optimized by layout optimizer. 2019-02-18 04:19:49.171108: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 53.572ms. 2019-02-18 04:19:49.171146: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] constant folding: Graph size after: 932 nodes (0), 1112 edges (0), time = 757.245ms. 2019-02-18 04:19:49.171181: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:503] TensorRTOptimizer: Graph size after: 932 nodes (0), 1112 edges (0), time = 54.12ms. Total nodes in the optimized graph: 7015 out of which 0 are TRTEngineOp Creating the session 2019-02-18 04:19:51.273352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-18 04:19:51.273486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-18 04:19:51.273532: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-18 04:19:51.273572: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-18 04:19:51.273691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3872 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Running with TRT model Running the inference on a single image to warm up the net 2019-02-18 04:20:35.683734: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. Runtime: 38.11 seconds Running the benchmark Average runtime: 0.106853 seconds