TensorFlow object detection and image classification accelerated for NVIDIA Jetson

We’re happy to share the following project on GitHub which demonstrates object detection and image classification workflows using TensorRT integration in TensorFlow (for details on TF-TRT integration see this blog post). With this project you can easily accelerate popular models like SSD Inception V2 for use on Jetson.

The project is hosted at the following URL

https://github.com/NVIDIA-Jetson/tf_trt_models

By following the steps outlined in this project you will

  1. Download pretrained object detection and image classification models sourced from the TensorFlow models repository
  2. Run scripts to preprocess the TensorFlow graphs for best utilization of TensorRT and Jetson
  3. Accelerate models using TensorRT integration in TensorFlow
  4. Execute models with the TensorFlow Python API

The models are sourced from the TensorFlow models repository, so it is possible to train the models for custom tasks using the steps detailed there. Provided you use one of the listed model architectures, you can follow the steps above to easily accelerate the model for ideal performance on Jetson.

Enjoy!

Hi jaybdub,

It’s easy to use! Thank you!
I try it on PC. Probably I failed in labeling.
My webcam shows car, but label says bicycle.
I tried ssd_inception_v2_coco_2017_11_17 and mscoco_label_map.pbtxt.

In mscoco_label_map.pbtxt:

item { name: "/m/0199g" id: 2 display_name: "bicycle" } item { name: "/m/0k4j" id: 3 display_name: "car" } 

car is id 3. but bicycle is id 2.

And I saw tf_trt_models/detection.ipynb at master · NVIDIA-AI-IOT/tf_trt_models · GitHub
the dog’s id seems 17.

But dog is id 18 cat is id 17.
In mscoco_label_map.pbtxt:

item { name: "/m/01yrx" id: 17 display_name: "cat" } item { name: "/m/0bt9lr" id: 18 display_name: "dog" } 

Does TensorRT need to adjust labels?

My repository is here: https://github.com/naisy/realtime_object_detection

Hi naisy,

It shouldn’t be related to TensorRT, but in this case it seems the neural network output is 0-indexed, while the label map is 1-indexed. You should be able to add +1 to each output index of the network before associating with the label map to get the correct label.

Hope this helps!

Hi jaybdub,

Thank you! It works well!

classes = np.add(classes, 1) 

Thank you for the clear explanation and benchmarking on this website, and for testing out different models, it is really appreciated!

According to your execution time table, I should get 54.4ms when running ssd_inception_v2_coco on the TX2. Over 200 runs, after the network is ‘warmed up’, I get 69.63ms. This seems a significant difference to me. When looking at tegra_stats, it seems that the GPU is not very efficiently utilized (even though it varies over time, it is rarely even close to 90%):

RAM 4167/7854MB (lfb 84x4MB) CPU [49%@2035,0%@2035,0%@2035,44%@2035,47%@2032,38%@2035] EMC_FREQ 7%@1866 GR3D_FREQ 18%@1300 APE 150 MTS fg 0% bg 0% BCPU@48C MCPU@48C GPU@47.5C PLL@48C Tboard@41C Tdiode@46.25C PMIC@100C thermal@48.3C VDD_IN 7862/4839 VDD_CPU 1763/820 VDD_GPU 2531/947 VDD_SOC 997/929 VDD_WIFI 0/33 VDD_DDR 1626/1271 

I just followed all the steps on the Github readme and the notebook, so any idea what could be the cause of this? I use Jetpack 3.3 and Tensorflow 1.10.

Thanks for the feedback and raising this issue!

We collected the benchmark timings under the following configuration

(1) JetPack 3.2
(2) TensorFlow 1.8
(3) MAXN power mode (sudo nvpmodel -m0 )
(4) Jetson clocks enabled (sudo ~/jetson_clocks.sh)
(5) Runtime averaged over 50 calls to sess.run(…) on a static image. This excludes reading from disk and JPEGdecoding.

First, if when you profiled (3)-(5) are different from our configuration, this would cause a difference in the timing.

If they are consistent with our profiling, then perhaps it is a performance regression from JetPack 3.2 → 3.3, or TensorFlow 1.8 → 1.10, which we would want to investigate.

Thanks for the reponse! I have indeed run the nvpmodel -m0 command and jetsock_clocks.sh so (3) and (4) are the same. And just so there is no doubt about it, here is the code I used to make sure (5) is comparable:

scores, boxes, classes = tf_sess.run([tf_scores, tf_boxes, tf_classes], feed_dict={tf_input: image_resized[None, ...]}) times = [] for i in range(200):	t0 = time()	scores, boxes, classes = tf_sess.run([tf_scores, tf_boxes, tf_classes], feed_dict={tf_input: image_resized[None, ...]})	times.append(time()-t0) print(np.mean(times)) 

So I would say my setup is comparable. Two other things I noticed:

  • Running graphdef.ParseFromString() on the frozen graph (generated with build_detection_graph) takes 4.7 seconds. Loading the trt_graph generated by trt.create_inference_graph takes 9 minutes and 26 seconds (!). Same with running tf.import_graph_def(graphdef, name=‘’) on both files: 12.9 seconds for the frozen graph, 41.8 seconds for the trt_graph. Is this anywhere near expected times? Because it seems ridiculously long to me and could be indicative for something not working right with these versions of JetPack and TensorFlow
  • tegra_stats reports near-constant 90-100% GPU usage when running the frozen graph (which is running with comparable speed to the speed you reported, 139ms vs your reported 132ms for ssd_inception_v2_coco 300x300)

I’ll see if I can get JetPack 3.2 with Tensorflow 1.8 installed and see if I can reproduce your speeds that way to make sure there is nothing else that goes wrong

Jetpack 3.2 with Tensorflow 1.8 is a little faster, but still not as fast as reported (note that this is a different TX2 module). With the same setup as before, I now get an average runtime of 64.72ms. 5ms quicker than with Tensorflow 1.10 and Jetpack 3.3 but still 10 ms short of your measured time. Is there something I’m still missing here?

Running ParseFromString to load the trt_graph now only takes 4.59 seconds, so that bug is gone at least.

The GPU usage still seems to be suboptimal, but maybe that is inherent in the model / working with a batch of 1. This is what tegrastats reports with --interval 100:

RAM 2880/7854MB (lfb 719x4MB) CPU [60%@2035,0%@2034,0%@2034,33%@2029,30%@2034,66%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 66%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.9C VDD_IN 8092/6901 VDD_CPU 1686/1360 VDD_GPU 2606/1968 VDD_SOC 996/956 VDD_WIFI 0/20 VDD_DDR 1640/1498 RAM 2880/7854MB (lfb 719x4MB) CPU [25%@2034,0%@2033,0%@2035,30%@2032,50%@2033,60%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47.75C PMIC@100C thermal@46.9C VDD_IN 8015/6905 VDD_CPU 1686/1361 VDD_GPU 2530/1970 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1640/1498 RAM 2880/7854MB (lfb 719x4MB) CPU [50%@2031,0%@2035,0%@2035,55%@2031,60%@2038,58%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 2%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47.75C PMIC@100C thermal@46.9C VDD_IN 8015/6909 VDD_CPU 1763/1363 VDD_GPU 2530/1972 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1499 RAM 2880/7854MB (lfb 719x4MB) CPU [30%@2031,0%@2035,0%@2035,45%@2033,27%@2034,55%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 91%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.9C VDD_IN 8130/6913 VDD_CPU 1686/1364 VDD_GPU 2683/1974 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1640/1499 RAM 2880/7854MB (lfb 719x4MB) CPU [66%@2035,0%@2035,0%@2033,50%@2034,44%@2033,55%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.9C VDD_IN 7900/6916 VDD_CPU 1686/1365 VDD_GPU 2454/1976 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1500 RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2035,0%@2034,0%@2035,54%@2035,36%@2034,60%@2032] EMC_FREQ 8%@1866 GR3D_FREQ 99%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47.5C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.9C VDD_IN 8053/6920 VDD_CPU 1686/1366 VDD_GPU 2606/1978 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1640/1500 RAM 2880/7854MB (lfb 719x4MB) CPU [58%@2035,0%@2035,0%@2035,44%@2035,55%@2034,30%@2035] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@46.5C PMIC@100C thermal@46.9C VDD_IN 7938/6924 VDD_CPU 1686/1367 VDD_GPU 2454/1980 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1500 RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2032,0%@2035,0%@2036,55%@2032,33%@2033,54%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47.5C PLL@47C Tboard@42C Tdiode@46.5C PMIC@100C thermal@46.9C VDD_IN 7977/6928 VDD_CPU 1686/1368 VDD_GPU 2531/1982 VDD_SOC 996/957 VDD_WIFI 0/20 VDD_DDR 1621/1501 RAM 2880/7854MB (lfb 719x4MB) CPU [33%@2033,0%@2035,0%@2035,33%@2033,63%@2033,45%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 80%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@47.2C VDD_IN 8168/6932 VDD_CPU 1686/1369 VDD_GPU 2683/1984 VDD_SOC 996/958 VDD_WIFI 0/20 VDD_DDR 1659/1501 RAM 2880/7854MB (lfb 719x4MB) CPU [70%@2033,0%@2035,0%@2035,54%@2032,40%@2034,54%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47.5C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@47.2C VDD_IN 7900/6935 VDD_CPU 1686/1371 VDD_GPU 2454/1986 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1621/1502 RAM 2880/7854MB (lfb 719x4MB) CPU [36%@2033,0%@2035,0%@2033,63%@2031,22%@2032,40%@2031] EMC_FREQ 8%@1866 GR3D_FREQ 69%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@47.2C VDD_IN 8053/6939 VDD_CPU 1686/1372 VDD_GPU 2606/1988 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1640/1502 RAM 2880/7854MB (lfb 719x4MB) CPU [70%@2032,0%@2035,0%@2035,40%@2026,40%@2032,54%@2032] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@47.2C VDD_IN 7823/6942 VDD_CPU 1686/1373 VDD_GPU 2377/1989 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1602/1503 RAM 2880/7854MB (lfb 719x4MB) CPU [30%@2031,0%@2034,0%@2036,50%@2030,50%@2033,33%@2031] EMC_FREQ 8%@1866 GR3D_FREQ 71%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@47.2C VDD_IN 7977/6945 VDD_CPU 1686/1374 VDD_GPU 2531/1991 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1621/1503 RAM 2880/7854MB (lfb 719x4MB) CPU [55%@2029,0%@2035,0%@2035,44%@2034,40%@2031,54%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 19%@1300 APE 150 MTS fg 0% bg 0% BCPU@47.5C MCPU@47.5C GPU@46C PLL@47.5C Tboard@42C Tdiode@47.75C PMIC@100C thermal@47.2C VDD_IN 8092/6949 VDD_CPU 1686/1375 VDD_GPU 2606/1993 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1640/1504 RAM 2880/7854MB (lfb 719x4MB) CPU [55%@2033,0%@2034,0%@2034,60%@2034,55%@2034,50%@2032] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47.75C PMIC@100C thermal@47.2C VDD_IN 7977/6953 VDD_CPU 1763/1376 VDD_GPU 2454/1995 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1621/1504 RAM 2880/7854MB (lfb 719x4MB) CPU [50%@2035,0%@2036,0%@2036,50%@2031,40%@2034,45%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 14%@1300 APE 150 MTS fg 0% bg 0% BCPU@47.5C MCPU@47.5C GPU@47C PLL@47.5C Tboard@42C Tdiode@48C PMIC@100C thermal@47.2C VDD_IN 8092/6957 VDD_CPU 1686/1377 VDD_GPU 2606/1997 VDD_SOC 996/958 VDD_WIFI 0/19 VDD_DDR 1640/1504 RAM 2880/7854MB (lfb 719x4MB) CPU [58%@2033,0%@2036,0%@2034,37%@2036,44%@2037,44%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 7%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@48C PMIC@100C thermal@47.2C VDD_IN 8015/6960 VDD_CPU 1763/1378 VDD_GPU 2530/1999 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1505 RAM 2880/7854MB (lfb 719x4MB) CPU [33%@2033,0%@2035,0%@2035,50%@2032,40%@2035,62%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@48C PMIC@100C thermal@47.2C VDD_IN 7977/6964 VDD_CPU 1686/1380 VDD_GPU 2531/2000 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1505 RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2025,0%@2035,0%@2035,40%@2034,22%@2034,45%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 2%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.8C VDD_IN 8092/6967 VDD_CPU 1686/1381 VDD_GPU 2606/2002 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1506 RAM 2880/7854MB (lfb 719x4MB) CPU [66%@2032,0%@2034,0%@2035,66%@2033,44%@2033,40%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47C PMIC@100C thermal@46.8C VDD_IN 7938/6971 VDD_CPU 1686/1382 VDD_GPU 2454/2004 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1602/1506 RAM 2880/7854MB (lfb 719x4MB) CPU [20%@2032,0%@2035,0%@2035,50%@2032,36%@2032,44%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 98%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@46.8C VDD_IN 8053/6974 VDD_CPU 1686/1383 VDD_GPU 2606/2006 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1640/1507 RAM 2880/7854MB (lfb 719x4MB) CPU [80%@2033,0%@2035,0%@2035,54%@2033,55%@2034,36%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@46.8C VDD_IN 7823/6977 VDD_CPU 1686/1384 VDD_GPU 2377/2007 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1602/1507 RAM 2880/7854MB (lfb 719x4MB) CPU [50%@2031,0%@2035,0%@2035,30%@2035,60%@2033,36%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 99%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@46.75C PMIC@100C thermal@46.8C VDD_IN 8053/6981 VDD_CPU 1686/1385 VDD_GPU 2531/2009 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1621/1507 RAM 2880/7854MB (lfb 719x4MB) CPU [33%@2035,0%@2035,0%@2035,70%@2034,44%@2034,40%@2033] EMC_FREQ 8%@1866 GR3D_FREQ 19%@1300 APE 150 MTS fg 0% bg 0% BCPU@47.5C MCPU@47.5C GPU@46.5C PLL@47.5C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 7977/6984 VDD_CPU 1686/1386 VDD_GPU 2454/2010 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1621/1508 RAM 2880/7854MB (lfb 719x4MB) CPU [36%@2033,0%@2034,0%@2034,45%@2032,60%@2034,54%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@47C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 7977/6987 VDD_CPU 1686/1387 VDD_GPU 2454/2012 VDD_SOC 996/959 VDD_WIFI 0/19 VDD_DDR 1621/1508 RAM 2880/7854MB (lfb 719x4MB) CPU [66%@2028,0%@2034,0%@2035,40%@2034,36%@2035,44%@2034] EMC_FREQ 8%@1866 GR3D_FREQ 25%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46.5C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 8092/6991 VDD_CPU 1686/1387 VDD_GPU 2606/2014 VDD_SOC 996/960 VDD_WIFI 0/18 VDD_DDR 1640/1508 RAM 2880/7854MB (lfb 719x4MB) CPU [40%@2034,0%@2036,0%@2036,66%@2033,37%@2034,44%@2036] EMC_FREQ 8%@1866 GR3D_FREQ 0%@1300 APE 150 MTS fg 0% bg 0% BCPU@47C MCPU@47C GPU@46C PLL@47C Tboard@42C Tdiode@47.5C PMIC@100C thermal@46.8C VDD_IN 7938/6994 VDD_CPU 1686/1388 VDD_GPU 2454/2015 VDD_SOC 996/960 VDD_WIFI 0/18 VDD_DDR 1621/1509 

Hi,

We have released a official TensorFlow package.
Could you repeat your experiment with our official package and share the result with us?
[url]https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/[/url]

Thanks.

Sure! On the Jetpack 3.2 setup, now with the official Tensorflow 1.9, I get about the same running time I got earlier with the Jetpack 3.3 and Tensorflow 1.10 setup: an average of 69.65ms per image for ssd_inception_v2. I realize I should maybe have mentioned this last time, but this is the log for the creation of the inference graph (with the official TF1.9):

>>> trt_graph = trt.create_inference_graph( ... input_graph_def=frozen_graph, ... outputs=output_names, ... max_batch_size=1, ... max_workspace_size_bytes=1 << 25, ... precision_mode='FP16', ... minimum_segment_size=50 ... ) 
2018-09-10 09:01:07.543057: I tensorflow/core/grappler/devices.cc:51] Number of eligible GPUs (core count >= 8): 0 2018-09-10 09:01:24.474770: I tensorflow/contrib/tensorrt/convert/convert_graph.cc:438] MULTIPLE tensorrt candidate conversion: 7 2018-09-10 09:01:25.042883: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims) 2018-09-10 09:01:25.043030: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:0 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 91 nodes) 2018-09-10 09:01:25.048693: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims) 2018-09-10 09:01:25.048845: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:1 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 812 nodes) 2018-09-10 09:01:25.883711: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:2 due to: "Invalid argument: Output node 'FeatureExtractor/InceptionV2/InceptionV2/Mixed_3b/concat-4-LayoutOptimizer' is weights not tensor" SKIPPING......( 844 nodes) 2018-09-10 09:01:25.890138: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:3 due to: "Unimplemented: Operation: GatherV2 does not support tensor input as indices, at: Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/FilterGreaterThan_83/Gather/GatherV2" SKIPPING......( 91 nodes) 2018-09-10 09:01:25.894630: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims) 2018-09-10 09:01:25.894759: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:4 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 180 nodes) 2018-09-10 09:01:25.898671: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims) 2018-09-10 09:01:25.898789: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:5 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 93 nodes) 2018-09-10 09:01:25.902308: E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims) 2018-09-10 09:01:25.902452: W tensorflow/contrib/tensorrt/convert/convert_graph.cc:515] subgraph conversion error for subgraph_index:6 due to: "Invalid argument: Failed to create Input layer" SKIPPING......( 91 nodes) 

Do you also want me to check the log or performance with Jetpack 3.3 or with TF 1.10/1.8?

Hi,

Some instructions run slow on TensorFlow due to some CPU/GPU resource transferring.

Could you test it with nvprof and share the profiling data with us?
This step can help us to find out the bottleneck.

Thanks.

I’m seeing the same error as reported by frederiki3k63, when trying to optimize object detection
models with trt.create_inference_graph(). I’m also using JetPack-3.3 with the latest official tensorflow (1.9.0) wheel for TX2, as specified in https://devtalk.nvidia.com/default/topic/1038957/jetson-tx2/tensorflow-for-jetson-tx2-/

https://developer.download.nvidia.com/compute/redist/jp33/tensorflow-gpu/tensorflow_gpu-1.9.0+nv18.8-cp35-cp35m-linux_aarch64.whl

Due to this error, I think the object detection model just runs at un-optimized speed.

...... 2018-09-13 16:00:44.289961: [b]E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger Parameter check failed at: ../builder/Network.cpp::addInput::364, condition: isValidDims(dims)[/b] ...... 

@AastaLLL, could you check and advise? Thanks.

Right. Here are the files and here’s the log:

nvidia@tegra-ubuntu:~/tf_trt_models$ nvprof python3 test_inception.py Starting session.. ==1986== NVPROF is profiling process 1986, command: python3 test_inception.py ==1986== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements 2018-09-13 13:31:04.443654: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:864] ARM64 does not support NUMA - returning NUMA node zero 2018-09-13 13:31:04.443902: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: name: NVIDIA Tegra X2 major: 6 minor: 2 memoryClockRate(GHz): 1.3005 pciBusID: 0000:00:00.0 totalMemory: 7.67GiB freeMemory: 1.32GiB 2018-09-13 13:31:04.443957: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1471] Adding visible gpu devices: 0 2018-09-13 13:31:07.226353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:952] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-13 13:31:07.226440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:958] 0 2018-09-13 13:31:07.226469: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: N 2018-09-13 13:31:07.226696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1084] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 557 MB memory) -> physical GPU (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2) Loading graph.. Running network one time to warm up.. Starting test.. Average run time: 0.08413341760635376 ==1986== Profiling application: python3 test_inception.py ==1986== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 23.06% 935.20ms 3015 310.18us 70.400us 1.4286ms trtwell_fp16x2_hcudnn_winograd_fp16x2_128x128_ldg1_ldg4_relu_tile148m_nt 12.01% 487.14ms 1809 269.29us 95.584us 1.2186ms trt_maxwell_fp16x2_hcudnn_fp16x2_128x64_relu_small_nn_v1 7.24% 293.60ms 15477 18.969us 1.7600us 274.34us void cuScale::scale<__half, __half, cuScale::Mode, bool=0, int=4, cuScale::FusedActivationType>(__half const *, cuScale::scale<__half, __half, cuScale::Mode, bool=0, int=4, cuScale::FusedActivationType>*, cuScale::KernelParameters<cuScale::scale<__half, __half, cuScale::Mode, bool=0, int=4, cuScale::FusedActivationType>>, nvinfer1::cudnn::reduced_divisor, nvinfer1::cudnn, nvinfer1::cudnn) 7.00% 283.71ms 2211 128.32us 50.816us 232.64us trt_maxwell_fp16x2_hcudnn_fp16x2_128x64_relu_interior_nn_v1 6.32% 256.13ms 1608 159.29us 68.320us 239.46us trtwell_fp16x2_hcudnn_fp16x2_128x64_relu_interior_nn 5.54% 224.52ms 15477 14.506us 3.3600us 166.24us void cudnn::detail::activation_fw_4d_kernel<__half, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>(cudnnTensorStruct, __half const *, cudnn::detail::activation_fw_4d_kernel<__half, float, int=128, int=1, int=4, cudnn::detail::relu_func<float, cudnnNanPropagation_t=1, bool=0>>, cudnnTensorStruct*, float, cudnnTensorStruct*, int, cudnnTensorStruct*) 5.10% 206.76ms 15276 13.535us 1.1840us 203.90us void cuEltwise::eltwise<cuEltwise::SimpleAlgo<__half, long>, cuEltwise::Compute<nvinfer1::ElementWiseOperation>>(cuEltwise::LaunchParams) 3.89% 157.86ms 603 261.79us 256.10us 270.88us trtwell_scudnn_128x32_relu_interior_nn 2.82% 114.21ms 603 189.40us 93.280us 277.66us trtwell_fp16x2_hcudnn_fp16x2_128x128_relu_interior_nn 2.65% 107.27ms 603 177.90us 90.144us 402.24us trtwell_fp16x2_hcudnn_fp16x2_128x128_relu_small_nn 2.51% 101.93ms 804 126.78us 101.44us 144.26us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=1, int=4>, fused::KpqkPtrWriter<__half, int=2, int=2>, __half2, __half, int=2, int=5, int=2, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=4, int=2Type>, float) 2.48% 100.66ms 402 250.40us 193.76us 310.82us trtwell_fp16x2_hcudnn_fp16x2_128x64_relu_small_nn 1.97% 79.866ms 1206 66.223us 2.1440us 246.46us void cuPad::pad<__half, __half2, int=128, bool=1>(__half2*, int, cuPad::pad<__half, __half2, int=128, bool=1> const *, int, int, int, int, int, int, int, int, int, nvinfer1::cudnn::reduced_divisor, nvinfer1::cudnn, nvinfer1::cudnn, float const *, float const ) 1.93% 78.395ms 1407 55.717us 20.736us 103.30us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=1, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=3, int=7, int=8, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float) 1.67% 67.523ms 402 167.97us 165.66us 172.16us void nvinfer1::tiled_pooling::poolCHW_RS3_UV2_PQT_kernel<int=4, int=4, int=32, int=2, nvinfer1::ITiledPooling::PoolingMode>(nvinfer1::TiledPoolingParams, int) 1.36% 54.956ms 804 68.353us 64.800us 72.159us void nvinfer1::tiled_pooling::poolCHW_RS3_UV1_PQT_kernel<int=4, int=4, int=32, int=2, nvinfer1::ITiledPooling::PoolingMode>(nvinfer1::TiledPoolingParams, int) 1.26% 50.947ms 201 253.47us 243.94us 279.30us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=1, int=1>, fused::KpqkPtrWriter<__half, int=1, int=1>, __half2, __half, int=5, int=7, int=4, int=3, int=3, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=1, int=2Type>, float) 1.25% 50.630ms 603 83.964us 55.520us 111.84us void nvinfer1::tiled_pooling::poolCHW_RS3_UV1_PQT_kernel<int=4, int=8, int=16, int=2, nvinfer1::ITiledPooling::PoolingMode>(nvinfer1::TiledPoolingParams, int) 0.97% 39.172ms 201 194.89us 193.60us 197.06us void tensorflow::_GLOBAL__N__60_tmpxft_00001137_00000000_6_resize_bilinear_op_gpu_cu_cpp1_ii_c6ae9512::ResizeBilinearKernel<float>(int, float const *, float, float, int, int, int, int, int, int, float*) 0.97% 39.139ms 201 194.72us 194.02us 195.90us void cuInt8::nchwToNchhw2<float>(float const *, __half*, int, int, int, int, cuInt8::ReducedDivisorParameters) 0.86% 34.927ms 201 173.77us 171.90us 176.58us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=1024, int=1024, int=2, bool=0>*) 0.86% 34.674ms 852 40.697us 160ns 11.018ms [CUDA memcpy HtoD] 0.68% 27.699ms 402 68.902us 66.560us 72.064us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=2, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=3, int=10, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float) 0.66% 26.603ms 201 132.35us 129.82us 135.36us trt_maxwell_fp16x2_hcudnn_fp16x2_128x64_relu_large_nn_v1 0.51% 20.512ms 201 102.05us 99.136us 106.56us trt_maxwell_fp16x2_hcudnn_fp16x2_128x128_relu_small_nn_v1 0.47% 18.992ms 201 94.489us 93.824us 95.680us void cuPad::pad<float, float, int=128, bool=1>(float*, int, cuPad::pad<float, float, int=128, bool=1> const *, int, int, int, int, int, int, int, int, int, nvinfer1::cudnn::reduced_divisor, nvinfer1::cudnn, nvinfer1::cudnn, float const *, float const ) 0.42% 17.199ms 2412 7.1300us 960ns 36.640us void cuInt8::nchhw2ToNchw<float>(__half const *, float*, int, int, int, int, cuInt8::ReducedDivisorParameters) 0.39% 15.651ms 201 77.868us 75.744us 81.120us void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=2, int=2, int=4, int=28, int=513, int=6, int=2, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams) 0.33% 13.584ms 1005 13.516us 2.5600us 34.400us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=32, int=32, bool=0>*) 0.31% 12.657ms 3216 3.9350us 800ns 27.264us [CUDA memcpy DtoD] 0.29% 11.641ms 201 57.914us 56.096us 60.064us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_difference_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1) 0.28% 11.268ms 201 56.060us 54.240us 58.144us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_left<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1) 0.25% 10.095ms 201 50.224us 49.344us 51.520us void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=2, int=2, int=7, int=7, int=225, int=6, int=2, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams) 0.25% 10.076ms 201 50.130us 47.776us 52.640us trt_maxwell_fp16x2_hcudnn_fp16x2_128x32_relu_interior_nn_v1 0.23% 9.3126ms 201 46.331us 43.584us 49.120us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=3, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<int, int=3> const , Eigen::DSizes<int, int=3> const , Eigen::TensorMap<Eigen::Tensor<float const , int=3, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=3) 0.19% 7.8614ms 1005 7.8220us 224ns 37.376us [CUDA memcpy DtoH] 0.18% 7.4801ms 201 37.214us 35.520us 39.264us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_sigmoid_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1) 0.17% 6.7136ms 201 33.401us 32.800us 34.080us void nvinfer1::tiled_pooling::poolCHW_PQT<int=3, int=3, int=1, int=1, int=14, int=14, int=256, int=6, int=2, bool=1, nvinfer1::ITiledPooling::PoolingMode, bool=0>(nvinfer1::TiledPoolingParams) 0.10% 3.9531ms 201 19.667us 18.816us 20.576us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=2, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=8, int=2, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float) 0.07% 2.7159ms 201 13.511us 13.119us 14.336us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=4, int=1>, fused::KpqkPtrWriter<__half, int=1, int=1>, __half2, __half, int=2, int=7, int=4, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=1, int=2Type>, float) 0.06% 2.6248ms 201 13.058us 12.544us 13.600us void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<__half, int=2, int=2, int=2>, fused::KpqkPtrWriter<__half, int=2, int=1>, __half2, __half, int=4, int=4, int=8, int=1, int=1, int=1, int=1>(fused::ConvolutionParams<__halfSrcType, int=2, int=2Type>, float) 0.06% 2.2885ms 201 11.385us 11.104us 12.320us void cuEltwise::eltwise<cuEltwise::StripMineAlgo<__half, int>, cuEltwise::Compute<nvinfer1::ElementWiseOperation>>(cuEltwise::LaunchParams) 0.05% 2.1641ms 201 10.766us 10.400us 11.360us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=128, int=12, int=128, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=128, int=12, int=128, bool=0>*) 0.05% 2.0480ms 201 10.189us 7.5200us 16.160us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=4, int=512, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=4, int=512, bool=0>*) 0.04% 1.7993ms 201 8.9510us 8.5440us 9.4400us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=8, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=8, bool=0>*) 0.04% 1.6919ms 201 8.4170us 7.9040us 8.9600us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=512, int=4, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=512, int=512, int=4, bool=0>*) 0.04% 1.6261ms 804 2.0220us 1.5040us 3.2000us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_product_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1) 0.04% 1.6132ms 201 8.0260us 7.7440us 8.7040us void tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=4, bool=0>(unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3UsingTiles<unsigned int, int=256, int=256, int=4, bool=0>*) 0.04% 1.4325ms 804 1.7810us 1.1840us 3.4240us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_sum_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1) 0.03% 1.1131ms 402 2.7680us 1.4400us 4.4160us void tensorflow::functor::SwapDimension1And2InTensor3Simple<unsigned int, bool=0>(int, unsigned int const *, tensorflow::functor::Dimension<int=3>, tensorflow::functor::SwapDimension1And2InTensor3Simple<unsigned int, bool=0>*) 0.03% 1.1125ms 804 1.3830us 960ns 2.4000us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=2, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorSlicingOp<Eigen::DSizes<long, int=2> const , Eigen::DSizes<long, int=2> const , Eigen::TensorMap<Eigen::Tensor<float const , int=2, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=2) 0.02% 836.58us 402 2.0810us 1.3440us 3.1360us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_exp_op<float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1) 0.02% 727.14us 402 1.8080us 1.1840us 2.8800us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseUnaryOp<Eigen::internal::scalar_right<float, float, Eigen::internal::scalar_product_op<float, float>>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, int>(float, int=1) 0.02% 700.10us 402 1.7410us 1.1840us 2.5600us void Eigen::internal::EigenMetaKernel<Eigen::TensorEvaluator<Eigen::TensorAssignOp<Eigen::TensorMap<Eigen::Tensor<float, int=1, int=1, int>, int=16, Eigen::MakePointer>, Eigen::TensorCwiseBinaryOp<Eigen::internal::scalar_difference_op<float, float>, Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, int>, int=16, Eigen::MakePointer> const , Eigen::TensorMap<Eigen::Tensor<float const , int=1, int=1, long>, int=16, Eigen::MakePointer> const > const > const , Eigen::GpuDevice>, long>(float, int=1) 0.00% 55.712us 19 2.9320us 960ns 9.6640us dit::computeOffsetsKernel(dit::ComputeOffsetsParams) 0.00% 5.6640us 9 629ns 160ns 1.5360us [CUDA memset] API calls: 35.55% 3.88662s 19 204.56ms 1.4720us 2.78165s cudaFree 35.48% 3.87971s 16 242.48ms 3.3920us 3.87934s cudaStreamCreateWithFlags 16.83% 1.84008s 63535 28.961us 22.048us 735.01us cudaLaunch 3.16% 345.50ms 11457 30.156us 22.272us 505.57us cudaLaunchKernel 2.22% 242.86ms 1 242.86ms 242.86ms 242.86ms cuDevicePrimaryCtxRetain 1.56% 170.91ms 813 210.22us 24.096us 650.46us cuMemcpyHtoDAsync 1.45% 158.48ms 3251 48.748us 28.672us 12.520ms cudaMemcpyAsync 1.13% 123.66ms 304735 405ns 287ns 569.50us cudaSetupArgument 0.53% 57.682ms 1005 57.395us 21.504us 300.80us cuMemcpyDtoHAsync 0.49% 53.481ms 29707 1.8000us 1.0560us 603.74us cuEventQuery 0.48% 53.030ms 3636 14.584us 1.5680us 215.94us cuEventRecord 0.35% 38.568ms 63535 607ns 384ns 521.41us cudaConfigureCall 0.30% 32.768ms 67555 485ns 288ns 487.07us cudaGetLastError 0.23% 24.655ms 54 456.57us 15.424us 14.183ms cudaMalloc 0.06% 6.7991ms 5 1.3598ms 171.52us 5.3080ms cuMemAlloc 0.06% 6.0980ms 1818 3.3540us 1.6960us 285.41us cuStreamWaitEvent 0.04% 3.9055ms 201 19.430us 16.800us 46.559us cuCtxSynchronize 0.02% 1.8859ms 201 9.3820us 7.9040us 24.831us cudaEventRecord 0.01% 1.4191ms 2 709.55us 379.39us 1.0397ms cuMemHostAlloc 0.01% 1.3228ms 424 3.1190us 1.7280us 29.280us cudaEventCreateWithFlags 0.01% 549.86us 34 16.172us 8.0320us 28.672us cudaStreamSynchronize 0.00% 510.59us 4 127.65us 51.200us 200.03us cudaMemcpy 0.00% 460.51us 1 460.51us 460.51us 460.51us cudaFreeHost 0.00% 445.25us 8 55.655us 14.144us 173.22us cudaMemsetAsync 0.00% 397.18us 288 1.3790us 320ns 56.864us cuDeviceGetAttribute 0.00% 376.32us 2 188.16us 164.96us 211.36us cudaHostAlloc 0.00% 371.46us 15 24.763us 10.559us 76.288us cudaGetDeviceProperties 0.00% 343.10us 17 20.182us 11.551us 43.712us cudaCreateTextureObject 0.00% 221.82us 4 55.455us 30.720us 124.13us cuStreamCreate 0.00% 199.94us 8 24.992us 3.8080us 52.192us cudaStreamCreateWithPriority 0.00% 168.74us 1 168.74us 168.74us 168.74us cuMemsetD32 0.00% 132.29us 44 3.0060us 1.8880us 23.552us cudaEventDestroy 0.00% 97.184us 76 1.2780us 768ns 3.5520us cudaDeviceGetAttribute 0.00% 68.223us 12 5.6850us 4.2240us 14.400us cudaStreamDestroy 0.00% 65.088us 4 16.272us 5.4080us 31.712us cuDeviceTotalMem 0.00% 60.448us 20 3.0220us 896ns 8.2240us cudaGetDevice 0.00% 51.264us 2 25.632us 23.360us 27.904us cuMemGetInfo 0.00% 41.984us 9 4.6640us 1.9520us 14.368us cuCtxSetCurrent 0.00% 41.536us 1 41.536us 41.536us 41.536us cuDeviceGetProperties 0.00% 37.056us 2 18.528us 15.168us 21.888us cudaDeviceSynchronize 0.00% 34.111us 4 8.5270us 6.2400us 11.008us cudaThreadSynchronize 0.00% 16.448us 6 2.7410us 1.3440us 5.4720us cuEventCreate 0.00% 13.408us 17 788ns 576ns 1.8560us cudaCreateChannelDesc 0.00% 13.376us 3 4.4580us 896ns 10.176us cudaGetDeviceCount 0.00% 13.216us 2 6.6080us 6.5920us 6.6240us cudaHostGetDevicePointer 0.00% 11.904us 2 5.9520us 5.0240us 6.8800us cudaSetDevice 0.00% 10.848us 9 1.2050us 576ns 2.3360us cuDeviceGetCount 0.00% 10.304us 3 3.4340us 2.5600us 4.1920us cuInit 0.00% 6.6880us 4 1.6720us 1.2800us 2.1120us cuDeviceGetName 0.00% 6.6560us 4 1.6640us 1.2160us 2.3680us cuDriverGetVersion 0.00% 6.3680us 5 1.2730us 640ns 2.3040us cuDeviceGet 0.00% 5.9520us 2 2.9760us 2.6560us 3.2960us cudaDeviceGetStreamPriorityRange 0.00% 4.4160us 1 4.4160us 4.4160us 4.4160us cuDeviceGetPCIBusId 0.00% 2.6560us 1 2.6560us 2.6560us 2.6560us cuDeviceComputeCapability 0.00% 1.1520us 1 1.1520us 1.1520us 1.1520us cuDevicePrimaryCtxGetState 0.00% 448ns 1 448ns 448ns 448ns cuCtxGetCurrent 

I wrote a blog post about my experience using the NVIDIA-Jetson/tf_trt_models code. I also shared a script about how to do real-time object detection with various cameras or file inputs. Feel free to check it out. Do let me know if you have suggestions about the code. I’ll update my blog post and my GitHub repo as needed.

[url]https://jkjung-avt.github.io/tf-trt-models/[/url]
[url]https://github.com/jkjung-avt/tf_trt_models[/url]

Awesome! Thanks for sharing this jkjung13.

As for the performance discrepancy / low GPU utilization. This may have to do with how the object detection post-processing pipeline is configured.

It seems that the default box score threshold for the non-maximum suppression stage is 1e-8, which essentially considers any box a detection. This may result in unnecessary box-to-box comparisons and a heavier CPU load. This parameter may be found here

[url]models/ssd_mobilenet_v1_coco.config at 17fa52864bfc7a7444a8b921d8a8eb1669e14ebd · tensorflow/models · GitHub

I believe the benchmarks in tf_trt_models were collected using a threshold of 0.3. Are your models using a very low threshold? If so could you try raising this to something larger (say above 0.1) and report the performance?

Thanks!

Wow, that matters a lot! With a threshold of 0.3, I get a running time of 41.3ms using Tensorflow 1.10 and TensorRT4 for the ssd_inception_v2 model, which is a lot faster than your reported time (maybe because I use a different image so the NMS has even less boxes to compare?) Anyway, thanks, I consider this solved :)

With the official Tensorflow 1.9 I get 113ms now; I don’t really know what’s wrong but it seems the graph optimization doesn’t work at all now. It doesn’t really matter, probably just some conflicting versions of TensorRT and Tensorflow on my side…

@frederiki3k63, are you using JetPack-3.3 (with TensorRT 4.0 GA) on Jetson TX2? And which tensorflow-1.10 wheel did you use?

I’m stuck with the error: “E tensorflow/contrib/tensorrt/log/trt_logger.cc:38] DefaultLogger cudnnFusedConvActLayer.cpp (64) - Cuda Error in createFilterTextureFused: 11” when I test with JetPack-3.3 and the “TF-1.10.1 for JetPack3.3” wheel ([url]https://nvidia.app.box.com/v/TF1101-Py35-wTRT[/url]) linked from here: [url]https://devtalk.nvidia.com/default/topic/1031300/jetson-tx2/tensorflow-1-8-wheel-with-jetpack-3-2-/[/url]

I used that wheel too. I use Jetpack 3.2.1 now, but according to the JetPack website Jetpack 3.2.1 is the same as JetPack 3.3 apart from newer CUDA and CuDNN versions, which I use the JetPack 3.3 versions of (tensorrt_4.0.2.0-1+cuda9.0_arm64.deb and libcudnn7-dev_7.1.5.14-1+cuda9.0_arm64.deb, which is TensorRT 4.0 GA I think).

Weird that it doesn’t work for you. The installation of Tensorflow 1.10 wheel was a bit of a hassle for me; I can’t seem to compile h5py which is a dependency for keras which is a dependency for tensorflow so I skipped that using pip --no-deps. And as reported earlier (and same as you report on your blog) parsing a network from string takes ~10 minutes using this version.

Thanks for your blog btw, I enjoy your clearly written articles, they have helped me much in the past!

@frederiki3k63, thanks for your kind words.

Today I fell back to JetPack-3.2.1 (TensorRT 3.0 GA) and tested my scripts against the tensorflow 1.8.0 wheel (Box) as specified in tf_trt_models/README.md at master · NVIDIA-AI-IOT/tf_trt_models · GitHub. And it indeed worked better! After setting score_threshold to 0.3, I was able to get ssd_mobilenet_v1_coco to do real-time object detection at ~20fps, just as advertised by NVIDIA. In addition, the trt optimization process ran much faster (only took 1~2 minutes) under this configuration.

I’m going to experiment more and try finding a way to make it work equally well on JetPack-3.3.

Otherwise, it’d be ideal if NVIDIA people could re-build the tensorflow wheels and verify tf_trt_models code against JetPack-3.3.

I confirmed that the slowness on Jetson TX2 (loading ssd models, optimizing model with TensorRT, and loading optimized graph, etc.) has a lot to do with the version of tensorflow. My guess is that some recent changes in tensorflow do not work that well on aarch64 architecture.

Based on my testing, TF-TRT works great with tensorflow 1.8.0. I’ve tested it on both JetPack-3.2.1 and JetPack-3.3.

For more details, please read my blog post and the README.md in my GitHub repo.

[url]https://jkjung-avt.github.io/tf-trt-models/[/url]
[url]https://github.com/jkjung-avt/tf_trt_models[/url]