A simple example on the custom layer API (TensorRT 2.1)?

ywpkwon · August 10, 2017, 8:14pm

Hello.

I am using TensorRT 2.1 and want to implement a simple custom layer. To practice, I wanted to make an “Inc” layer (just adding 1.0 to an input tensor values and keeping dimension the same).

I kept everything almost the same with the “class Reshape : public Iplugin” in sampleFasterRNN.cpp, except “getOutputDimensions()” to keep the same dimension. (this seems fine.)

Where should I implement the “adding 1.0” part? I guess it should be in “enqueue()”. So, I tried

int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override { # the below is from the Reshape class. seems to copy from input to output CHECK(cudaMemcpyAsync(outputs[0], inputs[0], mCopySize * batchSize, cudaMemcpyDeviceToDevice, stream)); # add 1.0 to first ten values float* foutputs = (float*) outputs[0]; int i; for (i = 0; i < 10; i++) foutputs[i] += 1.0; return 0; }

However, this part results in “segmentation fault” error.

My questions are:
1) where and how can I implement some calculation between input and output (addition in this case)?
2) Can you provide a simple example?

** Just in case, I post the full code of this example API here. (almost same with the Reshape)

class Inc : public IPlugin { public: Inc() {} Inc(const void* buffer, size_t size) { assert(size == sizeof(mCopySize)); mCopySize = *reinterpret_cast<const size_t*>(buffer); } int getNbOutputs() const override { return 1; } Dims getOutputDimensions(int index, const Dims* inputs, int nbInputDims) override { assert(nbInputDims == 1); assert(index == 0); assert(inputs[index].nbDims == 3); return DimsCHW(inputs[0].d[0], inputs[0].d[1], inputs[0].d[2]); // same dimension } int initialize() override { return 0; } void terminate() override { } size_t getWorkspaceSize(int) const override { return 0; } // currently it is not possible for a plugin to execute "in place". Therefore we memcpy the data from the input to the output buffer int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override { CHECK(cudaMemcpyAsync(outputs[0], inputs[0], mCopySize * batchSize, cudaMemcpyDeviceToDevice, stream)); float* foutputs = (float*) outputs[0]; int i; for (i = 0; i < 10; i++) foutputs[i] += 1.0; return 0; } size_t getSerializationSize() override { return sizeof(mCopySize); } void serialize(void* buffer) override { *reinterpret_cast<size_t*>(buffer) = mCopySize; } void configure(const Dims*inputs, int nbInputs, const Dims* outputs, int nbOutputs, int) override { mCopySize = inputs[0].d[0] * inputs[0].d[1] * inputs[0].d[2] * sizeof(float); } protected: size_t mCopySize; };

AastaLLL · August 11, 2017, 3:08am

Hi,

First, you can try Power layer. This layer is supported by TensorRT.
For example,

layer { name: "deploy_transform" type: "Power" bottom: "data" top: "transformed_data" power_param { shift: -127.0 } }

We often use this layer to subtract image mean.

For plugin layer:
Yes, subtraction is doing in enqueue function.
But both input/output are GPU memory. You need to handle this via CUDA code.

Thanks.

ywpkwon · August 11, 2017, 3:52am

Thanks. I didn’t know the Power layer. Nevertheless, my minimal example was just for understanding plugin layer. Could you provide an example of using CUDA code? or any link?

AastaLLL · August 11, 2017, 8:51am

Hi,

Currently, we only have two samples to demonstrate Plugin API.(SamplePlugin and SampleFasterRCNN)
But neither of them use CUDA. (since they are not directly deal with the data value)

It should be like this:

__global__ void inc(float *input, float *output, int maxidx) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < maxidx) output[i] = input[i] + 1; } int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override { inc<<<1, 10>>>(inputs[0], outputs[0], 10); }

We are planning to write a more detail sample to demonstrate Plugin API.
Thanks for your feedback. This gives us more motivation to enable the more Plugin examples.

ywpkwon · August 11, 2017, 9:13am

Thanks so much!!! Yes, I was curious if I can access and change values. Your code seems very helpful. I will try.

ywpkwon · August 12, 2017, 12:50am

Hi AastaLLL,

I had compile errors: “error: ‘blockIdx’ was not declared in this scope”. (blockDim, threadIdx, too)

I found a reference saying “This problem will occur when you are writing cuda code that is inside a file named .cpp, and you go to compile it. Rename the file to .cu.”

Is it right?

So, I added two files, core.cu and core.h.

core.cu:

#include <stdio.h> #include "core.h" __global__ void inc_cuda(const float* input, float *output, int maxidx) {	int i = blockIdx.x * blockDim.x + threadIdx.x;	if (i < maxidx) output[i] = input[i] + 1; } void inc(int batchSize, const void*const *inputs, void** outputs, int maxidx) {	inc_cuda<<<1,maxidx>>>((const float*) inputs[0], (float*) outputs[0], maxidx); }

core.h:

void inc(int batchSize, const void*const *inputs, void** outputs, int maxidx);

In sampleFasterRCNN.cpp:

#include "core.h" .... int enqueue(int batchSize, const void*const *inputs, void** outputs, void*, cudaStream_t stream) override { inc(batchSize, inputs, outputs, 10); }

By the way, I am not so sure how I can add the core.cu file to the existing Makefile.
Could you let me know how I can touch the Makefile in this case?

AastaLLL · August 14, 2017, 2:40am

Hi,

Yes, kernel code should be written in .cu file.

For compilation, you can write Makefile like this:

# Build tools CXX = g++-5 NVCC = /usr/bin/nvcc # here are all the objects GPUOBJS = cuexample.o OBJS = cppexample.o # make and compile cudaexample.out: $(OBJS) $(GPUOBJS) $(NVCC) -o cudaexample.out $(OBJS) $(GPUOBJS) cuexample.o: cuexample.cu $(NVCC) -c cuexample.cu cppexample.o: cppexample.cpp $(CXX) -c cppexample.cpp clean: rm cppexample.o cuexample.o

AastaLLL · August 25, 2017, 5:38am

Hi,

We have written a face-recognition sample to demonstrate TensorRT2.1 Plugin API.
Please check this GitHub for more details:

r02525028 · September 4, 2017, 1:22am

Hi,

Thanks for your sample code (Face-Recognition).
However, I am still curious about how to transfer my caffe layer code written in C
to Plugin. Could you please offer the caffe layer code written in C and I could
be more clear. Thx!

AastaLLL · September 4, 2017, 2:23am

Hi,

Have you tried our jetson_inference sample?
TensorRT can import caffemodel directly.
You can check this tutorial for details.

If you are interested in Face-Recognition sample, you can change the input of TensorRT to your model.

Thanks.

r02525028 · September 4, 2017, 7:04am

Hi,

Thx for ur reply.

I wanna add my custom layer so I reference the sample code in Face-Recognition

but I am curious about things iner “BboxMergeLayer::BboxMergeLayer” in pluginImplement.cpp

ex.

dimsData = DimsCHW{d[0], d[1], d[2]};
dimsConf = DimsCHW{d[3], d[4], d[5]};
dimsBbox = DimsCHW{d[6], d[7], d[8]};

I don’t know where to find the detail.

Thanks!

AastaLLL · September 4, 2017, 8:27am

Hi,

You can check tensorRT document for details.

Located at /usr/share/doc/tensorrt/

r02525028 · October 2, 2017, 9:02am

HI,

In faster-rcnn sample

I add

float* data = (float*)inputs[0];
std::cout<<“data:”<< data[0]<<std::endl;

in enqueue of reshape

but (core dumped)

why this error happened?? thx

AastaLLL · October 3, 2017, 2:21am

Hi,

inputs[0] is a GPU memory pointer and can’t access with CPU.
Please copy the memory back to CPU first.

Thanks.

r02525028 · October 3, 2017, 2:46am

Hi,

ok~

the enqueue method is referenced to “face-recognition” pluginimplement.cpp

but how to copy the memory back to CPU?

where I can find in “face-recognition” sample?

AastaLLL · October 3, 2017, 3:32am

Hi,

Use cudaMemcpy(data, d_data, size cudaMemcpyDeviceToHost) to copy memory back to host.
Here is an example for your reference:
https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

Face-recognition can be found here:

r02525028 · October 3, 2017, 5:19am

Hi,

but why in face-recognition sample, DataRoiLayer::enqueue

========================================
float* bbox = (float*)inputs[1];
int roi = { int(bbox[0]+0.5), int(bbox[1]+0.5), int(bbox[2]+0.5), int(bbox[3]+0.5)}; //rounding

input data can be directly accessed??

AastaLLL · October 3, 2017, 7:17am

Hi,

Face-recognition uses unified memory, which can be accessed via CPU/GPU.
Faster-rcnn sample uses standard CUDA memory, can only be used via GPU.

Here is the document of unified memory:
[url]Programming Guide :: CUDA Toolkit Documentation

a462428 · December 13, 2017, 7:35am

Hi,

“reshape” layer is implemented in IPlugin, so whether TRT support Float16 and INT8 in this layer?

my reshape layer is in the end of net, will it works if I use Float16 or INT8?

AastaLLL · December 14, 2017, 1:58am

Hi,

Currently, Plugin API only support FP32 mode.
We will enable FP16 mode in our future release but no concrete schedule yet.

Thanks.

Topic		Replies	Views
problem adding custom TensorRT layer to a network defined using TensorRT API TensorRT	5	1564	May 15, 2018
[TensorRT] Problem adding custom layer to a network defined using TensorRT API TensorRT	5	784	May 15, 2018
Tensor RT supports caffe model layers Jetson TX1	28	10670	October 18, 2021
TensorRT Plugin and custom layers GPU-Accelerated Libraries	5	4434	February 5, 2018
Is there any clearer official doc of customizing plugins in TensorRT? Jetson AGX Xavier	16	1109	October 18, 2021
customer layer questions Jetson Nano	21	1186	October 18, 2021
Problems about iplugin layer in TensorRT-2.1 Jetson TX1	2	614	October 18, 2021
Problem for adding a caffe custom layer in TensorRT TensorRT	0	696	April 24, 2019
Can I convert yolo v2 caffemodel to tensorRT into 2 parts? Jetson TX2	6	1638	March 22, 2018
How to use addPlugin( ) in tensorRT2.1? Jetson TX2	11	1949	October 18, 2021

A simple example on the custom layer API (TensorRT 2.1)?

I add

float* data = (float*)inputs[0]; std::cout<<“data:”<< data[0]<<std::endl;

======================================== float* bbox = (float*)inputs[1]; int roi = { int(bbox[0]+0.5), int(bbox[1]+0.5), int(bbox[2]+0.5), int(bbox[3]+0.5)}; //rounding

Related topics

float* data = (float*)inputs[0];
std::cout<<“data:”<< data[0]<<std::endl;

========================================
float* bbox = (float*)inputs[1];
int roi = { int(bbox[0]+0.5), int(bbox[1]+0.5), int(bbox[2]+0.5), int(bbox[3]+0.5)}; //rounding