How to capture and decode MJPEG live stream "directly" from RTX 3080 TI GPU instead of CPU?

Hello,

So here is my PC:
Windows 10.
GPU: RTX 3080 TI
CUDA 12.6
cuDNN 9.6
NVIDIA Video Codec SDK 12.2.72
Programming language: C++ 17 MSVC 17 2022

Before I ask some questions, please do not provide a cheap answer by saying to go use OpenCV library for simple image processing requirements, this can be done easily with some proper coding. Latency is also important.

Any how, I have a MJPEG live stream from a webcam.
It works perfectly when I test it using FFPLAY/FFMPEG via CLI.

Here is the live input MJPEG webcam stream details:
Stream #0:0: Video: mjpeg (Baseline) (MJPG / 0x47504A4D), yuvj422p(pc, bt470bg/bt709/unknown)

My question is, how can I directly decode this stream specifically and directly first from the GPU and not CPU? It is pretty obvious why I would like to do this, reduce latency and unnecessary CPU utilization.

I assume that the CPU would need to capture the frames and pass it the GPU? Is this possible?
Then the GPU decodes it and then passes it to CPU to store to file? This is possible?
I know that the GPU can also do DMA, but this seems more involved and will venture into this later until I figure out the fundamentals.

I want to offload CPU usage as much as possible. I’ll have about a dozen of cameras all running in parallel and need real time decoding.

I also checked the NVIDIA Video Codec SDK 12.2.72 sample for nvdecoder.cpp and header file and both doesn’t mention anything valuable about MJPEG, the parameter arguments for MJPEG are empty but simply listed there, makes no sense why.

Any advice would be much obliged. Thanks.

Edit:

Alright after doing some thorough research on this super rare complex topic of JPEG 100% GPU decoding. I’ve answered my own question.

It seems nvJPEG can decode MJPEG in a hybrid method (CPU + GPU) and not completely in GPU. But why???

After more research, this is because of the unfortunate Huffman algorithm used in JPEG which was created in a computing serially, thus not favorable in GPU computation and this component is taken care by CPU due to being more efficient computationally.

However there seems to be a paper which shows that CUDA can decode Huffman:

Anyhow, have no choice but to use what is currently available and use the nvJPEG hybrid. Someday I will learn the Huffman algo and read the paper and make my own CUDA 100% JPEG decoder with the help of advanced AI.