#

inference-optimization

Here are 31 public repositories matching this topic...

imedslab / pytorch_bn_fusion

Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.

deep-neural-networks deep-learning pytorch batch-normalization inference-optimization

Updated Apr 6, 2020
Python

ZFTurbo / Keras-inference-time-optimizer

Optimize layers structure of Keras model to reduce computation time

keras inference-optimization

Updated Jul 18, 2020
Python

Rapternmn / PyTorch-Onnx-Tensorrt

A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3

pytorch darknet tensorrt onnx onnx-torch yolov3 inference-optimization onnxruntime

Updated Dec 31, 2019
Python

BaiTheBest / SparseLLM

Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)

pruning model-compression inference-optimization alternating-optimization large-language-models efficient-ai

Updated Mar 27, 2025
Python

vbdi / divprune

[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

pruning inference-optimization multi-modality llm token-pruning vision-language-model llava multimodal-large-language-models

Updated May 27, 2025
Python

AGI-Study

keli-wen / AGI-Study

The blog, read report and code example for AGI/LLM related knowledge.

demo train code-examples inference-optimization llm

Updated Feb 1, 2025
Python

yester31 / Monocular_Depth_Estimation_TRT

Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction

Updated Oct 16, 2025
Python

Harly-1506 / Faster-Inference-yolov8

Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢

opencv image-processing torch segmentation object-detection numpy-arrays openvino inference-optimization openvino-toolkit numpy-implementation ultralytics yolov8

Updated Dec 8, 2024
Python

ResponsibleAILab / DAM

Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.

inference-optimization sparse-attention efficient-ai

Updated Jun 16, 2025
Python

ccs96307 / fast-llm-inference

Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.

acceleration inference-optimization large-language-models speculative-decoding

Updated Jul 1, 2025
Python

zzbright1998 / SentenceKV

Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.

natural-language-processing transformers memory-efficiency efficient-inference inference-optimization kv-cache llm semantic-caching colm2025

Updated Sep 29, 2025
Python

amazon-science / llm-rank-pruning

LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.

pagerank graph-theory pruning inference-optimization weighted-pagerank large-language-models llm llms

Updated Nov 29, 2024
Python

yester31 / TensorRT_Examples

TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization

sparsity pruning quantization tensorrt onnx inference-optimization quantization-aware-training post-training-quantization real-esrgan efficientvitsam sam2 d-fine depth-pro yolov12 eomt

Updated Oct 29, 2025
Python

batch-partitioning

sjlee25 / batch-partitioning

Batch Partitioning for Multi-PE Inference with TVM (2020)

deep-learning data-parallelism tvm inference-optimization dl-optimization dl-compiler

Updated Dec 17, 2022
Python

amazon-science / mlp-rank-pruning

MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.

machine-learning neural-network pagerank graph-theory pruning multilayer-perceptron structured-sparsity inference-optimization centrality-measures weighted-pagerank

Updated Apr 22, 2024
Python

PRITHIVSAKTHIUR / Multimodal-OCR3

Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.

ocr pillow pytorch matplotlib ocr-recognition nanonets inference-optimization huggingface-transformers vision-transformer huggingface-models sota-model huggingface-spaces vision-language-model multimodal-large-language-models qwen2-5-vl qwen3-vl chandra-ocr dotsocr olmocr2

Updated Oct 23, 2025
Python

piotrostr / infer-trt

Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.

deep-learning object-detection tensorrt inference-optimization

Updated May 7, 2022
Python

kiritigowda / mivisionx-inference-analyzer

MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results

caffe amd opencl inference vgg resnet squeezenet docker-images amdgpu rocm resnet-50 inference-engine openvx inceptionv4 onnx inference-optimization mivisionx nnef mivisionx-inference-analyzer nnir

Updated Nov 17, 2020
Python

shreyansh26 / Accelerating-Cross-Encoder-Inference

Leveraging torch.compile to accelerate cross-encoder inference

inference-optimization mlsys jina cross-encoder torch-compile

Updated Mar 3, 2025
Python

mvish7 / dycoke_token_compression

This repo integrates DyCoke's token compression method with VLMs such as Gemma3 and InternVL3

inference-optimization vlms video-large-language-models token-compression

Updated Oct 8, 2025
Python

Improve this page

Add a description, image, and links to the inference-optimization topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the inference-optimization topic, visit your repo's landing page and select "manage topics."