Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
- Updated
Apr 6, 2020 - Python
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
Optimize layers structure of Keras model to reduce computation time
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
The blog, read report and code example for AGI/LLM related knowledge.
Optimizing Monocular Depth Estimation with TensorRT: Model Conversion, Inference Acceleration, and 3D Reconstruction
Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.
Official implementation of "SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching" (COLM 2025). A novel KV cache compression method that organizes cache at sentence level using semantic similarity.
LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.
TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization
Batch Partitioning for Multi-PE Inference with TVM (2020)
MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.
Multimodal-OCR3 is an advanced Optical Character Recognition (OCR) application that leverages multiple state-of-the-art multimodal models to extract text from images.
Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.
MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results
Leveraging torch.compile to accelerate cross-encoder inference
This repo integrates DyCoke's token compression method with VLMs such as Gemma3 and InternVL3
Add a description, image, and links to the inference-optimization topic page so that developers can more easily learn about it.
To associate your repository with the inference-optimization topic, visit your repo's landing page and select "manage topics."