GVProf: A Value Profiler for GPU-based Clusters
- Updated
Mar 24, 2024 - Python
GVProf: A Value Profiler for GPU-based Clusters
The GPU Optimizer for ML Models enhances GPU performance for machine learning. It offers advanced scheduling, real-time monitoring, and efficient resource management through a user-friendly web interface and robust API, integrating big data technologies for seamless data processing and model optimization. @NVIDIA
🤖 Ollama Consumer - A Python-based interactive chat interface for Ollama models with advanced model management, comprehensive benchmarking, vision support, and automatic error recovery. Features dynamic model switching, GPU optimization, and intelligent service monitoring for seamless AI model interactions.
AI Infrastructure Senior Engineer Learning Track - Advanced ML infrastructure and technical leadership
Optimizing PyTorch Model Training by Wrapping Memory Mapped Tensors on Nvidia GPUs with TensorDict.
Hybrid AI routing: LOCAL Ollama + CLOUD GitHub Copilot
Optimizing PyTorch Model Training by Wrapping Memory Mapped Tensors on an Nvidia GPU with TensorDict.
Quantitative dataset of 119 neural architectures (2017-2025) scored on hardware compatibility and ecosystem friction. Validates the Transformer Attractor thesis.
Optimized LSTM-based character-level text generator trained on Shakespeare, achieving 3.5x faster training with mixed precision.
LM Multi-Bin Dynamic Scheduler Simulator - Implementation combining Multi-Bin batching with SLA-constrained dynamic batching
GPU-Optimized AI for Geospatial Annotation and Visual Search Accelerating Geospatial Intelligence through Distillation, Segmentation, and GPU Optimization.
High-performance CUDA implementation of LayerNorm for PyTorch achieving 1.46x speedup through kernel fusion. Optimized for large language models (4K-8K hidden dims) with vectorized memory access, warp-level primitives, and mixed precision support. Drop-in replacement for nn.LayerNorm with 25% memory reduction.
A high-performance kernel implementation of multi-head attention using Triton. Focused on minimizing memory overhead and maximizing throughput for large-scale transformer layers. Includes clean-tensor layouts, head-grouping optimisations, and ready-to-benchmark code you can plug into custom models.
High-performance CNN for CIFAR-10 classification with GPU optimization, achieving 88.82% accuracy through systematic hyperparameter tuning
LLM-guided CUDA kernel generation framework with correctness validation and roofline analysis
A no-cost infrastructure benchmark measuring the VRAM and throughput impact of NF4 (4-bit) quantization on LLMs
High-performance CUDA implementation of Muon optimizer for LLM training. Features Newton-Schulz polar decomposition, cuBLAS acceleration, and transpose optimization for 8x FLOP savings on transformer FFN layers. Benchmarked on NVIDIA A100 with Llama 3.1 8B architectures (4096×11008 weights).
🚀 Achieve rapid training of NanoGPT (GPT-2 124M) on a single RTX 4090, targeting a validation loss below 3.28 with FineWeb-Edu data.
Add a description, image, and links to the gpu-optimization topic page so that developers can more easily learn about it.
To associate your repository with the gpu-optimization topic, visit your repo's landing page and select "manage topics."