GPU Programming Insights

Explore top LinkedIn content from expert professionals.

Yangqing Jia

Co-founder & CEO of Lepton AI (now part of NVidia). Hiring top talents.

9,372 followers 1y
Report this post
People often ask why prices like $2.8/m token for Llama 405B, while being super fast, are still profitable at Lepton AI. We've even been asked by a leading GPU provider! So, I figured we should share some technical analysis. This information could benefit the community. We've taken these statistics and analysis for granted, but they might not be obvious to everyone. 1. Big batches: Each request receives an output of ~30 tokens/second. Batching (grouping multiple requests simultaneously) significantly improves total throughput, often 10x or higher than a single request. GPUs are more efficient with larger batches. 2. Dynamic batching: This technique immediately adds a new request to an existing batch instead of making it wait, ensuring the GPU always works at high capacity. 3. Input tokens: The ~30 tokens/second refers to output tokens. Input tokens are processed much faster (known as "prefilling"). Typically, the input length is many times larger than the output (3x to 10x). This increases the total number of tokens processed, explaining why there is often separate billing for input and output. 4. Quantization: Using 8-bit integers or 8-bit floats instead of 16-bit floats reduces memory usage and speeds up processing because the GPU accesses less memory. Newer GPUs also have hardware instructions for lower bit numbers, increasing speed further. For example, the new Nvidia Blackwell GPU supports 4-bit floats (fp4). Quantization also saves memory, allowing even bigger batches from point 1, making it more economic. 5. Speculative decoding: This method uses a smaller model to predict the next token. For example, predicting "you" after "it is good to see" doesn't require a large model. Smaller models make such predictions faster. The Medusa algorithm by Tianle Cai is a specific example of this approach. 6. Prompt caching: LLMs often encounter repeated prefixes, such as "you are a smart AI agent" in system prompts. Caching these prefilled prompts avoids recalculating them, speeding up repeated requests. 7. Optimizing GPU setups: This involves using large GPUs for big models, small GPUs for small models, and matching GPUs to specific tasks—some are better for prefilling, others for decoding. There are many optimization opportunities here. This is not a complete list. We integrate these methods (and a growing number of more) in our runtime to ensure profitability with reasonable traffic. Lepton is created by experts who have developed key AI software over the past decade - Caffe, onnx, pytorch - alongside cloud experts like the creator of etcd and core contributors to Kubernetes. We provide not only LLM APIs, but also a full cloud-native experience to help you find, use, and optimize GPUs on our cloud platform. We love the open-source and open-access community. What AI technical explanation would you like to hear next?

23 Comments
Like Comment
Greg Coquillo Greg Coquillo is an Influencer

Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

212,984 followers 7mo
Report this post
Couple of weeks ago, amongst other things I called out that DeepSeek AI’s FlashMLA announced a suite of efficiency solutions that will improve AI workload GPU utilization, with increased speed. 🔸TLDR: It’s fascinating to see such quick innovations in CUDA programming right after DeepSeek, aiming to achieve substantial efficiency gains in variable-length prompt processing and small-batch inference scenarios. 🔹As such, Stanford researchers soft launched ThunderMLA, an optimized GPU decoding mechanism designed to accelerate large language model inference by implementing a fully fused “megakernel” for attention decoding. 🔹In other words, this megakernel consolidates multiple kernel operations into a single execution unit, reducing the overhead associated with individual kernel launches, such as setup and teardown times, while mitigating tail effects and improving memory bandwidth utilization. 🔹By leveraging custom scheduling strategies, including static and makespan-backward schedulers, ThunderMLA optimizes task execution order and resource allocation, achieving a 20-35% speedup over FlashMLA. 🔹Behind this performance gain, we find ThunderKittens, an embedded domain-specific language (DSL) developed by the researchers. It simplifies writing high-performance AI kernels for GPUs. 🔹Thunderkittens maintains extensibility and uses fundamental objects that align with tensor cores for optimal utilization, while abstracting complex GPU programming tasks. 🔹It provides a PyTorch-like API, making it accessible while remaining hardware-transparent for developers needing fine-grained control. Looking forward to the technical report, as well as an extension of this Multi-Head Latent Attention speed up to other areas. I’ll be glad to share it! See more below #genai #technology #artificialintelligence
No more previous content

No more next content

Greg Coquillo Greg Coquillo is an Influencer

Product Leader @AWS | Startup Investor | 2X Linkedin Top Voice for AI, Data Science, Tech, and Innovation | Quantum Computing & Web 3.0 | I build software that scales AI/ML Network infrastructure

Couple of weeks ago, amongst other things I called out that DeepSeek AI’s FlashMLA announced a suite of efficiency solutions that will improve AI workload GPU utilization, with increased speed. 🔸TLDR: It’s fascinating to see such quick innovations in CUDA programming right after DeepSeek, aiming to achieve substantial efficiency gains in variable-length prompt processing and small-batch inference scenarios. 🔹As such, Stanford researchers soft launched ThunderMLA, an optimized GPU decoding mechanism designed to accelerate large language model inference by implementing a fully fused “megakernel” for attention decoding. 🔹In other words, this megakernel consolidates multiple kernel operations into a single execution unit, reducing the overhead associated with individual kernel launches, such as setup and teardown times, while mitigating tail effects and improving memory bandwidth utilization. 🔹By leveraging custom scheduling strategies, including static and makespan-backward schedulers, ThunderMLA optimizes task execution order and resource allocation, achieving a 20-35% speedup over FlashMLA. 🔹Behind this performance gain, we find ThunderKittens, an embedded domain-specific language (DSL) developed by the researchers. It simplifies writing high-performance AI kernels for GPUs. 🔹Thunderkittens maintains extensibility and uses fundamental objects that align with tensor cores for optimal utilization, while abstracting complex GPU programming tasks. 🔹It provides a PyTorch-like API, making it accessible while remaining hardware-transparent for developers needing fine-grained control. Looking forward to the technical report, as well as an extension of this Multi-Head Latent Attention speed up to other areas. I’ll be glad to share it! See more below #genai #technology #artificialintelligence

8 Comments

Like Comment
8 Comments
Like Comment
Anand Logani

5,907 followers 8mo
Report this post
DeepSeek is sparking major conversation across the AI ecosystem. With claims of matching or exceeding OpenAI's model performance at a fraction of the cost and being open source, this is a development the industry cannot ignore. At EXL, we see this as an inflection point for businesses adopting AI. Here's my perspective: 1. What's Happened? DeepSeek has introduced key advancements setting a new benchmark for AI: - Open-Source Architecture: DeepSeek's open-source model accelerates innovation by providing accessibility and flexibility. - Multi-Head Latent Attention (#MLA): This new attention mechanism reduces algorithm complexity from Quadratic to Linear, cutting GPU memory needs and lowering costs. - Mix-of-Expert (MoE) Architecture: DeepSeek improves MoE architectures like Mixtral, boosting reasoning capabilities and reducing training costs. These innovations make DeepSeek's model cheaper and more efficient, opening doors for widespread adoption. Open-source models like Meta's LLama, OpenAI, Gemini, and Claude will likely adopt these mechanisms, achieving similar capabilities at lower costs. 2. What Does This Mean? EXL Client Solutions Will Benefit As Foundational Models Evolve -DeepSeek reduces barriers to entry, enabling organizations to scale generative AI solutions. These advancements lower gen AI use case costs while increasing adoption, positively impacting GPU and Cloud growth. From General Purpose to Deep Industry-Specific Use Cases Impact -General-purpose LLMs like DeepSeek provide a foundation, but EXL's domain-specific solutions—like EXL's Insurance LLM—unlock their true potential through fine-tuning to deliver transformative outcomes. -EXL reduces LLM training costs at the application layer with techniques like latent attention while opening new AI markets. These improvements enable clients to adopt gen AI use cases and automation at significantly lower costs. Scarcity Driven Disruption is an Opportunity -Cost reductions in LLM development expand the total addressable market (TAM) for AI, driving demand for cloud solutions, GPUs, and AI platforms. MLA-driven efficiencies and EXL's expertise in leveraging private data and domain knowledge create impactful, cost-effective AI solutions. This positions EXL to unlock orchestration opportunities and new use cases that were previously too costly to automate. EXL thrives in moments of transformation. As a model-agnostic partner, we deliver tailored AI solutions that drive actionable insights and measurable value. #DeepSeek isn't just a technical milestone—it's a call to action for enterprises to embrace AI, scale automation, and lead the next wave of innovation. Rohit Kapoor, Arturo Devesa, Gaurav Iyer, Shekhar Vemuri, Vivek Vinod

17 Comments
Like Comment
Hui Fu

CEO, United Micro

4,634 followers 1y
Report this post
I am deeply intrigued by this slides. Out of the 1000X performance improvement Nvidia gain in last 10 years, only 2.5X came from process improvement. Moore's Law has long been synonymous with the progress of semiconductor technology, predicting the doubling of transistors on a microchip approximately every two years. However, as we navigate the intricacies of the 21st century, it becomes evident that the traditional semiconductor process driven advancements is not sustained. To live the Moore's law, semiconductor manufacturing process improvement is not the answer any more, the needed improvement will come from design. So this is the era for our Designers. Particularly domain specific computing (or called Domain Specific Architectures) will be the answers. out of the 1000X improvement， ~16x come from number represenation!. The data representation are so different for different domain of applications. No more one size fits all data types. Domain specific data type will drive the new architecture. This is for AI computing, this is for wireless computing, this is for video/crypto computing. It will all come with their specific data width, dynamic range, precision requirement, complex or real data source. This is a gold mine for our next level of optimizations. Come nexts is the complex instructions, 12X!. This is a bit against the recent RISC (not RISC-V) movement. It was shown in Nvdia GPU design (as quote in the slide ) and it is also in the wireless specific computing（ remember long time ago there was a Qualcom paper on Hexagon DSP to combine ~30 RISC instruction to one to perform FFT computing which is fundamental for wireless computing) This reduce the code fetch, decode energy from 30 instructions to ONE. (while the execution is also optimized to it's specific addressing/computing mode for this FFT computing) Domain-specific architecture marks a departure from the one-size-fits-all approach. Instead of merely cramming more transistors onto a chip, designers are now crafting architectures tailored to specific tasks. This approach optimizes efficiency, enabling hardware to excel in particular domains such as artificial intelligence, graphics rendering, or scientific simulations. The trajectory of technological advancement is no longer solely dictated by the shrinking size of transistors. It's about reimagining the very architecture that drives our devices. As we gaze into the future, domain-specific architecture stands as a beacon, guiding us towards a realm where innovation is not confined by the constraints of a standardized approach. Moore's law is not dead, long live the designer's aspiration and pursuit.
No more previous content

No more next content

Hui Fu

CEO, United Micro

I am deeply intrigued by this slides. Out of the 1000X performance improvement Nvidia gain in last 10 years, only 2.5X came from process improvement. Moore's Law has long been synonymous with the progress of semiconductor technology, predicting the doubling of transistors on a microchip approximately every two years. However, as we navigate the intricacies of the 21st century, it becomes evident that the traditional semiconductor process driven advancements is not sustained. To live the Moore's law, semiconductor manufacturing process improvement is not the answer any more, the needed improvement will come from design. So this is the era for our Designers. Particularly domain specific computing (or called Domain Specific Architectures) will be the answers. out of the 1000X improvement， ~16x come from number represenation!. The data representation are so different for different domain of applications. No more one size fits all data types. Domain specific data type will drive the new architecture. This is for AI computing, this is for wireless computing, this is for video/crypto computing. It will all come with their specific data width, dynamic range, precision requirement, complex or real data source. This is a gold mine for our next level of optimizations. Come nexts is the complex instructions, 12X!. This is a bit against the recent RISC (not RISC-V) movement. It was shown in Nvdia GPU design (as quote in the slide ) and it is also in the wireless specific computing（ remember long time ago there was a Qualcom paper on Hexagon DSP to combine ~30 RISC instruction to one to perform FFT computing which is fundamental for wireless computing) This reduce the code fetch, decode energy from 30 instructions to ONE. (while the execution is also optimized to it's specific addressing/computing mode for this FFT computing) Domain-specific architecture marks a departure from the one-size-fits-all approach. Instead of merely cramming more transistors onto a chip, designers are now crafting architectures tailored to specific tasks. This approach optimizes efficiency, enabling hardware to excel in particular domains such as artificial intelligence, graphics rendering, or scientific simulations. The trajectory of technological advancement is no longer solely dictated by the shrinking size of transistors. It's about reimagining the very architecture that drives our devices. As we gaze into the future, domain-specific architecture stands as a beacon, guiding us towards a realm where innovation is not confined by the constraints of a standardized approach. Moore's law is not dead, long live the designer's aspiration and pursuit.

16 Comments

Like Comment
16 Comments
Like Comment
Sharada Yeluri

Engineering Leader

19,572 followers 8mo
Report this post
A lot has changed since my #LLM inference article last January—it’s hard to believe a year has passed! The AI industry has pivoted from focusing solely on scaling model sizes to enhancing reasoning abilities during inference. This shift is driven by the recognition that simply increasing model parameters yields diminishing returns and that improving inference capabilities can lead to more efficient and intelligent AI systems. OpenAI's o1 and Google's Gemini 2.0 are examples of models that employ #InferenceTimeCompute. Some techniques include best-of-N sampling, which generates multiple outputs and selects the best one; iterative refinement, which allows the model to improve its initial answers; and speculative decoding. Self-verification lets the model check its own output, while adaptive inference-time computation dynamically allocates extra #GPU resources for challenging prompts. These methods represent a significant step toward more reasoning-driven inference. Another exciting trend is #AgenticWorkflows, where an AI agent, a SW program running on an inference server, breaks the queried task into multiple small tasks without requiring complex user prompts (prompt engineering may see end of life this year!). It then autonomously plans, executes, and monitors these tasks. In this process, it may run inference multiple times on the model while maintaining context across the runs. #TestTimeTraining takes things further by adapting models on the fly. This technique fine-tunes the model for new inputs, enhancing its performance. These advancements can complement each other. For example, an AI system may use agentic workflow to break down a task, apply inference-time computing to generate high-quality outputs at each step and employ test-time training to learn unexpected challenges. The result? Systems that are faster, smarter, and more adaptable. What does this mean for inference hardware and networking gear? Previously, most open-source models barely needed one GPU server, and inference was often done in front-end networks or by reusing the training networks. However, as the computational complexity of inference increases, more focus will be on building scale-up systems with hundreds of tightly interconnected GPUs or accelerators for inference flows. While Nvidia GPUs continue to dominate, other accelerators, especially from hyperscalers, would likely gain traction. Networking remains a critical piece of the puzzle. Can #Ethernet, with enhancements like compressed headers, link retries, and reduced latencies, rise to meet the demands of these scale-up systems? Or will we see a fragmented ecosystem of switches for non-Nvdia scale-up systems? My bet is on Ethernet. Its ubiquity makes it a strong contender for the job... Reflecting on the past year, it’s clear that AI progress isn’t just about making things bigger but smarter. The future looks more exciting as we rethink models, hardware, and networking. Here’s to what the 2025 will bring!
No more previous content

No more next content

Sharada Yeluri

Engineering Leader

A lot has changed since my #LLM inference article last January—it’s hard to believe a year has passed! The AI industry has pivoted from focusing solely on scaling model sizes to enhancing reasoning abilities during inference. This shift is driven by the recognition that simply increasing model parameters yields diminishing returns and that improving inference capabilities can lead to more efficient and intelligent AI systems. OpenAI's o1 and Google's Gemini 2.0 are examples of models that employ #InferenceTimeCompute. Some techniques include best-of-N sampling, which generates multiple outputs and selects the best one; iterative refinement, which allows the model to improve its initial answers; and speculative decoding. Self-verification lets the model check its own output, while adaptive inference-time computation dynamically allocates extra #GPU resources for challenging prompts. These methods represent a significant step toward more reasoning-driven inference. Another exciting trend is #AgenticWorkflows, where an AI agent, a SW program running on an inference server, breaks the queried task into multiple small tasks without requiring complex user prompts (prompt engineering may see end of life this year!). It then autonomously plans, executes, and monitors these tasks. In this process, it may run inference multiple times on the model while maintaining context across the runs. #TestTimeTraining takes things further by adapting models on the fly. This technique fine-tunes the model for new inputs, enhancing its performance. These advancements can complement each other. For example, an AI system may use agentic workflow to break down a task, apply inference-time computing to generate high-quality outputs at each step and employ test-time training to learn unexpected challenges. The result? Systems that are faster, smarter, and more adaptable. What does this mean for inference hardware and networking gear? Previously, most open-source models barely needed one GPU server, and inference was often done in front-end networks or by reusing the training networks. However, as the computational complexity of inference increases, more focus will be on building scale-up systems with hundreds of tightly interconnected GPUs or accelerators for inference flows. While Nvidia GPUs continue to dominate, other accelerators, especially from hyperscalers, would likely gain traction. Networking remains a critical piece of the puzzle. Can #Ethernet, with enhancements like compressed headers, link retries, and reduced latencies, rise to meet the demands of these scale-up systems? Or will we see a fragmented ecosystem of switches for non-Nvdia scale-up systems? My bet is on Ethernet. Its ubiquity makes it a strong contender for the job... Reflecting on the past year, it’s clear that AI progress isn’t just about making things bigger but smarter. The future looks more exciting as we rethink models, hardware, and networking. Here’s to what the 2025 will bring!

14 Comments

Like Comment
14 Comments
Like Comment
Dr. Anu Asokan

Founder @ Stem A Chip | AI & Chip Design STEM Educator | PhD in Chip Design | DFT Expert | Builder of Future Innovators

25,904 followers 3mo
Report this post
What if I told you your next GPU might be co-designed by ChatGPT? Sounds wild, right? But that’s exactly where the chip industry is headed. Large Language Models aren’t just answering questions anymore — they’re writing code for hardware. Tools like AutoChip and RTLLM are already generating: → Verilog → Testbenches → Timing constraints → And even helping with simulation and debug. Basically, what used to take hours (and a team of engineers) can now be accelerated with the right AI prompts. This isn’t about replacing engineers. It’s about redefining how they work. Imagine: → You give AI a high-level spec. → It drafts the RTL. → You refine and validate. → Faster iterations. Fewer errors. Smarter designs. Chip design is entering its AI-native era — where humans set the direction, and AI fills in the blueprints. If you’re in hardware, this shift is massive. And if you’re in AI? Chances are, your next model will depend on chips designed by... AI. Following the hardware x AI space closely. Let’s talk if you're building here. #stemachip #ChipDesign #AI #LLM #AutoChip #Semiconductors #GenerativeAI #EDA #HardwareInnovation
No more previous content

No more next content

Dr. Anu Asokan

Founder @ Stem A Chip | AI & Chip Design STEM Educator | PhD in Chip Design | DFT Expert | Builder of Future Innovators

What if I told you your next GPU might be co-designed by ChatGPT? Sounds wild, right? But that’s exactly where the chip industry is headed. Large Language Models aren’t just answering questions anymore — they’re writing code for hardware. Tools like AutoChip and RTLLM are already generating: → Verilog → Testbenches → Timing constraints → And even helping with simulation and debug. Basically, what used to take hours (and a team of engineers) can now be accelerated with the right AI prompts. This isn’t about replacing engineers. It’s about redefining how they work. Imagine: → You give AI a high-level spec. → It drafts the RTL. → You refine and validate. → Faster iterations. Fewer errors. Smarter designs. Chip design is entering its AI-native era — where humans set the direction, and AI fills in the blueprints. If you’re in hardware, this shift is massive. And if you’re in AI? Chances are, your next model will depend on chips designed by... AI. Following the hardware x AI space closely. Let’s talk if you're building here. #stemachip #ChipDesign #AI #LLM #AutoChip #Semiconductors #GenerativeAI #EDA #HardwareInnovation

14 Comments

Like Comment
14 Comments
Like Comment
Abhinav Kohar

Artificial Intelligence and Energy | Engineering Leader | CS @ UIUC | Microsoft | IIT | President’s Gold Medal

16,459 followers 1y
Report this post
🔴 FlashAttention-3, a significant advancement in GPU-based attention algorithms. Here are the key highlights: 1. 1.5-2.0x speedup over FlashAttention-2 on H100 GPUs 2. Reaches up to 740 TFLOPs/s (75% of theoretical max) for FP16 3. FP8 version approaches 1.2 PFLOPs/s 4. Reduces FP8 numerical error by 2.6x compared to standard methods The team achieved these improvements through clever technical innovations: 1. Warp-specialization for producer-consumer asynchrony 2. Overlapping GEMMs and softmax computation 3. Optimized FP8 implementation with block quantization This work could have major implications for training and deploying large language models, especially those requiring long context processing. It's exciting to see continued progress in making these models more efficient! The researchers have open-sourced their code and plan to integrate it with popular deep learning frameworks. Definitely worth checking out if you're working in this space! #MachineLearning #AI #GPUComputing #Transformers
No more previous content

No more next content

Abhinav Kohar

Artificial Intelligence and Energy | Engineering Leader | CS @ UIUC | Microsoft | IIT | President’s Gold Medal

🔴 FlashAttention-3, a significant advancement in GPU-based attention algorithms. Here are the key highlights: 1. 1.5-2.0x speedup over FlashAttention-2 on H100 GPUs 2. Reaches up to 740 TFLOPs/s (75% of theoretical max) for FP16 3. FP8 version approaches 1.2 PFLOPs/s 4. Reduces FP8 numerical error by 2.6x compared to standard methods The team achieved these improvements through clever technical innovations: 1. Warp-specialization for producer-consumer asynchrony 2. Overlapping GEMMs and softmax computation 3. Optimized FP8 implementation with block quantization This work could have major implications for training and deploying large language models, especially those requiring long context processing. It's exciting to see continued progress in making these models more efficient! The researchers have open-sourced their code and plan to integrate it with popular deep learning frameworks. Definitely worth checking out if you're working in this space! #MachineLearning #AI #GPUComputing #Transformers

16 Comments

Like Comment
16 Comments
Like Comment
Ryan Peterman

AI/ML Infra @ Meta | Writing About Software Engineering & Career Growth

191,234 followers 1y
Report this post
GPU matrix multiplication may be the most expensive algorithm that exists. It is the main operation that OpenAI, Anthropic, Meta spend billions of $ of compute on. There are only 8 kernel optimizations you need to understand for 93.7% perf of NVIDIA’s state of the art cuBLAS library In this thread, we’ll go over kernels that get progressively more performant from an Anthropic engineer's blog post following the attached diagram. Kernel 1: Simply multiplies two matrices. We’ll use CUDA’s grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. This works, but only gets us 309 GFLOPs/s (1.3% of an A6000 GPU's potential), we can do much better. Kernel 2: Enables global memory coalescing by using “warps” (groups of threads). Threads part of the same warp can group their memory accesses into one. This dramatically improves memory throughput (110GB/s vs 15GB/s). Result: 1986 GFLOPs/s (8.5% of cuBLAS) Kernel 3: Utilizes on-chip shared memory (SMEM). SMEM bandwidth is much higher than global memory (12,080GiB/s vs 750GiB/s). We load chunks from A and B into SMEM and then perform as much work as possible on them. Result: 2980 GFLOPs/s (12.8% of cuBLAS). Kernel 4: Uses 1D blocktiling for calculating multiple results per thread. It works like the last one but adds an inner loop for multiple C entries per thread (does more in SMEM) with a 4KB SMEM cache per block. Result: 8474 GFLOPs/s, ~3x faster than the last (36.5% of cuBLAS) Kernel 5: Increases arithmetic intensity via 2D blocktiling. We compute a grid of 8*8 results per thread, leveraging shared memory and local registers to reduce global memory accesses. It offers another ~2x performance boost. Result: 15971 GFLOPs/s (68.7% of cuBLAS) Kernel 6: Vectorizing memory accesses. The key is to transpose loads from A, enabling the use of 128-bit load instructions (LDS.128) instead of 32-bit loads. This enables more efficient data movement. Result: 18237 GFLOPs/s (78.4% of cuBLAS) Kernel 7: Tunes params for how much data we cache in SMEM and registers which improves performance. We use a bash script to search all valid combinations to find the optimal settings. Result: 19721 GFLOPs/s (84.8% of cuBLAS) Kernel 8: Adds "warptiling". This is yet another form of tiling (on top of blocktiling and threadtiling). Warptiling allows different warps to execute in parallel on different warp schedulers. Leverages hardware for even more parallelism. Result: 21779 GFLOPs/s (93.7% cuBLAS) From reading the original post, I learned that optimizing GPU kernels requires a deep understanding of the hardware and memory access patterns. The basics are simple and get you most of the way there (author got ~80% of the perf in 2 weekends). It took another 4 weekends to get the last 14% (classic power law). For much more in-depth explanations with helpful diagrams and code snippets, check out the original post here it's really interesting: https://lnkd.in/gi-y4NFB
No more previous content

No more next content

Ryan Peterman

AI/ML Infra @ Meta | Writing About Software Engineering & Career Growth

GPU matrix multiplication may be the most expensive algorithm that exists. It is the main operation that OpenAI, Anthropic, Meta spend billions of $ of compute on. There are only 8 kernel optimizations you need to understand for 93.7% perf of NVIDIA’s state of the art cuBLAS library In this thread, we’ll go over kernels that get progressively more performant from an Anthropic engineer's blog post following the attached diagram. Kernel 1: Simply multiplies two matrices. We’ll use CUDA’s grid, block and thread hierarchy to assign each thread a unique entry in the result matrix C. This works, but only gets us 309 GFLOPs/s (1.3% of an A6000 GPU's potential), we can do much better. Kernel 2: Enables global memory coalescing by using “warps” (groups of threads). Threads part of the same warp can group their memory accesses into one. This dramatically improves memory throughput (110GB/s vs 15GB/s). Result: 1986 GFLOPs/s (8.5% of cuBLAS) Kernel 3: Utilizes on-chip shared memory (SMEM). SMEM bandwidth is much higher than global memory (12,080GiB/s vs 750GiB/s). We load chunks from A and B into SMEM and then perform as much work as possible on them. Result: 2980 GFLOPs/s (12.8% of cuBLAS). Kernel 4: Uses 1D blocktiling for calculating multiple results per thread. It works like the last one but adds an inner loop for multiple C entries per thread (does more in SMEM) with a 4KB SMEM cache per block. Result: 8474 GFLOPs/s, ~3x faster than the last (36.5% of cuBLAS) Kernel 5: Increases arithmetic intensity via 2D blocktiling. We compute a grid of 8*8 results per thread, leveraging shared memory and local registers to reduce global memory accesses. It offers another ~2x performance boost. Result: 15971 GFLOPs/s (68.7% of cuBLAS) Kernel 6: Vectorizing memory accesses. The key is to transpose loads from A, enabling the use of 128-bit load instructions (LDS.128) instead of 32-bit loads. This enables more efficient data movement. Result: 18237 GFLOPs/s (78.4% of cuBLAS) Kernel 7: Tunes params for how much data we cache in SMEM and registers which improves performance. We use a bash script to search all valid combinations to find the optimal settings. Result: 19721 GFLOPs/s (84.8% of cuBLAS) Kernel 8: Adds "warptiling". This is yet another form of tiling (on top of blocktiling and threadtiling). Warptiling allows different warps to execute in parallel on different warp schedulers. Leverages hardware for even more parallelism. Result: 21779 GFLOPs/s (93.7% cuBLAS) From reading the original post, I learned that optimizing GPU kernels requires a deep understanding of the hardware and memory access patterns. The basics are simple and get you most of the way there (author got ~80% of the perf in 2 weekends). It took another 4 weekends to get the last 14% (classic power law). For much more in-depth explanations with helpful diagrams and code snippets, check out the original post here it's really interesting: https://lnkd.in/gi-y4NFB

13 Comments

Like Comment
13 Comments
Like Comment
Tatev Aslanyan

Founder and CEO @ LunarTech | AI Engineer and Data Scientist | Seen on Forbes, Yahoo, Entrepreneur | Empowering Enterprises with Data Science and AI

25,373 followers 3mo
Report this post
When your LLM says “just one more parameter…” and your GPU goes 🔥 Imagine a tiny clown-car labelled “24 GB GPU” and an endless parade of language-model weights cramming inside. That’s basically what happens when you try to fine-tune a giant model on consumer hardware. 🚗 VRAM disappears fast when billions of model weights, extra optimizer states, saved activations, and token-by-token KV caches all pile up—each scaling with model size, batch size, and context length. The higher the precision (FP32 vs FP16 or Int8), the bigger the bite, so smart choices in precision, optimizers, and memory-saving tricks are essential to stay within your GPU’s limits. Why you keep hitting “CUDA-out-of-memory” - You’re running a 70 B parameter model on a single card. - Batch size set to “YOLO”. - You thought quantization was a myth. - You forgot activations double during training. - You assumed swap-to-CPU would be “fine”. (Spoiler: it isn’t.) Quick sanity savers ⛑️ - Gradient checkpointing → recompute, don’t store. - Low-rank adapters / LoRA → fine-tune millions, not billions. - Quantize for inference: Int8, 4-bit, even 2-bit if you’re spicy. - Pipeline / tensor parallelism → slice the model across multiple GPUs. - Know when to downsize — sometimes GPT-J gets the job done. Remember: AI isn’t “one-click”; it’s physics, algebra, and PCIe lanes. 😉 Follow LunarTech here on LinkedIn, YouTube and other platforms. 🎓 For AI & Data Science training: LunarTech Academy 🚀 For AI strategy & full-stack dev: LunarTech Enterprises #LunarTech #LLM #GPU #AIMemory #DeepLearning #ModelOptimization #DataScience #AIEngineering
No more previous content

No more next content

Tatev Aslanyan

Founder and CEO @ LunarTech | AI Engineer and Data Scientist | Seen on Forbes, Yahoo, Entrepreneur | Empowering Enterprises with Data Science and AI

When your LLM says “just one more parameter…” and your GPU goes 🔥 Imagine a tiny clown-car labelled “24 GB GPU” and an endless parade of language-model weights cramming inside. That’s basically what happens when you try to fine-tune a giant model on consumer hardware. 🚗 VRAM disappears fast when billions of model weights, extra optimizer states, saved activations, and token-by-token KV caches all pile up—each scaling with model size, batch size, and context length. The higher the precision (FP32 vs FP16 or Int8), the bigger the bite, so smart choices in precision, optimizers, and memory-saving tricks are essential to stay within your GPU’s limits. Why you keep hitting “CUDA-out-of-memory” - You’re running a 70 B parameter model on a single card. - Batch size set to “YOLO”. - You thought quantization was a myth. - You forgot activations double during training. - You assumed swap-to-CPU would be “fine”. (Spoiler: it isn’t.) Quick sanity savers ⛑️ - Gradient checkpointing → recompute, don’t store. - Low-rank adapters / LoRA → fine-tune millions, not billions. - Quantize for inference: Int8, 4-bit, even 2-bit if you’re spicy. - Pipeline / tensor parallelism → slice the model across multiple GPUs. - Know when to downsize — sometimes GPT-J gets the job done. Remember: AI isn’t “one-click”; it’s physics, algebra, and PCIe lanes. 😉 Follow LunarTech here on LinkedIn, YouTube and other platforms. 🎓 For AI & Data Science training: LunarTech Academy 🚀 For AI strategy & full-stack dev: LunarTech Enterprises #LunarTech #LLM #GPU #AIMemory #DeepLearning #ModelOptimization #DataScience #AIEngineering

10 Comments

Like Comment
10 Comments
Like Comment
Bert Maher

Member of Technical Staff @ Anthropic

2,076 followers 6mo
Report this post
I've been really excited to learn the lowest-level details of GPU matrix multiplication recently, so I wrote a pure-CUDA implementation of CUTLASS's "pingpong" gemm algorithm: https://lnkd.in/dfGaPHzn, as a way of understanding it at the deepest level. I was inspired by Pranjal Shankhdhar's fantastic blog post "Outperforming cuBLAS on H100", which implements a fast gemm from first principles in CUDA, and based my code on his kernel (which in CUTLASS parlance would be a "cooperative" warp-specialized gemm, as opposed to my "pingpong" kernel). I also wrote up a narrative of my perf-tuning experiences in the repo. Since matmul perf is a hot topic in ML recently I hope this is a fun read for anyone else who loves low-level programming!

GitHub - bertmaher/simplegemm github.com

5 Comments
Like Comment

GPU Programming Insights

More in GPU Programming Insights

More Artificial Intelligence topics

Explore categories