Neural Network Training Methods

Explore top LinkedIn content from expert professionals.

Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

584,895 followers 4mo
Report this post
If you’re an AI engineer, understanding how LLMs are trained and aligned is essential for building high-performance, reliable AI systems. Most large language models follow a 3-step training procedure: Step 1: Pretraining → Goal: Learn general-purpose language representations. → Method: Self-supervised learning on massive unlabeled text corpora (e.g., next-token prediction). → Output: A pretrained LLM, rich in linguistic and factual knowledge but not grounded in human preferences. → Cost: Extremely high (billions of tokens, trillions of FLOPs). → Pretraining is still centralized within a few labs due to the scale required (e.g., Meta, Google DeepMind, OpenAI), but open-weight models like LLaMA 4, DeepSeek V3, and Qwen 3 are making this more accessible. Step 2: Finetuning (Two Common Approaches) → 2a: Full-Parameter Finetuning - Updates all weights of the pretrained model. - Requires significant GPU memory and compute. - Best for scenarios where the model needs deep adaptation to a new domain or task. - Used for: Instruction-following, multilingual adaptation, industry-specific models. - Cons: Expensive, storage-heavy. → 2b: Parameter-Efficient Finetuning (PEFT) - Only a small subset of parameters is added and updated (e.g., via LoRA, Adapters, or IA³). - Base model remains frozen. - Much cheaper, ideal for rapid iteration and deployment. - Multi-LoRA architectures (e.g., used in Fireworks AI, Hugging Face PEFT) allow hosting multiple finetuned adapters on the same base model, drastically reducing cost and latency for serving. Step 3: Alignment (Usually via RLHF) Pretrained and task-tuned models can still produce unsafe or incoherent outputs. Alignment ensures they follow human intent. Alignment via RLHF (Reinforcement Learning from Human Feedback) involves: → Step 1: Supervised Fine-Tuning (SFT) - Human labelers craft ideal responses to prompts. - Model is fine-tuned on this dataset to mimic helpful behavior. - Limitation: Costly and not scalable alone. → Step 2: Reward Modeling (RM) - Humans rank multiple model outputs per prompt. - A reward model is trained to predict human preferences. - This provides a scalable, learnable signal of what “good” looks like. → Step 3: Reinforcement Learning (e.g., PPO, DPO) - The LLM is trained using the reward model’s feedback. - Algorithms like Proximal Policy Optimization (PPO) or newer Direct Preference Optimization (DPO) are used to iteratively improve model behavior. - DPO is gaining popularity over PPO for being simpler and more stable without needing sampled trajectories. Key Takeaways: → Pretraining = general knowledge (expensive) → Finetuning = domain or task adaptation (customize cheaply via PEFT) → Alignment = make it safe, helpful, and human-aligned (still labor-intensive but improving) Save the visual reference, and follow me (Aishwarya Srinivasan) for more no-fluff AI insights ❤️ PS: Visual inspiration: Sebastian Raschka, PhD
No more previous content

No more next content

Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

If you’re an AI engineer, understanding how LLMs are trained and aligned is essential for building high-performance, reliable AI systems. Most large language models follow a 3-step training procedure: Step 1: Pretraining → Goal: Learn general-purpose language representations. → Method: Self-supervised learning on massive unlabeled text corpora (e.g., next-token prediction). → Output: A pretrained LLM, rich in linguistic and factual knowledge but not grounded in human preferences. → Cost: Extremely high (billions of tokens, trillions of FLOPs). → Pretraining is still centralized within a few labs due to the scale required (e.g., Meta, Google DeepMind, OpenAI), but open-weight models like LLaMA 4, DeepSeek V3, and Qwen 3 are making this more accessible. Step 2: Finetuning (Two Common Approaches) → 2a: Full-Parameter Finetuning - Updates all weights of the pretrained model. - Requires significant GPU memory and compute. - Best for scenarios where the model needs deep adaptation to a new domain or task. - Used for: Instruction-following, multilingual adaptation, industry-specific models. - Cons: Expensive, storage-heavy. → 2b: Parameter-Efficient Finetuning (PEFT) - Only a small subset of parameters is added and updated (e.g., via LoRA, Adapters, or IA³). - Base model remains frozen. - Much cheaper, ideal for rapid iteration and deployment. - Multi-LoRA architectures (e.g., used in Fireworks AI, Hugging Face PEFT) allow hosting multiple finetuned adapters on the same base model, drastically reducing cost and latency for serving. Step 3: Alignment (Usually via RLHF) Pretrained and task-tuned models can still produce unsafe or incoherent outputs. Alignment ensures they follow human intent. Alignment via RLHF (Reinforcement Learning from Human Feedback) involves: → Step 1: Supervised Fine-Tuning (SFT) - Human labelers craft ideal responses to prompts. - Model is fine-tuned on this dataset to mimic helpful behavior. - Limitation: Costly and not scalable alone. → Step 2: Reward Modeling (RM) - Humans rank multiple model outputs per prompt. - A reward model is trained to predict human preferences. - This provides a scalable, learnable signal of what “good” looks like. → Step 3: Reinforcement Learning (e.g., PPO, DPO) - The LLM is trained using the reward model’s feedback. - Algorithms like Proximal Policy Optimization (PPO) or newer Direct Preference Optimization (DPO) are used to iteratively improve model behavior. - DPO is gaining popularity over PPO for being simpler and more stable without needing sampled trajectories. Key Takeaways: → Pretraining = general knowledge (expensive) → Finetuning = domain or task adaptation (customize cheaply via PEFT) → Alignment = make it safe, helpful, and human-aligned (still labor-intensive but improving) Save the visual reference, and follow me (Aishwarya Srinivasan) for more no-fluff AI insights ❤️ PS: Visual inspiration: Sebastian Raschka, PhD

33 Comments

Like Comment
33 Comments
Like Comment
Zain Hasan

AI builder & teacher | AI/ML @ Together AI | ℕΨ Engineering @ UofT | Lecturer | ex-Vector DBs, Data Scientist, Health Tech Founder

15,273 followers 1y
Report this post
Explanation of Low-Rank Adaptation (LoRA), a method for efficiently fine-tuning pre-trained neural networks. The Problem LoRA Solves: 🔸 In early 2021, Microsoft partnered with OpenAI to explore the commercial viability of GPT-3. 🔸 They found that prompting was insufficient for production tasks like natural language to code generation. 🔸 Fine-tuning was necessary but prohibitively expensive due to the large size of model checkpoints. How It Works: 🔸 LoRA generalizes full fine-tuning(updating every single parameter) by asking two questions: - Do we need to fine-tune all parameters? - For the weight matrices we fine-tune, how expressive should the updates be in terms of matrix rank? 🔸 These questions define a 2D plane where full fine-tuning is one corner(full rank and full parameter updates) and the origin represents the original model. 🔸 Any point in this plane is a valid LoRA configuration. 🔸The chosen rank of the update matrix controls the expressivity of the finetuning process. 🔸 A d x d matrix can represent any linear transformation in a d-dimensional vector space. 🔸 By first transforming the input to a lower-dimensional space and then back to the original space, we can restrict the kind of linear transformations that can be represented. 🔸 This reduces the number of parameters that need to be stored from (dxd) to (dxr + dxr) where r << d. 🔸 A point near the origin often performs as well as full fine-tuning. - because often Neural Networks are over-parametrized and thus the weight matrices are full of linearly dependent 🔸 This suggests that we can start with a low-rank configuration and gradually increase the rank if needed. Common practices when using LoRA: 🔸 How to choose the rank R of the update matrix: Start with a low rank and increase it if needed. 🔸 When to use full fine-tuning?: When finetuning on data that is completely new and absent from the pretraining of the base model (for example if you are tuning an English model on Martian then full fine-tuning may be necessary). 🔸 Can I use LoRA for any model architecture?: As long as the model uses matrix multiplication, LoRA can be applied. So basically pretty much every model architecture can use LoRA! Benefits of LoRA: 🔸 Reduced checkpoint sizes: On GPT-3, checkpoint size was reduced from 1TB to 25MB. 🔸 No additional inference latency: LoRA updates can be merged with the original parameters during inference. W_new = W_old + AxB 🔸 Ability to quickly switch between tasks: LoRA modules can be loaded and unloaded efficiently.(A_frenchxB_french),(A_germanxB_german),(A_spanishxB_spanish) Some interesting ideas enabled by LoRA: 🔸 Caching LoRA modules in RAM for faster model switching and routing between different finetunes. 🔸 Training multiple LoRA modules in parallel on different batches of the training set. 🔸 Creating a tree of adaptive models where each node is a LoRA module.
No more previous content

No more next content

Zain Hasan

AI builder & teacher | AI/ML @ Together AI | ℕΨ Engineering @ UofT | Lecturer | ex-Vector DBs, Data Scientist, Health Tech Founder

Explanation of Low-Rank Adaptation (LoRA), a method for efficiently fine-tuning pre-trained neural networks. The Problem LoRA Solves: 🔸 In early 2021, Microsoft partnered with OpenAI to explore the commercial viability of GPT-3. 🔸 They found that prompting was insufficient for production tasks like natural language to code generation. 🔸 Fine-tuning was necessary but prohibitively expensive due to the large size of model checkpoints. How It Works: 🔸 LoRA generalizes full fine-tuning(updating every single parameter) by asking two questions: - Do we need to fine-tune all parameters? - For the weight matrices we fine-tune, how expressive should the updates be in terms of matrix rank? 🔸 These questions define a 2D plane where full fine-tuning is one corner(full rank and full parameter updates) and the origin represents the original model. 🔸 Any point in this plane is a valid LoRA configuration. 🔸The chosen rank of the update matrix controls the expressivity of the finetuning process. 🔸 A d x d matrix can represent any linear transformation in a d-dimensional vector space. 🔸 By first transforming the input to a lower-dimensional space and then back to the original space, we can restrict the kind of linear transformations that can be represented. 🔸 This reduces the number of parameters that need to be stored from (dxd) to (dxr + dxr) where r << d. 🔸 A point near the origin often performs as well as full fine-tuning. - because often Neural Networks are over-parametrized and thus the weight matrices are full of linearly dependent 🔸 This suggests that we can start with a low-rank configuration and gradually increase the rank if needed. Common practices when using LoRA: 🔸 How to choose the rank R of the update matrix: Start with a low rank and increase it if needed. 🔸 When to use full fine-tuning?: When finetuning on data that is completely new and absent from the pretraining of the base model (for example if you are tuning an English model on Martian then full fine-tuning may be necessary). 🔸 Can I use LoRA for any model architecture?: As long as the model uses matrix multiplication, LoRA can be applied. So basically pretty much every model architecture can use LoRA! Benefits of LoRA: 🔸 Reduced checkpoint sizes: On GPT-3, checkpoint size was reduced from 1TB to 25MB. 🔸 No additional inference latency: LoRA updates can be merged with the original parameters during inference. W_new = W_old + AxB 🔸 Ability to quickly switch between tasks: LoRA modules can be loaded and unloaded efficiently.(A_frenchxB_french),(A_germanxB_german),(A_spanishxB_spanish) Some interesting ideas enabled by LoRA: 🔸 Caching LoRA modules in RAM for faster model switching and routing between different finetunes. 🔸 Training multiple LoRA modules in parallel on different batches of the training set. 🔸 Creating a tree of adaptive models where each node is a LoRA module.

5 Comments

Like Comment
5 Comments
Like Comment
Justin Hodges, Ph.D

Senior AI/ML Technical Specialist • Author: Approaching ML in CFD/CAE applications • Servant leader

84,686 followers 1y
Report this post
26 videos have been already uploaded on Youtube: implementing Neural Networks entirely from scratch by Raj (1) Coding a single neuron and a layer: https://lnkd.in/gfSRKuxt (2) The beauty of numpy and the dot product in coding neurons and layers: https://lnkd.in/grBjwTu4 (3) Coding multiple neural network layers: https://lnkd.in/gSwmEnZP (4) Implementing the Dense Layer class in Python: https://lnkd.in/gSEeZzTZ (5) Broadcasting and Array Summation in Python: https://lnkd.in/gt9u5hca (6) Coding Neural Network Activation Functions from scratch: https://lnkd.in/gxav-8-2 (7) Coding one neural network forward pass: https://lnkd.in/geyZAvAn (8) Coding the cross entropy loss in Python (from scratch): https://lnkd.in/gbgyQbJi (9) Introduction to Optimization in Neural Network training: https://lnkd.in/gU2ZyXNq (10) Partial Derivatives and Gradient in Neural Networks: https://lnkd.in/gmb4TgUC (11) Understand Chain Rule-The backbone of Neural Networks: https://lnkd.in/gpWqaB2s (12) Backpropagation from scratch on a single neuron: https://lnkd.in/gPXNvxwG (13) Backpropagation through an entire layer of neurons - from scratch: https://lnkd.in/gpTcyz3G (14) Role of matrices in backpropagation: https://lnkd.in/gME4Ey53 (15) Finding derivatives of inputs in backpropagation and why we need them: https://lnkd.in/gyDsmmkS (16) Coding Backpropagation building blocks in Python: https://lnkd.in/gA75tWfz (17) Backpropagation on the ReLU activation class: https://lnkd.in/gaTgYZGa (18) Implementing backpropagation on the cross entropy loss function: https://lnkd.in/gHdFwJBf (19) Combined backpropagation on softmax activation and cross entropy loss: https://lnkd.in/gNJMrCX3 (20) Build the entire backpropagation pipeline for neural networks | No PyTorch or Tensorflow| Only Numpy: https://lnkd.in/gqqmb8AN (21) Coding the entire neural network forward backward pass in Python: https://lnkd.in/grmHYnbn (22) Learning Rate Decay in Neural Network Optimization: https://lnkd.in/gCbciAu9 (23) Momentum in training neural networks: https://lnkd.in/gZwFz46b (24) Coding the ADAGRAD optimizer for Neural Network training: https://lnkd.in/gmyarquq (25) Coding the RMSProp Optimizer with Neural Network training: https://lnkd.in/gryU7Rsw (26) Coding the ADAM optimizer for neural networks: https://lnkd.in/grn2V7Yg Raj must have spend a TON of time and effort in making these lectures. He shows everything on a whiteboard and then as well in Python code. Nothing is assumed. Everything is spelled out. Here is a video of his 100 pages of handwritten notes on the topic "Building Neural Networks from Scratch" #ai #machinelearning #datascience #engineering #data #chatgpt

25 Comments
Like Comment

Neural Network Training Methods

More in Advanced AI Training

Explore categories