Improving Predictive Accuracy

Explore top LinkedIn content from expert professionals.

Andrew Ng Andrew Ng is an Influencer

Founder of DeepLearning.AI; Managing General Partner of AI Fund; Exec Chairman of LandingAI

2,240,558 followers 1y
Report this post
Last week, I described four design patterns for AI agentic workflows that I believe will drive significant progress: Reflection, Tool use, Planning and Multi-agent collaboration. Instead of having an LLM generate its final output directly, an agentic workflow prompts the LLM multiple times, giving it opportunities to build step by step to higher-quality output. Here, I'd like to discuss Reflection. It's relatively quick to implement, and I've seen it lead to surprising performance gains. You may have had the experience of prompting ChatGPT/Claude/Gemini, receiving unsatisfactory output, delivering critical feedback to help the LLM improve its response, and then getting a better response. What if you automate the step of delivering critical feedback, so the model automatically criticizes its own output and improves its response? This is the crux of Reflection. Take the task of asking an LLM to write code. We can prompt it to generate the desired code directly to carry out some task X. Then, we can prompt it to reflect on its own output, perhaps as follows: Here’s code intended for task X: [previously generated code] Check the code carefully for correctness, style, and efficiency, and give constructive criticism for how to improve it. Sometimes this causes the LLM to spot problems and come up with constructive suggestions. Next, we can prompt the LLM with context including (i) the previously generated code and (ii) the constructive feedback, and ask it to use the feedback to rewrite the code. This can lead to a better response. Repeating the criticism/rewrite process might yield further improvements. This self-reflection process allows the LLM to spot gaps and improve its output on a variety of tasks including producing code, writing text, and answering questions. And we can go beyond self-reflection by giving the LLM tools that help evaluate its output; for example, running its code through a few unit tests to check whether it generates correct results on test cases or searching the web to double-check text output. Then it can reflect on any errors it found and come up with ideas for improvement. Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses. Reflection is a relatively basic type of agentic workflow, but I've been delighted by how much it improved my applications’ results. If you’re interested in learning more about reflection, I recommend: - Self-Refine: Iterative Refinement with Self-Feedback, by Madaan et al. (2023) - Reflexion: Language Agents with Verbal Reinforcement Learning, by Shinn et al. (2023) - CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing, by Gou et al. (2024) [Original text: https://lnkd.in/g4bTuWtU ]

One Agent For Many Worlds, Cross-Species Cell Embeddings, and more deeplearning.ai

124 Comments
Like Comment
Damien Benveniste, PhD Damien Benveniste, PhD is an Influencer

Founder @ TheAiEdge | Follow me to learn about Machine Learning Engineering, Machine Learning System Design, MLOps, and the latest techniques and news about the field.

172,426 followers 1y
Report this post
How do you deal with imbalanced data? If you don't have too much data and the imbalance is not too extreme, the typical way to deal with it is to simply reweigh the samples such that the loss function considers the positive and negative samples equally. When you have an overwhelming amount of negative samples, you may want to downsample them to minimize training latency. But not all samples are equal! At TikTok, for example, for their recommendation engine, they use a non-uniform negative sampling scheme they developed with the University of Connecticut: https://lnkd.in/gRsFSr2d. They proved that optimal sampling of the negative class is done when giving more weight to samples with a higher probability of being positive (Theorem 3). This means that it is better to keep samples that are confusing for a model. This way, the model focuses on learning how to separate true positive samples from negative samples that look like positive ones. Interestingly enough, this theorem also means sampling bias is a good thing! In ML applications, a model shows users some samples they are likely to engage with. When they don't engage with those, they become negative samples for the next training batch. That is sampling bias because only the samples with a high probability of engagement ever get shown to users, and they never get the opportunity to interact with the "lesser" samples, so we never get signals for those. By sampling the data, we bias the probability estimates coming out of the model, and they become meaningless. The model is not calibrated anymore. To fix that, they came up with a correction of the likelihood function to generate unbiased estimates of the model parameters and, therefore, the probabilities (see eq 5). Practically, you follow this process to sample down negative samples: 1) Uniformly sample the negative class so that the data becomes balanced. 2) train a model with balanced data. They call it a "pilot" model. 3) predict the full data with that pilot model. You get an estimate of how much the model believes the sample is a positive one. 4) normalize that probability p by the average probability w and multiply by the sampling rate r: r * p / w 5) for each negative sample, pick a uniform random number u. If u < r * p / w, keep the sample; remove it otherwise. The greater p is, the more likely we will keep it 6) r * p / w is the sampling probability. When training the model or predicting, correct the log odds using that probability. Pretty simple process to follow! This is a simplified version of the more optimal approach, but they consider this approach satisfactory. -- 👉 Early-bird deal for my ML Fundamentals Bootcamp: https://lnkd.in/gasbhQSk -- #machinelearning #datascience #artificialintelligence
No more previous content

No more next content

Damien Benveniste, PhD Damien Benveniste, PhD is an Influencer

Founder @ TheAiEdge | Follow me to learn about Machine Learning Engineering, Machine Learning System Design, MLOps, and the latest techniques and news about the field.

How do you deal with imbalanced data? If you don't have too much data and the imbalance is not too extreme, the typical way to deal with it is to simply reweigh the samples such that the loss function considers the positive and negative samples equally. When you have an overwhelming amount of negative samples, you may want to downsample them to minimize training latency. But not all samples are equal! At TikTok, for example, for their recommendation engine, they use a non-uniform negative sampling scheme they developed with the University of Connecticut: https://lnkd.in/gRsFSr2d. They proved that optimal sampling of the negative class is done when giving more weight to samples with a higher probability of being positive (Theorem 3). This means that it is better to keep samples that are confusing for a model. This way, the model focuses on learning how to separate true positive samples from negative samples that look like positive ones. Interestingly enough, this theorem also means sampling bias is a good thing! In ML applications, a model shows users some samples they are likely to engage with. When they don't engage with those, they become negative samples for the next training batch. That is sampling bias because only the samples with a high probability of engagement ever get shown to users, and they never get the opportunity to interact with the "lesser" samples, so we never get signals for those. By sampling the data, we bias the probability estimates coming out of the model, and they become meaningless. The model is not calibrated anymore. To fix that, they came up with a correction of the likelihood function to generate unbiased estimates of the model parameters and, therefore, the probabilities (see eq 5). Practically, you follow this process to sample down negative samples: 1) Uniformly sample the negative class so that the data becomes balanced. 2) train a model with balanced data. They call it a "pilot" model. 3) predict the full data with that pilot model. You get an estimate of how much the model believes the sample is a positive one. 4) normalize that probability p by the average probability w and multiply by the sampling rate r: r * p / w 5) for each negative sample, pick a uniform random number u. If u < r * p / w, keep the sample; remove it otherwise. The greater p is, the more likely we will keep it 6) r * p / w is the sampling probability. When training the model or predicting, correct the log odds using that probability. Pretty simple process to follow! This is a simplified version of the more optimal approach, but they consider this approach satisfactory. -- 👉 Early-bird deal for my ML Fundamentals Bootcamp: https://lnkd.in/gasbhQSk -- #machinelearning #datascience #artificialintelligence

33 Comments

Like Comment
33 Comments
Like Comment
Luke Yun

AI Researcher @ Harvard Medical School, Oxford | Biomedical Engineering @ UT Austin | X-Pfizer, Merck

32,663 followers 4mo
Report this post
Harvard and Roche just developed a foundation AI model that predicts immunotherapy outcomes across cancers and treatments and explains why some patients respond while others don’t. Predicting who will benefit from immune checkpoint inhibitors (ICIs) has been notoriously difficult, as biomarkers like PD-L1 expression and tumor mutational burden often fail across cancer types. 𝗖𝗢𝗠𝗣𝗔𝗦𝗦 𝗶𝘀 𝘁𝗵𝗲 𝗳𝗶𝗿𝘀𝘁 𝗰𝗹𝗶𝗻𝗶𝗰𝗮𝗹𝗹𝘆 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗮𝗯𝗹𝗲, 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗹𝗲 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗔𝗜 𝗺𝗼𝗱𝗲𝗹 𝗳𝗼𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗻𝗴 𝗶𝗺𝗺𝘂𝗻𝗼𝘁𝗵𝗲𝗿𝗮𝗽𝘆 𝗿𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝟯𝟯 𝗰𝗮𝗻𝗰𝗲𝗿 𝘁𝘆𝗽𝗲𝘀. 1. Trained on 10,184 tumors and fine-tuned on 16 clinical cohorts spanning seven cancers and six ICI therapies, outperforming 22 baseline methods. 2. Increased precision by 8.5%, MCC by 12.3%, and AUPRC by 15.7% over the best competing models, even in new, unseen cancer types. 3. Predicted survival outcomes more accurately than PD-L1 expression and TMB, achieving a hazard ratio of 4.7 (p < 0.0001) in a phase II urothelial cancer trial. 4. Identified distinct resistance mechanisms in immune-inflamed non-responders, including TGF-β signaling, vascular exclusion, CD4+ T cell dysfunction, and B cell deficiency. A main focus of this paper is biological interpretability, something I am a huge advocate of in large models. It integrates mechanistic interpretability (concept bottleneck) with transfer learning to do so! Also to deal with uncertainty quantification beyond the learned temperature parameter, I think incorporating conformal prediction or Bayesian calibration could strengthen clinical alignment by flagging low-confidence predictions. Here's the awesome work: https://lnkd.in/gzXSnBd8 Congrats to Wanxiang Shen, Thinh Nguyen, Michelle L., Yepeng Huang, Intae Moon, Nitya Nair, Daniel Marbach, and Marinka Zitnik! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW
No more previous content

No more next content

Luke Yun

AI Researcher @ Harvard Medical School, Oxford | Biomedical Engineering @ UT Austin | X-Pfizer, Merck

Harvard and Roche just developed a foundation AI model that predicts immunotherapy outcomes across cancers and treatments and explains why some patients respond while others don’t. Predicting who will benefit from immune checkpoint inhibitors (ICIs) has been notoriously difficult, as biomarkers like PD-L1 expression and tumor mutational burden often fail across cancer types. 𝗖𝗢𝗠𝗣𝗔𝗦𝗦 𝗶𝘀 𝘁𝗵𝗲 𝗳𝗶𝗿𝘀𝘁 𝗰𝗹𝗶𝗻𝗶𝗰𝗮𝗹𝗹𝘆 𝗴𝗲𝗻𝗲𝗿𝗮𝗹𝗶𝘇𝗮𝗯𝗹𝗲, 𝗶𝗻𝘁𝗲𝗿𝗽𝗿𝗲𝘁𝗮𝗯𝗹𝗲 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻 𝗔𝗜 𝗺𝗼𝗱𝗲𝗹 𝗳𝗼𝗿 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗻𝗴 𝗶𝗺𝗺𝘂𝗻𝗼𝘁𝗵𝗲𝗿𝗮𝗽𝘆 𝗿𝗲𝘀𝗽𝗼𝗻𝘀𝗲 𝗮𝗰𝗿𝗼𝘀𝘀 𝟯𝟯 𝗰𝗮𝗻𝗰𝗲𝗿 𝘁𝘆𝗽𝗲𝘀. 1. Trained on 10,184 tumors and fine-tuned on 16 clinical cohorts spanning seven cancers and six ICI therapies, outperforming 22 baseline methods. 2. Increased precision by 8.5%, MCC by 12.3%, and AUPRC by 15.7% over the best competing models, even in new, unseen cancer types. 3. Predicted survival outcomes more accurately than PD-L1 expression and TMB, achieving a hazard ratio of 4.7 (p < 0.0001) in a phase II urothelial cancer trial. 4. Identified distinct resistance mechanisms in immune-inflamed non-responders, including TGF-β signaling, vascular exclusion, CD4+ T cell dysfunction, and B cell deficiency. A main focus of this paper is biological interpretability, something I am a huge advocate of in large models. It integrates mechanistic interpretability (concept bottleneck) with transfer learning to do so! Also to deal with uncertainty quantification beyond the learned temperature parameter, I think incorporating conformal prediction or Bayesian calibration could strengthen clinical alignment by flagging low-confidence predictions. Here's the awesome work: https://lnkd.in/gzXSnBd8 Congrats to Wanxiang Shen, Thinh Nguyen, Michelle L., Yepeng Huang, Intae Moon, Nitya Nair, Daniel Marbach, and Marinka Zitnik! I post my takes on the latest developments in health AI – 𝗰𝗼𝗻𝗻𝗲𝗰𝘁 𝘄𝗶𝘁𝗵 𝗺𝗲 𝘁𝗼 𝘀𝘁𝗮𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝗱! Also, check out my health AI blog here: https://lnkd.in/g3nrQFxW

46 Comments

Like Comment
46 Comments
Like Comment
Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

195,914 followers 1y
Report this post
Training LLMs for spam classification: I added 14 experiments comparing different approaches: https://lnkd.in/gTNVvGcj - which token to train - which layers to train - different model sizes - LoRA - unmasking - and more! Any additional experiments you'd like to see? And here are the take aways for the table shown in the picture: 1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. 2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is also results in substantially better results than training only the last layer. 3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. 4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) 5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. 6. Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4): Keeping the model frozen and adding trainable LoRA layers (see Appendix E for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. 7. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10): Padding the input to the full supported context length results is significantly worse. 8. Padding vs no padding (Row 1 vs. 11 and 12): The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments. 9. Disabling the causal attention mask (Row 1 vs. 13): Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.
No more previous content

No more next content

Sebastian Raschka, PhD Sebastian Raschka, PhD is an Influencer

ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field.

Training LLMs for spam classification: I added 14 experiments comparing different approaches: https://lnkd.in/gTNVvGcj - which token to train - which layers to train - different model sizes - LoRA - unmasking - and more! Any additional experiments you'd like to see? And here are the take aways for the table shown in the picture: 1. Training the Last vs. First Output Token (Row 1 vs. 2): Training the last output token results in substantially better performance compared to the first. This improvement is expected due to the causal self-attention mask. 2. Training the Last Transformer Block vs. Last Layer (Row 1 vs. 3): Training the entire last transformer block is also results in substantially better results than training only the last layer. 3. Training All Layers vs. Last Transformer Block (Row 1 vs. 4): Training all layers shows a modest improvement of ~2% over just training the last transformer block, but it requires almost three times longer in terms of training duration. 4. Using Larger Pretrained Models (Row 1 vs 5, and Row 1 vs. 6 and 7): Employing a 3x larger pretrained model leads to worse results. However, using a 5x larger model improves performance compared to the initial model, as was anticipated. Similarly, the 12x larger model improves the predictive performance even further. (The medium model was perhaps not well pretrained or the particular finetuning configuration works not as well for this model.) 5. Using a Model with Random Weights vs. Pretrained Weights (Row 1 vs. 8): Utilizing a model with random weights yields results that are only slightly worse by 1.3% compared to using pretrained weights. 6. Using LoRA (Low-Rank Adaptation) vs Training All Layers (Row 9 vs. 4): Keeping the model frozen and adding trainable LoRA layers (see Appendix E for details) is a viable alternative to training all model parameters and even improves the performance by 1% point. As it can be seen by the 1% lower gap between the training and validation accuracy when using LoRA, this is likely due to less overfitting. 7. Padding Input to Full Context Length vs. Longest Training Example (Row 1 vs. 10): Padding the input to the full supported context length results is significantly worse. 8. Padding vs no padding (Row 1 vs. 11 and 12): The `--no_padding` option disables the padding in the dataset, which requires training the model with a batch size of 1 since the inputs have variable lengths. This results in a better test accuracy but takes longer to train. In row 12, we additionally enable gradient accumulation with 8 steps to achieve the same batch size as in the other experiments. 9. Disabling the causal attention mask (Row 1 vs. 13): Disables the causal attention mask used in the multi-head attention module. This means all tokens can attend all other tokens. The model accuracy is slightly improved compared to the GPT model with causal mask.

38 Comments

Like Comment
38 Comments
Like Comment
Kristen Kehrer Kristen Kehrer is an Influencer

Mavens of Data Podcast Host, [in]structor, Co-Author of Machine Learning Upgrade

101,535 followers 2y
Report this post
Modeling something like time series goes past just throwing features in a model. In the world of time series data, each observation is associated with a specific time point, and part of our goal is to harness the power of temporal dependencies. Enter autoregression and lagging - concepts that taps into the correlation between current and past observations to make forecasts. At its core, autoregression involves modeling a time series as a function of its previous values. The current value relies on its historical counterparts. To dive a bit deeper, we use lagged values as features to predict the next data point. For instance, in a simple autoregressive model of order 1 (AR(1)), we predict the current value based on the previous value multiplied by a coefficient. The coefficient determines the impact of the past value on the present one only one time period previous. One popular approach that can be used in conjunction with autoregression is the ARIMA (AutoRegressive Integrated Moving Average) model. ARIMA is a powerful time series forecasting method that incorporates autoregression, differencing, and moving average components. It's particularly effective for data with trends and seasonality. ARIMA can be fine-tuned with parameters like the order of autoregression, differencing, and moving average to achieve accurate predictions. When I was building ARIMAs for econometric time series forecasting, in addition to autoregression where you're lagging the whole model, I was also taught to lag the individual economic variables. If I was building a model for energy consumption of residential homes, the number of housing permits each month would be a relevant variable. Although, if there’s a ton of housing permits given in January, you won’t see the actual effect of that until later when the houses are built and people are actually consuming energy! That variable needed to be lagged by several months. Another innovative strategy to enhance time series forecasting is the use of neural networks, particularly Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks. RNNs and LSTMs are designed to handle sequential data like time series. They can learn complex patterns and long-term dependencies within the data, making them powerful tools for autoregressive forecasting. Neural networks are fed with past time steps as inputs to predict future values effectively. In addition to autoregression in neural networks, I also used lagging there too! When I built an hourly model to forecast electric energy consumption, I actually built 24 individual models, one for each hour, and each hour lagged on the previous one. The energy consumption and weather of the previous hour was very important in predicting what would happen in the next forecasting period. (this model was actually used for determining where they should shift electricity during peak load times). Happy forecasting!
No more previous content

No more next content

Kristen Kehrer Kristen Kehrer is an Influencer

Mavens of Data Podcast Host, [in]structor, Co-Author of Machine Learning Upgrade

Modeling something like time series goes past just throwing features in a model. In the world of time series data, each observation is associated with a specific time point, and part of our goal is to harness the power of temporal dependencies. Enter autoregression and lagging - concepts that taps into the correlation between current and past observations to make forecasts. At its core, autoregression involves modeling a time series as a function of its previous values. The current value relies on its historical counterparts. To dive a bit deeper, we use lagged values as features to predict the next data point. For instance, in a simple autoregressive model of order 1 (AR(1)), we predict the current value based on the previous value multiplied by a coefficient. The coefficient determines the impact of the past value on the present one only one time period previous. One popular approach that can be used in conjunction with autoregression is the ARIMA (AutoRegressive Integrated Moving Average) model. ARIMA is a powerful time series forecasting method that incorporates autoregression, differencing, and moving average components. It's particularly effective for data with trends and seasonality. ARIMA can be fine-tuned with parameters like the order of autoregression, differencing, and moving average to achieve accurate predictions. When I was building ARIMAs for econometric time series forecasting, in addition to autoregression where you're lagging the whole model, I was also taught to lag the individual economic variables. If I was building a model for energy consumption of residential homes, the number of housing permits each month would be a relevant variable. Although, if there’s a ton of housing permits given in January, you won’t see the actual effect of that until later when the houses are built and people are actually consuming energy! That variable needed to be lagged by several months. Another innovative strategy to enhance time series forecasting is the use of neural networks, particularly Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) networks. RNNs and LSTMs are designed to handle sequential data like time series. They can learn complex patterns and long-term dependencies within the data, making them powerful tools for autoregressive forecasting. Neural networks are fed with past time steps as inputs to predict future values effectively. In addition to autoregression in neural networks, I also used lagging there too! When I built an hourly model to forecast electric energy consumption, I actually built 24 individual models, one for each hour, and each hour lagged on the previous one. The energy consumption and weather of the previous hour was very important in predicting what would happen in the next forecasting period. (this model was actually used for determining where they should shift electricity during peak load times). Happy forecasting!

45 Comments

Like Comment
45 Comments
Like Comment
Timothy Goebel

AI Solutions Architect | Computer Vision & Edge AI Visionary | Building Next-Gen Tech with GENAI | Strategic Leader | Public Speaker

17,565 followers 1y
Report this post
𝐀𝐫𝐞 𝐲𝐨𝐮𝐫 𝐜𝐨𝐦𝐩𝐮𝐭𝐞𝐫 𝐯𝐢𝐬𝐢𝐨𝐧 𝐦𝐨𝐝𝐞𝐥𝐬 𝐟𝐚𝐥𝐥𝐢𝐧𝐠 𝐬𝐡𝐨𝐫𝐭 𝐝𝐞𝐬𝐩𝐢𝐭𝐞 𝐡𝐢𝐠𝐡 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲? 𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫 𝐭𝐡𝐞 𝐡𝐢𝐝𝐝𝐞𝐧 𝐩𝐢𝐭𝐟𝐚𝐥𝐥𝐬 𝐚𝐧𝐝 𝐞𝐟𝐟𝐞𝐜𝐭𝐢𝐯𝐞 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬 𝐭𝐨 𝐨𝐯𝐞𝐫𝐜𝐨𝐦𝐞 𝐭𝐡𝐞𝐦. 𝐋𝐞𝐚𝐫𝐧 𝐡𝐨𝐰 𝐭𝐨 𝐭𝐚𝐜𝐤𝐥𝐞 𝐢𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐝𝐚𝐭𝐚, 𝐦𝐢𝐬𝐥𝐞𝐚𝐝𝐢𝐧𝐠 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐦𝐞𝐭𝐫𝐢𝐜𝐬, 𝐚𝐧𝐝 𝐞𝐧𝐡𝐚𝐧𝐜𝐞 𝐦𝐨𝐝𝐞𝐥 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐰𝐢𝐭𝐡 𝐚𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬. 𝐈𝐦𝐛𝐚𝐥𝐚𝐧𝐜𝐞𝐝 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 → Underrepresented classes compared to others. → Leads to biased models favoring majority class. → Common in medical diagnosis, fraud detection, object recognition. → Requires resampling, data augmentation, class weight adjustment. → Metrics like Precision, Recall, F1-Score needed for evaluation. 𝐀𝐜𝐜𝐮𝐫𝐚𝐜𝐲 𝐃𝐨𝐞𝐬𝐧'𝐭 𝐀𝐥𝐰𝐚𝐲𝐬 𝐆𝐢𝐯𝐞 𝐭𝐡𝐞 𝐂𝐨𝐫𝐫𝐞𝐜𝐭 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬 𝐀𝐛𝐨𝐮𝐭 𝐘𝐨𝐮𝐫 𝐓𝐫𝐚𝐢𝐧𝐞𝐝 𝐌𝐨𝐝𝐞𝐥 → Misleading with imbalanced datasets. → High accuracy may hide poor minority class performance. → Use Precision, Recall, F1-Score instead. → Confusion matrices provide detailed performance breakdown. → Comprehensive evaluation ensures effectiveness across classes. 𝐏𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐌𝐞𝐭𝐫𝐢𝐜𝐬 𝐀𝐬𝐬𝐨𝐜𝐢𝐚𝐭𝐞𝐝 𝐰𝐢𝐭𝐡 𝐋𝐚𝐛𝐞𝐥 1 → Precision: True positives out of all positive predictions. → Recall: True positives out of all actual positives. → F1-Score: Harmonic mean of Precision and Recall. → Specificity: True negatives out of all actual negatives. → Balanced Accuracy: Average Recall across all classes. 𝐑𝐞𝐜𝐞𝐢𝐯𝐞𝐫 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐂𝐡𝐚𝐫𝐚𝐜𝐭𝐞𝐫𝐢𝐬𝐭𝐢𝐜 𝐄𝐱𝐩𝐥𝐚𝐢𝐧𝐞𝐝 → ROC Curve: True Positive Rate vs. False Positive Rate. → AUC-ROC: Area summarizing model's discriminative ability. → Threshold Selection: Impacts True Positive and False Positive Rates. → Interpreting the Curve: Closer to top-left, better model. → Comparing Models: AUC-ROC allows straightforward performance comparison. 𝐌𝐮𝐥𝐭𝐢-𝐜𝐥𝐚𝐬𝐬 𝐄𝐱𝐚𝐦𝐩𝐥𝐞 → One-vs-All Approach: Binary classification for each class. → Macro-Averaging: Average metrics treating all classes equally. → Micro-Averaging: Aggregate metrics, often favors majority classes. → Confusion Matrix: Visualize multi-class misclassifications. → Per-Class Metrics: Precision, Recall, F1-Score for each class. 𝐏𝐨𝐬𝐬𝐢𝐛𝐥𝐞 𝐒𝐨𝐥𝐮𝐭𝐢𝐨𝐧𝐬 → Data Augmentation: Increase minority class samples through transformations. → Resampling Techniques: Balance dataset by oversampling or under sampling. → Class Weights Adjustment: Higher importance to minority class. → Advanced Algorithms: Models for imbalanced data, like Balanced Random Forest. → Ensemble Methods: Combine multiple models to improve performance. ♻️ Repost it to your network and follow Timothy Goebel for more. #computervision #machinelearning #datascience #modelperformance #aitechniques

124 Comments
Like Comment
Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

40,500 followers 1y
Report this post
In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y
No more previous content

No more next content

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

In the last three months alone, over ten papers outlining novel prompting techniques were published, boosting LLMs’ performance by a substantial margin. Two weeks ago, a groundbreaking paper from Microsoft demonstrated how a well-prompted GPT-4 outperforms Google’s Med-PaLM 2, a specialized medical model, solely through sophisticated prompting techniques. Yet, while our X and LinkedIn feeds buzz with ‘secret prompting tips’, a definitive, research-backed guide aggregating these advanced prompting strategies is hard to come by. This gap prevents LLM developers and everyday users from harnessing these novel frameworks to enhance performance and achieve more accurate results. https://lnkd.in/g7_6eP6y In this AI Tidbits Deep Dive, I outline six of the best and recent prompting methods: (1) EmotionPrompt - inspired by human psychology, this method utilizes emotional stimuli in prompts to gain performance enhancements (2) Optimization by PROmpting (OPRO) - a DeepMind innovation that refines prompts automatically, surpassing human-crafted ones. This paper discovered the “Take a deep breath” instruction that improved LLMs’ performance by 9%. (3) Chain-of-Verification (CoVe) - Meta's novel four-step prompting process that drastically reduces hallucinations and improves factual accuracy (4) System 2 Attention (S2A) - also from Meta, a prompting method that filters out irrelevant details prior to querying the LLM (5) Step-Back Prompting - encouraging LLMs to abstract queries for enhanced reasoning (6) Rephrase and Respond (RaR) - UCLA's method that lets LLMs rephrase queries for better comprehension and response accuracy Understanding the spectrum of available prompting strategies and how to apply them in your app can mean the difference between a production-ready app and a nascent project with untapped potential. Full blog post https://lnkd.in/g7_6eP6y

31 Comments

Like Comment
31 Comments
Like Comment
Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

584,895 followers 2mo
Report this post
If you are an AI Engineer building production-grade GenAI systems, RAG should be in your toolkit. LLMs are powerful for information generation, but: → They hallucinate → They don’t know anything post-training → They struggle with out-of-distribution queries RAG solves this by injecting external knowledge at inference time. But basic RAG (retrieval + generation) isn’t enough for complex use cases. You need advanced techniques to make it reliable in production. Let’s break it down 👇 🧠 Basic RAG = Retrieval → Generation You ask a question. → The retriever fetches top-k documents (via vector search, BM25, etc.) → The LLM answers based on the query + retrieved context But, this naive setup fails quickly in the wild. You need to address two hard problems: 1. Are we retrieving the right documents? 2. Is the generator actually using them faithfully? ⚙️ Advanced RAG = Engineering Both Ends To improve retrieval, we have techniques like: → Chunk size tuning (fixed vs. recursive splitting) → Sliding window chunking (for dense docs) → Structured data retrieval (tables, graphs, SQL) → Metadata-aware search (filtering by author/date/type) → Mixed retrieval (hybrid keyword + dense) → Embedding fine-tuning (aligning to domain-specific semantics) → Question rewriting (to improve recall) To improve generation, options include: → Compressing retrieved docs (summarization, reranking) → Generator fine-tuning (rewarding citation usage and reasoning) → Re-ranking outputs (scoring factuality or domain accuracy) → Plug-and-play adapters (LoRA, QLoRA, etc.) 🧪 Beyond Modular: Joint Optimization Some of the most promising work goes further: → Fine-tuning retriever + generator end-to-end → Retrieval training via generation loss (REACT, RETRO-style) → Generator-enhanced search (LLM reformulates the query for better retrieval) This is where RAG starts to feel less like a bolt-on patch and more like a full-stack system. 📏 How Do You Know It's Working? Key metrics to track: → Context Relevance (Are the right docs retrieved?) → Answer Faithfulness (Did the LLM stay grounded?) → Negative Rejection (Does it avoid answering when nothing relevant is retrieved?) → Tools: RAGAS, FaithfulQA, nDCG, Recall@k 🛠️ Arvind and I are kicking off a hands-on workshop on RAG This first session is designed for beginner to intermediate practitioners who want to move beyond theory and actually build. Here’s what you’ll learn: → How RAG enhances LLMs with real-time, contextual data → Core concepts: vector DBs, indexing, reranking, fusion → Build a working RAG pipeline using LangChain + Pinecone → Explore no-code/low-code setups and real-world use cases If you're serious about building with LLMs, this is where you start. 📅 Save your seat and join us live: https://lnkd.in/gS_B7_7d Image source: LlamaIndex
No more previous content

No more next content

Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer

If you are an AI Engineer building production-grade GenAI systems, RAG should be in your toolkit. LLMs are powerful for information generation, but: → They hallucinate → They don’t know anything post-training → They struggle with out-of-distribution queries RAG solves this by injecting external knowledge at inference time. But basic RAG (retrieval + generation) isn’t enough for complex use cases. You need advanced techniques to make it reliable in production. Let’s break it down 👇 🧠 Basic RAG = Retrieval → Generation You ask a question. → The retriever fetches top-k documents (via vector search, BM25, etc.) → The LLM answers based on the query + retrieved context But, this naive setup fails quickly in the wild. You need to address two hard problems: 1. Are we retrieving the right documents? 2. Is the generator actually using them faithfully? ⚙️ Advanced RAG = Engineering Both Ends To improve retrieval, we have techniques like: → Chunk size tuning (fixed vs. recursive splitting) → Sliding window chunking (for dense docs) → Structured data retrieval (tables, graphs, SQL) → Metadata-aware search (filtering by author/date/type) → Mixed retrieval (hybrid keyword + dense) → Embedding fine-tuning (aligning to domain-specific semantics) → Question rewriting (to improve recall) To improve generation, options include: → Compressing retrieved docs (summarization, reranking) → Generator fine-tuning (rewarding citation usage and reasoning) → Re-ranking outputs (scoring factuality or domain accuracy) → Plug-and-play adapters (LoRA, QLoRA, etc.) 🧪 Beyond Modular: Joint Optimization Some of the most promising work goes further: → Fine-tuning retriever + generator end-to-end → Retrieval training via generation loss (REACT, RETRO-style) → Generator-enhanced search (LLM reformulates the query for better retrieval) This is where RAG starts to feel less like a bolt-on patch and more like a full-stack system. 📏 How Do You Know It's Working? Key metrics to track: → Context Relevance (Are the right docs retrieved?) → Answer Faithfulness (Did the LLM stay grounded?) → Negative Rejection (Does it avoid answering when nothing relevant is retrieved?) → Tools: RAGAS, FaithfulQA, nDCG, Recall@k 🛠️ Arvind and I are kicking off a hands-on workshop on RAG This first session is designed for beginner to intermediate practitioners who want to move beyond theory and actually build. Here’s what you’ll learn: → How RAG enhances LLMs with real-time, contextual data → Core concepts: vector DBs, indexing, reranking, fusion → Build a working RAG pipeline using LangChain + Pinecone → Explore no-code/low-code setups and real-world use cases If you're serious about building with LLMs, this is where you start. 📅 Save your seat and join us live: https://lnkd.in/gS_B7_7d Image source: LlamaIndex

37 Comments

Like Comment
37 Comments
Like Comment
Pan Wu Pan Wu is an Influencer

Senior Data Science Manager at Meta

48,405 followers 4mo
Report this post
Machine learning models are built to learn from customer behavior and make predictions. But when that behavior shifts rapidly, like during the pandemic, even the most accurate models can fall behind. That’s exactly what the Data Science team at Booking.com experienced while working on cancellation prediction. In a recent blog post, they shared how they evolved their approach to stay aligned with changing user behavior. Originally, the team used traditional classification models to predict whether a booking would be canceled. These models performed well when patterns were stable, but they struggled in fast-changing environments. One key issue: they relied on historical outcomes that often took time to materialize. Plus, they only answered if a cancellation might happen, not when. To address these challenges, the team shifted to survival modeling, which estimates the time until an event occurs. This approach enabled them to generate dynamic, time-sensitive predictions over the course of each booking. With multiple enhancements to their survival modeling pipeline, the team saw improved predictive accuracy, especially in volatile conditions. The shift didn’t just boost performance; it showed how reframing a business problem through a different modeling lens can unlock smarter, more adaptable solutions. #MachineLearning #DataScience #SurvivalModeling #Classification #SnacksWeeklyonDataScience – – – Check out the "Snacks Weekly on Data Science" podcast and subscribe, where I explain in more detail the concepts discussed in this and future posts: -- Spotify: https://lnkd.in/gKgaMvbh -- Apple Podcast: https://lnkd.in/gj6aPBBY -- Youtube: https://lnkd.in/gcwPeBmR https://lnkd.in/gU3TsMQP

Predicting cancellations with survival modeling booking.ai

5 Comments
Like Comment
Juan M. Lavista Ferres

CVP and Chief Data Scientist at Microsoft

30,725 followers 2mo
Report this post
Today, Radiology published our latest study on breast cancer. This work, led by Felipe Oviedo Perhavec from Microsoft’s AI for Good Lab and Savannah Partridge (UW/Fred Hutch) in collaboration with researchers from Fred Hutch , University of Washington, University of Kaiserslautern-Landau, and the Technical University of Berlin, explores how AI can improve the accuracy and trustworthiness of breast cancer screening. We focused on a key challenge: MRI is an incredibly sensitive screening tool, especially for high-risk women—but it generates far too many false positives, leading to anxiety, unnecessary procedures, and higher costs. Our model, FCDD, takes a different approach. Rather than trying to learn what cancer looks like, it learns what normal looks like and flags what doesn’t. In a dataset of over 9,700 breast MRI exams—including real-world screening scenarios—our model: Doubled the positive predictive value vs. traditional models Reduced false positives by 25% Matched radiologists’ annotations with 92% accuracy Generalized well across multiple institutions without retraining What’s more, the model produces visual heatmaps that help radiologists see and understand why something was flagged—supporting trust, transparency, and adoption. We’ve made the code and methodology open to the research community. You can read the full paper in Radiology https://lnkd.in/gc82kXPN AI won't replace radiologists—but it can sharpen their tools, reduce false alarms, and help save lives.

Cancer Detection in Breast MRI Screening via Explainable AI Anomaly Detection | Radiology pubs.rsna.org

6 Comments
Like Comment

Improving Predictive Accuracy

More in Improving Predictive Accuracy

More Artificial Intelligence topics

Explore categories