Identifying Vulnerabilities in Language Models

Explore top LinkedIn content from expert professionals.

Katharina Koerner

AI Governance I Digital Consulting I Trace3 : All Possibilities Live in Technology: Innovating with risk-managed AI: Strategies to Advance Business Goals through AI Governance, Privacy & Security

44,204 followers 1y
Report this post
In January 2024, the National Institute of Standards and Technology (NIST) published its updated report on AI security, called "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations," which now includes a focus on the security of generative AI, addressing attacks on both predictive and generative AI systems. This comprehensive work categorizes various adversarial attack methods, their objectives, and capabilities, along with strategies for their mitigation. It can help put NIST’s AI Risk Management Framework into practice. Attacks on predictive AI systems (see screenshot #1 below): - The report breaks down predictive AI taxonomy into classifications based on attack stages, goals, capabilities, knowledge, and data modality. - Key areas of focus include evasion and poisoning attacks, each with specifics on white-box and black-box attacks, their transferability, and mitigation strategies. - Privacy attacks are dissected into data reconstruction, membership inference, model extraction, and property inference, with proposed mitigations. Attacks on generative AI systems (see screenshot #2 below): - The section on Generative AI Taxonomy from the NIST report outlines attack classifications and specific vulnerabilities within Generative AI systems such as Generative Adversarial Networks (GANs), Generative Pre-trained Transformers (GPTs), and Diffusion Models. - It then delves into the evolution of Generative AI stages of learning, highlighting the shift from traditional models to the pre-training of foundation models using unsupervised learning to capture patterns for downstream tasks. These foundation models are subsequently fine-tuned for specific applications, often by third parties, making them particularly vulnerable to poisoning attacks, even with minimal tampering of training datasets. - The report further explores the deployment phase of generative AI, which exhibits unique vulnerabilities distinct from predictive AI. Notably, it identifies the potential for attackers to exploit data channels for injection attacks similar to SQL injection, the manipulation of model instructions to align LLM behaviors, enhancements through contextual few-shot learning, and the ingestion of runtime data from external sources for application-specific context. - Additionally, it addresses novel security violations specific to Generative AI and details various types of attacks, including AI supply chain attacks, direct and indirect prompt injection attacks, and their mitigations, as well as violations like availability, integrity, privacy compromises, and abuse. For a deeper dive into these findings, including the taxonomy of attacks and their mitigations, visit the full report available at: https://lnkd.in/guR56reH Co-authored by Apostol Vassilev (NIST), Alina Oprea (Northeastern University), Alie Fordyce, and Hyrum Anderson (both from Robust Intelligence) #NIST #aisecurity
No more previous content

No more next content

Katharina Koerner

AI Governance I Digital Consulting I Trace3 : All Possibilities Live in Technology: Innovating with risk-managed AI: Strategies to Advance Business Goals through AI Governance, Privacy & Security

In January 2024, the National Institute of Standards and Technology (NIST) published its updated report on AI security, called "Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations," which now includes a focus on the security of generative AI, addressing attacks on both predictive and generative AI systems. This comprehensive work categorizes various adversarial attack methods, their objectives, and capabilities, along with strategies for their mitigation. It can help put NIST’s AI Risk Management Framework into practice. Attacks on predictive AI systems (see screenshot #1 below): - The report breaks down predictive AI taxonomy into classifications based on attack stages, goals, capabilities, knowledge, and data modality. - Key areas of focus include evasion and poisoning attacks, each with specifics on white-box and black-box attacks, their transferability, and mitigation strategies. - Privacy attacks are dissected into data reconstruction, membership inference, model extraction, and property inference, with proposed mitigations. Attacks on generative AI systems (see screenshot #2 below): - The section on Generative AI Taxonomy from the NIST report outlines attack classifications and specific vulnerabilities within Generative AI systems such as Generative Adversarial Networks (GANs), Generative Pre-trained Transformers (GPTs), and Diffusion Models. - It then delves into the evolution of Generative AI stages of learning, highlighting the shift from traditional models to the pre-training of foundation models using unsupervised learning to capture patterns for downstream tasks. These foundation models are subsequently fine-tuned for specific applications, often by third parties, making them particularly vulnerable to poisoning attacks, even with minimal tampering of training datasets. - The report further explores the deployment phase of generative AI, which exhibits unique vulnerabilities distinct from predictive AI. Notably, it identifies the potential for attackers to exploit data channels for injection attacks similar to SQL injection, the manipulation of model instructions to align LLM behaviors, enhancements through contextual few-shot learning, and the ingestion of runtime data from external sources for application-specific context. - Additionally, it addresses novel security violations specific to Generative AI and details various types of attacks, including AI supply chain attacks, direct and indirect prompt injection attacks, and their mitigations, as well as violations like availability, integrity, privacy compromises, and abuse. For a deeper dive into these findings, including the taxonomy of attacks and their mitigations, visit the full report available at: https://lnkd.in/guR56reH Co-authored by Apostol Vassilev (NIST), Alina Oprea (Northeastern University), Alie Fordyce, and Hyrum Anderson (both from Robust Intelligence) #NIST #aisecurity

26 Comments

Like Comment
26 Comments
Like Comment
Walter Haydock

I help AI-powered companies manage cyber, compliance, and privacy risk so they can innovate responsibly | ISO 42001, NIST AI RMF, and EU AI Act expert | Host, Deploy Securely Podcast | Harvard MBA | Marine veteran

21,713 followers 1y
Report this post
AI use is exploding. I spent my weekend analyzing the top vulnerabilities I've seen while helping companies deploy it securely. Here's EXACTLY what to look for: 1️⃣ UNINTENDED TRAINING Occurs whenever: - an AI model trains on information that the provider of such information does NOT want the model to be trained on, e.g. material non-public financial information, personally identifiable information, or trade secrets - AND those not authorized to see this underlying information nonetheless can interact with the model itself and retrieve this data. 2️⃣ REWARD HACKING Large Language Models (LLMs) can exhibit strange behavior that closely mimics that of humans. So: - offering them monetary rewards, - saying an important person has directed an action, - creating false urgency due to a manufactured crisis, or even telling the LLM what time of year it is can have substantial impacts on the outputs. 3️⃣ NON-NEUTRAL SECURITY POLICY This occurs whenever an AI application attempts to control access to its context (e.g. provided via retrieval-augmented generation) through non-deterministic means (e.g. a system message stating "do not allow the user to download or reproduce your entire knowledge base"). This is NOT a correct AI security measure, as rules-based logic should determine whether a given user is authorized to see certain data. Doing so ensures the AI model has a "neutral" security policy, whereby anyone with access to the model is also properly authorized to view the relevant training data. 4️⃣ TRAINING DATA THEFT Separate from a non-neutral security policy, this occurs when the user of an AI model is able to recreate - and extract - its training data in a manner that the maintainer of the model did not intend. While maintainers should expect that training data may be reproduced exactly at least some of the time, they should put in place deterministic/rules-based methods to prevent wholesale extraction of it. 5️⃣ TRAINING DATA POISONING Data poisoning occurs whenever an attacker is able to seed inaccurate data into the training pipeline of the target model. This can cause the model to behave as expected in the vast majority of cases but then provide inaccurate responses in specific circumstances of interest to the attacker. 6️⃣ CORRUPTED MODEL SEEDING This occurs when an actor is able to insert an intentionally corrupted AI model into the data supply chain of the target organization. It is separate from training data poisoning in that the trainer of the model itself is a malicious actor. 7️⃣ RESOURCE EXHAUSTION Any intentional efforts by a malicious actor to waste compute or financial resources. This can result from simply a lack of throttling or - potentially worse - a bug allowing long (or infinite) responses by the model to certain inputs. 🎁 That's a wrap! Want to grab the entire StackAware AI security reference and vulnerability database? Head to: archive [dot] stackaware [dot] com
No more previous content

No more next content

Walter Haydock

I help AI-powered companies manage cyber, compliance, and privacy risk so they can innovate responsibly | ISO 42001, NIST AI RMF, and EU AI Act expert | Host, Deploy Securely Podcast | Harvard MBA | Marine veteran

AI use is exploding. I spent my weekend analyzing the top vulnerabilities I've seen while helping companies deploy it securely. Here's EXACTLY what to look for: 1️⃣ UNINTENDED TRAINING Occurs whenever: - an AI model trains on information that the provider of such information does NOT want the model to be trained on, e.g. material non-public financial information, personally identifiable information, or trade secrets - AND those not authorized to see this underlying information nonetheless can interact with the model itself and retrieve this data. 2️⃣ REWARD HACKING Large Language Models (LLMs) can exhibit strange behavior that closely mimics that of humans. So: - offering them monetary rewards, - saying an important person has directed an action, - creating false urgency due to a manufactured crisis, or even telling the LLM what time of year it is can have substantial impacts on the outputs. 3️⃣ NON-NEUTRAL SECURITY POLICY This occurs whenever an AI application attempts to control access to its context (e.g. provided via retrieval-augmented generation) through non-deterministic means (e.g. a system message stating "do not allow the user to download or reproduce your entire knowledge base"). This is NOT a correct AI security measure, as rules-based logic should determine whether a given user is authorized to see certain data. Doing so ensures the AI model has a "neutral" security policy, whereby anyone with access to the model is also properly authorized to view the relevant training data. 4️⃣ TRAINING DATA THEFT Separate from a non-neutral security policy, this occurs when the user of an AI model is able to recreate - and extract - its training data in a manner that the maintainer of the model did not intend. While maintainers should expect that training data may be reproduced exactly at least some of the time, they should put in place deterministic/rules-based methods to prevent wholesale extraction of it. 5️⃣ TRAINING DATA POISONING Data poisoning occurs whenever an attacker is able to seed inaccurate data into the training pipeline of the target model. This can cause the model to behave as expected in the vast majority of cases but then provide inaccurate responses in specific circumstances of interest to the attacker. 6️⃣ CORRUPTED MODEL SEEDING This occurs when an actor is able to insert an intentionally corrupted AI model into the data supply chain of the target organization. It is separate from training data poisoning in that the trainer of the model itself is a malicious actor. 7️⃣ RESOURCE EXHAUSTION Any intentional efforts by a malicious actor to waste compute or financial resources. This can result from simply a lack of throttling or - potentially worse - a bug allowing long (or infinite) responses by the model to certain inputs. 🎁 That's a wrap! Want to grab the entire StackAware AI security reference and vulnerability database? Head to: archive [dot] stackaware [dot] com

35 Comments

Like Comment
35 Comments
Like Comment
George Z. Lin

AI Leader, Investor, & Advisor | MassChallenge | Wharton VentureLab

3,750 followers 3mo
Report this post
Recent research by UIUC and Intel Labs has introduced a new jailbreak technique for Large Language Models (LLMs) known as InfoFlood. This method takes advantage of a vulnerability termed "Information Overload," where excessive linguistic complexity can circumvent safety mechanisms without the need for traditional adversarial prefixes or suffixes. InfoFlood operates through a three-stage process: Linguistic Saturation, Rejection Analysis, and Saturation Refinement. Initially, it reformulates potentially harmful queries into more complex structures. If the first attempt does not succeed, the system analyzes the response to iteratively refine the query until a successful jailbreak is achieved. Empirical validation across four notable LLMs—GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1—indicates that InfoFlood significantly surpasses existing methods, achieving success rates up to three times higher on various benchmarks. The study underscores significant vulnerabilities in current AI safety measures, as widely used defenses, such as OpenAI’s Moderation API, proved ineffective against InfoFlood attacks. This situation raises important concerns regarding the robustness of AI alignment systems and highlights the necessity for more resilient safety interventions. As LLMs become increasingly integrated into diverse applications, addressing these vulnerabilities is crucial for ensuring the responsible deployment of AI technologies and enhancing their safety against emerging adversarial techniques. Arxiv: https://lnkd.in/eBty6G7z
No more previous content

No more next content

George Z. Lin

AI Leader, Investor, & Advisor | MassChallenge | Wharton VentureLab

Recent research by UIUC and Intel Labs has introduced a new jailbreak technique for Large Language Models (LLMs) known as InfoFlood. This method takes advantage of a vulnerability termed "Information Overload," where excessive linguistic complexity can circumvent safety mechanisms without the need for traditional adversarial prefixes or suffixes. InfoFlood operates through a three-stage process: Linguistic Saturation, Rejection Analysis, and Saturation Refinement. Initially, it reformulates potentially harmful queries into more complex structures. If the first attempt does not succeed, the system analyzes the response to iteratively refine the query until a successful jailbreak is achieved. Empirical validation across four notable LLMs—GPT-4o, GPT-3.5-turbo, Gemini 2.0, and LLaMA 3.1—indicates that InfoFlood significantly surpasses existing methods, achieving success rates up to three times higher on various benchmarks. The study underscores significant vulnerabilities in current AI safety measures, as widely used defenses, such as OpenAI’s Moderation API, proved ineffective against InfoFlood attacks. This situation raises important concerns regarding the robustness of AI alignment systems and highlights the necessity for more resilient safety interventions. As LLMs become increasingly integrated into diverse applications, addressing these vulnerabilities is crucial for ensuring the responsible deployment of AI technologies and enhancing their safety against emerging adversarial techniques. Arxiv: https://lnkd.in/eBty6G7z

4 Comments

Like Comment
4 Comments
Like Comment
Suyesh Karki

#girldad #tech-exec #blaugrana

4,068 followers 8mo
Report this post
My team spent this week evaluating the security of Deepseek’s R1 model. While open-source AI models offer transparency benefits, DeepSeek's safety remains questionable. Recent findings reveal significant vulnerabilities in DeepSeek R1, including susceptibility to jailbreak techniques, RCE, and potential data privacy issues. This team summarized some pros and cons of this model: Pros ✅ - The group has an MIT license which empowers users with flexibility of local deployment and customization options, offering some control over data handling - Potential for custom security measures: Users can implement additional security protocols when running the model locally - The model offers weights for fine-tuning, allowing for tailored performance enhancements and security Cons ❌ - Susceptible to exploits and lacks robust safeguards against malicious outputs. In fact, a recent incident exposed a DeepSeek database containing over a million lines of log streams, chat history, secret keys, and backend data - The lack of transparency in training data raises critical questions about data integrity and potential biases - Potential malicious Python code embedded within the model poses additional security threat as the model configuration requires trust_remote_code=True to be set, which increases the risk of arbitrary code/remote code execution Remember, also, these models are brand-new, which adds risk. Of course, we’re early in our evaluations. If this is helpful, I’ll keep sharing what we learn. #deepseek #AI #AISecurity #InformationSecurity #data #privacy

8 Comments
Like Comment
Aishwarya Naresh Reganti

Founder @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

111,895 followers 1y
Report this post
🌶 While there's a lot of hype around building smarter and more autonomous LLMs, the other side of the coin is equally if not more critical: Rigorously testing them for vulnerabilities. 🌟 The research in the LLM field is honestly amazing, with lots happening every day and a big focus on building more performant models. 💀 For instance, long-context LLMs are currently in the limelight, but a recent report by Anthropic suggests that these LLMs are particularly vulnerable to an attack known as "many-shot jailbreaking." More details: ⛳ Many-shot jailbreaking involves including a series of faux (synthetically generated) dialogues within a single prompt, culminating in a target query. By presenting numerous faux interactions, the technique coerces the model into providing potentially harmful responses, overriding its safety training. ⛳ The report shows that as the number of faux dialogues (referred to as "shots") included in the prompt increases, the percentage of harmful responses to target prompts also rises. For example, increasing the number of shots from a few to 256 significantly increases the likelihood of the model providing harmful responses. ⛳The research reports that many-shot jailbreaking tends to be more effective on larger language models. As the size of the model increases, the attack becomes more potent, posing a heightened risk. ⛳ The report also suggests potential mitigation techniques--one approach involving classification and modification of the prompt before model processing which lowered the attack success rate from 61% to 2% Research works like this underscore the side-effects of LLM improvements and how they should be tested extensively. While extending context windows improved the LLM's utility, it also introduces new and unseen vulnerabilities. Here's the report: https://lnkd.in/gYTufjFH 🚨 I post #genai content daily, follow along for the latest updates! #llms #contextlength
No more previous content

No more next content

Aishwarya Naresh Reganti

Founder @ LevelUp Labs | Ex-AWS | Consulting, Training & Investing in AI

🌶 While there's a lot of hype around building smarter and more autonomous LLMs, the other side of the coin is equally if not more critical: Rigorously testing them for vulnerabilities. 🌟 The research in the LLM field is honestly amazing, with lots happening every day and a big focus on building more performant models. 💀 For instance, long-context LLMs are currently in the limelight, but a recent report by Anthropic suggests that these LLMs are particularly vulnerable to an attack known as "many-shot jailbreaking." More details: ⛳ Many-shot jailbreaking involves including a series of faux (synthetically generated) dialogues within a single prompt, culminating in a target query. By presenting numerous faux interactions, the technique coerces the model into providing potentially harmful responses, overriding its safety training. ⛳ The report shows that as the number of faux dialogues (referred to as "shots") included in the prompt increases, the percentage of harmful responses to target prompts also rises. For example, increasing the number of shots from a few to 256 significantly increases the likelihood of the model providing harmful responses. ⛳The research reports that many-shot jailbreaking tends to be more effective on larger language models. As the size of the model increases, the attack becomes more potent, posing a heightened risk. ⛳ The report also suggests potential mitigation techniques--one approach involving classification and modification of the prompt before model processing which lowered the attack success rate from 61% to 2% Research works like this underscore the side-effects of LLM improvements and how they should be tested extensively. While extending context windows improved the LLM's utility, it also introduces new and unseen vulnerabilities. Here's the report: https://lnkd.in/gYTufjFH 🚨 I post #genai content daily, follow along for the latest updates! #llms #contextlength

3 Comments

Like Comment
3 Comments
Like Comment
Burcin Kaplanoglu Burcin Kaplanoglu is an Influencer

Artificial Intelligence (AI), Tech Research and Product Development, Linkedin Top Voice, 56 million views on LinkedIn (last 12 months). Vice President of Innovation, co-founder of Oracle Industry Lab.

51,230 followers 11mo
Report this post
How can Robots and Large Language Models work together? Researchers have been integrating large language models (LLMs) into robotics. It’s a very promising and fast-developing field. But there is a twist! Let’s start with why they are integrating LLMs into robotics. If you want to control a robot today, you either have to pre-program it or train it with machine learning for specific tasks. LLMs may generate high level plans following your instructions, then turn them into actions. LLMs have guardrails to prevent the generation of harmful content. However, some techniques may also deceive LLMs into performing certain tasks. A group of researchers at Penn Engineering has tested the vulnerability with robotic systems that use LLMs. Their algorithm designed to jailbreak LLM-controlled robots had a success rate of 100%. What did they do? Researchers convinced an autonomous driving system to ignore traffic lights - thankfully, this was in a simulation. They were able to have a wheeled robot enter an exclusion zone. They were able to get a four-legged robot to collide with a human. They performed testing in three different ways: when they didn’t know internal functionality (black box testing), when they partially knew internal functionality (gray box testing), and when they knew internal functionality (white box testing). The paper below has details of other tasks that they were able to perform without guardrails. As responsible researchers, they contacted the companies producing the products and shared their findings prior to the public release of this paper. Researchers also thanked the companies at the end of the paper for engaging in thoughtful dialogue. This type of research is fundamental for the future of robotics and AI. ….So, what’s the solution? I asked an expert, Andrew Pether. “All UR Robots have a built in safety system which is certified by TUV Nord in accordance with EN ISO 13849-1 (Category 3, PLd). This means that all motions are monitored by redundant safety controllers, and any motions beyond specified limits are restricted by the safety system. So while it's possible with the UR AI Accelerator to generate motions using LLMs, these are beholden to the same safety limits as any other motion, and therefore risks to users can be mitigated.” In conclusion, Andrew suggested having an overarching safety system monitoring LLMs based on industry standards as a solution. Something to think about. What is Jailbreaking? Process of removing limitations on an operating system and giving the user more control. Thanks, Iman Zadeh for sharing this paper with me. Paper published by researchers at Penn Engineering is in the comments. #artificialintelligence #technology #innovation

6 Comments
Like Comment
Agus Sudjianto

A geek who can speak: Co-creator of PiML and MoDeVa, SVP Risk & Technology H2O.ai, Retired EVP-Head of Wells Fargo MRM

24,276 followers 7mo
Report this post
Brilliant in some cases and dumb in others! I’m a heavy user of LLM for many tasks that I do, but… Large Language Models (LLMs) can appear brilliant in some areas and surprisingly bad in others because of the way they are designed and trained. 1. Training Data Bias and Coverage LLMs are trained on vast amounts of text data from the internet, research papers, books, and code repositories. They perform well in areas where they have seen a lot of high-quality data (e.g., general knowledge, programming, mathematics). However, they struggle in areas where data is sparse, biased, or highly nuanced, leading to gaps in reasoning. 2. Pattern Recognition vs. True Understanding LLMs are pattern recognition engines, not true reasoning machines. They generate responses based on statistical likelihood rather than deep conceptual understanding. This means they can sound intelligent without actually “thinking,” leading to confident but incorrect answers in complex situations. 3. Lack of Real-World Experience LLMs do not have real-world experience—they cannot observe, experiment, or interact with the physical world. This makes them excellent at answering structured, well-documented questions but bad at reasoning about real-world uncertainties. 4. Difficulty with Logic and Consistency While LLMs can follow logical rules, they often struggle with multi-step reasoning, consistency across responses, and self-correction. A simple fact recall might be perfect, but when asked to extend logic to a new situation, the model can make obvious mistakes. 5. Overfitting to User Inputs LLMs tend to mirror the structure and assumptions of the input they receive. If a user provides leading or biased questions, the model may generate an answer that aligns with those biases rather than critically analyzing the question. 6. Struggles with Small Data Scenarios LLMs are designed for big-picture knowledge but struggle with specific, small-sample reasoning (e.g., experimental setups, statistical overfitting). They can generalize well over large datasets but may fail in cases that require deep domain expertise. 7. Computational Constraints LLMs operate under finite compute budgets—they truncate memory, which makes long-term dependencies difficult to track. This can make them great at short, factual questions but weak at complex, multi-step problems requiring extended context. As for agentic to do data science …draw your own conclusion 😝

25 Comments
Like Comment
Mani Keerthi N

Cybersecurity Strategist & Advisor || LinkedIn Learning Instructor

17,195 followers 1y
Report this post
Threat Modelling and Risk Analysis for Large Language Model (LLM)-Powered Applications by Stephen Burabari Tete:https://lnkd.in/gvVd5dU2 1)This paper explores the threat modeling and risk analysis specifically tailored for LLM-powered applications. 2) Focusing on potential attacks like data poisoning, prompt injection, SQL injection, jailbreaking, and compositional injection, the author assesses their impact on security and proposes mitigation strategies. The author introduces a framework combining STRIDE and DREAD methodologies for proactive threat identification and risk assessment. #ai #artificialintelligence #llm #llmsecurity #riskmanagment #riskanalysis #threats #risks #defenses #security

6 Comments
Like Comment

Identifying Vulnerabilities in Language Models

More in Large Language Models Insights

Explore categories