Understanding AI Training's Impact on User Data

Explore top LinkedIn content from expert professionals.

Richard Lawne

Privacy & AI Lawyer

2,550 followers 7mo
Report this post
I'm increasingly convinced that we need to treat "AI privacy" as a distinct field within privacy, separate from but closely related to "data privacy". Just as the digital age required the evolution of data protection laws, AI introduces new risks that challenge existing frameworks, forcing us to rethink how personal data is ingested and embedded into AI systems. Key issues include: 🔹 Mass-scale ingestion – AI models are often trained on huge datasets scraped from online sources, including publicly available and proprietary information, without individuals' consent. 🔹 Personal data embedding – Unlike traditional databases, AI models compress, encode, and entrench personal data within their training, blurring the lines between the data and the model. 🔹 Data exfiltration & exposure – AI models can inadvertently retain and expose sensitive personal data through overfitting, prompt injection attacks, or adversarial exploits. 🔹 Superinference – AI uncovers hidden patterns and makes powerful predictions about our preferences, behaviours, emotions, and opinions, often revealing insights that we ourselves may not even be aware of. 🔹 AI impersonation – Deepfake and generative AI technologies enable identity fraud, social engineering attacks, and unauthorized use of biometric data. 🔹 Autonomy & control – AI may be used to make or influence critical decisions in domains such as hiring, lending, and healthcare, raising fundamental concerns about autonomy and contestability. 🔹 Bias & fairness – AI can amplify biases present in training data, leading to discriminatory outcomes in areas such as employment, financial services, and law enforcement. To date, privacy discussions have focused on data - how it's collected, used, and stored. But AI challenges this paradigm. Data is no longer static. It is abstracted, transformed, and embedded into models in ways that challenge conventional privacy protections. If "AI privacy" is about more than just the data, should privacy rights extend beyond inputs and outputs to the models themselves? If a model learns from us, should we have rights over it? #AI #AIPrivacy #Dataprivacy #Dataprotection #AIrights #Digitalrights

16 Comments
Like Comment
Amaka Ibeji FIP, AIGP, CIPM, CISA, CISM, CISSP, DDN QTE

Digital Trust Leader | Privacy & AI Governance Expert | Founder of PALS Hub & DPO Africa Network | 100 Brilliant Women in AI Ethics™ 2025 | Bridging Technology & Human Connection | Speaker & Coach | IAPP & DDN Faculty

14,571 followers 6mo
Report this post
21/86: 𝗜𝘀 𝗬𝗼𝘂𝗿 𝗔𝗜 𝗠𝗼𝗱𝗲𝗹 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗼𝗻 𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹 𝗗𝗮𝘁𝗮? Your AI needs data, but is it using personal data responsibly? 🛑Threat Alert: If your AI model trains on data linked to individuals, you risk: Privacy violations, Legal & regulatory consequences, and Erosion of digital trust. 🔍 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀 𝘁𝗼 𝗔𝘀𝗸 𝗕𝗲𝗳𝗼𝗿𝗲 𝗨𝘀𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗶𝗻 𝗔𝗜 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 📌 Is personal data necessary? If not essential, don't use it. 📌 Are unique identifiers included? Consider pseudonymization or anonymization. 📌 Do you have a legal basis? If the model uses PII, document your justification. 📌 Are privacy risks documented & mitigated? Ensure privacy impact assessments (PIAs) are conducted. ✅ What You Should Do ➡️ Minimize PII usage – Only use personal data when absolutely necessary. ➡️ Apply de-identification techniques – Use pseudonymization, anonymization, or differential privacy where possible. ➡️ Document & justify your approach – Keep records of privacy safeguards & compliance measures. ➡️ Align with legal & ethical AI principles – Ensure your model respects privacy, fairness, and transparency. Privacy is not a luxury, it’s a necessity for AI to be trusted. Protecting personal data strengthens compliance, ethics, and public trust in AI systems. 💬 How do you ensure AI models respect privacy? Share your thoughts below! 👇 🔗 Follow PALS Hub and Amaka Ibeji for more AI risk insights! #AIonAI #AIPrivacy #DataProtection #ResponsibleAI #DigitalTrust
No more previous content

No more next content

Amaka Ibeji FIP, AIGP, CIPM, CISA, CISM, CISSP, DDN QTE

Digital Trust Leader | Privacy & AI Governance Expert | Founder of PALS Hub & DPO Africa Network | 100 Brilliant Women in AI Ethics™ 2025 | Bridging Technology & Human Connection | Speaker & Coach | IAPP & DDN Faculty

21/86: 𝗜𝘀 𝗬𝗼𝘂𝗿 𝗔𝗜 𝗠𝗼𝗱𝗲𝗹 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗼𝗻 𝗣𝗲𝗿𝘀𝗼𝗻𝗮𝗹 𝗗𝗮𝘁𝗮? Your AI needs data, but is it using personal data responsibly? 🛑Threat Alert: If your AI model trains on data linked to individuals, you risk: Privacy violations, Legal & regulatory consequences, and Erosion of digital trust. 🔍 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻𝘀 𝘁𝗼 𝗔𝘀𝗸 𝗕𝗲𝗳𝗼𝗿𝗲 𝗨𝘀𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 𝗶𝗻 𝗔𝗜 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 📌 Is personal data necessary? If not essential, don't use it. 📌 Are unique identifiers included? Consider pseudonymization or anonymization. 📌 Do you have a legal basis? If the model uses PII, document your justification. 📌 Are privacy risks documented & mitigated? Ensure privacy impact assessments (PIAs) are conducted. ✅ What You Should Do ➡️ Minimize PII usage – Only use personal data when absolutely necessary. ➡️ Apply de-identification techniques – Use pseudonymization, anonymization, or differential privacy where possible. ➡️ Document & justify your approach – Keep records of privacy safeguards & compliance measures. ➡️ Align with legal & ethical AI principles – Ensure your model respects privacy, fairness, and transparency. Privacy is not a luxury, it’s a necessity for AI to be trusted. Protecting personal data strengthens compliance, ethics, and public trust in AI systems. 💬 How do you ensure AI models respect privacy? Share your thoughts below! 👇 🔗 Follow PALS Hub and Amaka Ibeji for more AI risk insights! #AIonAI #AIPrivacy #DataProtection #ResponsibleAI #DigitalTrust

Like Comment
Like Comment
Katharina Koerner

AI Governance I Digital Consulting I Trace3 : All Possibilities Live in Technology: Innovating with risk-managed AI: Strategies to Advance Business Goals through AI Governance, Privacy & Security

44,187 followers 2y
Report this post
Generative AI systems are increasingly evaluated for their social impact, but there's no standardized approach yet. This paper from June 2023 presents a framework for evaluating the social impact of generative AI systems, catering to researchers and developers, third-party auditors and red-teamers, and policymakers. Social impact is defined by the authors "as the effect of a system on people and communities along any timeline with a focus on marginalization, and active, harm that can be evaluated." The framework defines 7 categories of social impact: - bias, stereotypes, and representational harms; - cultural values and sensitive content; - disparate performance; - privacy and data protection; - financial costs; - environmental costs; - data and content moderation labor costs. E.g., the paper explains that safeguarding personal information and privacy relies on proper handling of training data, methods, and security measures. The paper stresses that there is great potential for more comprehensive privacy evaluations of GenAI systems: - Addressing the issue of memorization of training examples. - Ensure that only lawfully obtained data is shared with third parties. - Prioritize individual consent and choices. GenAI systems are harder to evaluate without clear documentation, systems for obtaining consent (e.g., opt-out mechanisms), and appropriate technical and process controls. Rules for leveraging end-user data for training purposes are often unclear, and the immense size of training datasets makes scrutiny increasingly difficult. Therefore, privacy risk assessments should go beyond proxies, focusing on memorization, data sharing, and security controls, and require extensive audits of processes and governance. 5 overarching categories for evaluation in society are suggested: - trustworthiness and autonomy; - inequality, marginalization, and violence; - concentration of authority; - labor and creativity; - ecosystem and environment. Each category includes subcategories and recommendations for mitigating harm. E.g., the category of trustworthiness and autonomy includes "Personal Privacy and Sense of Self". The authors emphasize that the impacts and harms from the violation of privacy are difficult to enumerate and evaluate. Mitigation first should determine who is responsible for an individual’s privacy, but requires both individual and collective action. The paper points out that technical methods to preserve privacy in a GenAI system, as seen in privacy-preserving approaches to language modeling, cannot guarantee full protection. Improving common practices and better global regulation for collecting training data can help. By Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, Yacine Jernite, Alexandra Sasha Luccioni, Alberto Lusoli, Margaret Mitchell, Jessica Newman, Marie-Therese Png, Andrew Strait, Apostol Vassilev

46 Comments
Like Comment
Debbie Reynolds

The Data Diva | Global Data Advisor | Retain Value. Reduce Risk. Increase Revenue. Powered by Cutting-Edge Data Strategy

39,585 followers 3mo
Report this post
🧠 “Data systems are designed to remember data, not to forget data.” – Debbie Reynolds, The Data Diva 🚨 I just published a new essay in the Data Privacy Advantage newsletter called: 🧬An AI Data Privacy Cautionary Tale: Court-Ordered Data Retention Meets Privacy🧬 🧠 This essay explores the recent court order from the United States District Court for the Southern District of New York in the New York Times v. OpenAI case. The court ordered OpenAI to preserve all user interactions, including chat logs, prompts, API traffic, and generated outputs, with no deletion allowed, not even at the user's request. 💥 That means: 💥“Delete” no longer means delete 💥API business users are not exempt 💥Personal, confidential, or proprietary data entered into ChatGPT could now be locked in indefinitely 💥Even if you never knew your data would be involved in litigation, it may now be preserved beyond your control 🏛️ This order overrides global privacy laws, such as the GDPR and CCPA, highlighting how litigation can erode deletion rights and intensify the risks associated with using generative AI tools. 🔍 In the essay, I cover: ✅ What the court order says and why it matters ✅ Why enterprise API users are directly affected ✅ How AI models retain data behind the scenes ✅ The conflict between privacy laws and legal hold obligations ✅ What businesses should do now to avoid exposure 💡 My recommendations include: • Train employees on what not to submit to AI • Curate all data inputs with legal oversight • Review vendor contracts for retention language • Establish internal policies for AI usage and audits • Require transparency from AI providers 🏢 If your organization is using generative AI, even in limited ways, now is the time to assess your data discipline. AI inputs are no longer just temporary interactions; they are potentially discoverable records. And now, courts are treating them that way. 📖 Read the full essay to understand why AI data privacy cannot be an afterthought. #Privacy #Cybersecurity #datadiva#DataPrivacy #AI #LegalRisk #LitigationHold #PrivacyByDesign #TheDataDiva #OpenAI #ChatGPT #Governance #Compliance #NYTvOpenAI #GenerativeAI #DataGovernance #PrivacyMatters

An AI Data Privacy Cautionary Tale: Court-Ordered Data Retention Meets Privacy Debbie Reynolds on LinkedIn

32 Comments
Like Comment
Odia Kagan

CDPO, CIPP/E/US, CIPM, FIP, GDPRP, PLS, Partner, Chair of Data Privacy Compliance and International Privacy at Fox Rothschild LLP

24,029 followers 1y
Report this post
UK Information Commissioner's Office issues for public comment guidance on legal basis for scraping data from the web or to train AI (Chapter 1 of Consultation). Key points: 🔹 Most developers of generative AI rely on publicly accessible sources for their training data, usually through web scraping. 🔹 To be fair and lawful your data collection can't be in breach of any laws - this will not be met if the scraping of personal data infringes other legislation outside of data protection such as intellectual property or contract law. 🔹 Legitimate interests can be a valid lawful basis for training generative AI models on web-scraped data, but only when the model’s developer can ensure they pass the three-part test Purpose test: is there a valid interest? 🔹 Despite the many potential downstream uses of a model, you need to frame the interest in a specific, rather than open-ended way, based on what information you can have access to at the time of collecting the training data. 🔹 If you don’t know what your model is going to be used for, how can you ensure its downstream use will respect data protection and people’s rights and freedoms? Necessity test: is web scraping necessary given the purpose? The ICO’s understanding is that currently, most generative AI training is only possible using the volume of data obtained though large-scale scraping. Balancing test: do individuals’ rights override the interest of the generative AI developer? 🔹 Collecting data though web-scraping is an ‘invisible processing’ activity. 🔹 Invisible processing and AI related processing are both seen as high-risk activities that require a DPIA under ICO guidance Risk mitigations to consider in the balancing test If you are the developer and rely on the public interest of the wider society for the first part of the test, you should be able to: 🔹 control and evidence whether the generative AI model is actually used for the stated wider societal benefit; 🔹 assess risks to individuals (both in advance during generative AI development and as part of ongoing monitoring post-deployment); 🔹 implement technical and organisational measures to mitigate risks. If you deploy a third party model through an API: 🔹 Developer should implement TOMS (eg. output filters) and organizational controls over the deployment such are: limit queries (preventing those likely to result in risks or harms to individuals) and monitoring the use of the model 🔹 Use contractual restrictions and measures, with the developer legally limiting the ways in which the generative AI model can be used by its customers If you provide a model to a third party: 🔹 Use contractual controls to mitigate the risks of lack of control on how the model is used - but that might not be effective 🔹 You need to evidence that any such controls are being complied with in practice #dataprivacy #dataprotection #privacyFOMO https://lnkd.in/eev_Qhah
No more previous content

No more next content

Odia Kagan

CDPO, CIPP/E/US, CIPM, FIP, GDPRP, PLS, Partner, Chair of Data Privacy Compliance and International Privacy at Fox Rothschild LLP

UK Information Commissioner's Office issues for public comment guidance on legal basis for scraping data from the web or to train AI (Chapter 1 of Consultation). Key points: 🔹 Most developers of generative AI rely on publicly accessible sources for their training data, usually through web scraping. 🔹 To be fair and lawful your data collection can't be in breach of any laws - this will not be met if the scraping of personal data infringes other legislation outside of data protection such as intellectual property or contract law. 🔹 Legitimate interests can be a valid lawful basis for training generative AI models on web-scraped data, but only when the model’s developer can ensure they pass the three-part test Purpose test: is there a valid interest? 🔹 Despite the many potential downstream uses of a model, you need to frame the interest in a specific, rather than open-ended way, based on what information you can have access to at the time of collecting the training data. 🔹 If you don’t know what your model is going to be used for, how can you ensure its downstream use will respect data protection and people’s rights and freedoms? Necessity test: is web scraping necessary given the purpose? The ICO’s understanding is that currently, most generative AI training is only possible using the volume of data obtained though large-scale scraping. Balancing test: do individuals’ rights override the interest of the generative AI developer? 🔹 Collecting data though web-scraping is an ‘invisible processing’ activity. 🔹 Invisible processing and AI related processing are both seen as high-risk activities that require a DPIA under ICO guidance Risk mitigations to consider in the balancing test If you are the developer and rely on the public interest of the wider society for the first part of the test, you should be able to: 🔹 control and evidence whether the generative AI model is actually used for the stated wider societal benefit; 🔹 assess risks to individuals (both in advance during generative AI development and as part of ongoing monitoring post-deployment); 🔹 implement technical and organisational measures to mitigate risks. If you deploy a third party model through an API: 🔹 Developer should implement TOMS (eg. output filters) and organizational controls over the deployment such are: limit queries (preventing those likely to result in risks or harms to individuals) and monitoring the use of the model 🔹 Use contractual restrictions and measures, with the developer legally limiting the ways in which the generative AI model can be used by its customers If you provide a model to a third party: 🔹 Use contractual controls to mitigate the risks of lack of control on how the model is used - but that might not be effective 🔹 You need to evidence that any such controls are being complied with in practice #dataprivacy #dataprotection #privacyFOMO https://lnkd.in/eev_Qhah

3 Comments

Like Comment
3 Comments
Like Comment
Sam Castic

Privacy Leader and Lawyer; Partner @ Hintze Law

3,606 followers 9mo
Report this post
The Oregon Department of Justice released new guidance on legal requirements when using AI. Here are the key privacy considerations, and four steps for companies to stay in-line with Oregon privacy law. ⤵️ The guidance details the AG's views of how uses of personal data in connection with AI or training AI models triggers obligations under the Oregon Consumer Privacy Act, including: 🔸Privacy Notices. Companies must disclose in their privacy notices when personal data is used to train AI systems. 🔸Consent. Updated privacy policies disclosing uses of personal data for AI training cannot justify the use of previously collected personal data for AI training; affirmative consent must be obtained. 🔸Revoking Consent. Where consent is provided to use personal data for AI training, there must be a way to withdraw consent and processing of that personal data must end within 15 days. 🔸Sensitive Data. Explicit consent must be obtained before sensitive personal data is used to develop or train AI systems. 🔸Training Datasets. Developers purchasing or using third-party personal data sets for model training may be personal data controllers, with all the required obligations that data controllers have under the law. 🔸Opt-Out Rights. Consumers have the right to opt-out of AI uses for certain decisions like housing, education, or lending. 🔸Deletion. Consumer #PersonalData deletion rights need to be respected when using AI models. 🔸Assessments. Using personal data in connection with AI models, or processing it in connection with AI models that involve profiling or other activities with heightened risk of harm, trigger data protection assessment requirements. The guidance also highlights a number of scenarios where sales practices using AI or misrepresentations due to AI use can violate the Unlawful Trade Practices Act. Here's a few steps to help stay on top of #privacy requirements under Oregon law and this guidance: 1️⃣ Confirm whether your organization or its vendors train #ArtificialIntelligence solutions on personal data. 2️⃣ Validate your organization's privacy notice discloses AI training practices. 3️⃣ Make sure organizational individual rights processes are scoped for personal data used in AI training. 4️⃣ Set assessment protocols where required to conduct and document data protection assessments that address the requirements under Oregon and other states' laws, and that are maintained in a format that can be provided to regulators.

3 Comments
Like Comment
Andrew Clearwater

Partner @ Dentons | Privacy, Cybersecurity, AI Governance

5,252 followers 9mo
Report this post
#EDPB opinion on #AI models and the #GDPR (Opinion 28/2024) #Anonymity of AI Models The EDPB states that AI models trained on #personaldata cannot, in all cases, be considered anonymous. Anonymity factors: * The likelihood of direct extraction of personal data from the model * The likelihood of obtaining personal data from queries * All means reasonably likely to be used by the controller or others The key operational step here is to document your assessment of these factors and the approaches that were taken to limited the risks of personal data extraction. #LegitimateInterest as Legal Basis When assessing legitimate interest as a legal basis for AI model development and deployment the focus remains on the existing three-step test. Further general considerations are outlined in the opinion where the role of data subjects’ reasonable expectations and mitigating measures to limit the impact of the processing are highlighted. A key operational step here is to view and possibly enhance the information provided to data subjects in the context of the processing Consequences of Unlawful Processing The Opinion outlines the impact of unlawful processing during AI model development and shares three factors for assessing the impact: * Whether development and deployment are separate purposes * The controller's due diligence in assessing the model's lawfulness * The risks posed by the deployment phase processing What are some of the areas of operational focus: * Enhanced documentation requirements for AI model development and deployment * Stringent legitimate interest assessments specific to AI contexts * Emphasis on transparency and managing data subjects' expectations * Thorough risk assessments, particularly for fundamental rights impacts

1 Comment
Like Comment

Understanding AI Training's Impact on User Data

More in Advanced AI Training

Explore categories