Oguine & Badillo-Urquiola on Inference Is Not Consent: Privacy Risks from Training LLMs on Social Media and Web-Scraped Data

Ozioma C. Oguine (U Notre Dame) and Karla Badillo-urquiola (U Central Florida) have posted “Inference Is Not Consent: Privacy Risks from Training LLMs on Social Media and Web-Scraped Data” on SSRN. Here is the abstract:

Large Language Models (LLMs) are increasingly embedded in tools used for education, creativity, productivity, and personal inquiry. Trained on vast web-scraped datasets, these models do not merely reproduce public content; they infer connections, identities, and attributes that individuals may never have disclosed. This paper introduces and centers the concept of inferential privacy in LLMs, arguing that privacy harms in the LLM era stem not just from memorization or data leakage, but from the automated synthesis of plausible, sensitive, or stigmatizing information. Drawing on research in data protection, HCI, AI ethics, and law, we examine how these harms disproportionately affect marginalized communities, including youth, activists, LGBTQ+ individuals, and people with disabilities. We critique the inadequacy of current regulatory frameworks such as GDPR and CCPA, which assume static data and explicit collection, and propose an expanded approach that treats inference as a distinct site of harm. We conclude with a roadmap for action, including inferential privacy audits, participatory red-teaming, context-aware model design, and regulatory innovation. This paper advocates for a shift in how we conceptualize privacy, away from control over data points and toward protection against algorithmic misrepresentation.