If Shadow AI was about the tools you don’t see, this is about the data exposure you think you understand, but probably don’t.
Artificial intelligence didn’t just walk into the enterprise; it plugged itself into email, chat, browsers, documents, calendars, and cloud storage in a matter of months. While leaders negotiate strategy and ROI, everyday prompts are quietly shuttling company data into shared, vendor‑controlled infrastructure that sits far outside your traditional perimeter. Behind every “quick question” to an AI tool is a much bigger question you need to be able to answer with confidence: Where did that data go, who can see it, how long does it live there, and what is it being used for?
The uncomfortable truth is that in many organizations, no one can answer that clearly. Not the C‑suite, not IT, not security. And that’s where the real risk begins.
Shadow AI, Meet Shadow Data
In April, we talked about Shadow AI, the unsanctioned tools your workforce is already using to get work done. Employees found value, moved fast, and left governance behind. This next phase is more subtle. Even when you give people “approved” tools, even when you sign enterprise agreements, there is still a second shadow growing alongside them: Shadow Data.
Shadow Data is everything that leaves your environment wrapped in a friendly, conversational interface. It’s the internal memo pasted into a prompt. The customer spreadsheet uploaded for “helpful analysis.” The proprietary roadmap summarized “for a slide deck.” The sensitive email thread someone wants rewritten “more professionally.” On the surface, it feels harmless and contained. In reality, each one of those actions can move regulated, confidential, or mission‑critical data into systems you don’t own and don’t fully control.
This isn’t just a problem with the free tools. It’s also baked into how the major platforms operate by design. To manage AI risk, you first have to understand what you’ve already agreed to.
What the Major AI Tools Really Do With Your Data
Most users never read the Terms of Service. Most leaders skim the executive summary and lean on “enterprise” branding. But buried in the privacy policies and data usage pages of your favorite AI tools are the real answers to what happens after you hit Enter.
On the consumer side of OpenAI’s ChatGPT, prompts and responses are stored on the provider’s infrastructure and, by default, used to improve their models unless you explicitly change your settings. Those chats can persist until you delete them, and even then, operational logs are typically retained for a period of time for safety and abuse monitoring. When you move to their API or enterprise offerings, the story changes: Training on your data is off by default, and retention is usually limited to a short window, with options for stricter “zero‑data‑retention” configurations on certain endpoints. The nuance matters. Disabling “use my content to improve models” is not the same thing as “no logs, no retention.”
Microsoft’s Copilot is built to feel safer because it lives inside your Microsoft 365 tenant and respects existing permissions, sensitivity labels, and encryption. Your data stays within Microsoft’s cloud, governed by the same contractual protections that apply to Exchange, SharePoint, and OneDrive. The company is explicit that commercial Copilot with enterprise data protection does not use your prompts or responses to train the underlying large language models for general use. But that doesn’t mean nothing is stored. Copilot queries and responses can be captured in audit logs, eDiscovery, and compliance archives, effectively turning AI chats into another regulated data source you have to govern. If your retention policies are broad, your AI history may now be, too.
Google’s Gemini splits into two different worlds: consumer and Workspace. On the consumer side, prompts, uploads, and interactions can be used to improve Google’s AI models by default through what they call activity settings. Leave that setting on, and your chats can be retained until you delete them, with optional auto‑delete after a set number of months; turn it off, and data related to that activity is generally kept for a much shorter window to maintain service quality. There’s another catch: Content that’s selected for human review to evaluate or improve the system can be retained for years, even if you’ve deleted the original chat. In the Workspace version, Google commits not to using your Workspace data to train models for products outside your organization, and admins can heavily control what Gemini can see through DLP and privacy controls. But that only helps if you’ve actually configured those controls to match your risk.
Meta’s AI strategy is more aggressive by design. The company has been clear that it can use public posts, comments, and interactions across its social platforms, as well as chats with its AI, as training data under a “legitimate interests” basis. Employees who casually invoke that AI from personal accounts, dropping fragments of work into a social conversation, are effectively feeding corporate data into a model that is designed to learn from it. Even if content is later deleted, the policies allow derivative data and model training artifacts to remain. That’s a very different exposure profile than an enterprise‑scoped assistant running in your own tenant.
Anthropic’s Claude presents itself as privacy‑focused, and to a point, it is. By default, Claude keeps your conversations in your account but does not use them to train models unless you opt in. Deleted chats are removed from the interface but kept in backend logs for a limited time before full deletion, and content flagged for policy violations can be retained for much longer. More recently, Anthropic introduced an explicit opt‑in program where, if you agree, your de‑identified chat data can be retained and used for model training for several years. This is another example where a single toggle changes your long‑term data exposure from a short operational window to multi‑year training inclusion.
And then there are search‑centric tools like Perplexity, which blend AI chat with live web search. Queries and conversations are logged and stored on the vendor’s infrastructure to improve ranking, answer quality, and safety. You may “own” your prompts on paper, but the service retains broad rights to analyze and reuse them for service improvement. In practice, that means an employee pasting in a sensitive troubleshooting log, internal URL, or customer detail is feeding that data into a system designed to learn from searchable queries.
From the user’s perspective, none of this is obvious. They see a friendly textbox and an impressive answer. You need to see contracts, defaults, retention windows, and training policies.
Why “Enterprise” Doesn’t Mean “Isolated”
It’s tempting to assume that once you move to a paid, enterprise, or pro version of an AI platform, the risk problem is solved. The branding reinforces that assumption. The security slides look great. The privacy FAQ hits all the right notes.
Enterprise tiers do genuinely improve your position. They usually bring clearer data‑processing agreements, audit rights, documented retention policies, and stronger commitments around not using your data to train public models. They add admin controls for access, logging, and policy enforcement. They align better with your compliance needs.
What they do not do is magically move your data into a private, single‑tenant universe. Your content still flows through shared compute clusters, shared GPUs, shared orchestration layers, and shared logging systems, all controlled by the vendor. Isolation is logical, not physical, implemented through identity, encryption, virtualization, and strict internal controls. When those controls are designed and enforced correctly, that’s acceptable risk for most organizations. But it is still risk.
They also do not eliminate operational logging. Even with “no training on customer data” in the contract, vendors retain the right to store logs for abuse detection, debugging, fraud prevention, and security analytics. Those logs may include prompts, outputs, and metadata about who accessed what and when. If your risk analysis assumes “no data lives with the vendor,” it’s already out of date.
And finally, they do not automatically align with how your employees are actually using the tools. If some staff are on enterprise instances while others are still using free accounts from personal email addresses, you now have a split‑risk environment: one half governed, the other half wide open. Your policy may say “use our approved assistant,” but your behavior logs may show “every public model available on the internet.”
Recent AI Security Flaws: Where the Edges Fray
Most AI “incidents” in the headlines are not massive breaches. They are the quiet ways complex systems behave in the real world.
Recently, researchers demonstrated that one major AI assistant could be manipulated into leaking information from a user’s calendar, effectively turning a conversational interface into a side‑channel into private scheduling data. No one broke into the calendar servers. The assistant simply had more access than the user realized, and under certain prompts, it revealed more than it should.
Shifts in retention and training policies at leading vendors are another kind of vulnerability, this time, in expectations. Many organizations framed certain tools as the “safe” options based on earlier, tighter data‑handling postures. When those stances evolved to allow multi‑year training windows for users who opt in, the organizations that failed to revisit their assumptions quietly expanded their exposure without realizing it.
Social platforms turning public content and AI interactions into training fuel is a similar story on a different surface. Employees are already blending personal and professional lives online. Now, when they experiment with AI on those same channels, work‑adjacent data can end up inside systems that are explicitly designed to learn from it and propagate those learnings.
None of this means you should abandon AI. It means your threat model has to catch up to how AI is being built, deployed, and changed in production, often without fanfare.
A Question for the C‑Suite and IT
Here’s the question that should keep both executives and IT leaders up at night:
If you picked a random AI interaction from your organization (a prompt, an upload, an integration), could you say, with specificity, where that data went, how long it will live there, and what that provider is allowed to do with it?
Executives: When you approve AI initiatives, are you reviewing marketing claims, or are you reading the actual data usage and retention commitments that govern your legal exposure? Do you know which business units are quietly experimenting with tools that have no enterprise contract at all?
IT and Security: Can you map AI traffic back to data classification? Do you know which repositories are exposed to AI copilots, which cloud folders assistants can reach, and which SaaS systems your employees are connecting to external models through plugins and browser extensions? Do you know how many prompts containing customer data, financials, or IP have already left your environment?
If the honest answer is no, then your company is not actually in control of its AI strategy. The tools are in control. The vendors are in control. Your users are in control. You are reacting.
Recommendations: Start With the Data, Not the Model
So where do you go from here? The answer isn’t to ban AI or to blindly trust it. It’s to make data, not tools, the center of your AI risk strategy.
First, you need visibility. You cannot govern what you can’t see. Data Security Posture Management (DSPM) platforms are designed to discover, classify, and protect sensitive data across cloud and hybrid environments at enterprise scale. Tools like Cyera, BigID, and Sentra fall into this category, continuously inventorying where your data lives; databases, data lakes, SaaS applications, object stores, and using AI‑powered classification to understand what that data actually is: personal data, financial records, health information, intellectual property, and more. From there, they surface misconfigurations, over‑privileged access, and exposure paths that turn into real risk the moment you connect AI tools to those systems.
Cyera, for example, positions itself as an AI‑native DSPM platform and has been expanding through partnerships and integrations in cloud and data‑security ecosystems, but it should be evaluated alongside peers like BigID or Varonis as one option in a broader data‑centric strategy, not as a silver bullet. The more important decision is that you adopt some form of DSPM, not which logo ends up on the contract.
Once you have visibility, you need control. This is where modern Data Loss Prevention (DLP) earns a second life in the AI era. Integrated into suites like Microsoft 365 and Google Workspace, DLP can inspect content as users’ type, paste, or share, and enforce policy in real time. That means you can block or warn on attempts to paste sensitive data into AI prompts in the browser. You can prevent generated content containing confidential details from being emailed externally. You can tie enforcement to classification labels so that anything tagged “Highly Confidential” never flows into third‑party AI tools in the first place.
With DSPM and DLP working together, your AI story shifts. Instead of relying on users to “be careful” with prompts, you can build an environment where the system understands what’s sensitive, knows where it lives, and actively prevents it from leaving in unsafe ways.
From there, the remaining work is governance and communication. You need clear, realistic guidelines for AI use that match how employees actually work, not how you wish they worked. You need sanctioned AI tools that are good enough that people want to use them, with enterprise contracts that align with your risk tolerance. You need cross‑functional oversight from IT, Security, Legal, HR, and business leaders to track vendor changes, update policies, and continually reassess where your data flows.
Shadow AI was the wake‑up call that your people were already using AI. The next phase is accepting that your data is already part of that story—and deciding whether you’re comfortable not knowing where it goes.
The future of work will be AI‑driven. The question is whether your data strategy is ready for it.
If you asked your organization today, “Do we truly know where our company data goes when we use AI, and how it’s being used once it leaves our walls?”…What answer would you get?



