Insight
Don’t train on my Data (Please!)
January13, 2025
Steve Smith
When ChatGPT first got announced to the public, there was a lot of fear that OpenAI would train on anything you put into the model. A lot of companies forbid their employees from using it in any way as they feared that their corporate secrets would be made available to the world.
I’ve had the opportunity to discuss AI in front of hundreds of organizations over the past year and the topic of privacy comes up a lot. I want to share how can you use these amazing tools with confidence so that you know that your data is not being trained on, and that you can get the full value out of what they can do.
The most important thing to takeaway to start – if you are using free versions of the tools, you should not expect any confidentiality. It’s not to say that any of these companies will train on your data, but they have the option to train on your data. In each of the following sections I’m going to focus on the paid versions of these tools. I also want to call out that all of these privacy statements are in regards to US privacy policies
ChatGPT (OpenAI)
Let’s start with OpenAI and ChatGPT. When you first subscribe to ChatGPT, the default setting is that they can train on your data. You need to select the option to opt-out of this. You can find this by doing the following:
Click on your profile icon in the upper right hand of the screen
Select ‘settings’ from the menu options
Select ‘data controls’ on the left hand-side of the dialog box
The top option says ‘improve this model for everyone’ – toggle that so it says ‘Off’
OpenAI has a second option to ensure that your data is not being trained on that I also recommend that you go do.
Go to this page: https://privacy.openai.com/policies and select the option in the upper right hand side to ‘make a privacy request’. Select the next option that says ‘I have a ChatGPT account’ and then in the next window select ‘Do not train on my content’. It will then ask you for the email address associated with your account. Enter your email address and shortly after that you will get a confirmation from their privacy team that they will not be training on your data.
The first option alone is probably sufficient (and you should definitely do that) – but I’ve been encouraging everyone to also send the email option just to be safe.
Once you’ve done that, any new chats you start will not be trained on by OpenAI.
Let’s talk about OpenAI’s data access and retention policies
Access Restrictions: OpenAI employs strict access controls to limit who can access user data. However, some authorized personnel may access data for purposes such as system maintenance, security, and compliance.
Data Retention Period: OpenAI retains personal data only as long as necessary to provide services or for other legitimate business purposes, such as resolving disputes or complying with legal obligations. The specific retention period depends on factors like the nature of the data and the purposes for which it was collected.
By default, OpenAI does not use content submitted by customers to their business offerings such as our API, ChatGPT Team, and ChatGPT Enterprise to improve model performance, unless you have explicitly opted in to share your data with them for this purpose.
I personally am confident enough in these controls that I have no issue putting in selected financial data, medical records and more (I would not put in things that include my SSN but pretty much everything else I’m personally comfortable doing).
Claude (Anthropic)
Claude (by Anthropic) does not use your inputs (prompts) and outputs (responses) from the free Claude.ai or Claude Pro to train their models by default. If you would like your data to be trained on it is something you actually have to opt-in to. By default, Claude is the most privacy focused of the large language models that most people will use. You don’t have to do anything to not have Claude train on your data.
Exceptions: There are specific scenarios where your data may be utilized for model training:
Explicit Permission: If you provide explicit consent, such as submitting feedback through features like thumbs up/down or directly reaching out with a request, your data may be used.
Trust and Safety Reviews: If your prompts or conversations are flagged for trust and safety concerns, Anthropic may analyze and use this data to enhance their ability to detect and enforce Usage Policy violations, including training trust and safety classifiers to improve service safety.
Data Access and Retention
Access Restrictions: Only a limited number of authorized staff members have access to conversation data, and they access it solely for explicit business purposes.
Data Retention Period: Anthropic retains inputs and outputs on the backend for up to 30 days to provide a consistent product experience. If you delete a conversation, it is immediately removed from your conversation history and automatically deleted from their backend within 30 days.
Like ChatGPT, I personally am confident enough in Claude with these controls that I have no issue putting in selected financial data, medical records and more (I would not put in things that include my SSN but pretty much everything else I’m personally comfortable doing). Claude was focused at the beginning on trust and safety and privacy – it’s been a founding principle of their overall business.
Gemini (Google)
Google has a very mixed approach to privacy compared to OpenAI and Anthropic – and you have less options to actually opt-out of having Google train on your data. There are big differences in their consumer and business versions and I will try to give you a high level here (and remember – all of this is related to the paid versions of Gemini).
Consumer Gemini Advanced (paid): Google’s privacy policies are at best unclear (to me at least) on whether or not they can and will train on hour data if you are using the paid consumer version of Gemini Advanced. The available information does not specify whether individual user data from paid consumer subscribers is directly used in model training. However, it's common practice for AI service providers to use aggregated and anonymized data to improve model performance. The service does not explicitly state that they will not train on your data when you are using the tool. I would personally not put confidential data in here as the policies are unclear at best.
Google Workspace Gemini Advanced (business paid account): I have a regular google workspace account and subscribe to Gemini Advanced as well and at the bottom of the screen when you are using the tool it explicitly states that ‘chats aren’t used to improve our models’. This message does not show up on the consumer paid accounts which leads me to believe that they have the option to train on those chats if they would like. I would have no issues putting confidential data in here as Google is explicitly stating that they don’t train on your data.
Google AI Studio: there is no paid version of Google AI studio – it’s a free tool and it is fantastic. It give you access to all of their latest models including the ability to stream real-time with Google’s multimodal models. That being said – as a free tool – Google has the ability to train on the data you put in here. I’ve heard from various sources connected to Google’s legal department that they are not currently training on the data (at least as of about two months ago) – but because this is a free services - it could change anytime. I would operate under the assumption that they *could* train on anything that you put in here.
I will say that I am a huge user of Google AI studio – I use it extensively and it is a truly fantastic services - I’m just very careful to make sure that I don’t put in confidential data in there.
Google Vertex AI Studio: this is part of Google’s enterprise suite of tools and they explicitly state that they do not train on your data. Google has strong customer data protection in Vertex as it’s part of the broader Google Cloud tools. Google Cloud maintains a strong commitment to data privacy and security. Your data, including inputs, outputs, and any custom models developed within Vertex AI Studio, remains under your control. Google does not use your data to train or improve its foundational models without your explicit consent. Whenever I am dealing with confidential data, I always access the Google models through Vertex or the AP associated with my paid services.
Google NotebookLM: this is one of the best tools that Google launched this year – the ability to query massive amounts of data and generate podcasts from your uploads. Here is the current privacy policy on NotebookLM:
We (Google) value your privacy and never use your personal data to train NotebookLM.
If you are logging in with your personal Google account and choose to provide feedback, human reviewers may review your queries, uploads, and the model's responses to troubleshoot, address abuse or make improvements. Keep in mind that it's best to avoid submitting any information you wouldn't feel comfortable sharing.
As a Google Workspace or Google Workspace for Education user, your uploads, queries and the model's responses in NotebookLM will not be reviewed by human reviewers, and will not be used to train AI models
WRAPPING UP
Each of these companies has made it possible to allow you to securely use their models with confidential data. I look at the privacy policies of each of these companies around once a quarter to look for major changes and this is my best understanding of how the privacy policies are structured today. Because these can change at any time, I would strongly urge you to review each companies privacy policy, only use paid services, and select any opt-out options that you can to ensure that they don’t train on your data. This is my best understanding of the privacy policies as of December, 2024.
I personally have reached a level of comfort with each of these tools that I have no issue putting in financial or medical data and asking endless questions. I hope that this guide has also given you the comfort and insights to securely use these in a way that benefits you as well.