General Blogging

Protecting Your Content: How to Opt Out of GPTBot Web Crawling

The Mediavine Team

•

August 9, 2023

Publishers spend time creating great content that keeps readers coming back, and we know that safeguarding that content and maintaining control over how it’s used are critical concerns in today’s digital landscape.

We also know that the advancement of generative AI technologies, like ChatGPT, brings up more questions than the can’t-recommend-anything-past-2021 robot can answer.

While we believe there is value in using generative AI and are sharing ways publishers can use ChatGPT to speed up their workflows, understanding how to protect your content from unwanted scraping and data collection is huge.

OpenAI has announced the release of GPTBot, a web crawler you can use to stop ChatGPT from reading your site to train its algorithms, and we’re breaking down how you can do this.

Understanding GPTBot: Enhancing AI Models and Web Crawling

GPTBot is OpenAI’s latest tool, and it’s designed to enhance the training data for their AI models, including the upcoming GPT-5. GPTBot crawls the entire internet for data sources that can improve the accuracy, capabilities and safety of AI technology.

In non-tech talk, this bot spends its time doom scrolling.

And while Skynet may not be fully self-aware just yet, this doom scrolling has understandably led to feelings of, well, doom for creators.

It seems ChatGPT’s robots — or at least the program’s powers that be — are listening.

You can now opt out of sharing your site’s data with OpenAI, a win — however small — for every publisher concerned about data privacy and transparency.

Your Control: Opting Out of GPTBot Web Crawling

NOTE: You may need to contact your site hosting provider or administrator or web developer for assistance with the following steps. Mediavine Support, as much as they would love to help you crush the robot overlords before they take over the world, aren’t equipped to modify your robots.txt files.

As discussions regarding copyright and fair and acceptable use take place around the globe, with its newest iteration, OpenAI is moving toward a model that asks for permission rather than just assuming you’re okay with sharing Grandma Betty’s Pecan Pie recipe with zero attribution or backlinks.

Granted, their permission is permission-by-assumption, but there’s a solution and that’s the major point here.

Something to consider before you proceed is that opting out of GPTBot web crawling could impact your traffic, particularly from Bing Chat.

To be clear, we don’t know for sure. But we know the impact is possible and, as always, we’ll keep you posted.

That said, we’re still glad publishers have this option.

By following these steps, you can prevent GPTBot from accessing and using your website’s content to train its models:

Modify Your Robots.txt File

The first step in opting out of GPTBot web crawling is to modify your website’s robots.txt file.

The robots.txt file is a set of instructions for web crawlers, including GPTBot, letting them know which parts of your site they are allowed to access and which parts are no-crawl zones.

To fully restrict GPTBot from accessing your site:

User-agent: GPTBot
Disallow: /

To grant partial access to specific directories while restricting others:

User-agent: GPTBot Allow: /directory-1/ Disallow: /directory-2/

Your directory names, or the areas that are off-limits to GBTBot, may be different than the code text seen above, and you can customize this to your specific needs. For instance, you may want to allow GBTBot to crawl your articles but disallow it from accessing your shop.

Remember: Your web developer is your friend.

Save and Upload

Once you’ve made the necessary changes to your robots.txt file, save the file and upload it to your website’s root directory.

This ensures that GPTBot recognizes your access preferences and adjusts its crawling behavior accordingly.

Verify Your Changes

To confirm that GPTBot is adhering to your access preferences, you can use online tools and services that analyze your website’s robots.txt file.

This verification step helps ensure that your content remains off-limits to GPTBot.

Empowering Publishers: Taking Control of Your Content

Will you need technical assistance from your site host or administrator to make these changes? Maybe. Maybe you’re savvy enough to do this yourself!

Again, this is not something our Support team is prepared to assist with.

But this is a proactive measure you can take to protect your content and maintain control over how it’s accessed and used.

AI has been here for a while now. It’s not going away any time soon. As the debate around AI, copyright and data usage unfolds, rest assured that we’re having conversations and joining other industry leaders in support of you, our publishers.

We’re watching for ways to help protect your content, including participating in discussions with Google about the future of responsible AI as part of our Premier GCPP status.

We’re committed to ensuring that your content remains yours, now and into the future.

About the author