Should We Block AI Robots from Crawling Client Sites?

According to Exploding Topics, ChatGPT acquired a million users within five days of launching in November 2022. To put this into context, Instagram took over two months, Facebook took ten months, and Netflix took over three years to reach the very same milestone. What’s often not talked about is the impact this technology is having on the world wide web.

In 2024, Imperva reported that almost 50% of traffic comes from non-human resources. Whilst not all of those are identifiable as AI Bots, it’s evident that they are increasingly responsible for more and more visitors.

With ChatGPT making “AI” mainstream and other competitors, such as Microsoft’s CoPilot and Google’s Gemini, also growing in popularity, the number of bots crawling the web is only going to increase. With this in mind, this article focuses on a very particular question: should we block AI robots from crawling (client) sites?

time taken to reach 1 million users graph

What’s the big deal?

The big deal is that Generative AI tools are trained on vast amounts of data, and a lot of this data is crawled from the web.

Cloudflare’s research shows that of all the websites protected by its Content Delivery Network (CDN), some user agents have accessed up to 40% of their websites. If anything, this is surprisingly low, considering how popular the tools are getting.

The early founders of the internet foresaw this scenario, and that’s why they decided to create Robots.txt or “Robots Exclusion Protocol” all the way back in 1994, over two decades ago. The idea was that webmasters would be able to configure this file on the root domain of their websites, and Search Engines and other robots would extract the rules before commencing crawling.

Created in a kinder and gentler era of the internet, the limitation of robots.txt is that it has always been essentially a gentleman’s agreement: the standard relies exclusively on voluntary compliance, and people who were early adopters of the internet were likely to buy into this delicate ecosystem.

However, in this new era, large language models (LLMs) have an insatiable desire to crawl and extract as much publicly accessible information as possible to help train the AI or, in common parlance, feed the machine.

Microsoft’s AI CEO created controversial headlines in 2024 when he essentially claimed that “all online content should be available to train AI and LLMs”, and his cavalier attitude to copyright protection pretty much sums up the general disregard all AI companies have for content on the web.

There is a longstanding concept of a free and open web. However, there was never any agreement that content could be scraped at scale to essentially open the floodgates to aggregated content on steroids that could directly compete with your own website.

In Machiavellian (or machAIvellian) terms, AI companies essentially believe that “it’s easier to ask for forgiveness than permission”. Thus, they will opt to crawl first and ask questions later – questions such as whether or not they should.

Arguments for blocking

Thankfully, in the era of content delivery networks (CDNs) and Web Application Firewalls (WAFs), denying robots access to resources or crawling a site is easier than ever, with the technology to block rogue bots available to step in when robots.txt is flagrantly ignored.

Before you advise your clients to drop the ban hammer, you need to understand the potential consequences – and at this moment in time, they’re not fully clear, which means you are taking a little bit of a leap of faith in assuming no negative impacts of banning “bad” AI bots.

We wouldn’t be doing our jobs if we made a recommendation which could negatively impact revenue in the short and long term, so let’s look at some scenarios where it could make sense:

Arguments for allowing

We’ve looked at some potential reasons for blocking. Now, what about allowing or essentially doing nothing in some cases? It’s clear that most of these reasons are based on optimism that being included in AI answers and searches is a positive trend and that this would override any philosophical concerns.

Big Businesses and the LLMs you probably want to allow (non-exhaustive list)

TL; DR – You probably don’t want to block it, but you should be monitoring it

Without giving a cop-out answer, I would say that the immediate action here is clearly to begin tracking and keeping an eye on bot traffic in general. The boring answer probably lies somewhere between accepting the most common bots and blocking unknown ones or LLMs, which are of questionable value.

Whether that’s referral traffic in analytics platforms or from available server logs, we need to get collectively better at understanding the volume of visits and the current impact on KPIs. Only once we begin to have access to this data can we decide whether or not to block AI user agents.

As usual – and again, it’s very much a baby and the bathwater scenario – there might be some rogue bots, but in reality most are benign in nature and can only lead to more inclusion in the era of AI-powered search.

What’s the big deal?

Arguments for blocking

Arguments for allowing

Big Businesses and the LLMs you probably want to allow (non-exhaustive list)

TL; DR – You probably don’t want to block it, but you should be monitoring it

Share this article

Read other articles

Dollar General Names iProspect Paid Media Agency of Record

Data and Customers: Marketing and Growth Challenges in the Age of AI