- 26 Mar
- Eric Hochberger
What is Robots.txt and How it Can Hurt Your Earnings
Robots.txt, also known as the Robots Exclusion Standard or Robots Exclusion Protocol, is a text file used by websites to tell web crawlers which pages of their sites can be indexed.
Google is actually reading your site via a computer program, or a crawler, which follows each link on a page to figure out which page(s) to crawl next. As it continues to crawl, you can almost visualize how it becomes a web of links. Spiders crawl the web. You get it.
But how is Robots.txt important?
You may have links to pages such as your login page, admin page, or even more private information on your site without realizing it – data Google is crawling and displaying to searchers.
Robots.txt is a boring old plain text file that the crawler downloads before making its way through the rest of your site. After you add the file, Robots.txt basically provides instructions to visiting robots, establishing what can and cannot be crawled.
For example, let’s say you put up a Robots.txt file with the following information:
User-agent: * Disallow: /wp-admin/
Assuming the crawler downloads your robots.txt and respects its contents, all robots (that’s what the * wildcard is for), won’t crawl or index everything inside of your WordPress admin.
What Robots.txt is NOT designed for
Blocking malicious robots. Sadly, it won’t do it.
Robots.txt is designed to tell positive players such as Google what to crawl. Bot traffic and hackers will NOT have to follow robots.txt. It’s an optional set of instructions for good guys.
However, we’ve seen misinformed tech people try to use robots.txt as a security measure to stop bots. We recently ran across a site where the robots.txt file basically allowed Googlebot, a few other well known bots, and blocked the rest.
That robots.txt looked like the following:
User-agent: Googlebot Allow: / User-agent: * Disallow: /
We’re guessing their theory was that this would allow Googlebot (the name of Google Search’s Spider) and block all bad bot traffic. Wishful thinking.
Again, putting this file up won’t stop bad guys. Yes, it will allow Googlebot and block everyone else … who respects the rules of this file.
The bad guys – bots, scrapers, and other malicious entities – will simply ignore this file. In essence, all you’ve done is blocked good guys who would’ve read it.
Can Robots.txt cause more harm than good?
Yes! Let’s go back to that previous example. You may be asking what harm could result from this if Google is still given access. Even if most bad actors don’t follow instructions, maybe some will.
Well, first of all, you’ll be missing out on smaller search engines and social media platforms besides Google. And, beyond search engines, what we call contextual advertisers often use crawlers to scan websites, parsing them for keywords to target or avoid.
For example, if you post a recipe full of sugar, a diabetes medicine won’t want to advertise on it. If you write about flour, maybe King Arthur Flour will want to advertise on those pages as a result.
If you accidentally block all robots, you’ll be shutting out good guys such as Google AdSense, a major inventory buyer at Mediavine as part of Google AdExchange.
It’s important to note that AdSense uses a different crawler than Google Search. Even if you whitelist Googlebot and blacklist everyone else, you’re likely blocking Google! There are lots more contextual advertisers – media.net, Graphote, Peer 39, etc. – you may also be unknowingly blocking.
The potential impact of Robots.txt on your site
It’s important to remember that this is optional. You can serve a completely blank robots.txt file, or no robots.txt file at all, and be okay.
For now, if you’re unsure if you’re having problem with this, email your technical support professional to wipe your robots.txt.
It’s always better to have an empty robots.txt file, or one that only allows what we’ve attached a sample of in this post. Let search engines figure out what they can or can’t crawl if you’re not sure.
However, we’ve seen CPMs jump by 25-50 percent overnight after fixing this issue.
This little thing can make a big difference, so make sure your site is secure – and you’re not leaving money on the table.
Mediavine will be creating its own meta crawler to detect publishers’ Robots.txt files and will be adding it to our standard health checks over the coming weeks.
For right now, we wanted to make sure our publishers knew about this as quickly as possible. For further reading from the pros, see Google Developers’ take on Robots.txt.