Recently we did a web review for a company and noticed something strange about their site. Their old web pages could be found in Google, but none of their most recent new web pages were visible. The company in question holds regular public events and their website is one of their main channels for promoting them. Google (and other search engines) simply were not see any of their new events information and so couldn’t include it in their search engines results pages.
It’s easy to check what content from your site is in the search engine database or cache. Just type in the command Cache:TheWebAddressYouWantToCheck
And Google will display the date that the spider last crawled that page, in this example on 4 March 2010 at 18:17:
What our client was seeing was an error message saying the page was not in the Google index:
Clearly there is a problem, and so the next place to check is your Google’s Webmaster Tools. This will show you when the Googlebot is visiting your site – and how many pages it indexes each time.
Login to your account in you’ll be able to see the spider indexing your site. It is typically to have a “deep crawl” periodically, with regular “small crawls” on a daily basis.
Our client didn’t have Webmaster Tools, or they probably would have seen a flat line with no indexing taking place.
A little bit of investigating soon worked out why the Googlebot wasn’t crawling our client’s site – an incorrectly configured Robots.txt file.
Robots.txt is a file on your server designed for guiding the search engines spiders as they crawl your site. But get it wrong and you could be doing your rankings some serious damage. A basic example looks like this:
In the first line, User-agent refers to the web robot you are addressing your instructions to , and the * means all robots.
In the second, line, Disallow tells the robots which areas of your website they can’t visit. In our example no areas of the website are off-limits.
But our client’s robots.txt file looked like this:
That second line is the source of all of their problems – their robots.txt file is telling all web robots not to crawl their website. They are disallowing access to everything on their site, effectively putting up a big sign saying “no robots allowed”.
Their old content had previously been crawled and hence was in the Google index.
But none of their new content was being discovered.
Why had the web designer done this? Because they had moved their website to a new server, and somehow this error crept in.
Of course this also had a knock-on effect on the ranking of their website. Within weeks of correcting their robots.txt file the site has shot up to the first page of the rankings for many of their main keyword searches. That’s without any extra Search Engine Optimisation work being done.
This company was still getting traffic from Google, to pages that were indexed before the robots.txt was accidentally changed, and so it took a web review for them to notice the mistake. But if they had used Webmaster Tools they would have spotted the problem immediately. They are free and available to all site owners, so if you are serious about SEO it’s worth getting an account.
Robots.txt can be really useful, but an incorrect file could have a major impact on the success of your site. Do yourself a favour and use the free tools out there to ensure that your website is in peak condition.