You might be surprised to hear that one small text file, known as robots.txt, could be the downfall of your website. If you get the file wrong you could end up telling search engine robots not to crawl your site, meaning your web pages won’t appear in the search results. Therefore, it’s important that you understand the purpose of a robots.txt file and learn how to check you’re using it correctly.
A robots.txt file gives instructions to web robots about the pages the website owner doesn’t wish to be ‘crawled’. For instance, if you didn’t want your images to be listed by Google and other search engines, you’d block them using your robots.txt file.
You can go to your website and check if you have a robots.txt file by adding /robots.txt immediately after your domain name in the address bar at the top:
The URL you enter should look like this: http://www.examplewebsite.com/robots.txt
(obviously with your domain name instead!)
How does it work?
Before a search engine crawls your site, it will look at your robots.txt file as instructions on where they are allowed to crawl (visit) and index (save) on the search engine results.
Robots.txt files are useful:
- If you want search engines to ignore any duplicate pages on your website
- If you don’t want search engines to index your internal search results pages
- If you don’t want search engines to index certain areas of your website or a whole website
- If you don’t want search engines to index certain files on your website (images, PDFs, etc.)
- If you want to tell search engines where your sitemap is located
How to create a robots.txt file
If you’ve found that you don’t currently have a robots.txt file, I’d advise you to create one as soon as possible. You will need to:
- Create a new text file and save it as the name “ – you can use the Notepad program on Windows PCs or TextEdit for Macs and then “Save As” a text-delimited file.
- Upload it to the root directory of your website – this is usually a root level folder called “htdocs” or “www” which makes it appear directly after your domain name.
- If you use subdomains, you’ll need to create a robots.txt file for each subdomain.
What to include in your robots.txt file
There’s often disagreements about what should and shouldn’t be put in robots.txt files. Please note again that robots.txt isn’t meant to deal with security issues for your website, therefore I’d recommend that the location of any admin or private pages on your site aren’t included in the robots.txt file. If you want to securely prevent robots from accessing any private content on your website then you need to password protect the area where they are stored. Remember, robots.txt is designed to act as a guide for web robots, and not all of them will abide by your instructions.
Let’s look at different examples of how you may want to use the robots.txt file:
Allow everything and submit the sitemap – This is the best option for most websites, it allows all search engine to fully crawl the website and index all the data, it even shows the search engines where the XML sitemap is located so they can find new pages very quickly:
Allow everything apart from one sub-directory – Sometimes you may have an area on your website where you don’t want search engines to show in the search engine results. This could be a checkout area, image files, an irrelevant part of a forum or an adult section of a website for example all shown below. Any URL including the path disallowed will be excluded by the search engines:
# Disallowed Sub-Directories
Allow everything apart from certain files – Sometimes you may want to show media on your website or provide documents but don’t want them to appear within image search results, social network previews or document search engine listings. Files you may wish to block could be any animated GIFs, PDF instruction manuals or any development PHP files for example shown below:
# Disallowed File Types
Allow everything apart from certain webpages – Some webpages on your website may not be suitable to show in search engine results and you can block individual pages as well using the robots.txt file. Webpages that you may wish to block could be your terms and conditions page, a page which you want to remove quickly for legal reasons or a page with sensitive information on which you don’t want to be searchable (remember that people can still read your robot.txt file and the pages will still be seen by some scrupulous crawler bots):
# Disallowed Web Pages
Allow everything apart from certain patterns of URLs – Lastly you may have an awkward pattern of URLs which you may wish to disallow, one’s which may be nicely grouped into a certain sub-directory. Examples of URL patterns you may wish to block might be internal search result pages, left over test pages from development or 2nd, 3rd, 4th etc. pages of an ecommerce category page:
# Disallowed URL Patterns
Putting it all together
Clearly you may wish to use a combination of these methods to block off different areas of your website, the key things to remember are:
- If you disallow a sub-directory then ANY file, sub-directory or webpage within that URL pattern will be disallowed
- The star symbol (*) substitutes for any character or number of characters
- The dollar symbol ($) signifies the end of the URL, without using this for blocking file extensions you may block a huge number of URLs by accident
- The URLs are case sensitive matched so you may have to put in both caps and non-caps versions to capture all
- It can take search engines several days to a few weeks to notice a disallowed URL and remove it from their index
- The “User-agent” setting allows you to block certain crawler bots or treat them differently if needed, a full list of user agent bots can be found here which replace the catch-all star symbol (*)
If you are still puzzled or worried about the robot.txt file creation then Google has a handy testing tool within Webmaster Tools, just sign into Webmaster Tools and visit this URL: https://www.google.com/webmasters/tools/robots-analysis. Yandex also have a free tool which doesn’t require a login here: http://webmaster.yandex.com/robots.xml
Google have put together a ‘fishy’ looking overview of what’s blocked and what’s not block on their in-depth robots.txt file page:
What not to include in your robots.txt file
Occasionally, a website has a robots.txt file which includes the following command:
This is telling all bots to ignore THE WHOLE domain, meaning none of that website’s pages or files would be listed at all by the search engines!
The aforementioned example highlights the importance of properly implementing a robots.txt file, so be sure to check yours to ensure you’re not unknowingly restricting your chances of being indexed by search engines.
Testing Your Robots.txt File
You can test your robots.txt file to ensure it works as you expect it to – we’d recommend you do this with your robots.txt file even if you think it’s all correct.
The testing tool was created by Google (and updated in July 2014) to allow webmasters to check their robots.txt file. To test your robots.txt file, you’ll need to have the site to which it is applied registered with Google Webmaster Tools (this is really useful, so you should have this set up already – if not, find out more here). You then simply select the site from the list and Google will return notes for you where it highlights any errors.
What happens if you have no robots.txt file?
Without a robots.txt file search engines will have a free run to crawl and index anything they find on the website. This is fine for most websites but it’s really good practice to at least point out where your XML sitemap is so search engines can find new content without having to slowly crawl through all the pages on your website and bumping into them days later.
NB – This is an updated version of Ben Wood‘s Robots.txt introductory post from 2012.