Is a robots.txt file preventing your site from being crawled properly? Learn how they work and how to make one yourself so that your site appears in search results.
You might be surprised to learn that one small text file, known as robots.txt, could be the downfall of your website.
If you use it incorrectly you could end up telling search engine robots not to crawl your entire website, which means it won’t appear in search results. Therefore, it’s important to understand the purpose of a robots.txt file and how to use it correctly.
What is robots.txt?
A robots.txt file tells web robots which pages the website owner doesn’t want them to ‘crawl’. Robots crawl (visit) your website and then index (save) your web pages before listing them on search engine result pages.
If you don’t want certain pages, such as those in staging, to be listed by Google and other search engines, you need to block them using your robots.txt file.
You can check if your website has a robots.txt file by adding /robots.txt immediately after your domain name in the address bar at the top:
The URL you enter should look like this: http://www.examplewebsite.com/robots.txt
How does it work?
Before a search engine crawls your website, it looks at your robots.txt file for instructions on what pages they are allowed to crawl and index in search engine results.
Robots.txt files are useful if you want search engines not to index:
- Duplicate pages on your website
- Internal search results pages
- Certain areas of your website or a whole website
- Certain files on your website such as images and PDFs
- Your sitemap
Using robots.txt files allows you to eliminate pages which are not suitable for indexing, so search engines focus on crawling the most important pages instead. Search engines have a limited “crawl budget” and can only crawl a certain amount of pages per day, so you want to give them the best chance of finding your pages quickly by blocking all irrelevant URLs.
You may also implement a crawl delay, which tells robots to wait a few seconds before crawling certain pages, so as not to overload your server. Beware that Googlebot doesn’t acknowledge this command, so use the crawl budget instead in this case.
How to create a robots.txt file
If you don’t currently have a robots.txt file, it’s advisable to create one as soon as possible. To do so, you need to:
- Create a new text file and name it “robots.txt” – you can use the Notepad program on Windows PCs or TextEdit for Macs and then “Save As” a text-delimited file, ensuring that the extension of the file is .txt
- Upload it to the root directory of your website – this is usually a root level folder called “htdocs” or “www” which makes it appear directly after your domain name
- Create a robots.txt file for each subdomain – if you use subdomains
- Test that you can now see the robots.txt file by entering yourdomain.com/robots.txt into the browser address bar
What to include in your robots.txt file
There are often disagreements about what should and shouldn’t be put in robots.txt files.
Robots.txt isn’t meant to hide secure pages for your website, therefore the location of any admin or private pages on your site shouldn’t be included in the robots.txt file. To securely prevent robots from accessing any private content on your website you need to password protect the area where they are stored. That’s because robots.txt is designed to act only as a guide for web robots and not all of them will abide by your instructions.
Let’s look at different examples of how you may want to use the robots.txt file:
Allow everything and submit the sitemap – This is the best option for most websites because it allows all search engines to fully crawl the site and index all its data. It even shows the search engines where the XML sitemap is located so they can find new pages very quickly:
Allow everything apart from one sub-directory – Sometimes you may have an area on your website which you don’t want search engines to show in the search engine results. This could be a checkout area, image files, an irrelevant part of a forum or an adult section of a website for example, as shown below. Any URL including the path disallowed will be excluded by the search engines:
# Disallowed Sub-Directories
Allow everything apart from certain files – Sometimes you may want to show media on your website or provide documents but don’t want them to appear within image search results, social network previews or document search engine listings. Files you may wish to block could be any animated GIFs, PDF instruction manuals or any development PHP files for example shown below:
# Disallowed File Types
Allow everything apart from certain webpages – Some webpages on your website may not be suitable to show in search engine results and you can block these individual pages as well using the robots.txt file. Webpages that you may wish to block could be your terms and conditions page, any page which you want to remove quickly for legal reasons, or a page with sensitive information which you don’t want to be searchable. Remember that people can still read pages that are disallowed by robot.txt file even if you aren’t directing them there from search engines. Also, the pages will still be seen by some scrupulous crawler bots:
# Disallowed Web Pages
Allow everything apart from certain patterns of URLs – Lastly, you may have an awkward pattern of URLs which you may wish to disallow, one’s which may be nicely grouped into a certain sub-directory. Examples of URL patterns you may wish to block might be internal search result pages, leftover test pages from development, or pages after the first page of an ecommerce category page:
# Disallowed URL Patterns
Putting it all together
Clearly, you may wish to use a combination of these methods to block different areas of your website. The key things to remember are:
- If you disallow a sub-directory then ANY file, sub-directory or webpage within that URL pattern, will be disallowed
- The star symbol (*) substitutes for any character or number of characters
- The dollar symbol ($) signifies the end of the URL, without using this for blocking file extensions you may block a huge number of URLs by accident
- The URLs are case sensitive matched so you may have to put in both caps and non-cap versions to capture all
- It can take search engines several days to a few weeks to notice a disallowed URL and remove it from their index
- The “User-agent” setting allows you to block certain crawler bots or treat them differently if necessary, a full list of user agent bots can be found here to replace the catch-all star symbol (*).
If you are still puzzled or worried about the robot.txt file creation then Google has a handy testing tool within Webmaster Tools. Just sign in to Webmaster Tools and visit this URL: https://www.google.com/webmasters/tools/robots-analysis. Yandex also has a free tool which doesn’t require a login: http://webmaster.yandex.com/robots.xml
Google has put together a ‘fishy’ looking overview of what’s blocked and what’s not blocked on their in-depth robots.txt file page:
What not to include in your robots.txt file
Occasionally, a website has a robots.txt file which includes the following command:
This is telling all bots to ignore THE WHOLE domain, meaning none of that website’s pages or files would be listed at all by the search engines!
The aforementioned example highlights the importance of properly implementing a robots.txt file, so be sure to check yours to ensure you’re not unknowingly restricting your chances of being indexed by search engines.
Testing Your Robots.txt File
You can test your robots.txt file to ensure it works as you expect it to – it’s a good idea to do this even if you think it’s all correct.
The testing tool was created by Google (and updated in July 2014) to allow webmasters to check their robots.txt file. To test your robots.txt file, the site it is applied to needs to be registered with Google Webmaster Tools (this is really useful, so you should have this set up already). Then simply select the site from the list and Google will return notes for you and highlight any errors.
What happens if you have no robots.txt file?
Without a robots.txt file search engines will have a free run to crawl and index anything they find on the website. This is fine for most websites, but even then it’s good practice to at least point out where your XML sitemap is so search engines can quickly find new content on your website.