Here's a brief introduction to log file analysis for SEO. Giving you an overview of what log files are, how they can be analysed for SEO, what to look out for and which tools to use.
What is a server log file?
A server log is a log file (or several files) automatically created and maintained by a server consisting of a list of activities it performed.
For SEO purposes, we are concerned with a web server log which contains a history of page requests for a website, from both humans and robots. This is also sometimes referred to as an access log, and the raw data looks something like this:
Yes the data looks a bit overwhelming and confusing at first, so let’s break it down and look at a “hit” more closely.
An Example Hit
Every server is inherently different in logging hits, but they typically give similar information that is organised into fields.
Below is a sample hit to an Apache web server (this is simplified – some of the fields have been taken out):
22.214.171.124 – – [01/March/2018:12:21:17 +0100] “GET” – “/wp-content/themes/esp/help.php” – “404” “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” – www.example.com –
As you can see, for each hit we are given key information such as the date and time, the response code of the requested URI (in this case, a 404) and the user-agent that the request came from (in this case Googlebot). As you can imagine, log files are made up of thousands of hits each day, as every time a user or bot arrives at your site, many hits are recorded for each page requested – including images, CSS and any other files required to render the page.
Why Are They Significant?
So you know what a log file is, but why is it worth your time to analyse them?
Well the fact is that there is only one true record of how search engines, such as Googlebot, process your website. And that is by looking at your server log files for your website.
Search Console, 3rd party crawlers and search operators won’t give us the whole picture on how Googlebot and other search engines interact with a website. ONLY the access log files can give us this information.
How Can We Use Log File Analysis for SEO?
Log file analysis gives us a huge amount of useful insight, including enabling us to:
- Validate exactly what can, or can’t be crawled.
- View responses encountered by the search engines during their crawl e.g. 302s, 404s, soft 404s.
- Identify crawl shortcomings, that might have wider site-based implications (such as hierarchy, or internal link structure).
- See which pages the search engines prioritise, and might consider the most important.
- Discover areas of crawl budget waste.
How Do I Get Hold Of Log Files?
For this type of analysis you require the raw access logs from all the web servers for your domain, with no filtering or modifications applied. Ideally you’ll need a large amount of data to make the analysis worthwhile. How many days/weeks worth this is depends on the size and authority of your site and the amount of traffic it generates. For some sites a week might be enough, for some sites you might need a month or more of data.
You web developer should be able to send you these files for you. It’s worth asking them before they send over to you whether the logs contain requests from more than a single domain and protocol and if they are included in this logs. Because if not, this will prevent you from correctly identify the requests. You won’t be able to tell the difference between a request for http://www.example.com/ and https://example.com/. In these cases you shoudl ask your developer to update the log configuration to include this information for the future.
What Tools Do I Need to Use?
If you’re an Excel whizz, then this guide is really useful at helping you to format and analyse your log files using Excel. Personally, I use the Screaming Frog Log File Analyser (cost $99 per year), as it’s user-friendly interface makes it quick and easy to spot any issues (although arguably you won’t quite the same level of depth or freedom as you would gain by using Excel).
Some other tools are Splunk and GamutLogViewer.
How To Analyse Log Files for SEO
1. Find Where Crawl Budget is Being Wasted
Firstly, what is crawl budget? Google defines it as:
“Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.”
Essentially – it’s the number of pages a search engine will crawl each time it visits your site and is linked to the authority of a domain and proportional to the flow of link equity through a website.
Crucially in relation to log file analysis, crawl budget can sometimes be wasted on irrelevant pages. If you have fresh content you want to be indexed but no budget left, then Google won’t index this new content. That’s why you want to monitor where you spend your crawl budget with log analysis.
Optimising your crawl budget will help search engines crawl and index the most important pages on your website. Check out our useful post on the subject.
Factors Affecting Crawl Budget
Having many low-value-add URLs can negatively affect a site’s crawling and indexing. Low-value-add URLs can fall into these categories:
- Faceted navigation and session identifiers
- On-site duplicate content
- Soft error pages
- Hacked pages
- Low quality and spam content
Wasting server resources on pages like these will drain crawl activity from pages that do actually have value, which may cause a significant delay in discovering good content on a site.
2. Answer Technical SEO Questions
By analysing log files, we can answer the following questions with far greater certainty than if we were trying to use other methods/tools:
How often are certain subdirectories being crawled? E.g. service pages, the blog, or perhaps particular authors.
Are all of your targeted search engine bots accessing your pages?
Which pages are not correctly serving? Look for pages with 3xx, 4xx & 5xx HTTP statuses
And many more!
3. Find Out If Your Site Has Switched To Google’s Mobile-First Index
You can also use a site’s server logs to know if your website is getting the increased crawling by Smartphone Googlebot indicating it’s been switched to the mobile-first index.
Typically a site still on the regular index will have about 80% of Google’s crawling done by the desktop crawler and 20% by the mobile one. If you’ve been switched to mobile-first, those numbers will reverse.
You can find this info by looking at the User Agents tab in Screaming Frog Log Analyser – you should see most of the events coming from Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html:
Log File Analysis: A Brief Example
I recently completed a lot file analysis using Screaming Frog for one of my clients, below is the overview giving you the top level data for the domain:
I discovered that Google seemed to be crawling some odd pages very frequently, and prioritising them over other important pages. Of course, ideally, your most important pages should be crawled the most e.g. your homepage. However I found pages in the top 15 pages with the most hits there were redirects, incorrect 302 (temporary) redirect, pages with no content on them and some which were 404s and soft 404s. I also discovered that Google was accessing and crawling a huge number of dynamic, faceted URLs.
This meant that I was able to advise the client on several technical fixes, including excluding URLs from being crawled by blocking URLs containing certain patterns with the robots.txt file, updating incorrect redirects and soft 404s, and more. All of which will help to boost their performance in the search engines and improve the site’s accessibility to Google.
So that’s my brief introduction to log file analysis. There’s a huge amount more you can do, both in Excel and with the tools mentioned above (plus others). More than I can possibly cover here! Below are some resources I’ve found useful:
And there’s loads more out there to read which should satisfy the most curious of technical SEO enthusiasts!