Foreword by Matt Diggity:
In a quick moment I’m going to hand things over to Rowan Collins, the featured guest author of this article.
Rowan is Head of Technical SEO at my agency The Search Initiative. He’s one of our best.
Other than being an overall well-rounded SEO, Rowan is a beast when it comes to the technical side of things… as you’ll soon learn.
Introduction: Rowan Collins
If you can do it right, then you’re going to have a responsive site. Every small change can lead to big gains in the SERPs. However, if done wrong, then you’ll be left waiting weeks for an update from the Googlebot.
I’m often asked how to force Googlebot to crawl specific pages. Furthermore, people are struggling to get their pages indexed.
Well, today’s your lucky day – because that’s all about to change with this article.
I’m going to teach you the four main aspects of masterering site crawl, so you can take actionable measures to improve your standings in the SERPs.
Pillar #1: Page Blocking
Google assigns a “crawl budget” to each website. To make sure Google is crawling the pages that you want, don’t waste that budget on pages you don’t want them to crawl.
This is where page blocking comes into play.
When it comes to blocking pages, you’ve got plenty of options, and it’s up to you which ones to use. I’m going to give you the tools, but you’ll need to analyse your own site.
Robots.txt
A simple technique that I like to use is blocking pages with robots.txt.
Originally designed as a result of accidentally DDOS’ing a website with a crawler; this directive has become unofficially recognized by the web.
Whilst there’s no ISO Standard for robots.txt, Googlebot does have its preferences. You can find out more about that here.
But the short version is that you can simply create a .txt file called robots, and give it directives on how to behave. You will need to structure it so that each robot knows which rules apply to itself.
Here’s an example:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Sitemap: https://diggitymarketing.com/sitemap.xml
This is a short and sweet robots.txt file, and it’s one that you’ll likely find on your website. Here it is broken down for you:
- User-Agent – this is specifying which robots should adhere to the following rules. Whilst good bots will generally follow directives, bad bots do not need to.
- Disallow – this is telling the bots not to crawl your /wp-admin/ folders, which is where a lot of important documents are kept for WordPress.
- Allow – this is telling the bots that despite being inside the /wp-admin/ folder, you’re still allowed to crawl this file. The admin-ajax.php file is super important, so you should keep this open for the bots.
- Sitemap – one of the most frequently left out lines is the sitemap directive. This helps Googlebot to find your XML sitemap and improve crawlability.
If you’re using Shopify then you’ll know the hardships of not having control over your robots.txt file. Here’s what your sitemap will most likely resemble:
However, the following strategy can still be applied to Shopify, and should help:
Meta Robots
Still part of the robots directives, the meta robots tags are HTML code that can be used to specify crawl preferences.
By default all your pages will be set to index,follow – even if you don’t specify a preference. Adding this tag won’t help your page get crawled and indexed, because it’s the default.
However, If you’re looking to stop crawlability of a specific page then you will need to specify.
<meta name=”robots” content=”noindex,follow”>
<meta name=”robots” content=”noindex,nofollow”>
Whilst the above two tags are technically different from a robots directive perspective, they don’t seem to function differently according to Google.
Previously, you would specify the noindex to stop the page being crawled. Furthermore, you would also choose to specify if the page should still be followed.
Google recently made a statement that noindexed pages eventually get treated like Soft 404s and they treat the links as nofollow. Therefore, there’s no technical difference between specifying follow and nofollow.
However, if you don’t trust everything that John Mueller states, you can use the noindex,follow tag to specify your desire to be crawled still.
This is something that Yoast have taken on board, so you’ll notice in recent versions of Yoast SEO plugin, the option to noindex pagination has been removed.
This is because if Googlebot is treating the noindex tag as a 404, then doing this across your pagination is an awful idea. I would stay on the side of caution and only use this for pages you’re happy not to be crawled or followed.
X-Robots Tags
There’s another robots tag that people never really use that often, and it’s powerful. But not many people understand why it’s so powerful.
With the robots.txt and meta robots directives, it’s up to the robot whether it listens or not. This goes for Googlebot too, it can still ping your pages to find out if they’re present.
Using this server header, you’re able to tell robots not to crawl your site from the server. This means that they won’t have a choice in the matter, they’ll simply be denied access.
This can either be done by PHP or by Apache Directives, because both are processed server side. With the .htaccess being the preferred method for blocking specific file types, and PHP for specific pages.
PHP Code
Here’s an example of the code that you would use for blocking off a page with PHP. It’s simple, but it will be processed server-side instead of being optional for crawlers.
header(“X-Robots-Tag: noindex”, true);
Apache Directive
Here’s an example of the code that you could use for blocking off .doc and .pdf files from the SERPs without having to specify every PDF in your robots.txt file.
<FilesMatch “.(doc|pdf)$”>
Header set X-Robots-Tag “noindex, noarchive, nosnippet”
</FilesMatch>
Pillar #2: Understanding Crawl Behaviours
Many of the people who follow The Lab will know that there’s lots of ways that robots can crawl your website. However, here’s the rundown on how it all works:
Crawl Budget
When it comes to crawl budget, this is something that only exists in principle, but not in practise. This means that there’s no way to artificially inflate your crawl budget.
For those unfamiliar, this is how much time Google will spend crawling your website. Megastores with 1000s of products will be crawled more extensively than those with a microsite. However, the microsite will have core pages crawled more often.
If you are having trouble getting Google to crawl your important pages, there’s probably a reason for this. Either it’s been blocked off, or it is low value.
Rather than trying to force crawls on pages, you may need to address the root of the problem.
However, for those that like a rough idea, you can check the average crawl rate of your website in Google Search Console > Crawl Stats.
Depth First Crawling
One way that robots can crawl your website is using the principle of depth-first. This will force crawlers to go as deep as possible before returning up the hierarchy.
This is an effective way for crawling a website if you’re looking to find internal pages with valuable content in as short a time as possible. However, core navigational pages will be pushed down in priority as a result.
Being aware that web crawlers can behave in this way will help when analysing problems with your website.
Breadth First Crawling
This is the opposite of depth first crawling, in that it preserves website structure. It will start by crawling every Level 1 page before crawling every Level 2 page.
The benefits of this type of crawling is that it will likely discover more unique URLs in a shorter period. This is because it travels across multiple categories in your website.
So, rather than digging deep into the rabbit hole, this method seeks to find every rabbit hole before digging deeper into the website.
However, whilst this is good for preserving site architecture, it’s can be slow if your category pages take a long time to respond and load.
Efficiency Crawling
There’s many different ways of crawling, but the most notable are the two above, and the third is efficiency crawling. This is where the crawler doesn’t observe breadth or depth first, but instead based on response times.
This means that if your website has an hour to crawl, it will pick all the pages with low response time. This way, it’s likely to crawl a larger amount of sites in a shorter period of time. This is where the term ‘crawl budget’ comes from.
Essentially, you’re trying to make your website respond as quickly as possible. You do this so that more pages can be crawled in that allocated time frame.
Server Speed
Many people don’t recognise that the internet is physically connected. There are millions of devices connected across the globe to share and pass files.
However, your website is being hosted on a server somewhere. For Google and your users to open your website, this will require a connection with your server.
The faster that your server is, the less time that Googlebot has to wait for the important files. If we review the above section about efficiency crawling; it’s clear why this is quite important.
When it comes to SEO, it pays to get good quality hosting in a location near your target audience. This will lower the latency and also wait time for each file. However, if you want to distribute internationally, you may wish to use a CDN.
Content Distribution Networks (CDNs)
Since Googlebot is crawling from the Google servers, these may be physically very far away from your website’s server. This means that Google can see your website as slow, despite your users perceiving this as a fast website.
One way to work around this is by setting up a Content Distribution Network.
There are loads to choose from, but it’s really straight forward. You are paying for your website’s content to be distributed across the internet’s network.
That’s what it does, but many people ask why would that help?
If your website is distributed across the internet, the physical distance between your end user and the files can be reduced. This ultimately means that there’s less latency and faster load times for all of your pages.
Image Credit: MaxCDN
Pillar #3: Page Funnelling
Once you understand the above and crawl bot behaviours, the next question should be; how can I force Google to crawl the pages that I want?
Below you’re going to find some great tips on tying up loose ends on your website, funneling authority and getting core pages recrawled.
AHREFS Broken Links
At the start of every campaign it’s essential to tie up any loose ends. To do this, we look for any broken links that are picked up in AHREFS.
Not only will this help to funnel authority through to your website; it will show broken links that have been picked up. This will help to clean up any unintended 404s that are still live across the internet.
If you want to clean this up quickly, you can export a list of broken links and then import them all to your favourite redirect plugin. We personally use Redirection and Simple 301 Redirects for our wordpress redirects.
Whilst Redirection includes import/export csv by default, you will need to get an additional add-on for Simple 301 Redirects. It’s called bulk update and is also for free.
Screaming Frog Broken Links
Similar to above, with Screaming Frog we’re first looking to export all the 404 errors and then add redirects. This should move all your errors into 301 redirects.
The next step to clean up your website is to fix your internal links.
Whilst a 301 can pass authority and relevance signals, it’s normally faster and more efficient if your server isn’t processing lots of redirects. Get in the habit of cleaning up your internal links, and remember to optimise those anchors!
Search Console Crawl Errors
Another place you can find some errors to funnel is in your Google Search Console. This can be a handy way to find which errors Googlebot has picked up.
Then do as you have above, export them all to csv, and bulk import the redirections. This will fix almost all your 404 errors in a couple of days. Then Googlebot will spend more time crawling your relevant pages, and less time on your broken pages.
Server Log Analysis
Whilst all of the above tools are useful, they’re not the absolute best way to check for inefficiency. By choosing to view server logs through Screaming Frog Log File Analyser you can find all the errors your server has picked up.
Screaming Frog filters out normal users and focuses primarily on search bots. This seems like it would provide the same results as above; but it’s normally more detailed.
Not only does it include all of the Googlebot URLs; but you can also pick up other search crawlers such as Bing and Yandex. Plus since it’s every error that your server picked up – you’re not going to rely on Google Search Console to be accurate.
Internal Linking
One of the ways that you can improve crawl rate of a specific page is by using internal links. It’s a simple one, but you can improve your current approach.
Using the Screaming File Log File Analyser from above, you can see which pages are getting the most hits from Googlebot. If it’s being crawled regularly throughout the month; there’s a good chance that you’ve found a candidate for internal linking.
This page can have internal links added towards other core posts, and this is going to help get Googlebot to the right areas of your website.
You can see below an example of how Matt properly includes internal links regularly. This helps you guys to find more awesome content; and also helps Googlebot to rank his site.
Pillar #4: Forcing a Crawl
If Googlebot is performing a site crawl and not finding your core pages, this is normally a big issue. Or if your website is too big and they’re not getting to the pages you want indexed – this can hurt your SEO strategy.
Thankfully, there are ways to force a crawl on your website. However, first there’s some words of warning about this approach:
If your website is not being crawled regularly by Googlebot, there’s normally a good reason for this. The most likely cause is that Google doesn’t think your website is valuable.
Another good reason for your page to not be crawled is the website is bloated. If you are struggling to get millions of pages indexed; your problem is the millions of pages and not the fact that it’s not indexed.
At our SEO Agency The Search Initiative, we have seen examples of websites that were spared a Panda penalty because their crawlability was too bad for Google to find the thin content pages. If we first fixed the crawlability issue without fixing the thin content – we would have ended up slapped with a penalty.
It’s important to fix all of your website’s problems if you want to enjoy long lasting rankings.
Sitemap.xml
Seems like a pretty obvious one, but since Google uses XML Sitemaps to crawl your website, the first method would be to do a sitemap.
Simply take all your URLs you want indexed, then run through the list mode of Screaming Frog, by selecting List from the menu:
Then you can upload your URLs from one of the following options in the dropdown:
- From File
- Enter Manually
- Paste
- Download Sitemap
- Download Sitemap Index
Then once you’ve crawled all the URLs you want indexed, you can just use the Sitemap feature to generate an XML Sitemap.
Submit this to your root directory and then upload to Google Search Console to quickly remove any duplicate pages or non crawled pages.
Fetch & Request Indexing
If you only have a small number of pages that you want to index, then using the Fetch and Request Indexing tool is super useful.
It works great when combined with the sitemap submissions to effectively recrawl your site in short periods of time. There’s not much to say, other than you can find it in Google Search Console > Crawl > Fetch as Google.
Link Building
It makes sense that if you are trying to have a page become more visible and more likely to be crawled; throwing some links will help you out.
Normally 1 – 2 decent links can help put your page on the map. This is because Google will be crawling another page and then discover the anchor towards yours. Leaving Googlebot no choice but to crawl the new page.
Using low quality pillow links can also work, but I would recommend that you aim for some high quality links. It’s ultimately going to improve your likelihood of being crawled as the good quality content gets crawled more often.
Indexing Tools
By the time you’ve got to using indexing tools, you should probably have hit the bottom of the barrel and running out of ideas.
If your pages are good quality, indexable, in your sitemap, fetched and requested, with some external links and you’ve still not been indexed – there’s another trick you can try.
Many people use indexing tools as the shortcut and default straight to it, but in most cases it’s a waste of money. The results are often unreliable, and if you’ve done everything else right then you shouldn’t really have a problem.
However, you can use tools such as Lightspeed Indexer to try and force a crawl on your pages. There are tons others, and they all have their unique benefits.
Most of these tools work by sending Pings to Search Engines, similar to Pingomatic.
Summary
When it comes to site crawlability, there are tons of different ways to solve any problem that you face. The trick for long term success will be figuring out which approach is best for your website’s requirements.
My advice to each individual would be this:
Make an effort to understand the basic construction and interconnectivity of the internet.
Without this foundation, the rest of SEO becomes a series of magic tricks. However, if you are successful, then everything else about SEO becomes demystified.
Try to remember that the algorithm is largely mathematical. Therefore, even your content can be understood by a series of simple equations.
With this in mind, good luck in fixing your site’s crawlability issues and if you’re still having problems, you know where to find us: The Search Initiative.
When it comes to technical analysis and implementation for client websites, Rowan Collins is your guy.