Plugging the link leaks, part 2 - site canonicalisation

May 22, 2013

Plugging the link leaks, part 2 – site canonicalisation

A couple of months ago we took a look at how you can reclaim links that you are simply throwing away. For the second look at how to fix common link leaks, we’re going to look at issues around site canonicalisation, and how easy it is to lose link authority through simple duplicate URLs.

As before, this is a simple way many sites lose valuable links without even realising. By plugging these leaks, we’ve helped clients gain an instant boost in link authority, without irking anyone on Google’s anti-spam team.

The many faces of a homepage

The best place to start when digging for link leaks is actually one of the checks you should carry out when looking at a new or prospective client for the first time – how many ways can you find the site homepage? Type your domain name into your browser, and then do the same for all the possible variations:

  • (or index.html, .aspx, .php or whatever the CMS/server uses)

Does the homepage come up for more than one of these? If so, take a look in Google using a site: search for each to see if they have been indexed. If they redirect, did they redirect using a 301 redirect, which indicates the file has permanently moved? You can check via a tool such as the Redirect Path Chrome extension or an http header checker. If the redirect is not a 301 (for example, a 302 has been used), you had better go and see if that page has been indexed.

Why does this matter? Well, for all Google’s power and expertise, in some ways it works in a very simple, black-and-white manner. As each of these different versions of our homepage could, in theory, contain different content, they are regarded as different pages by the search engines.

This means each of these pages can be considered duplicate content, as Google and the other major search engines see two, or more, pages with the same content. Even worse, this is bad for SEO efforts, as it dilutes your inbound links between these different URLs; if someone links to a different variation of your homepage to the one you expect, that link equity simply leaks away. Got links pointing to the wrong version? They don’t count.

An example

But just how bad can this be? Don’t most people simply link to the main version?

Our site's homepage has a large amount of links to both the www and non-www version

Oh. That’s bad. That’s really bad.

Here you can see with a quick search in Open Site Explorer (and it works just as well with MajesticSEO and Ahrefs) a site that is losing a ton of link equity to its homepage by having links to both the www and non-www versions. A look for the brand name in Google and Bing reveals they have both selected the non-www version to show in their results pages.

If we look at the links for the entire subdomain in MajesticSEO and Open Site Explorer it gets worse.

A look in OSE shows there are many links to the www sub-domain

MajesticSEO also shows links to the www sub-domain that are 'leaking' away

This site is leaking links not just to the homepage, but to multiple pages, and losing huge amounts of link equity.

Once you’ve checked for your homepage, and looked to see how many links leaks have sprung up as a result, you might want to check to see if these variations have occurred on any inner pages. This is a less common issue than the homepage, but still something we see regularly.

Firstly, to check internally, do a crawl of the site using a tool such as Screaming Frog and navigate to the page titles tab. Here you can filter for duplicate title tags; if you have duplicates, check to see if these are variations of the same URL. If any are, that means that somewhere on the site you have internal links pointing to different URLs for the same resource.

Then, to see if Google is finding duplicate versions, take a look in Google Webmaster Tools. Navigate to the HTML suggestions tool in Diagnostics, and look for duplicate title tags and meta descriptions, again checking to see if there are any pages with multiple URLs for the same content. Wherever you find different ways to reach the same URL, make a note of them for later.

Checking Webmaster Tools for duplicate versions of the same page

Fixing the link leaks

Regardless of where you found your links leaks, you want to plug them. Fortunately, in most cases this is a relatively straightforward job.

Now of course, the best solution to plugging this kind of link leak is to set up your site’s architecture so that duplicate URLs cannot occur. Of course, with many CMSs and site set-ups, this can be tricky.

So, the next alternative is to put in place site-wide redirects that automatically redirect to a consistent version of each URL. Canonical issues you should be sure are correct include:

  • Redirecting either www or non-www version (whichever you are not using)
  • Redirecting /index.html, or equivalent, URLs
  • Ensuring all URLs use the same characterisation – normally set as lowercase
  • Use of trailing slash at the end of a URL – do you want to use or not?

Doing site wide redirects this way allows you to capture common duplicate URLs in one stroke. They are not always possible, however, so instead you’ll have to do individual redirects. If this is the case, make sure you capture every duplicate page that you have linked to internally, and that it is appearing in Google and Bing’s indices. Site: searches and a check of Google Analytics landing pages can also help, as can a thorough check of backlink targets from your tool of choice.

There are not necessarily right or wrong choices of the URL version to use – the important thing is to consolidate all the options into one URL for each page, and to be consistent. In most cases the most powerful, consistent format will be clear. 301 redirects need to be created so that each page or resource can only be reached by one URL. You want to set it up so that if someone links to, or types in a different variation, say or, they are 301 redirected to By using 301 redirects, all or nearly all of any link juice that page has accrued will be passed along to the target URL.

In the example above, we added a redirect sending the www versions of each page to the non-www version, and immediately got an extra 120 linking domains for the site’s homepage.

There are, however, times when, sadly, implementing 301 redirects is not possible, either through lack of access to a client’s developer, or CMS limitations. When this occurs, our third choice of solution is rel=canonical tags.

Rel=canonical should only be used if other options are not possible, or in conjunction with 301 redirects if you are worried that there are so many variations of a URL out there that you want a ‘belt and braces’ approach.

The rel=canonical tag gives a strong indicator to the search engines that we want the URL in the tag to be regarded as the primary version of this page, and other variants (marked with the tag) to be regarded as duplicates, not to be indexed. This often works extremely well, but does have a couple of disadvantages.

Firstly, whilst it is a strong hint, and Google will generally honour the canonical request, it does not have to, should it have reason. Google has said that it is better to fix site issues in the first place, or use 301 redirects. Secondly, using this method means that these pages can still be crawled, so you are wasting crawl budget on pages that you don’t even want Google to index. Finally, there is no guarantee that other search engines will listen to your canonical tag, and that might be important in certain countries your site may be targeting.

However, despite this, the rel=canonical tag can be helpful if you genuinely can’t redirect your duplicate URL variations. Not only will Google generally index your preferred URL, they will also try and consolidate all the links pointing to the different URL variations and apply them to the specified canonical address, getting you back your lost link authority. As Google put it themselves in this handy recent post on rel=canonical mistakes:

The rel=canonical link consolidates indexing properties from the duplicates, like their inbound links, as well as specifies which URL you’d like displayed in search results.

Next you want to check your Screaming Frog internal crawl, and go to the Advanced Export option in the main navigation and select ‘Success (200) In Links’. Take this export into Excel, and turn it into a table (ctrl + t). Filter the Destination column by your duplicate URLs (that you noted earlier) to find all the internal links pointing to each non-canonical variation. You can then pass this on to your client or developer so all the incorrect internal links can be fixed. The least you can do is make sure you are not directing any users, and no search engines, to the wrong URLs!

Finally, if you haven’t already, verify both the www and non-www versions of your site in Google and Bing Webmaster Tools. Not only will this allow you to spot issues the search engines are having with your site, but you can explicitly request which URL variation you want to show in the SERPs with Google Webmaster Tools, but only if you’ve proved you own both (another clue that different sub-domains are regarded as separate sites if the above was not enough).

Choosing your preferred domain in Google Webmaster Tools

And that’s it. A relatively straightforward process, but as the examples above prove, one that can reveal a rather tasty number of links that are simply leaking away. As with link reclamation, for only a couple of hours’ work you can give your site a boost of link authority that you’ve already earned.

Do you have any examples of helping a site leaking links this way? Or advice on how to check, or remedy, multiple URLs for the same resource with a particualr CMS?

By Charlie Williams SEO Share:

One thought on “Plugging the link leaks, part 2 – site canonicalisation

  1. Pingback: SearchCap: The Day In Search, May 24, 2013 | Search Engine Optimization & Internet Marketing (SEO & SEM) Blog

Leave a Reply

Your email address will not be published. Required fields are marked *