September 6, 2012

Fixing Issues With Duplicate Content

Published: 6 September 2012 

Many website owners and web marketers are aware that duplicate content is seen as an issue with search engines such as Google. There are two types of duplicate content; internal and external. Internal duplication is the easiest to prevent or rectify but can be the hardest to find. External duplication can be found through online services such as Copyscape which also has an API if you would like to automate the checks. If you find another website suspiciously similar to your own search for it in archive.org to see when the update took place. Contacting the owner of the website would be the first port of call but if there is no hasty response (in a lot of cases there won't be) file a DMCA request to remove them from Google's index. On 10th August this year Google said that valid copyright removal notices are new signal in ranking factors which suggests that they aren't taking the issue lightly. The original announcement can be seen here.

The three most popular methods of handling on-site duplication are 301 redirects, rel canonical and meta noindex. Adding a duplicated page to your robots.txt file has also been used in the past but this does not stop search engine bots from indexing the page it only stops them from showing the content.

301 redirects are an effective way of solving the issue as they permanently redirect one page to another (passing along most of the strength with it). If you have an Apache web server this can be done by adding or altering the .htaccess file through FTP with a program such as FileZilla.

N.B. htaccess files are hidden by default, to view them in FileZilla click on 'Server' (in the menu drop down) then 'Force showing hidden files'.

The most common type of on-site duplication is called a canonicalisation issue. In this case the best URL is picked to represent a certain piece of content. Adding the following code to your .htaccess file will redirect all non-www pages to the www versions:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

Another common issue is when the home page file can be accessed from the domain name as well as the page URL e.g. www.domain.com.au and www.domain.com.au/index.php

To eliminate this problem add the following line below the code above:

RewriteRule ^index.php$ http://%{HTTP_HOST}/$1 [R=301,L]

If your home page is different to index.php replace it with the appropriate file name above i.e. RewriteRule ^home.html$

The rel=canonical tag is a good substitute for a 301 redirect because it passes the same amount of authority from one page to another and is usually easier to implement. Add the following code into the head of your duplicate page to pass all of its strength to the one you prefer:

<link rel="canonical" href="http://www.domain.com.au/page/" />

In the second paragraph I mentioned how adding pages to the robots.txt is not a good way of addressing duplicate content issues. The way to stop robots from indexing a page is by including a meta noindex tag into the head of your source code. This is as follows:

<meta name="robots" content="noindex, follow" />

There are more ways to action this such as configuring URL parameters in Google Webmaster Tools but the basic steps are listed above. Take a look at this post to learn more about SEO and how you can improve your skills on an ongoing basis.

Thanks for reading!

Many website owners and web marketers are aware that duplicate content is seen as an issue with search engines such as Google. There are two types of duplicate content; internal and external. Internal duplication is the easiest to prevent or rectify but can be the hardest to find. External duplication can be found through online services such as Copyscape which also has an API if you would like to automate the checks. If you find another website suspiciously similar to your own search for it in archive.org to see when the update took place. Contacting the owner of the website would be the first port of call but if there is no hasty response (in a lot of cases there won't be) file a DMCA request to remove them from Google's index. On 10th August this year Google said that valid copyright removal notices are new signal in ranking factors which suggests that they aren't taking the issue lightly. The original announcement can be seen here.

The three most popular methods of handling on-site duplication are 301 redirects, rel canonical and meta noindex. Adding a duplicated page to your robots.txt file has also been used in the past but this does not stop search engine bots from indexing the page it only stops them from showing the content.

301 redirects are an effective way of solving the issue as they permanently redirect one page to another (passing along most of the strength with it). If you have an Apache web server this can be done by adding or altering the .htaccess file through FTP with a program such as FileZilla.

N.B. htaccess files are hidden by default, to view them in FileZilla click on 'Server' (in the menu drop down) then 'Force showing hidden files'.

The most common type of on-site duplication is called a canonicalisation issue. In this case the best URL is picked to represent a certain piece of content. Adding the following code to your .htaccess file will redirect all non-www pages to the www versions:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.
RewriteRule ^(.*)$ http://www.%{HTTP_HOST}/$1 [R=301,L]

Another common issue is when the home page file can be accessed from the domain name as well as the page URL e.g. www.domain.com.au and www.domain.com.au/index.php

To eliminate this problem add the following line below the code above:

RewriteRule ^index.php$ http://%{HTTP_HOST}/$1 [R=301,L]

If your home page is different to index.php replace it with the appropriate file name above i.e. RewriteRule ^home.html$

The rel=canonical tag is a good substitute for a 301 redirect because it passes the same amount of authority from one page to another and is usually easier to implement. Add the following code into the head of your duplicate page to pass all of its strength to the one you prefer:

<link rel="canonical" href="http://www.domain.com.au/page/" />

In the second paragraph I mentioned how adding pages to the robots.txt is not a good way of addressing duplicate content issues. The way to stop robots from indexing a page is by including a meta noindex tag into the head of your source code. This is as follows:

<meta name="robots" content="noindex, follow" />

There are more ways to action this such as configuring URL parameters in Google Webmaster Tools but the basic steps are listed above. Take a look at this post to learn more about SEO and how you can improve your skills on an ongoing basis.

Thanks for reading!

Ben Maden

Read more posts by Ben

2 comments on “Fixing Issues With Duplicate Content”

  1. [...] Google usually takes a welfare into comment and should not be replaced for permanent redirects. 301 redirects impact both a hunt engine bot and user knowledge in that a URL automatically switches to a new one, specified by a webmaster. They also pass some strength from one page to another that is another reason since they should be a initial option. Problems arise when these redirects turn chained. This occurs when a page goes from A to B to C to D to E during once. For a user this could go from A to E though any denote of a URLs in between. However this could means hunt engine bots to give adult crawling that stops your calm from being indexed. For some-more information about transcribe calm and canonicalisation take a demeanour during this post. [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

Shares