Search
Articles
July 2006- Duplicate Content: What You Ought
to Know About
by Oleg Ishenko
Take a look at your website. How much of your content might be considered
as duplicate by a search engine algorithm? Even though you never copy
anyone you can't answer 'none' because someone can be copying you. Duplicate
content is one of the biggest issues both for search engines trying
to keep their results' relevancy high, and webmasters trying to avoid
search engine penalties.
Penalties for having duplicate content can be really harmful. This
is not just a downgrade in rankings but a move to supplementary results
which are hardly visible to the most of the web users. Normally it is
expected that Google would select one URL over another to display in
SERPs, while duplicates could be found in supplemental results. Unfortunately
this is not always so. In the thread "Duplicate content observation" in
the WebmasterWorld.com forum you can read about a case when an original
high quality and authoritative page was removed from Google's index
together with its duplicates. Considering that this can happen even
to the most honest webmaster, one can imagine the amount of attention
this issue gets on any SEO forum.
Types of Duplicate Content
Duplicate content has a wider definition than the 'copy-paste' plagiarism;
it is not just content scrapped from a competitor's site, a SERP or
a RSS feed. Apart from this there are few more aspects that are generally
referred to as duplicate content.
Circular Navigation
Jake Baille from TrueLocal vaguely defines circular navigation as
having multiple paths across website. This can be understood as the
same content being accessible via different URLs. An example of the
circular navigation could be an article that is retrieved by links like
- example.site/articles/1/ ,
- mysite.site/article1/
- mysite.site/articles.php?id=1
Another legitimate use of multiple URLs is forum threads. Each thread
can be accessible by a link like myforum.site/index.php/topic.1201.html
, and each message within the tread has a URL like myforum.site/index.php/topic.1201.msg.01.html
. In the eyes of a search engine all the links lead to different pages
with identical content. Solution? Think of a consistent way of linking,
or apply robot.txt exclusion rules.
This can also be the case when other people link to you using differently
looking URLs. Since these external links are out of your control, you
should create a 301 redirect to the canonical URL you choose to be displayed.
Printer-Friendly Versions
Making a printer friendly version is a common practice and it adds
value to the visitors. But printer-friendly version is also a prominent
example of duplicate content! Fortunately a simple solution like adding
a 'noindex' meta tag to your print pages solves the issue.
Product-Only Pages
Product pages looking similar are common among online stores. Typically
they are created using a single template. Often two different product
pages share a description that varies in just few words or numbers,
which causes them to be filtered out as duplicate content. This issue
has no easy solution. Either you rewrite robot.txt to allow only one
product description to be crawled and lose SE traffic to the rest of
them, or you roll up your sleeves and add something different to each
product page, like testimonials, which is time consuming or nearly impossible
depending on the number of product types in your stock.
How Do Duplicate Content Filters Work?
There are several algorithms in data mining aiming to detect similar
text passages. The one claimed to be used by search engines is w-shingling.
Each document has a unique fingerprint or shinglings - the contiguous
subsequences of tokens (blocks of text). The ratio of magnitude of union
and intersection of two documents' shinglings can be used to determine
their resemblance. Another algorithm that can be used for duplicates
detection is Levenshtein's distance
It is naturally to expect from a duplicate content filter to be able
to discover the origin and rank it higher. The simplest way to detect
the origin would be comparing the date of indexing implying that the
original source is uploaded and crawled earlier than its copies. But
with the advent of the RSS feeds the new content can be distributed
instantaneously and this approach is no longer valid.
Concerning the origin's right to be ranked higher - this is not always
implemented. In this article you
can read about an experiment of an article distribution. An article
was syndicated twice scoring as many as 19000 copies. After some time
Google, Yahoo and MSN have purged their indices leaving just few of
the duplicates. MSN's filter managed not only to discover the origin
but also put it to the top of the search results. Yahoo has also discovered
the origin, but in the results page to the title of the article, the
origin's position fluctuated obviously responding to the way Yahoo counts
relevancy and authority.
To the tester's amusement Google's refined index did not include the
original at all! Evidently Google featured only those pages with copies
of the same article which it considered relevant and authoritative with
no regard to the original source of the content! I've already mentioned
a thread where a similar problem is discussed. The both stories took
place in 2005 and early 2006 and so far I found no evidence that this
issue is resolved.
Originally published at "Duplicate
Content: What You Ought to Know About".
About the Author
Oleg Ishenko, SEOResearcher.com For more information on search engine optimization
and marketing check out our SEO
Training Materials website.
Note: These articles do not represent the advice or opinions of
Apollo Hosting. They represent the thoughts, advice and opinions of
the individual authors.
|