Duplicate Content Phantom: Don’t Be Duped, Be Informed | Search Engine Journal

Mar 09 2011

Duplicate Content Phantom: Don’t Be Duped, Be Informed

Duplicate content has always been a hot topic among webmasters; mostly because no one really knows what it is and the rumors persist.

And Google doesn’t help much either. Sometimes, I think of it as a hyperactive 3 year-old, who is incredibly sharp in some areas, but not so much in others.

So the best way to go is to keep it simple, stay under the radar, and shoot for the middle of the road.

With that said, let’s figure out what duplicate content is, what it isn’t, and what you should do to stay on top of it.

What is duplicate content?

From the horse’s mouth:

“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. Examples of non-malicious duplicate content could include:

  1. Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  2. Store items shown or linked via multiple distinct URLs
  3. Printer-only versions of web pages”

Identical or substantially similar content.

Within your own domain or across others.

Most of it is normal and acceptable.

So far, so good.

Why Is Duplicate Content a Problem?

How would you like to search for the best pecan pie recipe just to find that every single result on the first page turned out the exact same recipe?

Users don’t like the same result and Google doesn’t like crawling the same results.

For a search engine, it’s also a processing consideration. If there is substantial duplication, the crawl/indexation rates might be dampened. In short, the site can lose some ‘trust’.

Two Types of Duplicate Content

We all have our own ideas of duplicate content and most of the times they boil down to “Don’t republish the same article to multiple directories. Instead, spend countless hours spinning that same article to the point where it doesn’t make sense any longer and THEN publish it to a zillion and one directories. That will surely trick all the PhDs working for Google into ranking my site pretty highly.

Now in the spirit of “being informed”, let’s take a look at the 2 types of duplicate content you see around, shall we?

  1. Cross-domain type: this one is the most commonly thought of and includes the same content, which (often unintentionally) appears on several external sites.
  2. Within-your-domain type: the one that Google is actually mostly concerned about, i.e. that appears (often unintentionally) in several different places within your site.

Let’s now do a little more exploring into each type and see what Google really thinks about it.

Off-Site Content Syndication

There is absolutely nothing wrong with syndicating your content to different sites per se.

NOTHING WRONG WITH IT!

Here’s what happens when your content gets syndicated: Google will simply go through all the available versions and show the one that they find the most appropriate for a specific search.

Mind you the most appropriate version might not be the one you’d prefer to have ranked. That’s why it’s very important that each piece of syndicated content includes a link back to your original post – I assume it would be on your site. That way Google will trace the original version and will most likely (but not always) display it in its search results.

Per Matt Cutts:

I would be mindful that taking all your articles and submitting them for syndication all over the place can make it more difficult to determine how much the site wrote its own content vs. just used syndicated content. My advice would be 1) to avoid over-syndicating the articles that you write, and 2) if you do syndicate content, make sure that you include a link to the original content. That will help ensure that the original content has more PageRank, which will aid in picking the best documents in our index. (Source)

Black Hat Syndication

However, here’s the other side of content syndication coin: the content is deliberately duplicated across the web in an attempt to manipulate search engine rankings or to generate more traffic.

This results in repeated content showing up in SERPs, upsets the searchers, and forces Google to clean out the house.

“In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.” (Resource – Google Webmaster Tools Help)

On-Site Content Syndication

On-site duplicate content problems are much more common and guess what: they are entirely UNDER YOUR CONTROL, which makes it very easy to fix them.

The first step to identifying the potential weak spots on your blog is learning more about your content management system.

For example, a blog post can show up on the home page of your blog, as well as category page, tag page, archives, etc. – THAT’S the true definition of duplicate content.

We, the users, have the common sense to understand that it’s still the same post; we just get to it via different URLs. However, search engines as unique pages with exactly same content = duplicate content.

How to Take Matters into Your Own Hands

Here are some practical “non-techie” steps you can take to minimize the presence to dupe content on your site:

  1. Take care of your canonicalization issues. In other words, www.trafficgenerationcafe.com, trafficgenerationcafe.com, trafficgenerationcafe.com/index.html are one and the same site as far as we are concerned, but 3 different sites as far as search engines are concerned. You need to pick your fave and stick with it. If you don’t know how, here’s are the instructions: WWW vs non-WWW: Why You Should Put All Your Links in One Basket
  2. Be consistent in your internal link building: don’t link to /page/ and /page and /page/index.htm – if links to your pages are split among the various versions, it can cause lower per-page PageRank.
  3. Include your preferred URLs in your sitemap
  4. Use 301 redirects: If you have restructured your site (for instance, changed your permalink structure to a more SEO-friendly one), use 301 redirects (“RedirectPermanent”) in your .htaccess file or, even simpler, use one of the many Redirection plugins available in your WordPress plugin directory.
  5. Use rel=”canonical”
  6. Use Google Webmaster Parameter Handling Tool
  7. Minimize repetition: i.e. don’t post your affiliate disclaimer on every single page; rather create a separate page for it and linked to it wheb needed.
  8. Managing your archive pages: Avoid duplicate content issues by displaying excerpts on your archive pages instead of full posts. You really want to give your readers just a hint of the content and direct them back to the original posts. To accomplish that, open your archive.php of your theme and replace the_content with the_excerpt. Hint: make sure your category and tag pages also display excerpts only.
  9. Country-specific content: Google is more likely to know that .de indicates Germany-focused content, for instance, than /de or de.example.com.

This is a good time to remind myself that I am not writing a novel…

Oh, wait a minute: one more important issue: robot.txt file.

According to TopRankBlog.com:

Google doesn’t recommend blocking duplicate URLs with robots.txt, because if they can’t crawl a URL they have to assume it’s unique. It’s better to let everything get crawled and to clearly indicate which URLs are duplicates…. Robots.txt controls crawling, not indexing. Google may index something (because of a link to it from an external site) but not crawl it. That can create a duplicate content issue.

Let’s move on, shall we?

Is There a Duplicate Content Penalty?

I’ll have Google answer this daunting question.

Here’s a quote by Susan Moskwa, Webmaster Trends Analyst from Google:

A lot of people think that if they have duplicate content that they’ll be penalized. In most cases, Google does not penalize sites for accidental duplication. Many, many, many sites have duplicate content.

Google may penalize sites for deliberate or manipulative duplication. For example: auto generated content, link networks or similar tactics designed to be manipulative.”

Susan further explained when webmasters should not worry about duplicate content:

  1. Common, minimal duplication.
  2. When you think the benefit outweighs potential ranking concerns. Consider your cost of fixing the duplicate content situation vs. the benefit you would receive.
  3. Remember: duplication is common and search engines can handle it.

How exactly does Google handle it?

While pulling up the search results, Google will basically collapse the duplicates leaving only the most relevant, in their opinion of course, page in the SERPs for that specific query. As I explained before, the way Google determines the most relevant result is based upon a myriad of factors and the only thing you can do for your part is to always link back to your original post.

Scraping Be Gone!

A word on the recent Google algorithm change:

My post mentioned that “we’re evaluating multiple changes that should help drive spam levels even lower, including one change that primarily affects sites that copy others’ content and sites with low levels of original content.” That change was approved at our weekly quality launch meeting last Thursday and launched earlier this week. (Matt Cutts – source)

What does it mean to the average webmaster?

We can all do a little chicken dance, since the probability of scraped (stolen, in other words) content ranking above the original articles that we put blood, sweat, and tears into, is minimal.

Google is rightfully going to war against all the autoblogs that don’t have what it takes to produce content of their own and all they do is republished other people’s work in hopes to rank highly in search engines, bring traffic to their crappy websites and make some money off AdSense, paid advertisement, and such.

Good riddance!

If you find that another site is duplicating your content by scraping (misappropriating and republishing) it, it’s unlikely that this will negatively impact your site’s ranking in Google search results pages. If you do spot a case that’s particularly frustrating, you are welcome to file a DMCA request to claim ownership of the content and request removal of the other site from Google’s index.

Duplicate Content Marketing Takeaway:

1. Dupe content doesn’t cause your site to be penalized.

2. Google is getting better at picking the best version of your content to be displayed in SERPs and ignoring the rest.

3. Almost all dupe content issues are easy to fix and should be fixed.

4. Don’t worry, be happy – don’t be afraid, be informed.

Written By:

PG

Ana Hoffman | Generate Targeted Web Traffic

Traffic, traffic, traffic… Can’t do without it, but don’t know how to get it? Ana does, and she freely shares how to generate targeted web traffic through effective SEO link building, blogging, and social media engagement on her Traffic Generation Cafe blog.

More Posts By Ana Hoffman

  • Wiehan Britz

    Good take on content duplication – as an SEO-noob, I can learn a lot from this – some of the terms do sound Greek to me though 🙂

  • http://twitter.com/tomwsi Thomas W. Petty

    Thank you for this excellent post. I just discussed this in my SEO meetup on Monday, and there were a lot of questions around it. Glad I wasn’t *too* far off in my answers. 🙂

  • I’d like to echo Thomas’ words. This is a clear, well-written explanation of duplication content issues. I’m trying to fix them within my site too – its just a matter of being crystal in my own mind before explaining to the developers etc.

  • Vernoti

    I’m sorry but duplicate content is not gone and will never be gone. Just do a search for:

    “how to avoid google’s duplicate content filter” on google and you’ll see 4 of the top 10 results in the serps are the exact same content.

    A quick read of googles patents details how hard this problem is to tackle. They take the middle of the road when analyzing a document. The compare snippets of a page that match for a particular query. They create a fingerprint for it. If that snippet fingerprint matches then they “may” filter it.

    The larger the web gets the harder this becomes. The have to compare more and more snippets. So it becomes a problem of scale. Google will never eradicate duplicate content.

  • how can we find duplicate content? For example, I just posted something in my blog and want to see if a duplicate is existing out there, how or what can I do to find it out?

  • http://twitter.com/Svelmoe Allan Svelmøe Hansen

    Thank you.
    I’m getting tired of explaining it to people that just read some blog somewhere written 4 years ago that duplicate content is oh-so-bad.

  • Guest

    thanks.. excellent post…

  • From experince I have found that some of the sites we have worked on have duplicate content which its actually bringing more traffic to to a clients site they all rank on google and I am certainly not going to advise them that the pages should be removed just because thats what Google wants us to do.

  • http://twitter.com/TrenutakHr Siniša Gavrilović

    Duplicate contentent can make your site so big that Google can have crawl rate issues with it.

blog comments powered by Disqus

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.