What Is Duplicate Content?

If you’ve ever posted anything on line, chances are that Google, the world’s most efficient search engine, stored your article in it’s database, along with all phrases in that article, and even the individual words used. The process is called indexing. Think of the Google index as an encyclopedia of everything on the internet. To demonstrate just how efficient Google is, searching on the word “a” will produce a listing of all the web pages which include it somewhere in the text, in this case 18 ½ billion results; “the” yields 14 ½ billion. A very inclusive system. When Google sees substantial blocks of content within a site or between sites that are either identical or judged by comparison criteria to be suspiciously similar, the subject material is regarded as duplicate content.

Why is duplicate content an issue?

Around ten years ago Google founders came up with the idea of organizing a seemingly infinite amount of information on the web. Shortly, Google became the search tool of choice, and “googling” entered the Urban Dictionary as a nearly standalone synonym for “searching.” The benefit of this is immediate for the user. With the ability to seek out identical expressions in various web pages, research is at the fingertips.

Before the net became a megasaurus, magazines and books and newspapers were the standard way of placing one’s writing before the public. In my early efforts to reach the Holy Grail of publication, I sent in dozens of short stories to various “small and literary” magazines and used Writer’s Market, with its listing of thousands of magazines, as my publication Bible. What is curious is that many magazines stated that the author would be allowed to publish the same material elsewhere or that it would accept previously published material.

Through a major paradigm shift, we need to think of posting articles on the Internet as the new form of publication. And, with any new technology, something is lost. The Internet is far less tolerant of the hard copy replication policy just mentioned. Benevolent in intent, the idea is to prevent sites from simply lifting other’s material and claiming it as it their own. What could be wrong with that? It isn’t the intent, but the unintentional effects of indexing that are the big problem. On the internet, duplicating content has become public enemy #1.

Good intentions gone south

With minimal knowledge, anyone with an internet service provider can install a WordPress website and post several articles in fewer than ten minutes. Fine, but who will see your new site? No one but Aunt Nellie and a couple geeky friends because its ranking in the search process will be in the billions.

Google yields search results according to popularity, or more correctly by authority, in its effort to serve the public’s need for better material. Authority is determined by the power of links to a site, and the authority of those linking are, in turn, calculated the same way, by the strength of their own links. Thus, a site promoted by the New York Times, with its own connections from all over the world, would have more authority than one with back links from the Mayberry Gazette and the Hayfield Hollow Register because these two sources have little authority in themselves.

Google ranking, higher placement in search results, has become serious business, literally, as more people use the internet for shopping and finance, and, of course, notoriety. The higher the rank, the better chance of making money, receiving grants, gaining fame, or any other standard of success we can think of. A whole industry has sprung up to this end known as search engine optimization, SEO. The object of this high tech fame game is to be ranked within the first three pages of results, which amounts to the first thirty to forty items. SEO wisdom indicates that people won’t look farther than that third page.

Here enters the monkey wrench in the Google machine, duplicate content. Sites reproducing content, whether article text, or even videos, will be penalized. In effect, points aren’t deducted; rather, the duplicated pages are filtered out of the search along with its stash of back links. Thus, a site which prides itself on 255 pages of excellent material garnered from other sources will be seen in a Google search as having no content, quite a bad omen for its ranking.

And this is true even for a shopping site quoting descriptions from its products’ manufacturer, though a workaround called “canonicalization” does exist which allows the webmaster to consolidate product information and avoid penalty, but the process is even more tedious than its pronunciation and is in no way user-friendly to the casual internet markerter.

Five major problems

These come to mind as most important, but with little thought, more difficulties will become apparent.

1 — Material can not be posted on one’s own site after it is posted elsewhere. A prolific writer, naturally, wants to promote his work on his own turf. But, if he or she has guest posted, had the works published on other websites, the author is not allowed to place his work on his own site. Today’s Ernest Hemingway with his own www.hemingway.com featuring his family photos, travel trophies, and complimentary biographical information would be prevented from featuring any of his short stories on his domain if these had been posted elsewhere.

The converse happened to me before I began investigating the aberrations of the Google machine. Posted on my own website were several quality, general interest articles. When I offered them to a prominent site, I was summarily refused because they had been indexed.

2– In the above scenario, I volunteered to remove the articles from my site, but not even this was acceptable because, as the stern, SEO-savvy webmaster informed me, the articles had already been indexed. The only way I’ve found that material can disappear from the index is with the disappearance of the website itself, a rather high price to pay for Internet publication. And I’m not certain how long it takes Google to catch on to this change of status.

3– And here is the worst state of affairs I can think of. Supposing you’re a submissive type who plays by Google rules, and a hefty, high-authority website has lured you, with your eager compliance, into guest posting a dozen quality articles. Each article and phrase within has been recorded by the Google engine. Now, for whatever reason, you and your “publisher” have a falling out. This once sympathetic site decides to remove all your work from their website. 1) You still can’t post the material at your own location because Google would penalize you for the already indexed, and therefore duplicate, material. 2) The indexing prevents you from offering your articles to other sites. The hours of work and thought are wasted; no one will read your material or even be aware of it as it sits in the purgatory of the Google vault. And you have no legal recourse, none. In hard copy publication there are contracts that, no matter how unfair, yield some type of justice. Not so in the internet, at least not in this situation. You have no recourse regarding republication.

4– The net has no effective legal system. International powers are working on rules, but for now and the foreseeable future, it’s the Wild West. Of course, the only answer is to carry a side arm to do justice to your own work by posting it on your own site. But this flies in the face of Google indexing etiquette. Anyway, the articles will be dismissed so far as inclusion in site ranking, and their very limited visibility to family and friends will yield small comfort.

5– The search and ranking algorithms are so complex that no source I’ve come across can do more than generalize about them, and even these ideas result from inference, not inspection. The Google mechanism can be compared to a vast government oversight agency with the security of a Fort Knox. We don’t vote on or take part in its implementation; we simply abide by the rules or suffer the consequences.

You can’t hide . . . nearly

The Google search engine is monumentally efficient. If your material is on line, it’s branded, indexed, period. Right? Well, not necessarily. In hard copy days, you could send in 250 word/page manuscripts in a 9”x12” manila envelope for inspection and possible acceptance for publication. The editors could see your work. But how can anyone inspect your material on the Internet without it being regarded as already published? Sending emails to others with attached works is ineffective. Attachments are suspect. Many people flatly refuse to open them for fear of viruses. And emails are much more easily dismissed than the old-fashioned, foot-long manila package.

The good news is that, recently, several market-like services have sprung up which allow for article inspection sans indexing. Articles can be uploaded in total and viewed by webmasters who can offer to publish them. This is a cyber-godsend. One such service is Cathy Stucker’s BloggerLinkUp. Another, the one I use, is Ann Smarty’s MyBlogGuest. I’ve received many offers for posting my material and have added other’s works to my own site through this agency.

For WordPress users, another option is available which allows for local publication invisible to Google or Yahoo engines. WordPress has a privacy setting in its dashboard which states: “I would like to block search engines, but allow normal visitors.” This means you can post to your heart’s content, send anyone to your website to examine your articles, and these remain untainted by indexing. I’ve set up a location for this purpose which I named, appropriately, “Invisible.” Why would I do this if I can simply list articles with the aforementioned services? I think of it as a showcase. My personal information, purposes in writing, and maybe a few photos are there for any to see. If a site I like picks up an article, I direct the webmaster to my “Invisible” site for other possibilities, kind of like pitching a screenplay with several other available plots.

Out of control

Science fiction author William Gibson, the creator of the word “cyberspace,” commented that in technology we are moving as fast as we possibly can with utterly no idea of where we’re headed. The Google Internet dictatorship is a perfect example. In our quest for easy access to information, we’ve sacrificed, in a real sense, much of our freedom of expression. As we press for more security, our options shrink. What can be done? By us, nothing. We only drive along the internet highways, we don’t build them. It’s not even a matter of protests, like this one, which can make a difference. The whole situation, including the duplicate content dilemma, is in the hands Google.

Further information

Much of the above is based on excellent material at www.webconfs.com. Some reading there will give the you a much better grasp of the way search engines functions and the concept of duplicate content.

Mike lives in Florida as a retired high school English teacher. He devotes most of his time working on his websites and writing on Artist’s Inlet Press. His writing output tends toward social criticism and common sense analysis of website content quality. Mike’s hero is Jack Kerouac.

Website: http://www.artistsinlet.com/wordpress/.

Leave a Reply Cancel reply

Archives

Meta

Googleplex : The Duplicate Content Issue