how does google distinguish spun content from original content

Many webmasters do stand in generating a lot of content to populate the web sites. Due to the limitation of time and energy, the vast majority of the webmaster usually use scraping software to scape articles and rewrite the scraped articles. How does google differentiate the spun articles generated by article rewriters from the original articles? In fact, Google does a better job than Baidu in detecting rewritten content. Now let us look at how Google detects spun articles.

1 content similarity

Content similarity is the main index search engine uses to reduce redundancy. The most used algorithm is called the TF-IDF algorithm. This is the correlation calculation algorithm. The main idea of the TF-IDF is said: if you have a high frequency of a word or phrase in an article, and rarely appear in the other article, the word or phrase is of good ability to distinguish and suitable for classification.

TF frequency (Term Frequency) refers to the number of a given word appears in the file.

IDF anti Document Frequency (Inverse) refers to: the less document contains the term, the greater the IDF, and the entry has a very good classification ability.

When an article according to the TF / IDF calculation, formed a multi-dimensional vector, the vector is the article content feature vector, when the feature vectors of two articles tend to be consistent, we believe that close to the content of this article, if consistent, that is repeated.

TF/IDF and vector algorithm in detail, please refer to the Google of the United States and the United States and the United States on the blackboard mathematics 12- cosine theorem and news classification

2 data fingerprint

After search engine collects articles based on the similarity, it can detect duplicated article using the fingerprint data. The fingerprint data has many kinds of algorithms, a common one is to extract the punctuation to compare. It is difficult to imagine two different articles have consistent punctuations. Other methods include vector comparison, which is the TF frequency (keyword density).

Now you know that a lot of article rewriter tools just replace the keywords. After replacing the keywords, the fingerprint of punctuation marks, even the TF, is unchanged. Reordering the paragraphs of the article, may change the punctuation, but vector and the frequency of the word problem still exists. So you can imagine the quality of these article spinners.(they maybe work for Baidu)

3 code noise

The above discussion is based on a condition, that is, the search engine knows what it is, because each site templates are different, the code is also different, mixing all kinds of information together. Finding text is the firs thing the search engine deals with.

General speaking, Google will be through the layout of the code and noise ratio to distinguish which is the navigation, which is the body, and can be ignored on some typical code. Then when we create the template, we should pay attention to that. Here there is a kink point, is the whole page noise reduction, to facilitate the search engine to confirm the text, but the text area should be appropriate to add noise to increase the difficulty of the search engine recognizing duplication.

Posted in article rewriter