Text Analysis Info - Plagiarism

Last update: 22. October 2007

Note: this section is based on an article written by André Kramer and published by c't (omplete reference see end of this page).

In the age of information that is digitally available, it is very easy to copy texts from the internet and publish them oneself although another person wrote the text. This is called plagiarism. A well known case is the plagiarism that was detected in the beginning of 2003. A student paper published several years before was presented as secret information by the British Secret Service, including typographical errors. So it was easy to detect that the paper was a plagiat.

Today many universities have the problems to decide whether student papers are plagiats or not. Up to 30 percent of the students admit that they copy other people's work without quoting it. But also in a commercial setting it is important not to plagiat, because using already existing trade marks e.g. can result in compensation claims or other commercial desasters.

Debora Weber-Wulff tested 17 software solutions. The results are that no system does a very good job, and only one - Ephorus - was evaluated as good. The last link shows the results of the test, and how the software was evaluated.

Definitions

  • hapax legomena are words that occur only one time in a text. Some programs that generate a word list can select these if you specify a maximum frequency of 1.
  • tri-gram is a part of the text that consist exactly of three words. These three words combinations are counted and compared with the results of other texts.
  • n-gram is the general form of a tri-gram, you can set the n to a number that makes sense. If n=1, a word list is generated. If n is too high, most n-grams occur only once in a text and are therefore not useable, also the software will need some time to generate e.g. 10-grams.

Software

Literature


Please send comments and suggestions to