Text Analysis Info

Overview on software that analyses texts and other sources of human communication



Last update: 20. April 2017 

This page gives an overview on plagiarism and does not claim to be complete. Please have in mind that software solutions require that both the original text and the text that is suspicious to be a plagiate must be available in digital form.

Note: this section is based on an article written by André Kramer and published by c't (complete reference see end of this page).

In the age of information texts are digitally available, it is very easy to copy texts from the internet and publish them oneself although another person wrote the text without quoting. This is called plagiarism. A well known case is the plagiarism that was detected in the beginning of 2003. A student paper published several years before was presented as secret information by the British Secret Service, including typographical errors. So it was easy to detect that the paper was a plagiat.

Today many universities have the problems to decide whether student papers are plagiats or not. Up to 30 percent of the students admit that they copy other people's work without quoting it. But also in a commercial setting it is important not to plagiat, because using already existing trade marks e.g. can result in compensation claims or other commercial desasters.

Debora Weber-Wulff tested 17 software solutions. The results are that no system does a very good job, and only one - Ephorus - was evaluated as good. The last link shows the results of the test, and how the software was evaluated.


  • hapax legomena are words that occur only one time in a text. Some programs that generate a word list can select these if you specify a maximum frequency of 1.
  • tri-gram is a part of the text that consist exactly of three words. These three words combinations are counted and compared with the results of other texts.
  • n-gram is the general form of a tri-gram, you can set the n to a number that makes sense. If n=1, a word list is generated. If n is too high, most n-grams occur only once in a text and are therefore not useable, also the software will need some time to generate e.g. 10-grams.


  • CopyCatch Gold is developed by David Woolls and compares the hapax legomena of two texts, and if the common percentage of these words exceeds 70 %, you can assume that the texts have passages in common. It is now distribute by CFL Software.
  • Duplichecker is free online software
  • Plagiarismsoftware.net is a free online software and
  • plagiarismsoftware.org seems to be the same as above
  • Glatt deletes every fifth word, and the student must insert the missing words. The detection of plagiats is high, but it is also a lot of work.
  • Pl@giarism is developed by Georges Span from the University of Maastricht (Belgium) and works with tri-grams.
  • Plagiarism Finder by Mediaphor Software Entertainment extracts phrase from a give document and searches for these using Google as a search engine.
  • PlagScan Plagiarism detector distributed by PlagScan GmbH in Cologne, Germany.
  • Turnitin by iParadigm compares a text with stored documents from the internet and ca. 4500 print media. A so called fingerprint gives statistical information and is used to decide whether the uploaded document is a plagiat or not. All uploladed documents are stored for future use.
  • Scriptum by Vancouver Software Labs is also based on a tri-gram algorithm.
  • PlagTracker is a free online tool and uses already stored texts in several languages.