Text Analysis Info

Overview on software that analyses texts and other sources of human communication

Search



Content - quantitative with category systems

Last update: 4. December 2014

AmCat - Amsterdam Content Analysis Toolkit

authors: members of the section of Communication Science at the Vrije Universiteit Amsterdam
program: AmCAT
documentation: Introduction
download: none, you work online
operating system: irrelevant, you work online
description: AmCAT is an online tool for content analyses, especially relational content analysis.

CoAn 2.08 - Content Analysis (German only)

author: Matthias Romppel
program: CoAn 2.08
documentation: printed manual in German
download: test
operating system: Win 3.x, Win9x, WinNT, does not run on 64-bit systems
description: word list, concordances, frequencies of categories COAN is inspired by a former Intext version. It uses dictionaries to code texts, special features are interactive coding, powerful search patterns like word co-occurences. It is available in German only.
Personal comment: this site was not updated since 2006.

Diction 7.0

program: DICTION 7.0
author: Roderick P. Hart

distributor: Digitext Inc., Austin, TX, USA
download: trial version
manual: manual
operating system: MS-Windows, Mac OS-X
description: Diction uses dictionaries (word-lists) to search a text for these qualities:

  • Certainty: Language indicating resoluteness, inflexibility, and completeness and a tendency to speak ex-cathedra.
  • Activity: Language featuring movement, change, the implementation of ideas and the avoidance of inertia.
  • Optimism: Language describing tangible, immediate, recognizable matters that affect people's everyday lives.
  • Commonality: Language highlighting the agreed-upon values of a group and rejecting idiosyncratic modes of engagement.
  • Realism: Language describing tangible, immediate, recognizable matters that affect people's everyday lives

The results can be statistically analysed and are compared with other texts, so that an under- or overrepresentation of categories can be detected.

General Inquirer

program: General Inquirer
author and distributor: Philip J. Stone
download: yes, but only the category systems
operating system: Java, category systems are Excel-files (XLS)
documentation: description of categories
description: The grandfather of many content analysis software is now available for computers that run Java and are able to read the category system (Excel files). 

KH Coder

program: KH Coder 2.b. 31d
author and distributor: Koichi Higuchi
download: free download
operating system: MS-Windows, Mac OS-X, Linux
documentation: Tutorial
description: KH Coder is a free software for quantitative content analysis or text mining. It supports the analysis of texts in Japanese, English, French, German, Italian, Portuguese and Spanish. KH Coder has following features:
Words: Frequency List, Searching, KWIC Concordance, Collocation Stats, Correspondence Analysis, Multi-Dimensional Scaling, Hierarchical Cluster Analysis, Co-Occurrence Network

  • Categories: Developing Your Own Coding Rules, Frequency List, Cross Tabulation, Correspondence Analysis, Multi-Dimensional Scaling, Co-Occurrence Network, Hierarchical Cluster Analysis
  • Documents: Searching, Clustering, Naive Bayes classifier

KH Coder provides these functions using back-end tools such as Stanford POS Tagger, Snowball stemmer, MySQL and R. Just input raw texts and you can utilize these functionalities.

LIWC 2007 - LInguistic Word Count

program: LIWC 2007 - LInguistic Word Count
author: James B. Pennebaker
Roger J. Booth, and Martha E. Francis.
distributor: Erlbaum Associates
download: with registration only
operation system: MS-Windows, Mac OS-X
documentation: LIWC 2007 manual
description: The program analyses text files on a word-by-word basis, calculating percentage words that match each of several language dimensions. The program has 68 pre-set dimensions (output variables) including linguistic dimensions, word categories tapping psychological constructs, and personal concern categories, and can accommodate user-defined dimensions as well.
In the new LIWC 2007 version the dictionary has been extended. In the Mac OS-version there are new features like phrases and parts of words (stems) as search patterns, and also highlighting of the text. A lite version for students is also available.

MCCA - Minnesota Contextual content analysis

program: Dimap 4.0 with MCCA
operating system: Win95
authors: Ken Litkowski, Donald McTavish
distributor: CL Research
download: test
documentation: no, but many white papers on the website
description: DIMAP/MMCA description

personal comment: the web pages are outdated, one last edited in 2001 (sic!)

PCAD 2000

program: PCAD 2000
author and distributor: GB Software

documentation: manual
download: no
operating system(s): Win9x
description: The primary area of interest is measuring psychobiologically interesting states such as anxiety, hostility, and hope using the Gottschalk-Gleser content analysis scales. These scales have been empirically developed and tested, and have been shown to be reliable and valid in a wide range of studies. Louis A. Gottschalk (M.D. Ph.D.) has been the principal developer of these scales, and has applied them in many areas of medicine and beyond.

Protan - Protocol Analyser

program: PROTAN
author and distributor: Robert Hogenraad

documentation: overview on the analysis modules
download: no
operating system(s): MS-DOS, Mac OS, Unix
description: word list, concordances, frequencies of categories, sequences of categories manuals in electronic and printed form, Documetation in French only PROTAN is also a sucessor of the General Inquirer, with a lot of utilities that perform numerous text analysis tasks. PROTAN is very complex and difficult to handle. Documentation is in French, where the command language for the utilities is English.
comment: Although the functions of PROTAN are very impressing, it requires some time to make use of all the functions it offers. A lot of standardised category systems are available like all language version of Colin Martindale's RID (Regressive Imagenry Dictionary), Harvard dictionary, Whissell dictionary and many others for different languages.
 

TEXTPACK 7.0 - TextPackage

program: TEXTPACK 7.0
authors: Peter Ph. Mohler and Cornelia Züll
distributor: GESIS Mannheim (former ZUMA)
documentation: not available online any more
download: free download with registration, see the end of the description
operating systems: MS-Windows, in English or Spanish
description: TEXTPACK features:

  • word frequencies for the entire text or its sub-units, can be filtered by external variables (identifiers) and/or frequency, sorted by alphabet or frequency, sort order tables possible
  • Keyword-in-Context and Keyword-out-of-Context (KWIC/KWOC) Single words, word roots (beginnings of strings) or word sequences can be shown in their context.
  • cross-references and concordances
  • word comparison of two texts
  • TEXTPACK categorises/classifies a text according to a user dictionary. It generates files with both category frequencies or category sequences. The validity of the coding can be checked by various options (e.g., the insertion of category numbers or category labels in the continuous text).
  • selection of text units: filtering on the basis of the external variables or to use a numeric file to select text units.

TEXTPACK is available free of charge. The program is no longer subject of further development. Only scientific use of the data is accepted (not-for-profit research or teaching). The registration enables you to download the demo version of the program. With a registration one agrees that the data will be used for scientific purposes only.

TextQuest 4.2

program: TextQuest 4.2
author and distributor: Social Science Consulting

documentation: manual (PDF-file) is included in the test version
download: test version in English and German
operating system(s): MS-Windows, Mac OS-X
description: TextQuest uses dictionaries to code texts, special features are interactive coding, powerful search patterns like word co-occurences, and negation detection for English and German. The text exploring features are word lists supporting sort order tables, exclusion lists (STOP-words), KWIC-lines with variable length, and lists of word sequences (phrases) and word permutations. The readability module consists of 70 readability formula for 8 languages (English, French, German, Spanish, Italian, Dutch, Danish, and Swedish). TextQuest is available with an English or German user interface.

Version 4.2 contains a new word sequence module and new readability formulas: language independent, for English and for Italian. There are also new text statistics in the vocabulary analyses.

Version 4.1 contains a new algorithm that separates sentence marks from words, 2 new readability formulas for English and Spanish, and a new test version (English and German).

Version 4.0 is a complete overhaul and is now available for Windows and Apple's Macintosh. Latin-1 and UTF-8 encoding are now supported, so many languages can be processed properly. Also time-restricted licenses are available at lower prices (e.g. 200 Euro for a 6-month academic single license (instead of 600 Euro for a full license).

In version 3.0 there are two new modules: the multiple vocabulary comparison module and the category manager. Version 3.1 does not contain new modules but some unclear menus and messagers were fixed.

Since version 1.8 standard category systems like RID (Regressive Imagery Dictionary) for English and German, and the HKW (Hamburger kommunikationssoziologisches Wörterbuch) are included.

Whissell's dictionary of Affect in Language

author: Cynthia Whissell

download: not available any more, use the web service

distributor: Perceptmx Collingwood, ON, Canada
program: dictionary of Affect in Language (DAL)
operating system: MS-Windows, Mac OS, Lynx, online
documentation: manual
description: The Dictionary of Affect in Language (DAL) is an attempt to quantify emotion in language. Volunteers viewed many thousands of words and rated them in terms of their Pleasantness, Activation, and Imagery (concreteness). The DAL is embedded in a computer program which is used to score language samples on the basis of these three dimensions. The DAL has been applied to studies of fiction (e.g., Frankenstein, David Copperfield), of poetry (e.g., the work of Frost, Blake), drama (e.g., Shakespeare’s tragedies and comedies), advertisements, group discussions, and lyrics (e.g., the Beatles). It has also been used in the selection of words for memory research.

UIMA

program: UIMA
authors: many
distributor: IBM Research and IBM Software Group and Carnegie Mellon University
documentation: documentation of the SDK
download: www.ibm.com/developerworks/data/downloads/uima/downloads.html
operating systems: Java, SDK independent from operating system
description: UIMA stands for the Unstructured Information Management Architecture.
It is an open, industrial-strength, scaleable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components.
IBM is making UIMA available as free, open source software to provide a common foundation for industry and academia to collaborate and accelerate the world-wide development of technologies critical for discovering the vital knowledge present in the fastest growing sources of information today.
UIMA Software Development Kit (SDK) is freely available, also the UIMA core Java framework source code. In particular the UIMA APIs are available for creating customized solutions in WebSphere Information Integrator OmniFind Edition.

Wordscores

program: Wordscore
authors: Michael Laver, Kenneth Benoit, and John Garry
distributor: Trinity College University of Dublin, Ireland
documentation: user documentation
download: yes, source for the different modules is provided
operating systems: requires Stata version 7 or better
description: Wordscores is a set of Stata programs to perform a content analysis. A set of program named wordfreq, phrasefreq, setref, describetext, wordscore and textscore help you to explore the text and assigning codes.

WordStat 7.1

program: WordStat 7.1
author: Normand Peladeau
distributor: Provalis Research or Social Science Consulting (Europe)
documentation: manual as a PDF-file
download: test version expires after 30 days or self running demo
operating systems: MS-Windows
description: WordStat is an add-on to QDA-Miner or SimStat, a general purpose statistic program (comparable to SPSS e.g.). Both packages are integrated and especially useful for the coding of answers to open ended questions. It also includes thesauri and spell-checker for different languages. It comes with Colin Martindale's RID - Regressive Imagery Dictionary (English, French, Portuguese, Swedish, German, Latin) and a few other dictionaries and thesauri (WordNet, Roget's thesaurus). Version 7.1 offers geospatial processing, and also WordStat is available for Stata.

Yoshikoder 0.6.4

program: Yoshikoder 0.6.4
author: William Lowe

author: Will Lowe
distributor: Will Lowe
documentation: user documentation
download: free version
operating systems: MS-Windows, Mac OS-X, Linux, with Java environment
description: Yoshikoder works with text documents, whether in plain ASCII, Unicode (e.g. UTF-8), or a national encodings (e.g. Big5 Chinese.) You can construct, view, and save keywords-in-context. You can write content analysis dictionaries can be constructed using PERL-style regular expressions. Yoshikoder provides summaries of documents, either as word frequency tables or according to a content analysis dictionary. You can also compare documents according to word frequency profile or with respect to a content dictionary. Yoshikoder's native file format is XML, so dictionaries and keyword-in-context files are non-proprietary and human readable. The RID and LIWC are also available.