Text Analysis Info

Overview on software that analyses texts and other sources of human communication


FAQ - Frequently Asked Questions

Last update: 20. April 2017

This section is a summary of questions that reach me. I tried to answer them, if you do not agree with me or if you want to contribute yourself, please don't hesitate. All contributions will be considered and may be published here.

  • What kind of software do I need to do X, where X is one of:
  • scraping stories from a website
  • You will have to download the pages of the website(s) you wish to analyse. A website is organised in files, many of the are graphic files, images, or video that you mostly do not want to analyse. Programs that download a whole website or part(s) of it are called offline-reader or web spider, mostly you can restrict the download to files with certain file extensions (e.g. html) or file size.
    The second step is to prepare the files (editing) for further processing with text analysis software. TextGrab is a program that downloads all text files of a website and prepares these for seamless processing with TextQuest
  • simple term-searching
  • A simple form of term-searching is offered by each editor or text processor. An already loaded file is searched, and the hits are shown. Most editors can look for character strings, these may be whole words or any part of it.
  • keyword searching
  • Keyword searching is very similar to term-searching, however the results maybe different. Often concordance programs allow keyword searching, and the results are displayed in a results window, not in the original file. Sometimes also the results are specially formated, very popular is KWIC (key-word-in-context).
  • clustering analyses, the programs cluster text into different clusters. These cluster analyses require a certain text size.
  • converting files between formats (the 'Word documents' question)
  • There are a lot of file formats for text files, the most of them are proprietary (special for one product). Content analysis software often requires the text as a plain text file (often called ASCII file which is incorrect in a Windows environment). Nowadays the most popular encoding standard for texts is UTF-8 (universal text format). It allows to encode all characters from any language and makes texts interchangeable between operating systems (Windows, Linux, MacOS)
  • converting images/hardcopy to electronic text (OCR - optical character recognition)
  • Content analysis software requires the text to be stored in a file. That means, if you only have printed material, just must digitise it (or in other words: make it readable for a computer). You can type the text, you can dictate the text using dictation software (the most known are ViaVoice from IBM and Dragopn's Naturally Speaking), and you can scan the text using a scanner. The scanner stores the text as in image, and the next step is transforming the image to text data. The software for this is called OCR (optical character recognition) software. One might think that 99 % correctly recognised characters is a good value, but this means that there are between 10 and 20 errors per page.
  • some particular CA-related task (like keywords in context, collocation, etc.). Often these analyses help you to detect problems like ambiguity (a word has several meanings) or negation.
  • what kind(s) of statistical tests can I use on CA data?
  • what statistical packages work with CA data? Nearly any package that can reade CSV-data (comma separated values).
  • are there any good books on CA?
  • how do I CA X, where X is one of:
    • web pages: web crawlers
    • media or speech transcripts: dictation software
    • focus group or conversational or interview transcripts: dictation software
    • historical documents: OCR software
    • images