Last update: 20. September 2023
This section is a summary of questions that reach me. I tried to answer them, if you do not agree with me or if you want to contribute yourself, please don't hesitate. All contributions will be considered and may be published here.
- What kind of software do I need to do X, where X is one of:
- scraping stories from a website You will have to download the pages of the website(s) you wish to analyse. A website is organised in files, many of the are graphic files, images, or video that you mostly do not want to analyse. Programs that download a whole website or part(s) of it are called offline-reader or web spider, mostly you can restrict the download to files with certain file extensions (e.g. html) or file size.
- simple term-searching A simple form of term-searching is offered by each editor or text processor. An already loaded file is searched, and the hits are shown. Most editors can look for character strings, these may be whole words or any part of it.
- keyword searching Keyword searching is very similar to term-searching, however the results maybe different. Often concordance programs allow keyword searching, and the results are displayed in a results window, not in the original file. Sometimes also the results are specially formated, very popular is KWIC (key-word-in-context).
- clustering analyses, the programs cluster text into different clusters. These cluster analyses require a certain text size.
- converting files between formats (the 'Word documents' question) There are a lot of file formats for text files, the most of them are proprietary (special for one product). Content analysis software often requires the text as a plain text file (often called ASCII or plain text file). Nowadays (2023) the most popular encoding standard for texts is UTF-8 (universal text format). It allows to encode all characters from any language and makes texts interchangeable between operating systems (Windows, Linux, MacOS). If you use word, save your text as plain text - that deletes all formatting information that you don't use in a CA.
- converting images/hardcopy to electronic text (OCR - optical character recognition) Content analysis software requires the text to be stored in a file. This means, if you only have printed material, just must digitise it (or in other words: make it readable for a computer). You can type the text, you can dictate the text using dictation software (the most known are ViaVoice from IBM and Dragon's Naturally Speaking), and you can scan the text using a scanner. The scanner stores the text as in image, and the next step is transforming the image to text data. The software for this is called OCR (optical character recognition) software. One might think that 99 % correctly recognised characters is a good value, but this means that there are between 10 and 20 errors per page (A page has about 1000-1500 characters.)
- some particular CA-related task (like keywords in context, collocation, etc.). Often these analyses help you to detect problems like ambiguity (a word has several meanings) or negation.
- what kind(s) of statistical tests can I use on CA data? In most cases you receive frequency counters for each category, and a statistical case is a text unit (sentence, section, depending on your hypotheses). So you can use tests suitable for interval scaled variables.
- what statistical packages work with CA data? Nearly any package that can read CSV-data (comma separated values) like SPSS or R.
- are there any good books on CA? Klaus Krippendorff's book on content analysis: Krippendorff, K. (2004). Content Analysis: An Introduction to Its Methodology (2nd ed.) Thousand Oaks, CA: Sage Publications.
- how do I CA X, where X is one of:
- web pages: web crawlers
- media or speech transcripts: dictation software
- focus group or conversational or interview transcripts: dictation software
- historical documents: OCR software
- images
The second step is to prepare the files (editing) for further processing with text analysis software.