FAQ
Last update: 29. October 2003
- scraping stories from a website You will have to download the pages of the website(s) you wish to analyse. A website is organised in files, many of the are graphic files, images, or video that you mostly do not want to analyse. Programs that download a whole website or part(s) of it are called offline-reader or web spider, mostly you can restrict the download to files with certain file extensions (e.g. html) or file size.
- simple term-searching A simple form of term-searching is offered by each editor or text processor. An already loaded file is searched, and the hits are shown. Most editors can look for character strings, these may be whole words or any part of it.
- keyword searching Keyword searching is very similar to term-searching, however the results maybe different. Often concordance programs allow keyword searching, and the results are displayed in a results window, not in the original file. Sometimes also the results are specially formated, very popular is KWIC (key-word-in-context).
- clustering analyses
- converting files between formats (the 'Word documents' question) There are a lot of file formats for text files, the most of them are proprietary (special for one product). Content analysis software often requires the text as a plain text file (often called ASCII file which is incorrect in a Windows environment).
- converting images/hardcopy to electronic text (OCR) Content analysis software requires the text to be stored in a file. That means, if you only have printed material, just must digitise it (or in other words: make it readable for a computer). You can type the text, you can dictate the text using dictation software (the most known are ViaVoice from IBM and Dragopn's Naturally Speaking), and you can scan the text using a scanner. The scanner stores the text as in image, and the next step is transforming the image to text data. The software for this is called OCR (optical character recognition) software. One might think that 99 % correctly recognised characters is a good value, but this means that there are between 10 and 20 errors per page.
- some particular CA-related task (like keywords in context, collocation, etc.)
- what kind(s) of statistical tests can I use on CA data?
- what statistical packages work with CA data?
- are there any good books on CA?
- how do I CA X, where X is one of:
- web pages
- media or speech transcripts
- focus group or conversational or interview transcripts
- historical documents
- images
The second step is to prepare the files (editing) for further processing with text analysis software. TextGrab is a program that downloads all text files of a website and prepares these for seamless processing with TextQuest