Text Analysis Info

Overview on software that analyses texts and other sources of human communication


Explanation of terms

Last update: 7. September 2014

The glossary explains terms used in text analysis with or without using text analysis software. Often some terms are ambiguous or there are more than one words for the same meaning. One rationale of this page is to standardise terms. If you want to improve this list, just send an e-mail.

ambiguity This problem occurs while defining search patterns for a category system (dictionary). Because search patterns have to be defined unique, ambiguity must not occur. Example: pot. This can mean the same as a cup, but it can also mean a certain drug. The search pattern ' pot ' is ambiguous. It makes sense that you examine the context by doing a concordance of the text unit.

analysis unit see coding unit.

autocoding this term is usually used with qualitative data analysis (QDA) software. It means that the coding is done by the software using search patterns. Each time a search pattern is found in the text, a code is assigned. The dictionary approach in a computer aided content analysis uses the same technique.

blank Another word for space. A word is formed by all characters between two blanks (or other delimiters like start or end of a line).

case folding Enabling case folding means that strings (mostly words) that are only different because they differ in lower/upper case letters are treated as the same by some TextQuest modules. Disabling case folding means that all differences matter, also the one that are based on differences in upper/lower case. For example: That and that are treated as one word if case folding is enabled and as two words if case folding is disabled.

category Operationalisation of a theoretical construct with one or more search patterns (see there).

category system a group of several categories. Each category consists of at least one search pattern.

character string all characters between two blanks (see there), usually a word.

coding unit The coding unit is the definition of a case. A new coding unit starts with every new text unit in most text analysis software.

concordance Search patterns in their context. This is an analysis that shows search patterns and their context in one line (similiar to KWICs). The search patterns are in the center of a line, the rest consists of the context before and after the search pattern. In TextQuest the length of the line is variable. In most programs KWICs have a fixed length.

cross reference A list of all positions of a string where it occurs. A cross reference consists of all external variables and their positions within the text unit.

data exploration The goals of data exploration are multiple, the main goal is the generation of a word list (see there). The word list of a text enables you to find search patterns for building a category system and to find orthographical errors. Often word lists contain strings that are not needed for further analysis, e.g. function words like pronouns, articles, or conjunctions. These can be removed from the word list, this working technique is called working with exclusion lists or STOP-words. Other forms of analysis are cross references (see there), concordances (see there) and lists of uncoded tokens (see there).

data generation The goal of the data generation is to digitize a text (to make a text readable by a computer (machine readability)). Currently that is possible by typing the text with an editor or text processing software, scanning printed material with a scanner and then transform the picture with OCR (optical character recognition) software, or to dictate it.

default Each parameter that can be changed by the user has a value that is taken if the user doesn't specify the parameter, this is called the default, e.g. file names have default names derived from the name of the project in TextQuest, other programs require to select a file name.

dictionary another term for category system. A dictionary consists of all search patterns that form the categories. Sometimes the term dictionary is also used in the sense of a word list.

exclusion list word list contains all strings of a text. If a category system is to be constrcuted, some of these like articles, pronouns, and prepositionsa are not needed for this task. With an exclusion list they do not become part of a word list (see also STOP-words)

external variable also identifier. External variables specify variables of a text unit, e.g. date, page, author etc. External variables are used to select (filter) text units in different kinds of analyses, e.g. a word list, content analysis, or readability analysis. The number of external variables varies, depending on the software: Textpack e.g. allows 3 external variables, other programs up to 50.

digit all strings where the first characters is a digit (0-9).

file A form how to organise data. A file consists of logical records, each record consists of at least one variable. Logical records of a file of text units (often called a system file) consist of the external variables, the number of words, the numbers of characters and the text. Each file has its own structure.

floating text Text in the format of a floating text is organised in a file that consists of text units as a logical record. This is the format a system file is organised. Another form of organising text is the vertical text format where a logical records consists of the identifiers and one word.

format Every file has a format that describes how the logical records look like (where which variables are to be found). Formats are different in each program, so the use of the same text in different programs often requires much work to convert the text from one format to another.

foreign word Readability analyses that use the TRI formula (for political comments in German newspapers) take foreign words into account, that means these words are used in German but have a different origin (e.g. Greek or Latin). Foreign words are also called special words (see there).

homonym A string that has more that one meaning. In a content analysis homonyms have to be disambiguated (see ambiguity). Example: pot. Meaning: cup or drug.

ID Abbrevation of identification variable (external varialbe). These variables represent external variables of a text. Identifiers must be specified by the user, up to 50 identifiers are possible in TextQuest, at least one is required. Other programs have different features.

infix A string (see there) that may occurs in any position within a word (see there). If an infix occurs in the beginning of a string, it's called prefix (see there), if it occurs at the end of a string, it's called suffix (see there). In a strict sense an infix may not occur at the beginning or end of a string).

list of uncoded tokens is a word lists (see there) of all strings (see there) there are not coded by a content analysis using the search pattern of a category system. Basis of a list of uncoded tokens are a word list and a category system (see there) that is sorted ascending by alphabet.

numeral a number written as a word (e.g.: one, eleven).

output file Many programs write their results into output files that can be processed by other programs (see also file). Output files can be plain text files that can be easily copied, or HTML-files that can be viewed with a browser that allows easier printing.

prefix A string (see there) that is in the beginning of a word (see there). A prefix is a special form of an infix (see there). In content analysis that can be a single letter (or another character).

raw text machine readable form of a text that can be processed without editing or converting by text analysis software. so that a system file (see there) can be generated. The raw text must have specific formats, see the software manuals for details.

reverse word list word list (see there) where the words are listed in reverse order (the first character becomes the last, the last character becomes the first). Example: small becomes llams.

sequence number Raw texts in fixed format (see there) must contain a sequence number because a text unit often does not fit in an input line, this occurs in INTEXT, Textpack, and TextQuest.

special characters all characters that neither start with a letter or a number. These are e.g. punctation marks or other characters of the characters set (IBM EBCDIC, PCs ASCII or ANSI).

special word see foreign word.

STOP-word A word list (also called exclusion list, see there) that contains all types (see there) of a text. Many of them are not useful for the definition of search patterns. Using a STOP-word file these can be deleted from a word list. Such a file contains articles, pronouns, prepositions and conjunctions.

search pattern at least one operationalisation of a category (see there). There are two types of search patterns in TextQuest: - strings (words, part of words or sequences of words) - word root chains (co-occurences of strings)

string a set of characters that is delimited by a blank in the beginning and the end (or other delimiters).

suffix that part of a string (see there) that forms the end of a string (see there). Search patterns can be defined as suffixes.

system file A file of text units (see there) that is the basic file for all forms of text analyses. They consist for external variables and the text, the latter is stored with variable length. A system file consists of at least one text unit (see there).

text unit A text unit is the unit of all further analyses and dependent what is to be researched. In readability analysis a text unit must be a sentence, in coding open ended questions a text unit is one answer to one open ended question.

token another term for a string (see there) in a text, used in linguistics.

truncate A string can be truncated if it exceeds the maximum length of 80 characters in the following applications: cross references, sorting with IRM (if a sort order table is enabled, the maximum length of a string is 38 characters), some forms of output of comparisions of word lists.

TTR Type-Token-Ratio. The ratio between all different strings (types, see there) and the sum of all strings (token, see there). The value of the TTR is between 0 and 1; the higher it is, the more heterogen is the vocabulary of the text. A value of 0 indicates an empty input file, a value of 1 means, that each word occurs only once. The TTR-value is dependent on the length of a text: The larger the text is, the lower the value of the TTR can become.

type the sum of different strings (see there) in a text.

variable format one of the many input formats of raw text (see there) that works with control sequences that start with $. It is best used if you have to type in the text yourself.

vertical text The logical record of a text consists of a word together with its external variables. The opposite is called floating text (see there), each logical record consists of a text unit (see there).

word A word within a text unit are all characters, that are between two blanks (or another delimiter like start or end of a line). The more precise expression is string (see there), although most strings are words.

word list a list of all types (see there) together with their frequency. Sometimes the term frequency table is also used.

word length the number of characters in a string.

word root A string (see there), that can be part of another string. Word roots can be in prefix, infix or suffix position (see there).

word root chain several word roots that must occur within one text unit. Up to 6 word roots can be in a word root chain. These can be searched within a text unit in three different modes that vary the order and the distance of the word roots how they must occur in the text.