Macrostructure and Microstructure in the print dictionary
Abstract:
In this section we present a theory and method of the structural analysis of print dictionaries.
Macrostructure
The
macrostructure of a dictionary is, first and foremost, the ordered list of its entries.The entry head (the
lemma) is part of the macrostructure as well as of the microstructure and therefore assumes a pivotal role. In most dictionaries of the western languages the lemma is the base form of a lexical item which represents the full form paradigm of this lexical item. The process of reducing a word form to its base form is called
lemmatisation. Lemmatisation is, in natural language analysis, an important step if one wants to match a textual form of a lexical item with its dictionary entry form, in order to acess further information for this lexical item. Another important issue in the generation of a lemma list is homonymy vs. polysemy, which we have treated in an earlier chapter. Homonymous lexical items receive more than one entry in the dictionary, while conventionally the different readings of a polysemous lexical item are grouped in one entry. The grouping policy might however differ slightly between dictionaries.
There are still some minor problems in finding an adequate lemma for a lexical item:
-
If the rule for adjectives is to use the predicative form as the lemma, this rule is not applicable to such adjectives which are not used predicatively, e.g. German *letzt*, which could instead be registered as letzt, letzte (r,s), or letzt-; the same problem arises with nouns which are derived from adjectives (e.g., Langer, Lange).
-
Spelling variants which are due to phonological phenomena, e.g. schwa-deletion
(duss(e)lig); spelling variants which are due to a spelling reform which allows the use of allographs (Potential, Potenzial); in some rare cases, variants result from a competition between a number and its written variant (zehnfach, 10-fach). A manual for lexicographers as well as instructions for users should describe how these cases are handled.
- The exact form of multi-word lexemes, especially idioms, is often difficult to determine, e.g., einen Bärendienst erweisen, jemandem einen Bärendienst erweisen, jemandem einen echten (wirklichen, wahrhaftigen, richtigen...) Bärendienst erweisen?
In most dictionaries of the western languages, the list of headwords is ordered alphabetically, starting with the first letter of the head word. There are some specialised dictionaries where headwords are ordered alphabetically, but beginning with the last letter of the headword ('backward-sorted dictionary'). The same holds for Arabian language dictionaries. Dictionaries of syllabic languages have still other principles for the ordering of entries. And, last but not least, there are dictionaries which order lexical items according to their meaning, and not to their form (wordnets are a well-known example of such resources).
One obvious disadvantage of alphabetical ordering is that the semantic or conceptual structure of a vocabulary is not reflected at all. For example lie (the noun) and untruth, which are synonyms, are far apart in the lemma list, while lie (the verb), which has almost nothing to do with lie (the noun), is close by. Alphabetically ordered dictionaries try to remedy this shortcoming through cross-references which connect semantically related words. Better still in this respect are those dictionaries which arrange lexical items according to their meaning.
With electronic dictionaries, which are accessible either on a local PC or over the Internet, the macrostructure plays a far less important role. One typically uses a search engine to find the entry which is needed, so that the order of articles does not play an important role for search and access. The potential of electronic dictionaries lies therefore in revealing lexical-semantic structures and to match them which conceptual structures. Only few electronic dictionaries however use this potential (see, for example,
the One Look Reverse Dictionary function).
Microstructure
The term microstructure denotes the structure, i.e. the information items and their relations, of a single dictionary entry. A second concept can be derived: the general structure of all entries of a particular dictionary. The latter is sometime called abstract microstructure, while the former is called concrete microstructure.
The term
microstructure has first been introduced by Rey-Debove (cf. Rey-Debove, Josette, 1971). Rey-Debove used it to characterize her strictly linear analysis of dictionary entries. A sound theoretical account of dictionary entry structures, which accounted not only for linear precedence relations between lexicographic text segments, but also for immediate dominance relations, has been developed successively by Wiegand (cf. Hausmann, Franz Josef ; Wiegand, Herbert Ernst in: Hausmann, Franz Josef ; Reichmann, Oskar et al. (Ed.), 1989 and Wiegand, Herbert Ernst in: Hausmann, F. ; Reichmann, O. et al. (Ed.), 1989).
The following example is a dictionary entry for the headword
Pamphlet, taken from Drosdowski, Günter (Ed.), 1996:
Example 1: Example 1
(i)Pamph|lt, das: -[e]s, -e [frz. Pamphlet, engl. Pamphlet = Broschüre, H. u.]: Streit- oder Schmähschrift: ein politisches P., ein P. gegen jmdn schreiben, verfassen
This entry contains the following information: Pamphlet has two syllables Pamph and let, the stress is on the second syllable which has a long vowel (). Pamphlet is a noun with neuter gender, its genitive singular form is composed either by attaching -es or by attaching -s to the stem. The nominative plural is composed by attaching -e to the stem.
The word has come into the German language from the English language via the French language; the origin of the word is not known . It is used mainly by well-educated speakers with a negative connotation and means Streitschrift or Schmähschrift. The entry is completed with some usage examples: ein poltisches Pamphlet, ein Pamphlet gegen jemanden schreiben and ein Pamphlet gegen jemanden verfassen.
In the following, we present an overview of the most important information elements in entries of monolingual print dictionaries (cf. Wiegand, Herbert Ernst in: Hausmann, F. ; Reichmann, O. et al. (Ed.), 1989, the terminology is originally German, we therefore add in brackets the German term and the abbreviation which is introduced by Wiegand):
- Each entry consists of two main parts: one part containing information about the form of the lexical item ('Formkommentar', (FK)), the other containing information about the meaning of the lexical item ('semantischer Kommentar', (SK));
- The citation form of the lemma (Lemmazeichengestaltangabe (LZGA)) is part of the form section;
- Information about the spelling and pronunciation of the lexical item ('Ausspracheangabe' (AusA), 'Akzentangabe' (AkzA), 'Vokalqualitätsangabe' (VQA), 'Silbenangabe' (SA), 'Rechtschreibangabe' (RA), 'Worttrennungsangabe') are also part of the form section;
- Information about morphological features of the lexical item ('Flexionsangabe' (FlA), 'Genusangabe' (GA), 'Graduierungsangabe' (GradA), 'Kompositumsangabe' (KompA), 'Wortfamilienangabe', 'Numerusangabe' (PlbA)) are also part of the form section;
- Information about (morpho-)syntactic features of the lexical item ('Wortartenangabe' (WAA), 'Angabe zur syntaktischen Valenz' (VVA), 'Adjektivdistributionsangabe' (attributive, predikative oder adverbia)) are also part of the form section;
- Syntacto-semantic information items (SynSem) ('Kollokationsangabe' (KollA), 'Idiomangabe', 'Sprichwortangabe' (SprichwA), 'Kompetenzbeispielangabe' (KBeiA), Belegbeispielangabe for corpus-based citations (BBeiA), including 'Belegstellenanagabe' (BStA)) are largely part of the meaning section;
- Semantic information items ('Bedeutungsangabe' (BA), 'Bedeutungsparaphrasenangabe' (BPA), 'Synonymenangabe' (SynA), 'Antonymenangabe' (AntA), 'Polysemieangabe' (PA), 'Illustrationsangabe', 'Wortäquivalenzangabe (WÄA)') are part of the meaning section. Note that some of the information items are implicit or explicit cross-references (i.e. the ones that point to neighbours in the lexical-semantic field);
- Information items concerning the proper use and usage restrictions of a lexical item ('Pragmatische Angaben' (PragA)): relating to the subject ('Fachgebietsangabe' (FGA)), the stylistic level ('Stilschichtenangaben' (StilA)), the frequency of use ('Häufigkeitsangabe' (HA)), the diachronic level ('diachrA)), geographic restriction of usage, etc. They are typically subsumed under the meaning section.
- Other: etymology ('Etymologische Angabe' (EtyA)), cross-references ('Verweisangabe' (VerwA)).
To illustrate this list, we will present two example entries:
Example 2: Example 2
(1) Rappe, der: -en, -en 'schwarzes Pferd'
Figure 1: Entry Rappe from Drosdowski, Günter (Ed.), 1996 (tree structure) |
A more complex entry example for the adjective famos, taken from Kempcke, Günter (Ed.), 2000, is presented by Engelberg and Lemnitzer (cf. Engelberg, Stefan ; Lemnitzer, Lothar, 2001, p. 139):
Figure 2: Entry famos from
Kempcke, Günter (Ed.), 2000 (tree structure) |
The structure tree is represented in the following figure:
Figure 3: |
The examples we have presented so far are examples of concrete microstructures. If we want to abstract from them to come to a general entry model, we have to take the following into account:
-
There are some information items which are obligatory for all entries, e.g. the specification of the lemma ('Lemmazeichengestaltangabe');
-
Some information items concern lexical items of a specific part of speech. They are therefore obligatory for entries of lexical items belonging to that class, and not applicable to all other lexical items (e.g. gender for German nouns);
- Some information items are optional for all (classes of) entries, e.g. citations ('Beispielangaben', 'Belegangaben');
- Some information items are optional for some classes of entries, and not applicable to other classes, e.g. gradation (of adjectives and some adverbs).
A complete description of the abstract microstructures of dictionary entries for a particular dictionary must contain:
- a complete list of all information items which are used in this dictionary;
- a list of all entry classes (defined by the lemma type);
- for all information items and all classes of entries, a specification of whether they are obligatory, optional or not applicable to this entry class;
- assignment of an abstract microstructure to each lemma type.
A full account of dictionary entry structures (or (abstract) microstructures) of a dictionary is a necessary pre-requisite for the parsing of these entries. Machine readable versions of print dictionaries have very often been parsed in order to make the information they contain available for natural language processing. We will cover this topic later on.
It is also a necessary prerequisite for the media-independent compilation of dictionary entries. In this process it is necessary to clearly separate the function of each entry segment from a specific layout. One must therefore start with a data model which resembles the abstract microstructure(s) of entries. We will cover this topic in the next chapter. Figure 3 is an example of ac dictionary entry in which the information types are marked up using XML. Figure 4 is a representation of that same entry in a web browser. The same entry could easily be prepared for presentation in print or on a mobile device. We will cover that topic in the next chapter.
Cross-references
Cross-references connect two dictionary entries, or two segments of dictionary entries. Pragmatically, they are used to point users to another part of the dictionary at which they can expect further information on the topic which is treated at the source of the cross-reference. A typographical convention is to represent a cross-reference by an arrow (e.g. ). In an alphabetically ordered word list, cross-references are a handy tool to reconstruct the lexical-semantic relations between the lexcical items (synonymy, antonymy, etc.). Another common use, especially in bilingual and learners' dictionaries, is to relate irregularly inflected forms to their base forms. (e.g. trug to tragen). Cross-references can help to avoid redundancies: a multi-word lexeme or collocation is described in the entry of one of its meaning bearing parts, the entries of the other meaning bearing words contain only cross-references to the former entry (e.g. jmdn. über den -> Leisten ziehen under the headword ziehen). But one has to keep in mind that redundancy can only be avoided at the price of the users' convenience.
Not all kinds of cross-references which are used in dictionaries are useful or appreciated by the reader. Cross-references which are circular, i.e. entry B points to entry A and entry A points to entry B, should be avoided. The compilers of dictionaries should also consider the question whether the information found at the target of a cross-reference is really helpful for answering the questions for which the dictionary is consulted.
The overall consistency of dictionary entries is another pre-requisite for good cross-referencing. Otherwise it can happen that the target of a cross-reference does not exist at all!
For cross-media publishing of dictionary data it is useful to distinguish cross-references, which model relations in the vocabulary which is described, from cross-references which reflect the structure of the dictionary text. For example, all cross-references which help to reduce redundancies and to save place in print are superfluous in the electronic versions of this dictionary and are an obstacle to the proper use of it. Cross-references which reflect lexical-semantic relations should however be kept. The computer can even provide new and exciting ways to visualize these structures.
For the representation of lexical-semantic structures in electronic dictionaries and in lexical resources for NLP, a structural approach which has been introduced by Gibbon (cf. Gibbon, Daffyd in: Carstensen, Kai-Uwe etal. (Ed.), 2001) and is called
Mesostruktur is worth reviewing. It seems to be an adequate approach to modelling lexical resources which make use of object hierarchies and default inheritance mechanisms.