Controlled Languages: An Introduction                                                          Back|Home|Next

The Role of Linguistic Knowledge in Controlled Languages

Morphology and the Lexicon in Controlled Languages

4.0  Morphology and the Lexicon in Controlled Languages
    4.1 The Lexicon
    4.2 Lexicon Development
    4.3 Terminology Selection and Lexical Disambiguation
    4.4 Corpus Analysis and Lexicon Development
    4.5 Lexical Entry Tagging
    4.6 Lexical Disambiguation

4.1 The Lexicon
A lexicon is a corpus or term bank, a collection of words, phrases and/or terms used in a controlled language.  Controlled language terminology may be constrained to the entries in the lexicon, or may simply refer to the lexicon [Huijsen98, Bernth98, Nyberg98].  For example the word 'boring' might be an entry in the lexicon, referring to the meaning 'drilling'.  However, if the controlled language prohibits the use of gerunds as adjectives ('The boring machine made a hole for the tunnel very quickly.'), it will constrain its use.  In a highly constrained controlled language, it is possible that only the words in the lexicon would be permitted (I have not seen an example of this).

4.2 Lexicon Development
In many cases, the lexicon is developed using naturally occurring texts [Nyberg98, Barthe96, Barthe98, Zhang98, Almqvist96].
The number of occurrences of various words and terms is one common way in which natural language may be analyzed in order to develop a basis for the terms to be included in a controlled language [Barthe98, Nyberg98]. But a terminology base for a lexicon can also be hand selected based on other criteria determined by its creators [Nyberg96].  The foremost criteria for the design of machine-processable controlled languages is that they be, in fact, machine processable.  This means one must be able to formalize the language, and that mappings of meanings must be very narrowly defined.  One way to do this is to severely limit the size of the lexicon.  'Computer Processable English' [Grover00] has only 19 verbs.

4.3 Terminology Selection and Lexical Disambiguation
Terms may be identified in a lexicon as either "approved” or “unapproved”--meaning that the lexicon is able to identify terms for the user as allowed or prohibited for use in the controlled language.  A manufacturer with whom I once worked did not have the words 'defect' or 'reject' in their vocabulary--their correct term was 'reclaim'. In addition, terms may be approved for some meanings, but disapproved for other meanings, as in the 'boring' example above..  This is one of the key ways in which semantic disambiguation is performed in a controlled language [Huijsen98, Fuchs96].  If the lexicon limits the term 'boring' to its present and past tense forms ('bore, bores, bored'), and permits it only to be used together with an auxiliary verb, it is confined to mean 'uninterested', not 'drill' or 'bear down upon'.

4.4 Corpus Analysis and Lexicon Development
Here is a part of a sample report of such an analysis towards defining a lexicon, performed in the development of the lexicon for Scania Controlled Swedish [Almqvist96].  The analysis attempts to identify the terms in an existing corpus.  It provides very basic information about the number of simple, phrasal , and numerical words in the corpus. 'Types' refers to lemmas. This analysis was performed as a pilot test, on 40 documents (984 pages) related to a Scania truck:

 
Tokens % Types %
Simple 69,666 80 8,819 84
Phrasal 2,503 3 337 3
Numerical 14,348 17 1,335 13
Table 1.  Tokens and Types in the Controlled Swedish Pilot Corpus

One way in which lexical development and definition can be done is by extracting a portion of a publicly or commercially available corpus such as the Penn Treebank, WordNet, or the International Corpus of English (www.ucl.ac.uk/english-usage/ice/index.htm).  These corpora provide tags identifying parts of speech, or other semantic or syntactic information for each item  [Zhang98, Nyberg98]. The steps taken in development of a lexicon for a controlled Chinese are shown in Figure 3 below.

Figure 3: Lexicon Construction for a Chinese Controlled Language [Zhang98]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

4.5 Lexical Entry Tagging
Once the items to be included in the lexicon are determined, they are often further analyzed and tagged.  Tags for lexical items may include whether the items are approved, disapproved, or unknown; part of speech semantic meaning, case, stem, number, gender, etc.

Here is an example of a tagged entry, from Scania Controlled Swedish:

arbetstakten
LEM=ARBETSTAKT.NN    (the lemma of the word, and how it is appended)
INFL=PATTERN.EXIL        (the inflection of the word in this form)
DIC.STEM=ARBETSTAKT (the stem, or root, of the word)
NUMBER=SING                  (whether the word is singular or plural in this form)
WORD.CAT=NOUN           (part of speech in this form)
GENDER=UTR                     (gender of the word)
FORM=DEF                          (whether the word is definite or indefinite)
CASE=BASIC                       (the case of the word in this form)
Domain specific lexicons are developed for specific needs and domains [Farrington96: aircraft; Huijsen98: telecommunications, forklifts, Zhang98: Chinese, Nyberg98: industrial equipment]. A controlled language tool could be used to convert an automotive manual authored in the US  for use in the UK by performing morphology and spell checking, as well as identifying certain words to be used in preference to others.  These are two very common tasks performed by controlled language tools.  In this example, consider US automotive and UK automotive as two different controlled languages.  The morphology/spelling component could identify occurrences of the term 'color' (and convert them to 'colour'), and the word 'trunk' could be converted to the word 'boot' for the UK. The tool can change the words automatically, or flag them for the author/editor, providing the author/editor with the preferred word, and then permitting that person to accept, modify, or reject the change.  Markup languages such as HTML and SGML can tag document types for specific domains, leading to pre-processing for technical terms, proper nouns, or other domain specific lexical features [Nyberg96].  In fact, the LantMark controlled language checker and many others are based on just such a premise, using a machine translation approach to translate free-form English to controlled language English [Knops98].

4.6 Lexical Disambiguation
As noted above, lexicons also operate as a disambiguation tool, limiting the meaning, or sense, of certain terms.  This applies to domain specific or customized lexicons like the automotive example suggested above, but also applies to generalized applications where limiting the sense of a word can make text easier to understand (and machine translate).  A domain specific lexicon might limit the meaning of the term “joint” to “bone junctures”, rejecting or excluding the meaning “collective” in a medical domain.  Or, it could limit the word “right” to its meaning “right-hand side”, forcing the author to use the word “correct” rather than “right”, in a general purpose controlled language.

One outstanding example in the literature comes from the airline industry, where the author asserts that the word 'round' has 40 different meanings:

"Round the edges of the round cap.  If it then turns round and round as it circles round the casing, another round of tests is required" [Farrington96].
A more advanced step in development of semantic lexicons is development of terminology class hierarchies and relations   In a class lexicon, each lexical entry is defined as an instance of various classes and sub-classes; and instances, sub-classes, and classes are related to one another. This type of representation contains a much higher level of semantic identification, defining the relationships between and among different entry items.  Since these relationships are defined, it can be said that knowledge is contained in the structure.  Nasr and Kittredge discuss this kind of approach in the design of a controlled language, using a structure they call "Deep Syntactic Structure", based on "Meaning Text Theory" [Nasr98].  In this system, rain is associated with a function 'Magn' which expresses intensity (Magn(rain)=(heavy), and also associated with a function called Syn which relates synonyms.  The associations and relations among words and concepts in their structure work together with a rule set to deconstruct a sentence into concepts.  The concept held in a frame, and a new sentence is regenerated to express the concepts in the frame.

Further, the relationships can be defined in a first order logic system, and how entities act upon one another can also be defined [Sowa00, Pulman96].  (Bats hit balls, people push doors, but houses don't drive cars). In a related disambiguation strategy,  the ACE Attempto Controlled English system parses all an author's input into Prolog, and subjects it to a logic proof.  If a sentence is logically acceptable in itself, it is further checked to ensure it does not contradict the sentence preceding it.  In systems like this, analysis tools can point out for an author concepts (or parts, or ideas, or engine components) which do not belong to one another, and give the author an opportunity to re-compose his writing. This has many of the features of an ontology, or knowledge base.  Attempts like those in Meaning Text Theory and ACE to encode real-world knowledge into language tools are a relatively new and unexplored area for controlled languages.

Back|Home|Next

Richard Altwarg
Macquarie University Graduate Program in Speech and Language Processing
SLP803 An Introduction to Language Technology

This site last updated November 20, 2000.
Comments and corrections welcome: raltwarg@earthlink.com