The Role of Linguistic Knowledge in Controlled Languages
6.0 Principles of Syntax and Parsing and Their Relationship to Controlled
Languages
6.1 Syntax and
Parsing
6.2 Coverage
6.3 Coverage Strategies
6.4 Search
6.5 Search Strategies
6.1 Syntax and Parsing
Syntax is analyzed by a parser, using a grammar. It analyzes the syntactic
features of a text, such as its constituent noun phrases, verb phrases,
adjectival phrases, determiners, adverbial phrases, etc. A grammar
may also use non-syntactic features to perform text analysis, such as morphological
and semantic information.
Strictly speaking, the grammar in most controlled language tools is
a robust grammar [Samuellson00]. That is,
it recognizes the grammaticality of the input sentence (a recognizer),
and performs a structural analysis on it (a parser), even if the input
is not covered by the grammar (a robust grammar). Grammars use a
variety of approaches in their mechanical processing of language input.
These include top-down, bottom-up, depth-first and breadth-first sequencing
of processing steps. They may or may not include additional morphological
or lexical information in their analyses, which can be used to aid in making
parse determinations.
The challenges
to effective grammars in a theoretical sense are related to coverage
and search.
Figure
4: General Architecture of Attempto Controlled Language Checker, Showing
Parser Module [Fuchs96]
6.2 Coverage
Coverage is an issue because the possible range of inputs is
infinite. Because human language is composed of large vocabularies,
which can be combined in infinite variations, there is no limit to the
range of possible inputs. This is further complicated by the fact
that inputs may be technically ungrammatical or incorrect. These
inputs may be analyzed and interpreted in many different correct ways.
This state of any given input giving rise to many possible interpretations
is one type of ambiguity.
6.3 Coverage Strategies
The primary way in which to address the challenge of coverage is to
relax syntactic constraints when they are not met, so that full coverage
is achieved, even if full analysis is not achieved. This relaxation
method is commonly seen in controlled languages. In this way, the
grammar still attempts to analyze parts of inputs, even if it cannot analyze
them in full.
In the example below, the construction 'which when seen', is a rather ambigous adjectival reference back to symptoms.
The disease results in symptoms, which when seen, tell the doctor that care is urgently needed.
If the grammar is unable to resolve the connection between 'symptoms' and 'which when seen', it does not 'cover' this construction. If the example does not fit into the grammar, one way to continue working on the text is to simply ignore the phrase "which when seen', and continue to parse the rest of the sentence. This way, the remainder can be analyzed.
In the context of controlled languages, however, this gives rise to a problem related to the mission of the controlled language, the control of non-conforming input. The controlled language developer must be careful that a relaxation of his or her grammar for the purpose of achieving full coverage does not also result in controlled language acceptance of all user input. Non conforming input should be temporarily accepted by the grammar, and the user should be informed of the non conforming sentence. Using the example above, if 'wh-words' such as 'which' are not approved for use in adjectival phrases, they must be included in the grammar to identify and correct them. Unfortunately, if the grammar cannot process the input, it cannot provide data for downstream correction mechanisms to function, defeating one of the purposes of the tool.
6.4 Search
Search is concerned with the fact that more than one analysis
may be correct. Even if a grammar is able to account for all possible inputs,
it would be challenged by the huge number of choices available. A
simple instance of the search problem is demonstrated by the sentence:
Put the block [in the box on the table] .
vs. Put [the block
in the box] on the table.
This sentence has two possible correct interpretations. This combinatorial choice problem is frequently encountered in the form of noun compounds, a particularly pointed issue in technical documents like those subject to controlled languages.
In a technical document, we might see the noun compound: "front bearing ring groove repair tool". Does this mean:
Another way to control for choice is to use syntactic or semantic constraints already specified in the domain. One extremely effective application of this is limiting sentence length, commonly used in controlled languages. This constraint alone makes a substantial difference in limiting combinatorial contributions to ambiguity. Other commonly used examples are constraints on allowable verb tenses, constrained word meanings, and constraints to noun compound lengths and meanings. The use of both syntactic and semantic constraints is the critical foundation of controlled languages.
Richard Altwarg
Macquarie University Graduate Program in Speech and Language Processing
SLP803 An Introduction to Language Technology
This site last updated November 20, 2000.
Comments and corrections welcome: raltwarg@earthlink.com