Chinese Characters and Phonetic Symbols
Defining the phonemes of Mandarin (or any language) requires a set
of symbols which are able to accurately represent these sounds. Mandarin
differs substantially from English in its character-based structure.
While there are relationships between the way in which a character is written
and the way in which it is pronounced, Chinese is not a phonetic writing
system. All words are composed of one or more characters, and many words
are only one character. The majority of characters are a single syllable,
and there are relatively few characters with more than one pronunciation.
Because of this, there is a natural inclination to define the language,
whether textually or acoustically, in terms of characters.
Mandarin does have a number of different alphabetic representations aimed at a phonetic representation of the language, developed at different times and used in different Mandarin-speaking locales. These include the Wade-Giles system, Roma-tzyh, Mandarin Phonetic Symbols (ZYFH), and Pinyin. Pinyin, developed in the course of mainland Chinese language reforms in the 1950s, is the most widely used alphabetic representation system, and is the one used in this paper. Pinyin operates on a phonetic basis, and is successful in this regard. However, my purpose is phonological - to define the possible set of phones and diphones in Mandarin - and Pinyin is wholly inadequate to the task. My solution to this problem is to also use Worldbet, an ASCII-compatible version of IPA. I use Pinyin examples and representations alongside the Worldbet representations for those more familiar with Pinyin than IPA or Worldbet.
The Mandarin Phone Inventory
In some respects, the phonemic structure of Mandarin is quite simple.
Mandarin contains 21 consonants, 5 semi-vowels, 4 dipthong vowels, and
14 monopthong vowels.
Mandarin Consonants
The consonants are generally straightforward and there is common agreement
concerning their definition [Worldbet, Shih96, Hu99]. The consonants
of Mandarin are listed in Table
5, along with the medial and final vowels which can follow them to
form a syllable.
Mandarin Vowels
However, vowels are another story. Perhaps because of the lack
of a universal phonemic symbol system for Mandarin, or the traditional
focus on the character as the smallest sub-unit, or the complexity of vowel-vowel
coarticulation, there is some disagreement over the number of vowels in
Mandarin.
My working definition of Pinyin vowels before pruning calls for 13 single vowels, five semi-vowels, and four dipthong vowels (total 22). As represented in Pinyin, Mandarin's monopthong vowels are highly allophonic. Nearly every representation of a vowel in Pinyin actually represents two or more phonemes. The traditional Pinyin scheme only allows for six unique vowel symbols in Mandarin, while Worldbet calls for a total of 22, and [Shih96] call for a total of 18 for the ATT text to speech system. Table 1 illustrates the allophones of Pinyin monopthong vowels in a scheme generally in agreement with [Shih96, Worldbet, Chao68, Hu99].
Table 1. Pinyin-Worldbet Monopthong Vowels and Allophones
| Pinyin Vowel | Pinyin Allophone | Pinyin Example | Worldbet Vowel | Worldbet Example |
| a | a | ta | A | t[hA |
| a | a | tan | @ | t[h@n |
| e | e | te | 2 | t[h2 |
| e | e | teng | & | t[h&N |
| e | e | tie | E | t[hi:E |
| e | e | ger | &r | k&r |
| i | i | ti | i | t[hi: |
| i | i | ting | I | t[hIN |
| i | i | si | If | sIf |
| i | i | shi | 4r | sr4r |
| o | o | tuo | > | t[h> |
| u | u | tu | u | t[hu |
| ü | ü | xü | y | Cy |
Mandarin's four dipthong vowels are straightforward, and shown in Table 2 below:
Table 2. Mandarin Dipthong Vowels
| Pinyin Dipthong
Vowel |
Pinyin
Example |
Worldbet
Dipthong Vowel |
Worldbet
Example |
| ai | dai | ai | t[ai |
| ao | dao | aO | t[aO |
| ei | dei | ei | t[ei |
| o | dou | oU | t[oU |
In addition to these vowels, Mandarin has five semi-vowels:
| Pinyin
Semi-vowel |
Pinyin
Example |
Worldbet
Semi-Vowel |
Worldbet
Example |
| l | lu | l | lu |
| r | ru | r+ | r+u |
| i | yu | j | ju |
| ua | tuan | w | t[wAn |
| ua | juan | jw | cCjwn |
The first two of these semi-vowels are clear, and present no redundancies or overlap with other phones. However, the last three are slightly tricky. The /j/ is almost identical to the high monopthong vowel /i:/. The /w/ is almost identical to the monopthong consonant /v/. And the /jw/ is almost identical to /y/ (the umlaut ü in Pinyin). Shi considers these as on-glides [Shih96]. The /j/ is particularly tricky because it may precede other monopthong and dipthong vowels in Mandarin; /A/, /E/, /aU/, and /oU/. There is a substantial level of coarticulation when this /j/ is produced together with any of these other vowels, and it has been suggested that these coarticulations are unique phones [Hu99]. However, the coarticulatory effects can be reproduced by pairing the full, high vowels together with the adjacent monopthong or dipthong vowels as diphones, and do not require the addition of new phones to the list.
In sum, before pruning, there are 21 consonants, 5 semi-vowels, 4 dipthong vowels, and 13 monopthong vowels in the total possible universe of Mandarin phonemes.
Intra- and Inter- Syllable Structure and Rules
The next step is to see how these combine, both within character syllables,
and between character syllables, in order to further define (and then reduce)
the number of possible combinations. Table 5, 6,
and 7 show all possible intra-syllabic diphones for Mandarin.
Intra-Syllabic Structure
In Mandarin, there is a clear set of constraints on the phonetic structure
of each character. Traditional Chinese philology defines Mandarin
phonetics in terms of initials, medials, and finals. The correspondence
between onset and rime and Mandarin's initial, medial, and final designations
(from [Li99]) is shown here:
Figure 3. Mandarin Initial, Medial, Final vs. Onset and Rime
Initials may be consonants or vowels, medials are vowels , and finals are vowels or nasals. This is an advantage in defining the minimum set of diphones for Mandarin, because the total number of ways in which consonants and vowels combine within a syllable is somewhat constrained. Based on these constraints, one can define all intra-syllabic phone strings; the total possible universe of diphones (and syllables) in Mandarin.
The possible combinations are:
Initial
Initial-Final
Initial-Medial
Medial-Final
Initials
The only initial phones which occur without a medial or final are vowels,
as shown at the bottom of Table 5. These are
/A/, /2/, /i:/, /ai/, /aO/, and /oU/ (Pinyin a, e, i, ai, ao, ou).
Initial-Final and Initial-Medial
The possible combinations of initial-final and initial-medial include
almost all of the consonants, and many of the vowels. These are shown
in Tables 5 and 6. Note that some of the phones are used quite rarely.
In particular, the vowels /If/, 4r/, and /y/ (Pinyin i, i, ?) do
not occur after plosives or nasals, as seen in Table
5. Table
6 shows that /C/-related fricatives/affricates (/C/, /cC/, /cCh/) (Pinyin
x, j, q) do not precede any of the dipthong vowels. These limited
occurrences help to further constrain the required diphone set.
Medial-Final
The simple and relatively straightforward medial-final vowel-nasal
finals are shown in Table
7: /@n/, /@N/, /&n/, /&N/, /In/, /IN/, and />N/ (Pinyin
an, ang, en, eng, in, ing, ong).
All that remains are the inclusion of the medial-final vowel-vowels, somewhat more complicated because these vowel-vowel combinations result in a great number of coarticulated sounds, some which could be mistaken for unique phones.
Table 4. Mandarin Medials + Vowel Finals
| Pinyin | Example | Worldbet | Example |
| ia | xia | jA | hjA |
| ian | xian | j&n | hj&n |
| iao | xiao | jaU | hjaU |
| iu | xiu | joU | hjoU |
| ie | xie | IjE | hIjE |
| io | xiong | i> | hi>N |
| ua | wa | wA | wA |
| uai | wai | wai | waI |
| ui | wei | wei | wei |
| üe | yü | jw | jw |
Inter-Syllabic Structure
Because inter-syllabic transitions do not share any of the linguistic
constraints of the intra-syllabic rules, they are very straightforward.
The possible set of inter-syllabic transitions is simply the combination
of all:
Medials-Initials
Finals-Initials
These are shown in Tables
8-11.
Total Count of Intra-Syllabic and Inter-Syllabic Diphones
The count of all possible intra-syllabic combinations in Tables
5-7 is 171, for 171 diphones. For inter-syllable combinations,
any word can follow any other; any initial could follow any final in spontaneous
speech or text. Thus the number of all possible final-initial combinations
for inter-syllabic combinations is simply the number of finals times the
number of initials, or 19 X 31, totaling 589 diphones to represent all
inter-syllabic transitions. 171 intra-syllabic diphones + 589 inter-syllabic
diphones= 760 total possible Mandarin diphones.
Diphone Reduction Strategies
The second major part of defining the smallest set of diphones required
to produce intelligible Mandarin is finding ways in which to reduce the
phone inventory or diphone requirements.
Allophonic Semi-Vowels
One way is to see if there are phones which are very similar to one
another. I mentioned above the similarities between Mandarin's semi-vowels
/j/, /w/ and /jw/ (Pinyin i, ua, üe) and their corresponding high
vowels /i:/, /v/, and /y/ (Pinyin i, w, ü). Because they can be reproduced
almost identically by other phones [Shih96], and because the coarticulation
effects can be reproduced when paired with other vowel phones, these are
the first three sounds I prune from the inventory. Thus, /j/, /w/
and /jw/ are replaced by /i:/, /v/, and /y/. On an intra-syllabic
basis, this reduces the diphone count by 14. When paired with the
19 finals in final-initial combination on an inter-syllabic basis, 3 X
19 = 57 diphones are eliminated. The total number of diphones pruned
from this category is 14 + 57 = 71.
Allophonic Affricates
Another set of highly similar sounds which may be considered for diphone
reduction are among the affricates. From an articulatory point of
view, /tsr/ and /cC/ (Pinyin zh and j) and /thsr/ and /chC/ (Pinyin ch
and q) are different; they have different places of articulation, but similar
manners of articulation. The first pair are both voiceless affricates,
of which /tsr/ is retroflex while /cC/ is palatal. The second pair
are both voiceless aspirated affricates, of which /thsr/ is retroflex while
/chC/ is palatal. Substituting the retroflex versions for the palatal
versions would remove two initial consonants from the inventory.
There is no doubt that this change will have a substantial effect on the
quality of synthesis; these phonemes can be distinguished by native speakers,
and the combination of retroflex initial-high vowel is in contradiction
to the normal phonotactic rules of Mandarin. However, the phonemes
do exist in complementary distribution, and I will expect listeners to
understand which affricate is used based on context. (Frankly, I
would be more comfortable making this decision with acoustic data to support
it; this is an enhancement which I could undertake at a later date.) These
combine with three vowels in an intra-syllabic fashion for six diphones.
The three phones would combine with the 19 finals on an inter-syllabic
basis for a sum of 57 diphones which can be removed from the set.
The total number of diphones pruned from this category is 6 + 57 = 63.
'Silent' Consonants
All the plosives and affricates (/p/, /ph/, [t[/, t[h/, /k/, /kh/,
/ts/, /tsh/, /thsr/, /tsr/, /cC/, and /cCh/) (Pinyin b, p, d, t, g k, z,
c, zh, ch, j, q) are voiceless and begin with a section of silent closure
[Shih96], so when paired with any other phone, all of these can be considered
to appear as /PHONE/ - /#C/ -. (eg; /nn/ - /#p/, /NN/ - /#p/).
Based on this, we can eliminate all the diphones composed of final-initial
plosive and final-initial affricate. This is 19 finals X 6 initial
plosives and 19 finals X 6 initial affricates, a total of 228 diphones.
Rare Usage
The last set of items I identified are two diphone combinations which
make up two single characters, both of which are archaic and rarely
occur in contemporary usage:
/gei/ (kei); archaic, meaning to beat or fight'
/nou/ (nou) ; archaic, meaning 'plow'
Summary of Diphone Reductions
This series of diphone reductions can be summarized as follows:
Allophonic semi-vowels: 71
Allophonic affricates: 63
'Silent' consonants: 228
Rare usage: 2
Total: 364
Subtracting from the original total; 760-364=396. Thus, I conclude that the minimum diphone set for synthesis of intelligible Mandarin is 396 diphones. (As a point of reference [Shih96] indicate that the 1987 version of the ATT Mandarin synthesis system used 492 diphones - it did not use vowel-affricate, final-affricate, vowel-plosive, or final-plosive transition diphones).
Diphone Definition for Mandarin Speech Synthesis
Richard Altwarg
Macquarie University, Speech and Language Processing
SLP807 Speech Synthesis