3. Defining the Minimum Diphone Set for Synthesis of Intelligible Mandarin
The primary purpose of this paper is to define the minimum diphone set which would result in intelligible Mandarin speech synthesis.  I approach this problem in the following manner:
1) Define all the basic phonemes of Mandarin, classifying them by type
2) Construct a syntagmatic analysis of Mandarin phonemes; the matrix which describes all phoneme combinations which could exist in Mandarin
3) Evaluate the matrix for redundant or unnecessary combinations which can be pruned

Chinese Characters and Phonetic Symbols
Defining the phonemes of Mandarin (or any language) requires a set of symbols which are able to accurately represent these sounds.  Mandarin differs substantially from English in its character-based structure.  While there are relationships between the way in which a character is written and the way in which it is pronounced, Chinese is not a phonetic writing system. All words are composed of one or more characters, and many words are only one character. The majority of characters are a single syllable, and there are relatively few characters with more than one pronunciation.  Because of this, there is a natural inclination to define the language, whether textually or acoustically, in terms of characters.

Mandarin does have a number of different alphabetic representations aimed at a phonetic representation of the language, developed at different times and used in different Mandarin-speaking locales.  These include the Wade-Giles system, Roma-tzyh, Mandarin Phonetic Symbols (ZYFH), and Pinyin.  Pinyin, developed in the course of mainland Chinese language reforms in the 1950s, is the most widely used alphabetic representation system, and is the one used in this paper.  Pinyin operates on a phonetic basis, and is successful in this regard.  However, my purpose is phonological -  to define the possible set of phones and diphones in Mandarin - and Pinyin is wholly inadequate to the task. My solution to this problem is to also use Worldbet, an ASCII-compatible version of IPA. I use Pinyin examples and representations alongside the Worldbet representations for those more familiar with Pinyin than IPA or Worldbet.

The Mandarin Phone Inventory
In some respects, the phonemic structure of Mandarin is quite simple.  Mandarin contains 21 consonants, 5 semi-vowels, 4 dipthong vowels, and 14 monopthong vowels.

Mandarin Consonants
The consonants are generally straightforward and there is common agreement concerning their definition [Worldbet, Shih96, Hu99].  The consonants of Mandarin are listed in Table 5, along with the medial and final vowels which can follow them to form a syllable.

Mandarin Vowels
However, vowels are another story.  Perhaps because of the lack of a universal phonemic symbol system for Mandarin, or the traditional focus on the character as the smallest sub-unit, or the complexity of vowel-vowel coarticulation, there is some disagreement over the number of vowels in Mandarin.

My working definition of Pinyin vowels before pruning calls for 13 single vowels, five semi-vowels, and four dipthong vowels (total 22).  As represented in Pinyin, Mandarin's monopthong vowels are highly allophonic.  Nearly every representation of a vowel in Pinyin actually represents two or more phonemes.  The traditional Pinyin scheme only allows for six unique vowel symbols in Mandarin, while Worldbet calls for a total of 22, and [Shih96] call for a total of 18 for the ATT text to speech system.  Table 1 illustrates the allophones of Pinyin monopthong vowels in a scheme generally in agreement with [Shih96, Worldbet, Chao68, Hu99].

Table 1. Pinyin-Worldbet Monopthong Vowels and Allophones
Pinyin Vowel Pinyin Allophone Pinyin Example Worldbet Vowel Worldbet Example
a a ta A t[hA
a a tan @ t[h@n
e e te 2 t[h2
e e teng & t[h&N
e e tie E t[hi:E
e e ger &r k&r
i i ti i t[hi:
i i ting I t[hIN
i i si If sIf
i i shi 4r sr4r
o o tuo > t[h>
u u tu u t[hu
ü ü y Cy

Mandarin's four dipthong vowels are straightforward, and shown in Table 2 below:

Table 2. Mandarin Dipthong Vowels
Pinyin Dipthong 
Vowel
Pinyin 
Example
Worldbet 
Dipthong Vowel
Worldbet 
Example
ai dai ai t[ai
ao dao aO t[aO
ei dei ei t[ei
o dou oU t[oU

In addition to these vowels, Mandarin has five semi-vowels:

Table 3. Mandarin Semi-vowels
Pinyin
Semi-vowel
Pinyin
Example
Worldbet
Semi-Vowel
Worldbet
Example
l lu l lu
r ru r+ r+u
i yu j ju
ua tuan w t[wAn
ua juan jw cCjwn

The first two of these semi-vowels are clear, and present no redundancies or overlap with other phones.  However, the last three are slightly tricky.  The /j/ is almost identical to the high monopthong vowel /i:/. The /w/ is almost identical to the monopthong consonant /v/.  And the /jw/ is almost identical to /y/ (the umlaut ü in Pinyin). Shi considers these as on-glides [Shih96]. The /j/ is particularly tricky because it may precede other monopthong and dipthong vowels in Mandarin; /A/, /E/, /aU/, and /oU/.  There is a substantial level of coarticulation when this /j/ is produced together with any of these other vowels, and it has been suggested that these coarticulations are unique phones [Hu99].  However, the coarticulatory effects can be reproduced by pairing the full, high vowels together with the adjacent monopthong or dipthong vowels as diphones, and do not require the addition of new phones to the list.

In sum, before pruning, there are 21 consonants, 5 semi-vowels, 4 dipthong vowels, and 13 monopthong vowels in the total possible universe of Mandarin phonemes.

Intra- and Inter- Syllable Structure and Rules
The next step is to see how these combine, both within character syllables, and between character syllables, in order to further define (and then reduce) the number of possible combinations. Table 5, 6, and 7 show all possible intra-syllabic diphones for Mandarin.

Intra-Syllabic Structure
In Mandarin, there is a clear set of constraints on the phonetic structure of each character.  Traditional Chinese philology defines Mandarin phonetics in terms of initials, medials, and finals. The correspondence between onset and rime and Mandarin's initial, medial, and final designations (from [Li99]) is shown here:

Figure 3. Mandarin Initial, Medial, Final vs. Onset and Rime

Initials may be consonants or vowels, medials are vowels , and finals are vowels or nasals. This is an advantage in defining the minimum set of diphones for Mandarin, because the total number of ways in which consonants and vowels combine within a syllable is somewhat  constrained.  Based on these constraints, one can define all intra-syllabic phone strings; the total possible universe of diphones (and syllables) in Mandarin.

The possible combinations are:
Initial
Initial-Final
Initial-Medial
Medial-Final

Initials
The only initial phones which occur without a medial or final are vowels, as shown at the bottom of Table 5. These are /A/, /2/, /i:/, /ai/, /aO/, and /oU/ (Pinyin a, e, i, ai, ao, ou).

Initial-Final and Initial-Medial
The possible combinations of initial-final and initial-medial include almost all of the consonants, and many of the vowels.  These are shown in Tables 5 and 6.  Note that some of the phones are used quite rarely. In particular,  the vowels /If/, 4r/, and /y/ (Pinyin i, i, ?) do not occur after plosives or nasals, as seen in Table 5Table 6 shows that /C/-related fricatives/affricates (/C/, /cC/, /cCh/) (Pinyin x, j, q) do not precede any of the dipthong vowels.  These limited occurrences help to further constrain the required diphone set.

Medial-Final
The simple and relatively straightforward medial-final vowel-nasal finals are shown in Table 7:  /@n/, /@N/, /&n/, /&N/, /In/, /IN/, and />N/ (Pinyin an, ang, en, eng, in, ing, ong).

All that remains are the inclusion of the medial-final vowel-vowels, somewhat more complicated because these vowel-vowel combinations result in a great number of coarticulated sounds, some which could be mistaken for unique phones.

Table 4. Mandarin Medials + Vowel Finals
Pinyin Example Worldbet Example
ia xia jA hjA
ian xian j&n hj&n
iao xiao jaU hjaU
iu xiu  joU hjoU
ie xie IjE hIjE
io xiong i> hi>N
ua wa wA wA
uai wai wai waI
ui wei wei wei
üe jw jw

Inter-Syllabic Structure
Because inter-syllabic transitions do not share any of the linguistic constraints of the intra-syllabic rules, they are very straightforward.  The possible set of inter-syllabic transitions is simply the combination of all:
Medials-Initials
Finals-Initials
These are shown in Tables 8-11.

Total Count of Intra-Syllabic and Inter-Syllabic Diphones
The count of all possible intra-syllabic combinations in Tables 5-7 is 171, for 171 diphones.  For inter-syllable combinations, any word can follow any other; any initial could follow any final in spontaneous speech or text.  Thus the number of all possible final-initial combinations for inter-syllabic combinations is simply the number of finals times the number of initials, or 19 X 31, totaling 589 diphones to represent all inter-syllabic transitions.  171 intra-syllabic diphones + 589 inter-syllabic diphones= 760 total possible Mandarin diphones.

Diphone Reduction Strategies
The second major part of defining the smallest set of diphones required to produce intelligible Mandarin is finding ways in which to reduce the phone inventory or diphone requirements.

Allophonic Semi-Vowels
One way is to see if there are phones which are very similar to one another.  I mentioned above the similarities between Mandarin's semi-vowels  /j/, /w/ and /jw/ (Pinyin i, ua, üe) and their corresponding high vowels /i:/, /v/, and /y/ (Pinyin i, w, ü). Because they can be reproduced almost identically by other phones [Shih96], and because the coarticulation effects can be reproduced when paired with other vowel phones, these are the first three sounds I prune from the inventory.  Thus, /j/, /w/ and /jw/ are replaced by /i:/, /v/, and /y/.  On an intra-syllabic basis, this reduces the diphone count by 14.  When paired with the 19 finals in final-initial combination on an inter-syllabic basis, 3 X 19 = 57 diphones are eliminated.  The total number of diphones pruned from this category is 14 + 57 = 71.

Allophonic Affricates
Another set of highly similar sounds which may be considered for diphone reduction are among the affricates.  From an articulatory point of view, /tsr/ and /cC/ (Pinyin zh and j) and /thsr/ and /chC/ (Pinyin ch and q) are different; they have different places of articulation, but similar manners of articulation.  The first pair are both voiceless affricates, of which /tsr/ is retroflex while /cC/ is palatal.  The second pair are both voiceless aspirated affricates, of which /thsr/ is retroflex while /chC/ is palatal.  Substituting the retroflex versions for the palatal versions would remove two initial consonants from the inventory.  There is no doubt that this change will have a substantial effect on the quality of synthesis; these phonemes can be distinguished by native speakers, and the combination of retroflex initial-high vowel is in contradiction to the normal phonotactic rules of Mandarin.  However, the phonemes do exist in complementary distribution, and I will expect listeners to understand which affricate is used based on context.  (Frankly, I would be more comfortable making this decision with acoustic data to support it; this is an enhancement which I could undertake at a later date.) These combine with three vowels in an intra-syllabic fashion for six diphones.  The three phones would combine with the 19 finals on an inter-syllabic basis for a sum of 57 diphones which can be removed from the set.  The total number of diphones pruned from this category is 6 + 57 = 63.

'Silent' Consonants
All the plosives and affricates (/p/, /ph/, [t[/, t[h/, /k/, /kh/, /ts/, /tsh/, /thsr/, /tsr/, /cC/, and /cCh/) (Pinyin b, p, d, t, g k, z, c, zh, ch, j, q) are voiceless and begin with a section of silent closure [Shih96], so when paired with any other phone, all of these can be considered to appear as /PHONE/ - /#C/ -.  (eg; /nn/ - /#p/, /NN/ - /#p/).  Based on this, we can eliminate all the diphones composed of final-initial plosive and final-initial affricate.  This is 19 finals X 6 initial plosives and 19 finals X 6 initial affricates, a total of 228 diphones.

Rare Usage
The last set of items I identified are two diphone combinations which make up two single characters, both of  which are archaic and rarely occur in contemporary usage:
/gei/ (kei); archaic, meaning  to beat or fight'
/nou/ (nou) ; archaic, meaning 'plow'

Summary of Diphone Reductions
This series of diphone reductions can be summarized as follows:
Allophonic semi-vowels: 71
Allophonic affricates: 63
'Silent' consonants: 228
Rare usage: 2
Total: 364

Subtracting from the original total; 760-364=396.  Thus, I conclude that the minimum diphone set for synthesis of intelligible Mandarin is 396 diphones.  (As a point of reference [Shih96] indicate that the 1987 version of the ATT Mandarin synthesis system used 492 diphones - it did not use vowel-affricate, final-affricate, vowel-plosive, or final-plosive transition diphones).

Home|Next

Diphone Definition for Mandarin Speech Synthesis

Richard Altwarg
Macquarie University, Speech and Language Processing
SLP807 Speech Synthesis