Mandarin is a contour tone language, with five tones:
| Tone | Description | Worldbet
Tone Mark |
Mean F0 | Example | Meaning |
| 1 | high, level | 55 | 241-5-228.6 | ma1 | mother |
| 2 | rising | 35 | 180.8-201.9 | ma2 | hemp |
| 3 | falling-rising | 213 | 199.3-128.7-241.4 | ma3 | horse |
| 4 | falling | 51 | 242.3-175.4 | ma4 | curse,scold |
| 5 | neutral | 5 |
(from [Li99])
Figure 4. Mandarin Lexical Tones (X axis is frequency) (from
[Shih0000])
The Worldbet tone mark is an indication of the pitch of the tone throughout its duration; for example, 213 means that the pitch begins at level 2, rises to level 1, then drops to level 3. The first four tones are considered 'lexical' tones because there is a lexical difference in each representation. The meaning (and the written form) of each of these syllables is different. The fifth 'neutral' tone is not lexical, but used to denote syllables which are not stressed, or reduced.
Tone Sandhi
Tones vary based on their context. These changes, and the rules
with which they are associated, are called tone sandhi. The canonical
sandhi of Mandarin relate primarily to Tone 3:
1. Tone 3 + Tone 3: When two tone 3 (213) syllables are adjacent, the first one changes to tone 2 (35). This applies to strings of any length of tone 3 syllables (but may be affected by phrasing):
"would like to buy a good horse"
Ci:@N213 mai213 xaU213 ma213 => Ci:@N35 mai35 xaU35 ma213 (Worldbet)
xiang3 mai3 hao3 ma3 => xiang2 mai2 hao2 ma3 (Pinyin)
2. Tone 3 + Tones 1,2,4: When a third tone syllable is followed by a syllable with a tone other than third tone, it changes to a low-rising tone with the pitch contour 21. [Chao68] calls this the "1/2 third tone":"a good book"3. Tone 2 => Tone 1: In a 3 syllable string, when a tone 2 syllable is preceded by a tone 1 or 2 and followed by any tone other than neutral tone, the 2nd syllable changes to tone 1:
xaU213 sru55 => xaU21 sru55 (Worldbet)
hao3 shu1 => hao1/2 of 3 shu1 (Pinyin)
[Chao68] also indicates 'morpho-phonemic' tone sandhi, tones which change four different characters in different contexts. These are /chCi:/, /pa/, /i:/ and /pu/, the words for seven, eight, one, and not (Pinyin: qi, ba, yi, bu). Of these, the tone sandhi for seven and eight are 'optional', and I will therefore refrain from further explanation of them here. In citation form, 'bu' is a tone four syllable, and 'yi' is a tone one syllable. In connected speech, they both appear in tone 2 before a tone 4, and in tone 4 before other tones."the third grade"
s@n55 nj&n35 cCi:35 => s@n55 nj&n55 cCi:35
san1 nian2 ji2 => san1 nian1 ji2
In actuality, there are many other variations which take place between tones in connected speech. In most cases, the tones of two syllables adjacent to one another appear to adjust their start and end points up or down to meet the tone of the adjoining syllable. Initial steps towards studying and cataloging these effects have begun, but are not yet complete [Shih96, Shih00, WangR96a, WangR00, Liu98]. These variations also have effects upon (and are affected by) aspects of prosody other than tone, such as duration, stress, phrasing, and sentence level pitch contours [Shih96, Shih00, Wang96, WangR00, Liu98].
Sentential Tone
Tones are implemented in a Mandarin speech synthesis system by establishing
a baseline fundamental frequency (F0) for all outputs. The baseline
F0 can be set by hand, or established based on a set of smoothed frequency
measurements taken over a speech corpus. Then, the four tones are set with
reference to the baseline F0, and all outputs shifted up or down based
on their tonal identity with respect to the reference F0. In
Figure 5, H and L refer to High and Low tone levels. The figure shows
the imposition of lexical tones (the higher curve) on sentence F0 contour
tones (the lower curve). It would seem that writing the rules for
the basic tone sandhi would not be too problematic, but that implementing
the rules for the morphophonemic tone sandhi and all the other inter-syllabic
tone variation would be quite daunting.
Figure
5. Fundamental Frequency Tone Integrated with Lexical Tones (from [WangR98b])
F0 can also be established using more data-driven approaches to tone modeling. These approaches establish a baseline F0 from measurements of a speech corpus, and then establish tonal variations for different words and syllables in the same corpus based on the data itself. This has been done by linear regression of hand-tagged parameters [Black96, Shih00]. The data driven approaches have performed somewhat poorly, lacking sufficient data to model new, unknown inputs. This is a data scarcity issue, which may be resolved with the acquisition and application of larger training sets. Since it is based on actual speech behavior (rather than a set of rules), I would expect this method to produce better results with more training. Or, perhaps the most likely solution is a hybrid of a data-based language model supplemented by a rule set to address unseen factor combinations.
Diphone Definition for Mandarin Speech Synthesis
Richard Altwarg
Macquarie University, Speech and Language Processing
SLP807 Speech Synthesis