4. Tone
Tones constitute a special set of additional features which should be reproduced by a speech synthesis system.  They are a critical feature of a language, and are referred to as lexical tone when the same syllable with a different tone has a different meaning.  Tonal languages are commonly distinguished among two types; contour tone languages and register tone languages.  Register tone languages are characterized by having relatively few (<4) tones, which are relatively unchanging, and appear in word units.  Contour tone languages are characterized by tones which are generally more changing (such as low-rising, or high-falling), and usually appear at the syllable level [Clark95].  In most cases the definition of tone is with reference to F0, the level of the underlying fundamental frequency occurring in normal speech.

Mandarin is a contour tone language, with five tones:
 
Tone  Description Worldbet 
Tone Mark
Mean F0 Example  Meaning
1 high, level 55 241-5-228.6 ma1 mother
2 rising  35  180.8-201.9  ma2  hemp
3 falling-rising  213  199.3-128.7-241.4 ma3  horse
4 falling  51  242.3-175.4  ma4  curse,scold
5 neutral  5

(from [Li99])


Figure 4. Mandarin Lexical Tones (X  axis is frequency) (from [Shih0000])

The Worldbet tone mark is an indication of the pitch of the tone throughout its duration; for example, 213 means that the pitch begins at level 2, rises to level 1, then drops to level 3.  The first four tones are considered 'lexical' tones because there is a lexical difference in each representation.  The meaning (and the written form) of each of these syllables is different.  The fifth 'neutral' tone is not lexical, but used to denote syllables which are not stressed, or reduced.

Tone Sandhi
Tones vary based on their context.  These changes, and the rules with which they are associated, are called tone sandhi.  The canonical sandhi of Mandarin relate primarily to Tone 3:

1. Tone 3 + Tone 3: When two tone 3 (213) syllables are adjacent, the first one changes to tone 2 (35). This applies to strings of any length of tone 3 syllables (but may be affected by phrasing):
"would like to buy a good horse"
Ci:@N213  mai213 xaU213 ma213   => Ci:@N35 mai35 xaU35 ma213  (Worldbet)
xiang3 mai3 hao3 ma3   =>  xiang2 mai2 hao2 ma3 (Pinyin)
2. Tone 3 + Tones 1,2,4:  When a third tone syllable is followed by a syllable with a tone other than third tone, it changes to a low-rising tone with the pitch contour 21.  [Chao68] calls this the "1/2 third tone":
"a good book"
xaU213 sru55   =>  xaU21 sru55  (Worldbet)
hao3 shu1   =>  hao1/2 of 3  shu1 (Pinyin)
3. Tone 2 => Tone 1: In a 3 syllable string, when a tone 2 syllable is preceded by a tone 1 or 2 and followed by any tone other than neutral tone, the 2nd syllable changes to tone 1:
  "the third grade"
  s@n55 nj&n35 cCi:35 => s@n55 nj&n55 cCi:35
  san1 nian2 ji2   =>  san1 nian1 ji2
[Chao68] also indicates 'morpho-phonemic' tone sandhi, tones which change four different characters in different contexts.  These are /chCi:/, /pa/, /i:/ and /pu/, the words for seven, eight, one, and not (Pinyin: qi, ba, yi, bu).  Of these, the tone sandhi for seven and eight are 'optional', and I will therefore refrain from further explanation of them here.  In citation form, 'bu' is a  tone four syllable, and 'yi' is a tone one syllable. In connected speech, they both appear in tone 2 before a tone 4, and in tone 4 before other tones.

In actuality, there are many other variations which take place between tones in connected speech.  In most cases, the tones of two syllables adjacent to one another appear to adjust their start and end points up or down to meet the tone of the adjoining syllable.  Initial steps towards studying and cataloging these effects have begun, but are not yet complete [Shih96, Shih00, WangR96a, WangR00, Liu98]. These variations also have effects upon (and are affected by) aspects of prosody other than tone, such as duration, stress, phrasing, and sentence level pitch contours [Shih96, Shih00, Wang96, WangR00, Liu98].

Sentential Tone
Tones are implemented in a Mandarin speech synthesis system by establishing a baseline fundamental frequency (F0) for all outputs.  The baseline F0 can be set by hand, or established based on a set of smoothed frequency measurements taken over a speech corpus. Then, the four tones are set with reference to the baseline F0, and all outputs shifted up or down based on their tonal identity with respect to the reference F0.   In Figure 5, H and L refer to High and Low tone levels.  The figure shows the imposition of lexical tones (the higher curve) on sentence F0 contour tones (the lower curve).  It would seem that writing the rules for the basic tone sandhi would not be too problematic, but that implementing the rules for the morphophonemic tone sandhi and all the other inter-syllabic tone variation would be quite daunting.


Figure 5. Fundamental Frequency Tone Integrated with Lexical Tones (from [WangR98b])

F0 can also be established using more data-driven approaches to tone modeling. These approaches establish a baseline F0 from measurements of a speech corpus, and then establish tonal variations for different words and syllables in the same corpus based on the data itself.  This has been done by linear regression of hand-tagged parameters [Black96, Shih00].  The data driven approaches have performed somewhat poorly, lacking sufficient data to model new, unknown inputs.  This is a data scarcity issue, which may be resolved with the acquisition and application of larger training sets.  Since it is based on actual speech behavior (rather than a set of rules), I would expect this method to produce better results with more training.  Or, perhaps the most likely solution is a hybrid of a data-based language model supplemented by a rule set to address unseen factor combinations.

Home|Next

Diphone Definition for Mandarin Speech Synthesis

Richard Altwarg
Macquarie University, Speech and Language Processing
SLP807 Speech Synthesis