Vowel Perceptual Space Mapping
Australian English Subjects
This project examines the perception, by native speakers of Australian English, of a large array of synthesised speech tokens. In a typical condition the speech tokens consist of "long" and "short" vowels in an /h_d/ context (although consonantal context is varied for some conditions). A large array of tokens is generated by a specially modified formant synthesiser. The tokens are uniformly spaced in two "vowel spaces". The two vowel spaces differ only in vowel duration ("long" versus "short"). Listening subjects are asked to identify each token (the details of exactly what they are asked to do varies from condition to condition). From the responses, a contour map is developed for each vowel phoneme that indicates the percentage of responses that specified that vowel phoneme for each point in the two vowel spaces. A composite contour map is then produced that displays the responses for all vowel phonemes.
A typical pair of vowel space contour maps is displayed in figures 1 and 2. These two diagrams display the responses of phonetically trained subjects responding with Australian English vowel phoneme symbols. These particular vowel spaces have tokens located at each grid intersection point within the defined vowel space boundary. The tokens in this example are in an /h_d/ context and long vowels are defined here as being 300 ms in length and short vowels are defined as being 150 ms in length (for some conditions, these durations are varied). The tokens are confined to a frequency range typical of an adult male speaker. Higher formants in this condition model those of a male speaker. The F0 is 160 Hz which was defined, for the purpose of this project, as being ambiguous with respect to speaker sex.

Figure 1: Perceptual (F1/F2) space of Australian English subjects listening to a synthetic male voice producing SHORT vowels. 75% (dark red fill), 50% (light grey fill) and 25% identification contours are shown. (Mannell, 1988)

Figure 2: Perceptual (F1/F2) space of Australian English subjects listening to a synthetic male voice producing LONG vowels. 75% (dark red fill), 50% (light grey fill) and 25% identification contours are shown. (Mannell, 1988)
This project has examined the effect of 10 independent variables in various combinations. The project commenced in 1988 and data collection was completed in May 2007. About 700 subjects have participated. Table 1 summarises most of the conditions in this project.
VWL |
Frame |
Contx |
Phon |
Len |
Hi Fx |
F0 |
BW |
Lo F3 |
F4/F5 |
Phon. Naive |
#Pnts |
#Subj |
Year |
01 |
M |
h_d |
phon |
300 |
M |
160 |
OK |
NO |
YES |
NO |
224 |
20 |
1988 |
02 |
F |
h_d |
phon |
300 |
F |
160 |
OK |
NO |
YES |
NO |
360 |
20 |
1988 |
06 |
M |
h_d |
ortho |
300 |
M |
160 |
OK |
YES |
YES |
YES |
280 |
30 |
89-91 |
07 |
F |
h_d |
ortho |
300 |
F |
160 |
OK |
YES |
YES |
YES |
289 |
20 |
89-91 |
08 |
M |
h_d |
ortho |
300 |
M |
110 |
OK |
YES |
YES |
YES |
280 |
20 |
89-91 |
09 |
F |
h_d |
ortho |
300 |
F |
220 |
OK |
YES |
YES |
YES |
289 |
20 |
89-91 |
10 |
M |
h_d |
ortho |
300 |
none |
160 |
OK |
NO |
NO |
YES |
224 |
18 |
1991 |
11 |
F |
h_d |
ortho |
300 |
none |
160 |
OK |
NO |
NO |
YES |
230 |
18 |
1991 |
12 |
M |
h_d |
ortho |
300 |
none |
110 |
OK |
NO |
NO |
YES |
224 |
20 |
1991 |
13 |
F |
h_d |
ortho |
300 |
none |
220 |
OK |
NO |
NO |
YES |
230 |
19 |
1991 |
14 |
red.F |
h_d |
ortho |
300 |
F |
160 |
OK |
NO |
YES |
YES |
136 |
20 |
90-91 |
15 |
red.F |
h_d |
ortho |
300 |
F |
220 |
OK |
NO |
YES |
YES |
136 |
20 |
90-91 |
16 |
red.F |
h_d |
ortho |
300 |
F |
110 |
OK |
NO |
YES |
YES |
136 |
20 |
1991 |
17 |
F |
h_d |
ortho |
300 |
none |
110 |
OK |
NO |
NO |
YES |
230 |
19 |
1991 |
18 |
M |
h_d |
ortho |
300 |
none |
220 |
OK |
NO |
NO |
YES |
224 |
19 |
1991 |
19 |
M |
h_d |
ortho |
300 |
M |
110 |
400 |
YES |
YES |
YES |
280 |
19 |
90-91 |
20 |
M |
h_d |
ortho |
300 |
M |
110 |
800 |
YES |
YES |
YES |
280 |
20 |
1991 |
21 |
M |
h_d |
ortho |
300 |
M |
220 |
OK |
NO |
YES |
YES |
224 |
18 |
1991 |
22 |
F |
h_d |
ortho |
300 |
F |
110 |
OK |
NO |
YES |
YES |
230 |
18 |
1991 |
23 |
red.M |
h_d |
ortho |
300 |
M |
160 |
OK |
NO |
YES |
YES |
112 |
20 |
1991 |
24 |
M |
h_t |
ortho |
180 |
M |
110 |
OK |
NO |
YES |
YES |
224 |
20 |
2001 |
25 |
M |
h_t |
ortho |
300 |
M |
110 |
OK |
NO |
YES |
YES |
224 |
20 |
2001 |
26 |
M |
s_t |
ortho |
180 |
M |
110 |
OK |
NO |
YES |
YES |
224 |
20 |
2001 |
28 |
M |
h_d |
phon |
300 |
M |
110 |
OK |
NO |
YES |
NO |
224 |
20 |
2004 |
29 |
M |
h_d |
ortho |
300 |
M |
110 |
OK |
NO |
YES |
NO |
224 |
20 |
2004 |
30 |
F |
h_d |
phon |
300 |
F |
160 |
OK |
NO |
YES |
NO |
360 |
20 |
2005 |
31A |
M(+F) |
h_d |
phon |
300 |
M(+F) |
160 |
OK |
YES |
YES |
NO |
309 |
20 |
2005 |
31B |
M(+F) |
h_d |
phon |
300 |
M(+F) |
160 |
OK |
YES |
YES |
NO |
300 |
20 |
2006 |
32 |
F |
h_d |
ortho |
300 |
F |
220 |
OK |
YES |
YES |
YES |
289 |
19 |
2006 |
33 |
M |
h_d |
ortho |
300 |
M |
110 |
OK |
YES |
YES |
YES |
224 |
21 |
2006 |
34 |
F(+M) |
h_d |
phon |
300 |
F(+M) |
220 |
OK |
YES |
YES |
NO |
309 |
20 |
2006 |
35 |
red.F |
h_d |
ortho |
300 |
F |
110 |
OK |
NO |
YES |
YES |
136 |
20 |
2006 |
36 |
red.F |
h_d |
ortho |
300 |
F |
160 |
OK |
NO |
YES |
YES |
136 |
20 |
2006 |
37 |
red.F |
h_d |
ortho |
300 |
F |
220 |
OK |
NO |
YES |
YES |
136 |
20 |
2006 |
38 |
F |
h_d |
ortho |
300 |
F |
220 |
OK |
YES |
YES |
NO |
289 |
18 |
2007 |
Table 1: This table summarises the Australian English conditions tested in this project. The conditions have been given codes such as "VWL01" and an abbreviated version of these codes are shows in column 1. A typical condition consisted of 20 subjects, although this number varied between 18 and 30 (see the 2nd right-most column "#Subj"). The third right-most column ("#Pnts") indicates the number of tokens presented to the subjects in each condition. The remaining columns summarise the treatment of the independent variables in each of the conditions and these will be described in the text below. The blue shading of certain cells highlights less common independent variable settings. Missing numbers under the VWL column represent discarded conditions.
| Frame | A "frame" defines the size of the vowel space for each condition. A "male" frame ("M") can be seen in figures 1 and 2, above, and is intended to represent an average male vowel production space (ie. the range of possible vowel articulations for an average male vocal tract. A "female" frame ("F") similarly attempts to model the range of possible vowel articulations for an average female speaker. This space has a similar shape to the male frame but is significantly larger with a maximum F2 of 3120 Hz and a maximum F1 of 1080 Hz. A "reduced female" frame ("red.F") is similar in size to a male frame but the token spacing is the same as for the female frame. A "reduced male" frame ("red.M") is reduced in size in the F1 and F2 dimensions relative to the male frame. The "red.F" conditions were repeated in 2006 utilising a better model of female F3. M(+F) and F(+M) represent a condition utilising male or a female frames (and HiFx), respectively, followed by a number of tokens with the opposite vocal gender specification (to test a point normalisation hypothesis). | |
| Contx | This variable indicates the consonantal context for each condition. The most common context is the often used /h_d/ frame, but so far /h_t/ and /s_t/ frames have also been examined. | |
| Phon | This variable indicates whether the subjects were instructed to provide an orthographic response ("ortho") or a phonetic response ("phon"). Obviously, phonetically untrained subjects responded only orthographically. | |
| Len | This variable indicates vowel length. A value of "300" indicates that the long vowels had a duration of 300 ms and that the short vowels for such a condition had a duration of 150 ms. A value of "180" indicates that the long vowels had a duration of 180 ms and that the short vowels for such a condition had a duration of 90 ms. | |
| Hi Fx | This variable indicates whether formants above F3 were present and whether they had values which modeled a male (lower F4 and F5) or a female (higher F4) vocal tract. As the synthesiser was limited to a 5000 Hz bandwidth, the female F5 was omitted from the model as it is greater than 5000 Hz. | |
| F0 | Three F0 values are used: 110 Hz (male), 160 Hz (neutral), 220 Hz (female) | |
| BW | For most conditions vowel formant bandwidths are calculated using the formula BW=A*(1.+FX/B) where A=50 and B=2000 and FX is the formant frequency. This results, for example, in a bandwidth of 100 Hz for a formant with a frequency of 2000 Hz. For two conditions (VWL19 and VWL20) the bandwidth was fixed to the much wider values of 400 Hz and 800 Hz for all formants at all frequencies. | |
| Lo F3 | F3 was determined for most vowel tokens as a simple function of F2. That is, for each value of F2 there is a single value of F3. This approach is a reasonable model of F3 for all Australian English vowels except for /u:/ which, being a rounded central vowel has a lower F3 than the other central vowels, which are all unrounded. For some conditions, this is modeled by the provision of a second partial long vowel plane which is distinguished from the main plane by its consistently lower F3 values. | |
| F4/F5 | This variable indicates whether F4 and F5 are present. When not present they are modeled with a low gain and a broad bandwidth. | |
| Phon. Naive | Some conditions use phonetically trained subjects whilst other conditions use untrained subjects. Phonetic training is defined minimally as a person who has at least completed our second year phonetics course (or equivalent) and who obtained a good result in the transcription assessments. | |
| Forced Choice |
Originally there was an additional independent variable "forced choice" but regardless of the instructions most participants left zero blanks or at most only a very small number of blank responses so htere was effectively no distinction between forced choice and non-forced choice conditions. | |
| Year | The year(s) the data for this particular condition was collected. The time span of this project permits controlled examination of change in monophthong vowel perception in Australian English over a 19 year period. |
Non-native Speakers of English and Native Speakers of Other Dialects of English
Non native speakers of English (135 subjects), native speakers of other dialects of English (34 subjects) and male and female native speakers of Australian English (30 female and 3 male) were asked to perform a task identical to condition VWL06. They were then asked to record a list of /h_d/ tokens as elicited by orthographic prompts. Individual subject vowel spaces are being examined as well as composite vowel spaces for subjects with the same L1. This latter analysis is only being done on L1 groups with a reasonable number of subjects, such as Cantonese, Japanese and Korean as well as British and US English and separately for each of male and female Australian English speakers. Perceptual centroids for each vowel are being compared to the equivalent produced vowel (following normalisation to an equivalent male vowel, where necessary). Data was also collected on L2 English speakers' experience with English.
Research Questions
The main research questions are:-
- To what extent do each of vowel frame size, fundamental frequency and higher formant frequencies contribute to vowel perceptual normalisation?
- Which model of normalisation best accounts for the differences between the male and female vowel spaces?
- To what extent and in what way do lexical access effects (lexicality, lexical frequency, neighbourhood density) influence these vowel perceptual patterns? That is, how do the processes of lexical access and normalisation interact?
- What evidence is there for changes in vowel perceptual patterns over the 17 years of this project and are these changes related to observed changes in the production of Australian English vowels over the same period?
- What differences can be found in the perceptual patterns of phonetically trained and phonetically naive subjects?
- What differences can be found in the perceptual patterns of phonetically trained subjects when they respond with orthographic (and therefore lexical) responses and when they respond with vowel phoneme symbols?
- What is the interaction between L1 and vowel perceptual patterns for learners of English?
- What is the interaction between the extent of English experience and vowel perceptual patterns for learners of English?
- To what extent do perceived vowel centroids relate to (normalised) productions of the same vowels for speakers of diverse L1 or other dialects of English?
Project Status
A very large amount of data has been collected from more than 800 subjects and this has all undergone initial data entry, processing and analysis. All planned data collection for this project is now complete. So far, only preliminary results have been presented at conferences. Several papers are currently being prepared for submission to refereed journals.
No doubt, numerous follow-up experiments might be anticipated for the future. The synthesiser will be modified to permit it to produce a larger range of tokens (ie. more consonantal contexts) and this will also probably result in the potential for further experiments. These modifications could expand its applicability to other dialects of English and perhaps even to other languages.
Relevant Papers
Bernard J.R., & Mannell R.H., (1988) "A study of /h_d/ words in Australian English", Working Papers, 1986, Speech Hearing and Language Research Centre, Macquarie University.
Mannell R.H., (1988) "Perceptual space of male and female Australian English vowels", Proceedings of the Second Australian International Conference on Speech Science and Technology, Sydney, Nov. 1988. pp 22-27
Mannell R.H. (1995), "Perceptual mapping and vowel normalisation", Proceedings of the XIIIth International Congress of Phonetic Sciences, Stockholm, Sweden, August 13-19, 1995
Mannell R.H. (2001), "The influence of lexical access on the perception of vowel phoneme boundaries in formant space", 13th Australian Language and Speech Conference, Sydney: Australia (Abstract)
Mannell, R.H. (2004), "Perceptual vowel space for Australian English vowels: 1988 and 2004", Proceedings of 10th Australian International Conference on Speech Science and Technology, Sydney, Australia, pp 221-226.
Mannell, R., (2006), " Perception and modelling of vowels and vocal gender in synthetic speech", Proceedings of the 15th Australian Language and Speech Conference, Sydney, Australia, published in:- Australian Journal of Psychology, Vol. 58, Supplement 2006, p.9. (abstract only)

