Section 2.1, Extralinguistic voice features, see additional text below.
Table 1. The relationship between vocal variables and their marking functions
||Informative and communicative|
Relation to language
|under potential muscular control, therefore learnable and imitatable|
We can initially consider speech production from the point of view of the different muscle systems which make up all the vocal apparatus. The muscle systems exploited in speaking are almost all anatomically inter-connected (Laver 1975), so that no muscular action takes place without affecting the activity of many other parts of the vocal apparatus. Each muscular action has to be cooperatively facilitated by all the muscle systems that could potentially counteract the desired effect of its execution. Speaking thus requires the most complex and skilful collaboration between the different muscle systems, whose cooperative actions all have to be precisely and intricately coordinated in time. It is not at all surprising, therefore, that in learning to control such a complex apparatus sufficiently to be able to produce auditorily acceptable imitations of speech patterns heard in one's social environment, speakers should nevertheless develop idiosyncracies of pronunciation that serve to individuate them within their own social group.
The notion of an isolable muscle system is itself something of a fiction. But if we accept the fiction as analytically convenient, then there are seven basic muscle systems whose contributions to speech can be distinguished. These are: the
The acoustic correlates of features of auditory quality are essentially spectral in nature, and include such aspects as formant frequencies and amplitudes, and the frequency and amplitude of aperiodic noise in the spectrum. The acoustic correlates of dynamic auditory features include fundamental frequency as the correlate of pitch, intensity as the correlate of loudness, and duration as the correlate of length. It should be noted, however, that the allocation of fundamental frequency and intensity to the acoustic realization of dynamic auditory features is not always completely valid: pitch 'jitter' and loudness 'shimmer' (that is, aperiodic cycle-to-cycle variability of fundamental frequency or intensity around the mean value) are both heard as contributing to auditory quality, giving a 'rough', 'harsh' auditory texture.
Auditorily, all speech is made up of sounds describable in terms of quality, pitch, loudness and length. All markers in speech thus depend on these variables for their phonetic realization, and the discussion that follows is an attempt to explain the phonetic basis of different types of speaker-characteristics.
There are three different facets of vocal performance to be considered. Each of these facets is subject to a different time-perspective. Firstly, there is the facet of vocal performance that represents the speaker's permanent or quasi-permanent voice, by which he is recognizable even when his consonants and vowels are unintelligible, for example, when heard speaking on the other side of a closed door. The other two facets are tone of voice and the phonetic realizations of linguistic units. The time-perspective of tone of voice is usually medium-term, and that of linguistic articulations very short-term.
Because voice features are by definition long-term, they lie quite outside any possibility of signalling linguistic meaning, so it is appropriate to refer to such voice features as extralinguistic. Since they are not normally consciously manipulated by the speaker, voice features are informative but not communicative. The medium-term features that make up tone of voice, and which have the function of signalling affective information, have a rather closer resemblance in some ways to the short-term use of the vocal apparatus for signalling linguistic meaning, and such features are therefore often referred to as paralinguistic. They are paralinguistic in the sense that they form a communicative code subject to cultural convention for its interpretation; paralinguistic features are not fully linguistic in the sense that they lack the possibility of signalling meaning through sequential arrangement into structures, which is a criterial property of linguistic communication.
Neither extralinguistic nor paralinguistic features are irrelevant to directly linguistic interests, since they constitute a background against which the linguistic articulations can achieve their perceptual prominence. Strictly, each of the three types of vocal feature, extralinguistic, paralinguistic and linguistic, acts as a perceptual ground for the figures of the other two types of figure.
Each of these categories of vocal behaviour will now be discussed in more phonetic detail. A summary of the relationship between these vocal variables and their marking functions is given in table 1.
2.1. Extralinguistic voice features
Long-term speaker-characterizing voice features are of two different sorts. One type of voice feature arises from anatomical differences between speakers. The second type is the product of the way in which the individual speaker habitually 'sets' his vocal apparatus for speaking. Unlike this second type, which will be discussed in a moment, the first type of feature is by definition outside any possibility of control by the speaker. It includes anatomical influences on aspects of voice quality and of voice dynamics.
Anatomical influences on voice quality are due to factors such as basic vocal tract length, dimensions of lips, tongue, nasal cavity, pharynx and jaw, dental characteristics, and geometry of laryngeal structures (Abercrombie 1967: 92). These anatomical factors impose limits on the range of spectral effects (in terms of formant frequency and amplitude ranges, and on the distribution of aperiodic noise through the spectrum) that the speaker can potentially control acoustically.
Anatomical influences on voice dynamics are due to factors such as the dimensions and mass of the vocal folds, and respiratory volume. These influence pitch and loudness ranges, by imposing limits on the ranges of fundamental frequency and amplitude that the speaker can produce.
Listeners' judgments of physical attributes, based on the product of such anatomically derived features, are amongst the most accurate conclusions drawn. This is precisely because they are based on invariant, involuntary aspects of a speaker's vocal performance. Physique, age and sex are all judged with a fair accuracy, and interesting information about a speaker's medical condition is also sometimes accurately inferred.
Physique and height are probably judged accurately because of the good correlation that seems to exist between these factors and the dimensions of the speaker's vocal apparatus. A tall, well-built man will tend to have a long vocal tract and large vocal folds. His voice quality will reflect the length of his vocal tract by having correspondingly low ranges of formant frequencies, and his voice dynamic features will indicate the dimensions and mass of his vocal folds by a correspondingly low range of fundamental frequency. His large respiratory volume will be reflected in a powerful loudness range. If we then hear such a voice over the telephone, we normally have a confident expectation that the speaker will turn out to be a large, strong male. In general, our expectations are fulfilled, within a reasonable margin of error. Bonaventura (1935) gave subjects pictures and voices to match, and found that fair accuracy was achieved: in terms of Kretschmerian body-types (Kretschmer 1925), judgments of pyknic types were most accurate, accuracy was less for leptosome types, and least for athletic types. Moses (1940, 1941) gives general support to this, and Fay & Middleton (1940a) report a more detailed finding: they found that in judging body-types from voices transmitted over a public address sys-tem, the results were 22 per cent above chance for pyknic types, 20 per cent for leptosomes, but only 1 per cent above chance for athletic types. Lass, Beverly, Nicosia & Simpson (1978) report that listeners typically judge weight to within 3-4 lbs (though overestimating the weight of males and underestimating that of females), and that they judge height to within 1.5 inches (though underestimating the height of both males and females). There is one class of voices where the general correlation does not apply, but where listeners nevertheless seem to be able to reach successful conclusions about the physical attributes. That is where the formant ranges of the voice are radically discrepant with the fundamental frequency, as in particular types of dwarfism (Vuorenkoski, Tjernlund & Perheentupa 1972; Weinberg & Zlatin 1970). In these cases, the dimensions of the vocal folds are smaller than their general correlation with vocal tract length would lead one to expect.
Exceptions to the general rule of our ability as listeners to attach a particular size and physique to a given voice are sufficiently rare to take us aback when they occur.
Age is judged accurately (Dordain, Chevrie-Muller & Grémy 1967; Hollien & Shipp 1972; Mysak 1959; Ptacek, Sander, Maloney & Roe Jackson 1966; Shipp & Hollien 1969). Voice quality features probably play their part in marking this characteristic, but voice dynamic features are likely to be the more primary cues. Age is marked by pitch in both males and females: Hollien & Shipp (1972) show a progressive lowering of mean pitch with age for males from 20 up to 40, then a rise from age 60 through the 80s. Mysak (1959) also showed this rise in mean pitch from the 50s upwards. Dordain et al. (1967) report a drop in mean pitch for older women, but a rise with extreme age. Ptacek et al. (1966) also report a reduced pitch range with extreme age.
Features of auditory quality can signal aspects of the age of a speaker. These include the quality associated with the 'breaking' voice of puberty, and the quality of extreme old age. Vocal indications of puberty, referred to in clinical literature as 'vocal mutation', often include whispery voice. Luchsinger & Arnold (1965: 132) write that 'In addition to the lowering of the average speaking pitch, the voice is frequently husky during mutation, or it may sound weak.' The senescent voice of extreme old age derives from a complex of endocrinal, anatomical and physiological changes. The mucal fluid supply often becomes disturbed, either greatly increasing or decreasing, tissues become increasingly less elastic, and cartilages become calcified and ossified (Fyfe & Naylor 1958; Luchsinger & Arnold 1965; Meader & Muyskens 1962; Terracol & Azemar 1949). Meader & Muyskens (1962: 77) comment that'Since the rigidity of tissue is one determination of its resonating qualities, the gradual deposition of lime in ... cartilages (replacing them by bone) helps to explain the shrill voice and thin voice (deficient in harmonics) of age.' Because muscles atrophy, the glottis of old speakers often has a bowed appearance (Luchsinger & Arnold 1965: 136; Tarneaud 1941); this means that, to achieve phonation, greater effort has to be exerted to bring the vocal folds together, and a rather harsh voice is often the result. When this is combined with inefficient phonation because of an excess of mucus, the type of voice that results is a harsh whispery voice, as suggested by the following comment from Luchsinger & Arnold (1965: 136): 'Tracheal and laryngeal mucous secretions are increased, sometimes on an allergic basis. Together with a tendency to chronic bronchitis, this over-secretion of mucus produces the hacking, coughing, throat-clearing, or "moist" hoarseness of the old man.' In old age, fatty tissue can build up in the ventricles in the sides of the upper larynx (Ferreri 1959), and the ventricular folds above the ventricles can shrink towards the sides of the larynx, giving a wider entrance to the ventricles (Luchsinger & Arnold 1965:136). All these factors can contribute significantly to the fine detail of the auditory quality of the phonation being produced. [...]