We are interested in sound as a medium that may be used for communication. Particular forms of audible communication, such as music and speech, may be highly specialized to their purposes and to the acoustic resources available to them for generating sound. This chapter is concerned with the very general structural qualities of sound that are present in essentially all uses of sound for communication. Each particular form of audible communication may exploit these general structural qualities in very different ways.
By rough analogy to visual communication, notice that almost all visual scenes may be described in terms of structural concepts such as region, edge, texture, color, brightness. Written communication in English exploits the shapes of regions with contrasting brightness, and the edges of those regions, to provide recognizable alphabetic characters of the Roman alphabet. Architectural drawings exploit edges in a radically different way. Perspective pictures draw on texture and color in yet other ways to communicate layouts of physical objects. In this chapter we seek an intuitive understanding of structural qualities of sound roughly at the level of region, edge, texture, color, brightness in video.
The key receptive structure in the ear is the cochlea, a spiral-shaped tube containing lots of little hairs that vibrate with the surrounding lymphatic fluid. From our point of view, each hair is a physical realization of a rotor. Somehow (the how is still the topic of some debate) each hair is tuned to a narrow range of frequencies, and stimulates an assigned nerve ending proportionally to the amount of excitation it receives within its frequency range. So, the human ear is roughly a frequency analyzer, passing on a spectral presentation of sound at each instant to the brain for further analysis.
I call a complex of sound that is presented to a listener an ``audible scene.'' Many audible scenes decompose naturally into the sum of several components that are perceived as units, vaguely analogous to contiguous regions in a visual scene. The decomposition is often ambiguous, and sometimes there is no sensible decomposition, but the notion of a perceived contiguous piece of sound is likely to be useful whenever it applies. I call such an intuitive unit in an audible scene ``a sound.'' In well-articulated musical pieces, a single note by a single instrument is a sound. In speech, the notion is more ambiguous, but perhaps a phoneme or segment of a phoneme may be understood as a sound.
Automated analysis of audible scenes into individual sounds is extremely difficult, because it must resolve all of the ambiguities that arise. Synthesis by adding up individual sounds to create audible scenes is much more tractable, since the instructions for synthesizing a given scene can specify an interpretation explicitly. A synthesis method based on adding individual sounds together might be very useful even if it doesn't guarantee that every object described as ``a sound'' by the system is perceived as a single sound--as long as there is a good heuristic correlation between description and perception the method can succeed.
The precise way in which the ear and brain decompose an audible scene into individual sounds is not understood. The spatial location of sound sources, as detected by the stereo effects of pairs of ears and by the asymmetric distortion induced by the funny shapes of our heads and external ears, certainly plays an important part. We will ignore spatial location, not because it isn't important, but because for the purposes of synthesis, it can probably be separated from monaural qualities. To synthesize an audible scene, we may describe sounds, then describe where each sound is placed, and these two parts of our description may be essentially independent. For analysis, they are probably tangled together inextricably.
Ignoring location, the qualities that make a particular complex vibration sensible to regard as an individual sound probably have to do with the frequency components of that vibration. We prefer to group frequency components together perceptually when their beginnings, and to a lesser extent their endings, are nearly simultaneous. Also, we prefer to group frequencies that are very close to being integer multiples of some audible frequency, which may or may not be present itself--stated another way we prefer to associate frequencies whose ratios are very close to rational numbers with small integer numerators and denominators. Finally, we prefer to group frequency components perceptually when the variations in their frequencies and amplitudes are similar (e.g., vibrato helps a sound cohere perceptually). These qualitative observations are very far from providing a useful basis for analysis, but they may serve as heuristic guides in considering synthesis techniques.
Pitch is the quality of a sound that leads us to consider it ``higher'' or ``lower'' than another sound. Some sounds, such as engine noises and drum beats, yield only a vague sense of high or low pitch. Other sounds, such as notes of bird songs and of melodic musical instruments, yield a fairly precise sense of pitch that can be measured numerically, with most listeners agreeing that the measurement is correct.
At first approximation, the pitch of a sound is the logarithm of its frequency. The human ear detects frequencies from about 20 Hertz (cycles per second) to about 20,000 Hertz. But the change from 20 to 21 Hertz is perceptually much greater than the change from 19,999 to 20,000. Multiplying, rather than incrementing, a frequency is perceived as adding to the pitch. The logarithm of frequency matches this multiplicative quality of perception, since .
For example, on a piano keyboard, the interval (difference in pitch) called an octave is heard as the result of moving up 12 half steps (the interval from one key to the next higher--usually one is white and the other black), but it is essentially a multiplication of the frequency by 2. The interval called a perfect fifth, heard as moving up 7 half steps, multiplies frequency by approximately . The tempered half step multiplies frequencies by . The 20-20,000 Hz range of frequency perception spans about 10 musical octaves (the piano keyboard spans about 7 octaves). Some people hear signals with frequencies well above 20,000 Hz, but most people as they age lose sensitivity to high frequencies.
But, it's not that simple. Perception of pitch is affected by loudness (loud sounds tend to sound higher in pitch than soft sounds of the same frequency), and there may be many other small but significant influences on perceived pitch. My hunch is that most of these should not affect the structure of a general-purpose model of sound, but rather should be viewed as fine points to be applied outside of the model, when polishing a sound definition to its final form, only when great precision is truly required. For most purposes, lots of perceptual subtleties are best ignored.
One major complication in pitch perception probably will affect the structure of good digital models of sound. Although pitch is essentially the logarithm of frequency, perception of pitch is tied more closely to the relation between a number of component frequencies in a sound, rather than to the frequency of one particular component. Specifically, when a sound is nearly harmonic--when most of the frequency components of a sound are nearly integer multiples of another audible frequency , called the fundamental pitch of that sound--we tend to hear a pitch given by . The frequency itself need not be present! This seems spooky at first, but it is probably a very sensible adaptation of aural perception to the fact that some components of a sound may be filtered out or masked by noise. Perception of the ``missing fundamental'' is roughly analogous to the visual perception of an entire object, even though parts of it are hidden behind other objects.
The perception of pitch intervals is also a bit more complicated than merely subtracting one pitch from another. When we perceive the pitch interval between two nearly harmonic sounds with fundamental frequency and with fundamental , we seem to overlay their component frequencies. If the component of with approximate frequency is close enough to the component of with approximate frequency ( and are integers), this influences us toward perceiving an interval determined by , rather than . Each pair of components that overlays closely enough influences the perceived interval, and it is hard to characterize the way in which these influences add up. But, there are plenty of sounds that are nearly enough harmonic to have a musical effect, but far enough from perfect integer ratios to confuse the perception of intervals. Piano notes, for example, deviate from harmonic sound, and the comparison of pitch between them and nearly perfect harmonic sounds, such as the sounds of most orchestral instruments, is quite tricky.
The precision of pitch perception is roughly constant within audible frequency limits. Since pitch is the logarithm of frequency, this means that frequency precision is much better for lower frequencies and poorer for higher frequencies. Section 2.3.6 discusses the perception of time for sound, which has a variation of precision inverse to the variation of frequency precision.
Psychologists measure the resolution of perception by just noticeable differences (jnd). Jnd measurements depend a lot on the precise form and context in which stimulations are presented. Between about 1,000 and 8,000 Hz, we notice changes in frequencies with ratios around or , which is roughly 200-350 steps per octave, or - of a half step. Outside of this range, discrimination is poorer, but the jnd stays mostly below , which gives more than 60 steps per octave, or something smaller than of a half step. Discrimination of frequencies played in sequence is poorer--typically around 90 steps per octave or about of a half step. Musicians sometimes measure pitch with the cent, which is of a half step, or the savart, which is of a half step. So, the structure of pitch perception is given roughly by a scale ranging over about 10 octaves with between 900 and 3,600 distinguishable steps, each around 3-7 cents or 1-2 savarts.
Loudness of a simple helical signal is roughly the logarithm of its power (the rate at which it delivers energy). Notice that when two helical signals have the same amplitude, the one with higher frequency also has higher power, because it moves faster. The exact power in a signal depends on the precise physical interpretation of the signal, but in general the power in a helical signal is proportional to some polynomial in and , and at least as big as , so perceived loudness is roughly proportional to . But, perceived loudness varies according to the sensitivity of the ear at the given frequency, so signals at frequencies near the limits of audible frequencies seem softer than signals of equal power near the center.
When a number of frequencies are present in a sound, it seems sensible that the perceived loudness will be roughly proportional to the logarithm of the sum of all power within audible frequency limits. This seems sensible, but it's wrong. The perception of loudness in complex sounds is influenced by the critical bands of human sound perception--frequency bands containing a spread of frequencies roughly spanning a musical minor third, so the highest frequency in a band is roughly times the lowest. These bands are not discrete, rather they overlap continuously across the range of audible frequencies, varying in width proportionally to the center frequency.
Two helical signals within a critical band tend to add their powers, so that the perceived loudness within a critical band is close to the logarithm of the total power in the band. But, two helical signals whose frequencies differ by more than a critical band tend to add their perceived loudness after the individual loudnesses are taken as the logarithm of power. Since , the same amount of sonic power sounds louder when spread over a larger frequency range. The precise computation of loudness from power spectrum is quite complicated because of the overlapping of critical bands.
I doubt that the critical band concept will have an impact on the lowest levels of sound modeling, but it clearly has a profound effect on perception, and therefore on the construction of highly polished sounds. Composers and orchestra directors are familiar with the impact of critical bands. If one doubles the number of violins playing the same line, the audience hears a modest increase in loudness, but not a doubling. If one adds an equal number of cellos playing one octave below the violins, the audience hears a much greater increase in loudness, perhaps a doubling of loudness.
Perceived loudness is also affected by duration in a peculiar way. The ear tends to approximate power by summing energy for about of a second. So, a burst of sound second long sounds approximately as loud as a burst second long, at of the power. But, sustaining a sound beyond second produces a sensation of a longer sound at the same loudness.
Loudness is usually measured in decibels (dB). To add confusion, there are several different decibel scales in common use. Strictly speaking, a decibel is of a bel (B), and a bel is a generic logarithmic unit, having nothing special to do with loudness. When one measurement comes out 10 times as big as another, the measurements are said to differ by 1 bel, or 10 decibels. Loudness decibels are sometimes associated with the power level, typically measured in watts per square meter, and sometimes with change in pressure, typically measured in bars (1 bar is standard atmospheric pressure). Since power is proportional to the square of pressure, pressure decibels should be half the size of power decibels.
Just like temperature scales, loudness scales have an arbitrary 0 point. Since is not defined, we can't use 0 energy to register the decibel loudness scale. Also, the intuitive feeling of no sound is associated with a low level of ambient noise, rather than absolute 0 sound energy. The closest experience to 0 sound energy seems to be achieved by standing very quietly in an anechoic chamber. That experience is very weird, and gives an impression of an active suppression of sound, rather than a mere absence of sound. A typical choice for 0 dB is about Watts per square meter or bars, which is about the softest perceptible sound. On this scale, we detect sounds as faint as -4 dB at 4,000 Hz, and about 25 dB at 12,000 Hz. There is no clear upper limit to loudness. Around 100 dB sound becomes uncomfortably loud. Around 140 dB it becomes painful. Eventually, extreme variations in pressure are sensed as shocks, rather than as sound, and they can be destructive.
Jnd of loudness varies a lot depending on the type of signal. dB to 1 dB is probably a sensible practical range for loudness discrimination. Loudness memory is also much fuzzier than pitch memory, so loudness is a tricky parameter to use for carrying information.
Two sounds of the same pitch and loudness may have recognizably different qualities: for instance the sounds of string instruments vs. reed instruments in the orchestra. These distinguishing qualities of sound are called timbre, and are sometimes compared to visible color. Compared to pitch and loudness, timbre is not at all well defined. It clearly has a lot to do with the relative strengths of different frequency components of a sound, called the partials. But, it is also affected seriously by some aspects of the time development of partials--particularly but not exclusively by the increase in amplitude of partials at the beginning of a sound, called the attack in music. Different partials of a musical sound typically increase at very different rates, and these differences are crucial to the identification of a sound with a particular instrument. For example, the sounds of brass instruments are recognized partly by the quicker development of lower frequencies than higher frequencies.
At first approximation it seems that two sounds of different pitch will have the same perceived timbre when the spectral content of one looks just like the other, but shifted in frequency. For example a sound with a component of amplitude 1 at 100 Hz, amplitude 0.5 at 200 Hz, and amplitude 0.25 at 300 Hz might be expected to be qualitatively similar to one with amplitude 1 at 250 Hz, 0.5 at 500 Hz, and 0.25 at 750 Hz. In this sort of case, the second sound might be produced by recording the first one on tape, then playing it back with the tape moving faster. The famous singing chipmunks demonstrate the fallacy in this expectation--they do not sound at all like their creator singing higher.
A more accurate notion of timbre must take into account the fact that sound perception has adapted to the way that many sound producers, including the human voice and most musical instruments, create their sounds by a two-stage process. First, there is some sort of vibrating structure, such as the vocal chord, violin string, oboe reeds, which may follow the shifted partials model fairly well. But, the sound coming from this first vibrating structure filters through another resonating structure, such as the human head, the body of the violin, the body of the oboe, which scales the amplitudes of partials according to its responsiveness at different frequencies. The responsiveness of the second structure does not take a frequency shift when the incoming pitch changes, so it changes the relative strengths of partials depending on their absolute frequencies, and not just their ratios to the given pitch. This filtering structure is sometimes called a formant filter, because it may often be characterized by a small number of resonant frequency bands, called formants. Human sound perception seems to have adapted to recognizing the constancy of formant filters when they are stimulated by a variety of incoming sounds at different pitches. This is vaguely analogous to the tendency of human visual perception to perceive the reflective properties of a given pigment as its color, even under radically different illuminations that may change the actual spectrum reaching the eye quite severely.
The perception of timbre is orders of magnitude more subtle than pitch, and has never been characterized with precision. Some acoustical scholars believe that the word ``timbre'' is simply a convenient label for those qualities of sound that we cannot describe or analyze satisfactorily, much in the way that ``intelligence'' sometimes seems to be used as a pleasant label for those aspects of human behavior that we want to admire, but cannot explain. My hunch is that timbre is susceptible to a much better analysis than has been achieved so far, but not necessarily to a complete analysis. Timbre perception certainly has a lot to do with the relative amplitudes of partials, but is also affected crucially by the initiation of a sound, the relation of amplitudes of partials to their absolute frequencies (rather than just the ratios with fundamental frequencies), and probably to a lot of other things that nobody has thought of yet.
The perception of pitch and timbre interact in a number of well-known ways, including the impact of inharmonic partials on perception of pitch intervals discussed in Section 2.3.1. There is also a subtle structural interaction between pitch and timbre, since they both depend on frequency, and they impose two different but interdependent structures on the frequency domain.
Think of the set of perceptible frequencies, and consider the structure that comes from the association of nearby pitches. This structure is the linear scale discussed in Section 2.3.1. Say that the melodic distance between two frequencies and is the pitch difference measured in half steps (mathematically ). But, the natural preference for integer multiples imposes a different notion of nearness: two frequencies are nearby harmonically if their ratio is a rational number with small numerator and denominator. We might define the harmonic distance between and as the total number of bits in the numerator and denominator when is given as a fraction in lowest terms ( where , , are integers with no common divisors). The octave interval has a melodic distance of 12, but a harmonic distance of 1, which is the shortest distance besides a unison. The perfect 5th has a melodic distance of 7, but a harmonic distance of , which is about . The perfect 4th has a melodic distance of 5, but a harmonic distance of , which is about . The tempered half step is not rational, so it has no sensible harmonic distance. But, the frequency ratio of is slightly less than a tempered half step, so it has melodic distance slightly less than 1, but harmonic distance of , which is about 8.2573878.
The structure of harmonic distance by itself is already more complicated than the simple straight-line scale of melodic distance. Notice that a perfect 5th followed by a perfect 4th spans an octave, but the sum of the harmonic distances of a 5th and a 4th is about 6.169925, which is much larger than the octave's harmonic distance of 1. So, three different frequencies may form a triangle in harmonic distance, instead of lying on a single line. What's worse is that the two notions of distance interact. If and are nearby harmonically, and and are nearby melodically, then we often hear a close harmonic connection between and . This is why tempered scales, in which only the octaves have perfect rational ratios, still produce sensible chords. The interaction between two distance measures with radically different structure does not correspond to any widely studied mathematical system. I drew a whimsical picture of this interaction at the head of the class Web page.
Although abstract physics recognizes time as a single one-dimensional continuum (at least for any single observer), different intervals of time may be perceived as if they are in completely different dimensions, depending on the lengths of the intervals and the sorts of perceptible changes that occur during them. For example, in visual perception, changes in electromagnetic flux on a scale of millionths of a second are not perceived as time at all, but rather determine the frequency of light, and thereby contribute to the perception of color. Changes on a scale of tenths of a second or longer are generally perceived as temporal events involving changes in visual qualities, including color. The huge gap between the electromagnetic time scale and the event-sequence time scale make it easy to classify particular changes unambiguously into one class or the other.
Sound perception seems to have at least three time scales that are perceived quite differently, and they all overlap to make things more complicated.
Even the sonic and event-sequence time scales overlap for sound, with the transitional scale in between and overlapping both. This makes the understanding of time developments in sound quite subtle in some cases. In particular, the boundaries are sensitive to frequency. For low-frequency components of sound, the boundaries of the scales move toward longer time intervals, and for high-frequency components they move toward shorter intervals. It takes about two full periods (rotations) of a helix to recognize the frequency. So, changes in a helical component of sound can only be detected when they are not too short compared to the time of a complete period.
The inverse relation between frequency precision, which is best for low frequencies, and time precision, which is best for high frequencies, is striking. It is not an accident, but comes from fundamental physical limitations, which limit the product of time and frequency precision, so that when one improves, the other gets proportionately worse. The same mathematical form produces the Heisenberg uncertainty principle in quantum mechanics.
In many cases, we regard certain qualities of a sound as the ones we wish to hear, and other qualities as unfortunate errors. Such distinctions are inherently not objective--they arise from our desires and intentions rather than from the inherent qualities of the sounds. Sometimes, the qualities of a sound that we wish to hear are those that we or our collaborators control, and the unfortunate errors are introduced by circumstances outside our control. For example, when we listen to recorded music, we may wish to hear the qualities introduced by the musicians, and we may regard other qualities introduced by the recording and playback mechanisms as unfortunate errors. We often refer to the sound that we wish to listen to as a signal, and the other qualities generically as noise. But the word ``noise'' is also used in other more or less technical senses--in particular it sometimes refers to a particular way of describing such error as discussed in Section 2.3.7 below.
Depending on the relationship between signal and noise, the ``noise'' may be understood as a separate sound added onto the signal, or it may be understood as a variation on the signal, called ``distortion.'' So, one sort of noise is called ``noise,'' and another sort of noise is called ``distortion.'' To confuse the terminology further, phenomena that typically produce noise and distortion as unfortunate errors obscuring a desired sound are sometimes produced deliberately, but we still use the words ``noise'' and ``distortion'' to describe them.
In many cases, we describe a sound as a desired signal and an unfortunate noise added together:
Whenever there is a well-defined sound signal that we wish to hear, but we are presented instead with a different sound , we may in principle regard as the sum of and a noise . But, when and are not very independent the analysis may not be very helpful for understanding the way that we hear . For an extreme example, if some unfortunate error completely erases our desired sound, we would have and . In other cases, errors in the presentation of a sound may cause systematic changes in the frequency components of the sound. Whenever the additive form of signal plus noise is not helpful, we fall back on a very general form
Distortion is too general concept for useful analysis, so we notice some common special forms of distortion.
Linear stationary distortion is the important case of distortion that preserves additive components in , and ignores the specific time at which the sound occurs.
Nonlinear distortion may be as complicated as you please. But when the distorting process has no memory, and only operates on the current moment in time, its behavior is simplified. In this case, the distortion function , operating on a whole signal to produce another signal, is completely determined by a simpler function from complex numbers to complex numbers.
Most of the interesting distortion in electric guitar amplifiers is harmonic distortion. This is an interesting case of a sound quality that was initially thought of as error, but later used deliberately as a musical device.
Other forms of distortion may be arbitrarily complicated. They are often called cross-modulation distortion because the complications often present themselves as interactions between different frequency components. I am not aware of any useful and totally general theory of distortion. The Wiener and Volterra theories of nonlinear systems provide an interesting generalization of linear stationary filters to nonlinear filters, analogous to the generalization of linear functions to nonlinear functions using Taylor series.