In Chapter 1 we modeled sound as a function from a real value representing time to a complex value representing a two-dimensional state of some vibrating system. In order to deal with sound digitally, we must somehow reduce such a signal to a finite number of symbols from a discrete set of possibilities, such as the bits 0 and that current digital computers use to represent all information. The digitization of sound is normally achieved by two independent steps, each of which has consequences for the fidelity with which sound is produced digitally.
The usual first logical step in digitizing sound is to approximate the
continuous domain of real values representing time by a discrete set
of equally spaced values. Let be a positive real number, called
the sampling rate (the number of samples to take in a unit of
time).
represents the set of all positive and negative
integers.
represents the domain of discrete time
with sampling rate .
(3.1) | |||
(3.2) |
(3.3) |
Given a continuous sound signal , the obvious and natural
choice for a discrete sound signal to represent it is
where
for | (3.4) |
Whenever two continuous sound signals and agree on every point in ( for all ), then they have the same representation as a discrete sound signal, and we say that and are aliases. The famous problem of wagon wheels appearing to roll backwards in old movies is an example of aliasing in a sampled video signal. The forward-rolling wagon wheel and the backward-rolling wagon wheel are aliases at the sampling rate imposed by the frames on the movie film. When we generate a real physical sound from , a listener can only hear one of the infinitely many continuous sound signals that it might represent. This is undesirable in the case where we discretize one continuous sound, and the listener hears a different one.
Aliasing occurs between arbitrarily complicated signals, but we can understand its impact best by concentrating on pure helical complex-valued signals, and pure sinusoidal real-valued signals. Two continuous helical signals are aliases if and only if their amplitudes and phases are exactly the same, and their frequencies are the same . Don't worry if the mathematical discussion is hard to understand. The animated Scilab demonstrations are much easier to follow.
(3.5) |
(3.6) |
For real-valued signals at frequencies that are exact multiples of half the sampling rate, there is even confusion about the amplitude and phase. In the case of odd multiples of half the sampling rate, the samples are all equal in magnitude, and alternating in sign. The amplitude of the samples depends on the phase at which the samples are taken, which is the same for each half wave.
(3.7) |
(3.8) |
(3.9) |
Having computed a discrete sound signal , we need to render it through a loudspeaker or similar controllable vibrating device in order to hear the sound. At some point in the rendering process, is converted to a continuous signal . It is natural to choose one of the infinitely many continuous sound signals that might represent. In particular, it is natural to create by interpolating values between the ones given by in such a way as to make the resulting sound signal as smooth as possible, according to some appropriate definition of smoothness. The interpolating is normally done, not by a digital computation, but by the analog machinery (usually electronic) controlled by the computation. All sorts of smoothing tend to depress higher frequencies. In Chapter 4 we find that when the sampling rate is high enough, and all significant frequency content is far below the Nyquist limit, the precise nature of the interpolation is relatively unimportant. If we try to play back signals with significant content near the Nyquist limit (within about 100 Hz), we are likely to hear the beating between those components and their aliases.
The sorts of analog devices commonly used for sound production typically interpolate so that the final result is close to a sum of sinusoidal signals of the lowest possible frequencies. So, if the frequency is substantially smaller than half the sampling rate, the discrete real sinusoidal signal is normally rendered as something very close to the continuous signal .
Not all aliases of a helical or sinusoidal signal are helical or sinusoidal themselves. For real-valued signals, the frequencies near half the sampling rate () alias to signals that are amplitude modulations of a carrier with frequency (see Figure ???). In many cases, these amplitude-modulated signals represent the way that a rendered signal is likely to be heard. For complex-valued helical signals, the problem of non-helical aliases for helical signals seems to be less important, but I know very little about the rendering of complex-valued signals. It is interesting to note that we seem to need two real numbers per period to represent a given frequency, whether those two reals are separate samples or whether they are bundled into a single complex sample. But, we really need only the sign of the imaginary component, along with the entire value of the real component of a complex sample, to resolve the ambiguity between helical frequencies , although the full value of both real and complex components is required to get full information about amplitude and phase, and about multiple frequency components.
If we know in advance that we have a pure sinusoidal real-valued signal with frequency strictly less than (half the sampling rate), then we can determine the frequency, amplitude, and phase of that signal from its discrete form. For a signal with frequency right at , the amplitude and phase get tangled together, and we can only determine a relation between them. In Chapter 4 we see that we can even determine the frequencies, amplitudes, and phases of all components of a sum of sines with frequencies all strictly less than . is called the Nyquist limit, in honor of Harry Nyquist, who demonstrated its theoretical significance.
Superficially, the limitations of sampling appear to be characterized very simply by the Nyquist limit. But, the simple view does not apply with any precision in practice. Notice that the theoretical ability to infer sinusoidal signals from their discrete forms works not only for frequencies in , but also for frequencies in any interval of length less than , leaving out exact multiples of . For example, if we know in advance that a signal is the sum of pure sines with frequencies in the range , but none of them has frequency precisely , then we may recover all information about frequency, amplitude, and phase, just as well as we can for frequencies in . The mere mathematical ability to infer these values is not enough: we need to render discrete signals so that they sound ``correct.''
Suppose we are given the sampled version of a continuous signal that is the sum of sinusoidal components with frequencies in . In order to play back , we need to interpolate smoothly between the samples so that frequencies greater than or equal to are completely suppressed. We must do this filtering with analog equipment. But, we can't build filters that cut off precisely at , and the sharper we make the cutoff, the more we introduce other undesirable inaccuracies--especially delays and changes in phase that depend on frequency.
Even if we could render discrete signals perfectly as continuous signals with frequencies below , we wouldn't want to do so. Suppose we have a signal consisting of a single sine wave, with frequency well below , but with the amplitude and/or frequency modulated (i.e., varying) slightly. In Chapter 4 we see that such a signal has pure sinusoidal components at almost every frequency, including those above , and a rendering mechanism that eliminated those frequencies entirely would not give a precise presentation of our modulated signal.
To get good practical results in rendering sampled sound, we need to stay a substantial distance from the Nyquist limit. Because young humans hear pitches up to about 20,000 Hz, standard high-fidelity sample rates (particularly for audio CDs) are 44,100 Hz. The precise value was chosen to be a multiple of the frame rate for standard video signals, but the idea is that the 22,050 Hz Nyquist limit for this sample rate should be high enough to accommodate signals up to 20,000 Hz. Some audiophiles argue that the 10% margin is not enough, and press for something above 50,000 Hz. Even though most of us do not hear frequencies above 20,000 Hz as pitches, it's possible that components at much higher frequencies affect our perception of the modulation of lower-frequency signals.
To get an appreciation of the practical complexity of the aliasing problem, study the Scilab animations of sampled sines. Look particularly closely at the sampled signal when the frequency gets close to the Nyquist limit. For small enough , the discretization of a continuous signal looks to the eye (and sounds to the ear) like a signal with frequency precisely , whose amplitude is modulated at a frequency of . In Chapter 4 we will see that such an amplitude-modulated signal results from adding together two pure sinusoidal signals at frequencies and --the effect is called beating. Notice that and are aliases, so in some sense, whenever one of them is present in the discrete signal, the other is there as well.
Aliasing occurs any time a continuous signal is converted to a discrete signal by sampling. It is natural to think of the case where a physical continuous signal is read by an electronic sampler, but this is not really the most important source of aliasing problems. The engineers who design and build samplers are pretty smart, and they have had plenty of time to worry about aliasing and find ways to prevent its harmful consequences. The most troublesome cases of aliasing arise when a continuous mathematical model of a sound signal is converted to a calculation of samples. The original continuous model may only exist in the mind of a person who is designing sound--it need not be present as a data structure in a computer, or in any other realization in an artificial medium. Even when there is an explicit representation of the continuous sound signal available as a data structure, the problem of avoiding aliasing in software is far more complex, due to the variety of conceptual sources for continuous signals. Flexible sound-processing software has largely failed to prevent the introduction of harmful aliasing in sampled signals.
The only cure for the harmful consequences of aliasing is prevention. Once a continuous sound signal has been replaced by a sampled discrete representation, and the continuous signal is no longer available for inspection, there is no way to determine which of the infinitely many possible continuous signals was truly intended. In order to prevent one continuous sound signal , converted to the discrete signal , from being rendered continuously as some alias that sounds quite different, we must sample only signals that will be rendered accurately. With the usual ``smooth'' rendering techniques, a sampled complex-valued signal produces frequencies in the range , and a sampled real-valued signal produces frequencies in the range . To avoid harmful aliasing, all higher frequencies must be filtered out from the continuous signal before sampling. For this reason, sampling converters have analog filters that eliminate high frequencies before sampling. Even though digital filters have many advantages, they cannot completely replace analog anti-aliasing filters, because they can only be applied after some sort of sampling, and at that point aliasing has already happened. Digital filters are used to avoid additional aliasing when a signal is converted from a higher to a lower sample rate. To avoid the aliasing of frequencies near with an amplitude-modulated signal (and presumably there are similar problems for complex-valued signals at frequencies near ), continuous signals should in fact be filtered to an even smaller frequency interval, but it is not clear precisely how much smaller it needs to be.
When we say, ``digital signal,'' there is a strong tendency to assume that this must mean the result of sampling the instantaneous value of a continuous signal at regular time intervals. In fact, there are other ways to determine a discrete signal from a continuous one.
Instead of storing a value of the continuous signal, we could store the interval of values (that is, the maximum and minimum) over the interval surrounding a discrete sample time. In this style of discretization, each value represents the behavior of the signal through an interval of time, rather than the value of the signal at a given moment. In a discrete interval representation, no zero-crossings are lost, as they may be in a sampled representation. A large component above the Nyquist limit will produce a large interval, instead of an alias for a lower frequency component. The interval idea is a lot like the blurring that movie editors now employ to eliminate the backwards rotation of wagon wheels. We can learn something by thinking about discrete interval representations as a thought experiment, but it is unlikely that they will find a practical use.
Instead of storing the value of the continuous signal at a particular time, we could store its average value over the interval around a discrete sample time, or a weighted average. With average-value representations, components with frequencies above the Nyquist limit average to something near zero, so their influence is reduced quite a bit. In fact, analog to digital converters perform some sort of weighted averaging, partly because it is physically impossible to take a perfect instantaneous measurement of the signal strength. In Chapter 7 we see that averaging or weighted averaging over an interval is just a sort of low-pass filtering, that suppresses high frequency components.
Even a finite time segment from a discrete sound signal is an infinite object if the sample values are complex or real numbers. In order to get a completely finite digital representation of a sound signal, we also approximate the continuous range of real or complex numbers by a discrete subset. Since the consequences of this quantization of the domain of values are largely independent of the consequences of discretizing the time domain, we consider signals from the continuous time domain to a discrete subset of the complex or real numbers.
A subset of the complex numbers is discrete if we may draw a circle around each point in , so that each circle contains only one point in . If contains only real numbers, then it is also a discrete subset of the reals. While the discretization of time seems to make sense only with a constant interval between points, there are a number of different popular ways to quantize the real or complex values . For the domain of real numbers, the two basic ideas are linear and logarithmic quantization.
Given a real number ,
represents the linear
quantization of the real domain with quantum interval .
(3.10) |
Given two real numbers and ,
represents the
logarithmic quantization of the real domain with base and minimum
nonzero value .
and | (3.11) |
The usual way to quantize the complex domain is to pick a quantization
of the real domain, and then apply it to the real and imaginary
components of complex numbers. So, we can define
(3.12) | |||
(3.13) |
(3.14) |
(3.15) | |||
(3.16) |
Given an exact real signal , we typically quantize it by rounding it to the closest value in a linear discrete domain:
(3.17) |
(3.18) |
(3.19) | ||||
(3.20) |
(3.21) |
(3.22) | ||||
(3.23) |
For each sort of quantization , the difference between a signal and its quantization is called quantization error.
(3.24) |
Although quantization error is mostly heard as an added noise, if is so small that it rounds to 0, the quantization error is heard as a distortion cancelling the signal entirely. If we listen to sounds that alternate silence (or something that rounds to silence) with audible segments, the audible quantization noise appears to turn on and off with the sound. This turns out to be much more annoying to perception than a constant soft noise.
With logarithmic rounding, a low amplitude high-frequency component may appear and disappear as the instantaneous value produced by larger amplitude low-frequency components moves the signal through different sized quantization intervals. Because the same amplitude signal has more power at higher frequencies, the low amplitude high-frequency component may be distinctly audible, while the larger amplitude low-frequency components may be less audible, or even inaudible. In this case, the effect of quantization is heard as a strange cross-modulation distortion, turning the high-frequency component on and off. Because of the addition of different components in a signal, logarithmic rounding is not an efficient way to produce high fidelity sound, although it could be useful for representing a single helical component.
Both the buzzy near-harmonic quality of quantization noise, and the turning on and off with the signal, make quantization noise more annoying than random noise. (Notice that the word ``noise'' has two different senses here: undesired sound vs. random sound spread across a range of frequencies). So, we usually cover up quantization noise with constant white noise, called dither. Dithered sampled signals can sound very good to the ear. But dithering reduces the effective dynamic range of a recording--the useful dynamic range goes between the loudness of the dither and the loudest portion of the signal, rather than between silence and the loudest portion. Marketing descriptions of CD technology often ignore this reduction of the effective dynamic range.
When should we add dither to a signal: before or after quantizing it? Dither is most effective at the lowest amplitude when added before quantization. After quantization, the amplitude of dither must be at least as big as the quantization interval to have any effect, and must go higher than that (about twice the quantization interval, I think, but this needs to be checked) in order to sound uniform. Before quantization, the smallest possible amplitude of dither has some effect on the rounding of values very near the middles of quantization intervals. Dither with amplitude approximately the same as the quantization interval already sounds satisfyingly uniform when added before quantization. Surprisingly, dither added before quantization can even reveal a previously inaudible component. A sine wave with amplitude just less than half the quantization interval rounds to 0, so it is inaudible. With dither, it rounds to a signal that varies randomly between 0 and small positive and negative values. The density of positive vs. negative values traces the value of the sine wave, and we can actually hear it along with the white noise of the dither.
I think there is some interesting interaction, but I haven't figured it out yet. If a signal varies fast enough to cross a quantization interval from one sample to the next, most of the detail of the quantization error is lost. In principle, the higher frequency part aliases down to lower frequencies. I'm not sure what sort of final impact we get on the sound.
Normal digital recording and playback uses real valued signals instead of complex valued signals. For computations to analyze synthesize and manipulate sound, we may use real or complex values as we wish. But most software for these purposes works with only real values. Because final playback only requires real values, the omission of imaginary components has no direct impact on what we hear. But some qualities of a signal may be easier to recognize and manipulate with a properly constructed complex valued signal. For spectra, normal software uses complex values to represent the phase of spectral components.
A sensibly constructed complex valued signal is often called an analytic signal, or a signal in quadrature. The word ``analytic'' appears to be arbitrary. ``In quadrature'' refers to the fact that sinusoidal components of the imaginary part of a complex valued signal should be shifted in phase from the real part by , which is one quarter of a circle. The same word is used in astronomy to refer to planets that are one quarter orbit apart. The word ``quadrature'' is also used sometimes for the operation of definite integration in calculus, which happens to shift the phases of sine waves by . As far as I can tell, that is a coincidence.
This is the easy part. We just take the real part of the signal.
(3.25) |
This conversion depends on precisely what relationship we decide should hold between the real and imaginary parts of a signal. If the two parts represent typical physical measurements, such as displacement and velocity, then one component should be the derivative of the other.
(3.26) |
The method of adding in the derivative of a signal as the imaginary
part is appealing because the imaginary part depends only on the local
character of the real part. But, if we do this operation on a sampled
signal, the usual approximations of the derivative by finite
differences are a bit suspect. Notice that
(3.27) | |||
(3.28) |
Another natural constraint on the analytic signal is that it contains
no negative frequency content. Using the Fourier transform, we can
easily define such a signal:
(3.29) |
(3.30) |
(3.31) |
Using the Hilbert transform to generate the imaginary part, we get an analytic signal composed of circular helixes. But depends on values of at times rather distant from , since the hyperbola decays rather slowly. So time structure in a signal is somewhat blurred by the Hilbert transform.
When a signal is most naturally understood as a sum of modulated helixes, the Hilbert transform is not quite the right thing. For example, consider a helix modulated by a Gaussian bell curve: . The Fourier transform gives a shifted bell curve: . Even though the frequency is positive, has negative-frequency content. The Hilbert transform eliminates all negative-frequency content, so it will produce slightly peculiar approximations to modulated helixes.
I'm pretty sure that nobody has devised a perfect reconstruction of complex valued modulated helixes from real valued modulated sine waves. I even doubt that there is a clear definition of what this means, since a given signal can be represented in more than one way as a sum of modulated helixes.