next up previous contents
Next: 4. The Frequency Spectrum Up: Digital Sound Modeling lecture Previous: 2. Perceptual Foundations of   Contents

Subsections


3. Digital Sampled Sound

In Chapter 1 we modeled sound as a function $ \sigma$ from a real value $ t$ representing time to a complex value $ \sigma(t)$ representing a two-dimensional state of some vibrating system. In order to deal with sound digitally, we must somehow reduce such a signal to a finite number of symbols from a discrete set of possibilities, such as the bits 0 and $ 1$ that current digital computers use to represent all information. The digitization of sound is normally achieved by two independent steps, each of which has consequences for the fidelity with which sound is produced digitally.


3.1 Discrete time

3.1.0.0.1 Sound signals in the discrete time domain.

The usual first logical step in digitizing sound is to approximate the continuous domain of real values representing time by a discrete set of equally spaced values. Let $ S$ be a positive real number, called the sampling rate (the number of samples to take in a unit of time). $ \mathcal{N}$ represents the set of all positive and negative integers. $ \mathcal{T}_S$ represents the domain of discrete time with sampling rate $ S$.

$\displaystyle \mathcal{N}$ $\displaystyle =$ $\displaystyle \{\ldots,-2,-1,0,1,2,3,\ldots\}$ (3.1)
$\displaystyle \mathcal{T}_S$ $\displaystyle =$ $\displaystyle \{k/S:\;k\in\mathcal{N}\}$ (3.2)

Every finitely represented sound spans some finite interval $ \{t_{\mathord{\text{\it min}}},\ldots,t_{\mathord{\text{\it max}}}\}$ rather than the infinite domain $ \mathcal{T}_S$, but we may only listen to a finite time-span of sound in a lifetime anyway, so the discretization of time is much more important than the limitation to a finite interval. A sound signal in the discrete time domain with sampling rate $ S$ is a complex-valued function $ \sigma$ on $ \mathcal{T}_S$. When discussing the domain $ \mathcal{T}_S$, we write the members of the domain as $ \ldots,t_{-2},t_{-1},t_0,t_1,t_2,t_3,\ldots$, where
$\displaystyle t_k$ $\displaystyle =$ $\displaystyle k/S$ (3.3)

Books and articles that deal only with sampled signals often measure time in units of $ 1/S$, so that $ t_{i}=i$.

3.1.0.0.2 Converting from continuous to discrete.

Given a continuous sound signal $ \sigma$, the obvious and natural choice for a discrete sound signal to represent it is $ \mathord{\mathcal{D}}_S(\sigma)$ where

$\displaystyle \mathord{\mathcal{D}}_S(\sigma)(t)$ $\displaystyle =$ $\displaystyle \sigma(t)$ for $\displaystyle t\in\mathcal{T}_S$ (3.4)

$ \mathord{\mathcal{D}}_S(\sigma)$ is just $ \sigma$ restricted to the domain $ {\cal
T}_S$. Such a representation is inherently ambiguous--there are an infinite number of different continuous sound signals represented by the same discrete sound signal. The confusion resulting from this ambiguity is called aliasing.

3.1.0.0.3 Aliasing.

Whenever two continuous sound signals $ \sigma_1$ and $ \sigma_2$ agree on every point in $ \mathcal{T}_S$ ( $ \sigma_1(t)=\sigma_2(t)$ for all $ t\in\mathcal{T}_S$), then they have the same representation $ \sigma_d=\mathord{\mathcal{D}}_S(\sigma_1)=\mathord{\mathcal{D}}_S(\sigma_2)$ as a discrete sound signal, and we say that $ \sigma_1$ and $ \sigma_2$ are aliases. The famous problem of wagon wheels appearing to roll backwards in old movies is an example of aliasing in a sampled video signal. The forward-rolling wagon wheel and the backward-rolling wagon wheel are aliases at the sampling rate imposed by the frames on the movie film. When we generate a real physical sound from $ \sigma_d$, a listener can only hear one of the infinitely many continuous sound signals that it might represent. This is undesirable in the case where we discretize one continuous sound, and the listener hears a different one.

Aliasing occurs between arbitrarily complicated signals, but we can understand its impact best by concentrating on pure helical complex-valued signals, and pure sinusoidal real-valued signals. Two continuous helical signals are aliases if and only if their amplitudes and phases are exactly the same, and their frequencies are the same $ \pmod{S}$. Don't worry if the mathematical discussion is hard to understand. The animated Scilab demonstrations are much easier to follow.

\begin{displaymath}\begin{array}{c} \text{For }R_1,R_2>0\text{, }P_1,P_2\in[0,2\...
...1=R_2\text{ and }P_1=P_2\text{ and }F_1=F_2\pmod{S} \end{array}\end{displaymath} (3.5)

For real-valued signals, there is even more aliasing. Frequencies $ F_1$ and $ F_2$ may be aliased when $ F_1=-F_2\pmod{S}$ and the signals are out of phase by exactly $ \mathord{\mbox{\boldmath $\pi$}}$ (half a period).

\begin{displaymath}\begin{array}{c} \text{For }R_1,R_2>0\text{, }P_1,P_2\in[0,2\...
...\mbox{\boldmath$\pi$}}}\text{ and }F_1=-F_2\pmod{S} \end{array}\end{displaymath} (3.6)

For real-valued signals at frequencies that are exact multiples of half the sampling rate, there is even confusion about the amplitude and phase. In the case of odd multiples of half the sampling rate, the samples are all equal in magnitude, and alternating in sign. The amplitude of the samples depends on the phase at which the samples are taken, which is the same for each half wave.

\begin{displaymath}\begin{array}{c} \text{For }R_1,R_2>0\text{, }P_1,P_2\in[0,2\...
...f and only if }   [2ex] R_1\sin(P_1)=R_2\sin(P_2) \end{array}\end{displaymath} (3.7)

In the special case where $ P_1=P_2=0$, all samples are 0, and the amplitudes could have any values. For multiples of the sampling rate, all samples have the same value, and again there is a tradeoff between amplitude and phase.

\begin{displaymath}\begin{array}{c} \text{For }R_1,R_2>0\text{, }P_1,P_2\in[0,2\...
...f and only if }   [2ex] R_1\sin(P_1)=R_2\sin(P_2) \end{array}\end{displaymath} (3.8)

Finally, for completeness, notice that an odd multiple of half the sampling rate aliases with a multiple of the sampling rate precisely when the phases are both 0, so that all samples have the value 0.

\begin{displaymath}\begin{array}{c} \text{For }R_1,R_2>0\text{, }P_1,P_2\in[0,2\...
...  [2ex] \text{ if and only if }   [2ex] P_1=P_2=0 \end{array}\end{displaymath} (3.9)

3.1.0.0.4 Rendering a discrete sound signal as continuous sound.

Having computed a discrete sound signal $ \sigma_d$, we need to render it through a loudspeaker or similar controllable vibrating device in order to hear the sound. At some point in the rendering process, $ \sigma_d$ is converted to a continuous signal $ \sigma_c$. It is natural to choose one of the infinitely many continuous sound signals that $ \sigma_d$ might represent. In particular, it is natural to create $ \sigma_c$ by interpolating values between the ones given by $ \sigma_d$ in such a way as to make the resulting sound signal as smooth as possible, according to some appropriate definition of smoothness. The interpolating is normally done, not by a digital computation, but by the analog machinery (usually electronic) controlled by the computation. All sorts of smoothing tend to depress higher frequencies. In Chapter 4 we find that when the sampling rate is high enough, and all significant frequency content is far below the Nyquist limit, the precise nature of the interpolation is relatively unimportant. If we try to play back signals with significant content near the Nyquist limit (within about 100 Hz), we are likely to hear the beating between those components and their aliases.

The sorts of analog devices commonly used for sound production typically interpolate so that the final result is close to a sum of sinusoidal signals of the lowest possible frequencies. So, if the frequency $ F$ is substantially smaller than half the sampling rate, the discrete real sinusoidal signal $ \sigma_d(t)=R\sin(P+2\mathord{\mbox{\boldmath $\pi$}} Ft)$ is normally rendered as something very close to the continuous signal $ \sigma_c(t)=R\sin(P+2\mathord{\mbox{\boldmath $\pi$}}(F\bmod S)t)$.

Not all aliases of a helical or sinusoidal signal are helical or sinusoidal themselves. For real-valued signals, the frequencies near half the sampling rate ($ S/2$) alias to signals that are amplitude modulations of a carrier with frequency $ S$ (see Figure ???). In many cases, these amplitude-modulated signals represent the way that a rendered signal is likely to be heard. For complex-valued helical signals, the problem of non-helical aliases for helical signals seems to be less important, but I know very little about the rendering of complex-valued signals. It is interesting to note that we seem to need two real numbers per period to represent a given frequency, whether those two reals are separate samples or whether they are bundled into a single complex sample. But, we really need only the sign of the imaginary component, along with the entire value of the real component of a complex sample, to resolve the ambiguity between helical frequencies $ F_1=-F_2\pmod{S}$, although the full value of both real and complex components is required to get full information about amplitude and phase, and about multiple frequency components.

3.1.0.0.5 The Nyquist limit.

If we know in advance that we have a pure sinusoidal real-valued signal with frequency strictly less than $ S/2$ (half the sampling rate), then we can determine the frequency, amplitude, and phase of that signal from its discrete form. For a signal with frequency right at $ S/2$, the amplitude and phase get tangled together, and we can only determine a relation between them. In Chapter 4 we see that we can even determine the frequencies, amplitudes, and phases of all components of a sum of sines with frequencies all strictly less than $ S/2$. $ S/2$ is called the Nyquist limit, in honor of Harry Nyquist, who demonstrated its theoretical significance.

Superficially, the limitations of sampling appear to be characterized very simply by the Nyquist limit. But, the simple view does not apply with any precision in practice. Notice that the theoretical ability to infer sinusoidal signals from their discrete forms works not only for frequencies in $ [0,S/2)$, but also for frequencies in any interval of length less than $ S/2$, leaving out exact multiples of $ S/2$. For example, if we know in advance that a signal is the sum of pure sines with frequencies in the range $ (3.1S,3.6S]$, but none of them has frequency precisely $ 3.5S$, then we may recover all information about frequency, amplitude, and phase, just as well as we can for frequencies in $ [0,S/2)$. The mere mathematical ability to infer these values is not enough: we need to render discrete signals so that they sound ``correct.''

Suppose we are given the sampled version $ \mathord{\mathcal{D}}_S(\sigma)$ of a continuous signal $ \sigma$ that is the sum of sinusoidal components with frequencies in $ [0,S/2)$. In order to play back $ \sigma$, we need to interpolate smoothly between the samples so that frequencies greater than or equal to $ S/2$ are completely suppressed. We must do this filtering with analog equipment. But, we can't build filters that cut off precisely at $ S/2$, and the sharper we make the cutoff, the more we introduce other undesirable inaccuracies--especially delays and changes in phase that depend on frequency.

Even if we could render discrete signals perfectly as continuous signals with frequencies below $ S/2$, we wouldn't want to do so. Suppose we have a signal consisting of a single sine wave, with frequency well below $ S/2$, but with the amplitude and/or frequency modulated (i.e., varying) slightly. In Chapter 4 we see that such a signal has pure sinusoidal components at almost every frequency, including those above $ S/2$, and a rendering mechanism that eliminated those frequencies entirely would not give a precise presentation of our modulated signal.

To get good practical results in rendering sampled sound, we need to stay a substantial distance from the Nyquist limit. Because young humans hear pitches up to about 20,000 Hz, standard high-fidelity sample rates (particularly for audio CDs) are 44,100 Hz. The precise value was chosen to be a multiple of the frame rate for standard video signals, but the idea is that the 22,050 Hz Nyquist limit for this sample rate should be high enough to accommodate signals up to 20,000 Hz. Some audiophiles argue that the 10% margin is not enough, and press for something above 50,000 Hz. Even though most of us do not hear frequencies above 20,000 Hz as pitches, it's possible that components at much higher frequencies affect our perception of the modulation of lower-frequency signals.

To get an appreciation of the practical complexity of the aliasing problem, study the Scilab animations of sampled sines. Look particularly closely at the sampled signal when the frequency gets close to the Nyquist limit. For small enough $ F$, the discretization of a continuous signal $ \sin(2\mathord{\mbox{\boldmath $\pi$}}(S/2-F)t)$ looks to the eye (and sounds to the ear) like a signal with frequency precisely $ S/2$, whose amplitude is modulated at a frequency of $ 2F$. In Chapter 4 we will see that such an amplitude-modulated signal results from adding together two pure sinusoidal signals at frequencies $ S/2-F$ and $ S/2+F$--the effect is called beating. Notice that $ \sin(2\mathord{\mbox{\boldmath $\pi$}}(S/2-f)t)$ and $ \sin(2\mathord{\mbox{\boldmath $\pi$}}(S/2+F)t)$ are aliases, so in some sense, whenever one of them is present in the discrete signal, the other is there as well.

3.1.0.0.6 Causes of and cures for aliasing.

Aliasing occurs any time a continuous signal is converted to a discrete signal by sampling. It is natural to think of the case where a physical continuous signal is read by an electronic sampler, but this is not really the most important source of aliasing problems. The engineers who design and build samplers are pretty smart, and they have had plenty of time to worry about aliasing and find ways to prevent its harmful consequences. The most troublesome cases of aliasing arise when a continuous mathematical model of a sound signal is converted to a calculation of samples. The original continuous model may only exist in the mind of a person who is designing sound--it need not be present as a data structure in a computer, or in any other realization in an artificial medium. Even when there is an explicit representation of the continuous sound signal available as a data structure, the problem of avoiding aliasing in software is far more complex, due to the variety of conceptual sources for continuous signals. Flexible sound-processing software has largely failed to prevent the introduction of harmful aliasing in sampled signals.

The only cure for the harmful consequences of aliasing is prevention. Once a continuous sound signal has been replaced by a sampled discrete representation, and the continuous signal is no longer available for inspection, there is no way to determine which of the infinitely many possible continuous signals was truly intended. In order to prevent one continuous sound signal $ \sigma_1$, converted to the discrete signal $ \sigma_d=\mathcal{D}_S(\sigma_1)$, from being rendered continuously as some alias $ \sigma_2$ that sounds quite different, we must sample only signals that will be rendered accurately. With the usual ``smooth'' rendering techniques, a sampled complex-valued signal produces frequencies in the range $ [0,S)$, and a sampled real-valued signal produces frequencies in the range $ [0,S/2)$. To avoid harmful aliasing, all higher frequencies must be filtered out from the continuous signal before sampling. For this reason, sampling converters have analog filters that eliminate high frequencies before sampling. Even though digital filters have many advantages, they cannot completely replace analog anti-aliasing filters, because they can only be applied after some sort of sampling, and at that point aliasing has already happened. Digital filters are used to avoid additional aliasing when a signal is converted from a higher to a lower sample rate. To avoid the aliasing of frequencies near $ S/2$ with an amplitude-modulated signal (and presumably there are similar problems for complex-valued signals at frequencies near $ S$), continuous signals should in fact be filtered to an even smaller frequency interval, but it is not clear precisely how much smaller it needs to be.

3.1.0.0.7 Discrete signal values other than samples.

When we say, ``digital signal,'' there is a strong tendency to assume that this must mean the result of sampling the instantaneous value of a continuous signal at regular time intervals. In fact, there are other ways to determine a discrete signal from a continuous one.

Instead of storing a value of the continuous signal, we could store the interval of values (that is, the maximum and minimum) over the interval surrounding a discrete sample time. In this style of discretization, each value represents the behavior of the signal through an interval of time, rather than the value of the signal at a given moment. In a discrete interval representation, no zero-crossings are lost, as they may be in a sampled representation. A large component above the Nyquist limit will produce a large interval, instead of an alias for a lower frequency component. The interval idea is a lot like the blurring that movie editors now employ to eliminate the backwards rotation of wagon wheels. We can learn something by thinking about discrete interval representations as a thought experiment, but it is unlikely that they will find a practical use.

Instead of storing the value of the continuous signal at a particular time, we could store its average value over the interval around a discrete sample time, or a weighted average. With average-value representations, components with frequencies above the Nyquist limit average to something near zero, so their influence is reduced quite a bit. In fact, analog to digital converters perform some sort of weighted averaging, partly because it is physically impossible to take a perfect instantaneous measurement of the signal strength. In Chapter 7 we see that averaging or weighted averaging over an interval is just a sort of low-pass filtering, that suppresses high frequency components.

3.2 Quantized Vibration State

Even a finite time segment from a discrete sound signal is an infinite object if the sample values are complex or real numbers. In order to get a completely finite digital representation of a sound signal, we also approximate the continuous range of real or complex numbers by a discrete subset. Since the consequences of this quantization of the domain of values are largely independent of the consequences of discretizing the time domain, we consider signals from the continuous time domain to a discrete subset of the complex or real numbers.

3.2.0.0.1 Discrete sets of complex or real values.

A subset $ \mathcal{V}$ of the complex numbers is discrete if we may draw a circle around each point in $ \mathcal{V}$, so that each circle contains only one point in $ \mathcal{V}$. If $ \mathcal{V}$ contains only real numbers, then it is also a discrete subset of the reals. While the discretization of time seems to make sense only with a constant interval between points, there are a number of different popular ways to quantize the real or complex values $ \sigma(t)$. For the domain of real numbers, the two basic ideas are linear and logarithmic quantization.

Given a real number $ Q>0$, $ \mathcal{V}_Q$ represents the linear quantization of the real domain with quantum interval $ Q$.

$\displaystyle \mathcal{V}_Q$ $\displaystyle =$ $\displaystyle \{kQ:\;k\in\mathcal{N}\}$ (3.10)

Yes, this is mathematically the same thing as the discrete time domain $ \mathcal{T}_{1/Q}$, but we think of it as having a different physical dimension. Just as in the case of the discrete time domain, a finite digital representation requires that we limit ourselves to a finite interval within $ \mathcal{V}_Q$, but it is the use of a discrete subset of values, rather than the limitation to a finite interval, that has the most interesting consequences for sound modeling.

Given two real numbers $ B>1$ and $ M>0$, $ \mathcal{L}_{B,M}$ represents the logarithmic quantization of the real domain with base $ B$ and minimum nonzero value $ M$.

$\displaystyle \mathcal{L}_{B,M}$ $\displaystyle =$ $\displaystyle \{0\}\cup\{B^k,-B^k:\;k\in\mathcal{N}$ and $\displaystyle B^k\geq M\}$ (3.11)

Notice that the minimum value $ M$ is required to make the domain discrete, even if the 0 value is omitted. $ \mathcal{L}_{B,M}$ is an idealized abstraction of several different essentially logarithmic quantizations, such as ``mu-law'' encoding, but it does not represent them precisely. The crucial quality of logarithmic domains is that the interval between points goes up exponentially with the magnitude of the points: the $ \mathcal{L}_{B,M}$s are the mathematically simplest sort of domains with that crucial quality. Floating-point domains are a funny hybrid of linear and logarithmic: they consist of finite segments of different linear domains pieced together so that the progression over larger segments is essentially logarithmic.

The usual way to quantize the complex domain is to pick a quantization of the real domain, and then apply it to the real and imaginary components of complex numbers. So, we can define

$\displaystyle \mathcal{V}^2_Q$ $\displaystyle =$ $\displaystyle \{x+\mathord{\mbox{\boldmath$i$}}y:\;x,y\in\mathcal{V}_Q\}$ (3.12)
$\displaystyle \mathcal{L}^2_{B,M}$ $\displaystyle =$ $\displaystyle \{x+\mathord{\mbox{\boldmath$i$}}y:\;x,y\in\mathcal{L}_{B,M}\}$ (3.13)

There are many other ways in principle to select a discrete set of points in the complex plane. From some points of view, it makes sense to quantize the polar form of a complex number. In that case, the most natural thing to do with angles is to divide the circle into an integral number of pie slices:
$\displaystyle \mathcal{A}_k$ $\displaystyle =$ $\displaystyle \{0,2\mathord{\mbox{\boldmath$\pi$}}/k,4\mathord{\mbox{\boldmath$\pi$}}/k,\dots,2(k-1)\mathord{\mbox{\boldmath$\pi$}}/k\}$ (3.14)

Then, we get a quantized polar form, using either linear or logarithmic quantization for the magnitude.
$\displaystyle \mathcal{V}_Q\mathcal{A}_k$ $\displaystyle =$ $\displaystyle \{re^{\mathord{\mbox{\boldmath\scriptsize$i$}}p}:\;r\in\mathcal{V}_Q\text{ and }r\geq 0\text{ and }p\in\mathcal{A}_k\}$ (3.15)
$\displaystyle \mathcal{V}_{B,M}\mathcal{L}_k$ $\displaystyle =$ $\displaystyle \{re^{\mathord{\mbox{\boldmath\scriptsize$i$}}p}:\;r\in\mathcal{L}_{B,M}\text{ and }p\in\mathcal{A}_k\}$ (3.16)

Mathematically, it seems more consistent to quantize the magnitude logarithmically, since the angle is a logarithmic component of a complex number. The logarithmic and polar quantizations appear very attractive for quantization of a pure helical signal, since our perception of loudness is logarithmic in the magnitude. But we see below that they do not behave very well as quantizations of a signal with many components.

3.2.0.0.2 Rounding to discrete values.

Given an exact real signal $ s$, we typically quantize it by rounding it to the closest value in a linear discrete domain:

$\displaystyle \mathcal{RV}_Q(s)$ $\displaystyle = Q\lfloor s/Q+1/2\rfloor$   $\displaystyle \in\mathcal{V}_Q$ (3.17)

We can round an angle similarly:

$\displaystyle \mathcal{RA}_k(a)$ $\displaystyle = 2\mathord{\mbox{\boldmath$\pi$}}/k\lfloor ak/(2\mathord{\mbox{\boldmath$\pi$}})+1/2\rfloor\bmod 2\mathord{\mbox{\boldmath$\pi$}}$   $\displaystyle \in\mathcal{A}_k$ (3.18)

We round an exact complex signal by rounding the real and imaginary parts, or the magnitude and angle:

$\displaystyle \mathcal{RV}^2_Q(\sigma)$ $\displaystyle = \mathcal{RV}_Q(\Re(\sigma))+\mathord{\mbox{\boldmath$i$}}\mathcal{R}_Q(\Im(\sigma))$   $\displaystyle \in\mathcal{V}^2_Q$ (3.19)
$\displaystyle \mathcal{RV}_Q\mathcal{A}_k(\sigma)$ $\displaystyle = \mathcal{RV}_Q(\vert\sigma\vert)e^{\mathord{\mbox{\boldmath\scriptsize$i$}}\mathcal{RA}_k(\arg(\sigma))}$   $\displaystyle \in\mathcal{V}_Q\mathcal{A}_k$ (3.20)

In rounding a real signal to a logarithmic discrete domain, it's not clear whether we should round the value itself or its logarithm. In practice it will not make a lot of difference. I define rounding to the nearest value just for something definite to discuss.

$\displaystyle \mathcal{RL}_{B,M}(s)$ $\displaystyle = \left\{ {\renewedcommand{arraystretch}{1.5}\begin{array}{ll} -B...
...k & \text{if }M\leq (B^{k-1}+B^k)/2\leq s<(B^k+B^{k+1})/2 \end{array} } \right.$ (3.21)

$\displaystyle \mathcal{RL}^2_{B,M}(\sigma)$ $\displaystyle = \mathcal{RL}_{B,M}(\Re(\sigma))+\mathord{\mbox{\boldmath$i$}}\mathcal{RL}_{B,M}(\Im(\sigma))$   $\displaystyle \in\mathcal{L}^2_{B,M}$ (3.22)
$\displaystyle \mathcal{RL}_{B,M}\mathcal{A}_k(\sigma)$ $\displaystyle = \mathcal{RL}_{B,M}(\vert\sigma\vert)e^{\mathord{\mbox{\boldmath\scriptsize$i$}}\mathcal{RA}_k(\arg(\sigma))}$   $\displaystyle \in\mathcal{L}_{B,M}\mathcal{A}_k$ (3.23)

3.2.0.0.3 Quantization error as noise.

For each sort of quantization $ \mathcal{Q}$, the difference between a signal and its quantization is called quantization error.

$\displaystyle \mathcal{EQ}(\sigma)=\mathcal{RQ}(\sigma)-\sigma$ (3.24)

Quantization error in a signal is mostly heard as a sort of noise added on to the desired signal. The error is typically a sawtoothish function. Since the error function has one sawtooth each time the signal crosses a quantization boundary, the frequency of the sawtooths is proportional to the derivative $ \sigma'$ of the signal. The shape of each sawtooth is determined by the shape of the signal at that time. Typically, the sawtooth functions from quantization error sound rather buzzy, and approximately but not exactly harmonic. The amplitude of the sawtooths is the width of the quantization intervals (except for a zero signal, which has no quantization error). For linear quantization methods, the amplitude is essentially constant, except for the special case of a signal so small that it rounds down to 0.

Although quantization error is mostly heard as an added noise, if $ \sigma$ is so small that it rounds to 0, the quantization error is heard as a distortion cancelling the signal entirely. If we listen to sounds that alternate silence (or something that rounds to silence) with audible segments, the audible quantization noise appears to turn on and off with the sound. This turns out to be much more annoying to perception than a constant soft noise.

With logarithmic rounding, a low amplitude high-frequency component may appear and disappear as the instantaneous value produced by larger amplitude low-frequency components moves the signal through different sized quantization intervals. Because the same amplitude signal has more power at higher frequencies, the low amplitude high-frequency component may be distinctly audible, while the larger amplitude low-frequency components may be less audible, or even inaudible. In this case, the effect of quantization is heard as a strange cross-modulation distortion, turning the high-frequency component on and off. Because of the addition of different components in a signal, logarithmic rounding is not an efficient way to produce high fidelity sound, although it could be useful for representing a single helical component.

3.2.0.0.4 Masking quantization error with dither.

Both the buzzy near-harmonic quality of quantization noise, and the turning on and off with the signal, make quantization noise more annoying than random noise. (Notice that the word ``noise'' has two different senses here: undesired sound vs. random sound spread across a range of frequencies). So, we usually cover up quantization noise with constant white noise, called dither. Dithered sampled signals can sound very good to the ear. But dithering reduces the effective dynamic range of a recording--the useful dynamic range goes between the loudness of the dither and the loudest portion of the signal, rather than between silence and the loudest portion. Marketing descriptions of CD technology often ignore this reduction of the effective dynamic range.

When should we add dither to a signal: before or after quantizing it? Dither is most effective at the lowest amplitude when added before quantization. After quantization, the amplitude of dither must be at least as big as the quantization interval to have any effect, and must go higher than that (about twice the quantization interval, I think, but this needs to be checked) in order to sound uniform. Before quantization, the smallest possible amplitude of dither has some effect on the rounding of values very near the middles of quantization intervals. Dither with amplitude approximately the same as the quantization interval already sounds satisfyingly uniform when added before quantization. Surprisingly, dither added before quantization can even reveal a previously inaudible component. A sine wave with amplitude just less than half the quantization interval rounds to 0, so it is inaudible. With dither, it rounds to a signal that varies randomly between 0 and small positive and negative values. The density of positive vs. negative values traces the value of the sine wave, and we can actually hear it along with the white noise of the dither.

3.2.0.0.5 Interaction of quantization with sampling.

I think there is some interesting interaction, but I haven't figured it out yet. If a signal varies fast enough to cross a quantization interval from one sample to the next, most of the detail of the quantization error is lost. In principle, the higher frequency part aliases down to lower frequencies. I'm not sure what sort of final impact we get on the sound.

3.2.1 Real vs. Complex Signals

Normal digital recording and playback uses real valued signals instead of complex valued signals. For computations to analyze synthesize and manipulate sound, we may use real or complex values as we wish. But most software for these purposes works with only real values. Because final playback only requires real values, the omission of imaginary components has no direct impact on what we hear. But some qualities of a signal may be easier to recognize and manipulate with a properly constructed complex valued signal. For spectra, normal software uses complex values to represent the phase of spectral components.

A sensibly constructed complex valued signal is often called an analytic signal, or a signal in quadrature. The word ``analytic'' appears to be arbitrary. ``In quadrature'' refers to the fact that sinusoidal components of the imaginary part of a complex valued signal should be shifted in phase from the real part by $ \mathord{\mbox{\boldmath $\pi$}}$, which is one quarter of a circle. The same word is used in astronomy to refer to planets that are one quarter orbit apart. The word ``quadrature'' is also used sometimes for the operation of definite integration in calculus, which happens to shift the phases of sine waves by $ \mathord{\mbox{\boldmath $\pi$}}$. As far as I can tell, that is a coincidence.

3.2.1.0.1 Converting a complex signal to real.

This is the easy part. We just take the real part of the signal.

$\displaystyle s=\Re(\sigma)$ (3.25)

In principle, we could take the imaginary part, or any linear combination of real and imaginary. These are all projections, vaguely analogous to (but much simpler than) the projections of three-dimensional scenes onto two-dimensional graphic displays. Typically, we have constructed a complex signal with the expectation that the real component will be played to our loudspeakers in the end.

3.2.1.0.2 Converting a real signal to complex.

This conversion depends on precisely what relationship we decide should hold between the real and imaginary parts of a signal. If the two parts represent typical physical measurements, such as displacement and velocity, then one component should be the derivative of the other.

$\displaystyle \sigma=s+\mathord{\mbox{\boldmath$i$}}s'$ (3.26)

This method for producing an analytic signal yields elliptical helixes. If $ s=\sin(2\mathord{\mbox{\boldmath $\pi$}}Ft)$, then the corresponding $ \sigma$ is an elliptical helix, whose aspect ration depends on the frequency $ F$. If $ s$ is an amplitude modulated sine, then $ \sigma$ is a tilted elliptical helix, where the tilt depends on the modulation.

The method of adding in the derivative of a signal as the imaginary part is appealing because the imaginary part depends only on the local character of the real part. But, if we do this operation on a sampled signal, the usual approximations of the derivative by finite differences are a bit suspect. Notice that

$\displaystyle \mathord\mathcal{F}(\sigma')(f)$ $\displaystyle =$ $\displaystyle \mathord{\mbox{\boldmath$i$}}f\mathord\mathcal{F}(\sigma)(f)$ (3.27)
$\displaystyle \mathord\mathcal{F}([t]\sigma(t+1)-\sigma(t-1))(f)$ $\displaystyle =$ $\displaystyle \mathord{\mbox{\boldmath$i$}}\sin(f)\mathord\mathcal{F}(\sigma)(f)$ (3.28)

So the sonic properties of a finite difference can be quite different from the sonic properties of a derivative.

Another natural constraint on the analytic signal is that it contains no negative frequency content. Using the Fourier transform, we can easily define such a signal:

$\displaystyle \sigma$ $\displaystyle =$ $\displaystyle \mathord\mathcal{F}^{-1}(2H\mathord\mathcal{F}(s))$ (3.29)

The doubled Heaviside function $ 2H$ cancels all of the negative portion of the spectrum $ \mathord\mathcal{F}(s)$, and doubles the size of the positive portion to keep the total integral the same. Using Equation 4.90 and the fact that $ \mathord\mathcal{F}^{-1}(H)$ is a hyperbola in the imaginary direction plus a Dirac function in the real direction, this is the same thing as
$\displaystyle \sigma$ $\displaystyle =$ $\displaystyle s + \mathord{\mbox{\boldmath$i$}}([t]1/(\mathord{\mbox{\boldmath$\pi$}}t))\ast s$ (3.30)

The convolution with a hyperbola, used to define the imaginary part of this analytic signal, is an important operation for other purposes, and it is called the Hilbert transform.
$\displaystyle \mathord\mathcal{H}(\sigma)$ $\displaystyle =$ $\displaystyle ([x]1/x)\ast\sigma$ (3.31)

Using the Hilbert transform to generate the imaginary part, we get an analytic signal composed of circular helixes. But $ \mathord\mathcal{H}(\sigma)(T)$ depends on values of $ \sigma$ at times rather distant from $ T$, since the hyperbola decays rather slowly. So time structure in a signal is somewhat blurred by the Hilbert transform.

When a signal is most naturally understood as a sum of modulated helixes, the Hilbert transform is not quite the right thing. For example, consider a helix modulated by a Gaussian bell curve: $ \sigma(t)=Z(t)e^{\mathord{\mbox{\boldmath\scriptsize $i$}}2\mathord{\mbox{\boldmath $\pi$}}Ft}$. The Fourier transform gives a shifted bell curve: $ \mathord\mathcal{F}(\sigma)=[f]Z(f-F)$. Even though the frequency $ F$ is positive, $ \mathord\mathcal{F}(\sigma)$ has negative-frequency content. The Hilbert transform eliminates all negative-frequency content, so it will produce slightly peculiar approximations to modulated helixes.

I'm pretty sure that nobody has devised a perfect reconstruction of complex valued modulated helixes from real valued modulated sine waves. I even doubt that there is a clear definition of what this means, since a given signal can be represented in more than one way as a sum of modulated helixes.

3.3 Other Ways to Digitize a Sound Signal

3.4 Direct Manipulation of Digital Sampled Sound


next up previous contents
Next: 4. The Frequency Spectrum Up: Digital Sound Modeling lecture Previous: 2. Perceptual Foundations of   Contents
Mike O'Donnell 2004-05-13