Logarithmic Number Systems for Audio
Since hearing is approximately logarithmic, it makes sense to represent sound samples in a logarithmic or semi-logarithmic number format. Floating-point numbers in a computer are partially logarithmic (the exponent part), and one can even use an entirely logarithmic fixed-point number system. The -law amplitude-encoding format is linear at small amplitudes and becomes logarithmic at large amplitudes. This section discusses these formats.
Floating-Point Numbers
Floating-point numbers consist of an ``exponent,'' ``significand'', and ``sign bit''. For a negative number, we may set the sign bit of the floating-point word and negate the number to be encoded, leaving only nonnegative numbers to be considered. Zero is represented by all zeros, so now we need only consider positive numbers.
The basic idea of floating point encoding of a binary number is to normalize the number by shifting the bits either left or right until the shifted result lies between 1/2 and 1. (A left-shift by one place in a binary word corresponds to multiplying by 2, while a right-shift one place corresponds to dividing by 2.) The number of bit-positions shifted to normalize the number can be recorded as a signed integer. The negative of this integer (i.e., the shift required to recover the original number) is defined as the exponent of the floating-point encoding. The normalized number between 1/2 and 1 is called the significand, so called because it holds all the ``significant bits'' of the number.
Floating point notation is exactly analogous to ``scientific notation'' for decimal numbers, e.g., ; the number of significant digits, 5 in this example, is determined by counting digits in the ``significand'' , while the ``order of magnitude'' is determined by the power of 10 (-9 in this case). In floating-point numbers, the significand is stored in fractional two's-complement binary format, and the exponent is stored as a binary integer.
Since the significand lies in the interval ,G.6its most significant bit is always a 1, so it is not actually stored in the computer word, giving one more significant bit of precision.
Let's now restate the above a little more precisely. Let denote a number to be encoded in floating-point, and let denote the normalized value obtained by shifting either bits to the right (if ), or bits to the left (if ). Then we have , and . The significand of the floating-point representation for is defined as the binary encoding of .G.7 It is often the case that requires more bits than are available for exact encoding. Therefore, the significand is typically rounded (or truncated) to the value closest to . Given bits for the significand, the encoding of can be computed by multiplying it by (left-shifting it bits), rounding to the nearest integer (or truncating toward minus infinity--as implemented by the floor() function), and encoding the -bit result as a binary (signed) integer.
As a final practical note, exponents in floating-point formats may have a bias. That is, instead of storing as a binary integer, you may find a binary encoding of where is the bias.G.8
These days, floating-point formats generally follow the IEEE standards set out for them. A single-precision floating point word is bits (four bytes) long, consisting of sign bit, exponent bits, and significand bits, normally laid out as
A double-precision floating point word is bits (eight bytes) long, consisting of sign bit, exponent bits, and significand bits. In the Intel Pentium processor, there is also an extended precision format, used for intermediate results, which is bits (ten bytes) containing sign bit, exponent bits, and significand bits. In Intel processors, the exponent bias is for single-precision floating-point, for double-precision, and for extended-precision. The single and double precision formats have a ``hidden'' significand bit, while the extended precision format does not. Thus, the most significant significand bit is always set in extended precision.
The MPEG-4 audio compression standard (which supports compression using music synthesis algorithms) specifies that the numerical calculations in any MPEG-4 audio decoder should be at least as accurate as 32-bit single-precision floating point.
Logarithmic Fixed-Point Numbers
In some situations it makes sense to use logarithmic fixed-point. This number format can be regarded as a floating-point format consisting of an exponent and no explicit significand. However, the exponent is not interpreted as an integer as it is in floating point. Instead, it has a fractional part which is a true mantissa. (The integer part is then the ``characteristic'' of the logarithm.) In other words, a logarithmic fixed-point number is a binary encoding of the log-base-2 of the signal-sample magnitude. The sign bit is of course separate.
An example 16-bit logarithmic fixed-point number format suitable for digital audio consists of one sign bit, a 5-bit characteristic, and a 10-bit mantissa:
A nice property of logarithmic fixed-point numbers is that multiplies simply become additions and divisions become subtractions. The hard elementary operation are now addition and subtraction, and these are normally done using table lookups to keep them simple.
One ``catch'' when working with logarithmic fixed-point numbers is that you can't let ``dc'' build up. A wandering dc component will cause the quantization to be coarse even for low-level ``ac'' signals. It's a good idea to make sure dc is always filtered out in logarithmic fixed-point.
Mu-Law Coding
Digital telephone CODECsG.9 have historically used (for land-line switching networks) a simple 8-bit format called -law (or simply ``mu-law'') that compresses large amplitudes in a manner loosely corresponding to human loudness perception.
Given an input sample represented in some internal format, such as a short, it is converted to 8-bit mu-law format by the formula [58]
As we all know from talking on the telephone, mu-law sounds really quite good for voice, at least as far as intelligibility is concerned. However, because the telephone bandwidth is only around 3 kHz (nominally 200-3200 Hz), there is very little ``bass'' and no ``highs'' in the spectrum above 4 kHz. This works out fine for intelligibility of voice because the first three formants (envelope peaks) in typical speech spectra occur in this range, and also because the difference in spectral shape (particularly at high frequencies) between consonants such as ``sss'', ``shshsh'', ``fff'', ``ththth'', etc., are sufficiently preserved in this range. As a result of the narrow bandwidth provided for speech, it is sampled at only 8 kHz in standard CODEC chips.
For ``wideband audio'', we like to see sampling rates at least as high as 44.1 kHz, and the latest systems are moving to 96 kHz (mainly because oversampling simplifies signal processing requirements in various areas, not because we can actually hear anything above 20 kHz). In addition, we like the low end to extend at least down to 20 Hz or so. (The lowest note on a normally tuned bass guitar is E1 = 41.2 Hz. The lowest note on a grand piano is A0 = 27.5 Hz.)
Next Section:
Round-Off Error Variance
Previous Section:
Linear Number Systems