comp.dsp | Audio CODECs

Hi,

I'm looking some pointers concerning the design of lossless
audio (plus "silence") codecs.

I want to deploy these on either end of a packet switched
network (coder at server, decoder at client).  I.e., they
are intended primarily for communication bandwidth reduction.
Push content into coder, pass over network, extract content
via decoder, *consume* (and discard).  I.e., the system makes
the network look like a long "virtual wire".


The decoder needs to be *fast*.  Ideally, suitable for on-the-fly
operation (i.e., having to "expand" an entire frame "in place"
is less desirable than being able to expand it AS CONSUMED).

[[[Note:  I am targeting general purpose MCU's, not DSP's!]]]

Smaller frame sizes are better than larger ones (requires
less resources to hold in the client WHILE CONSUMING).  And,
bigger frames mean bigger packets mean supporting fragmentation
and reassembly in the protocol stack, etc.  Alternatively,
makes the data stream more sensitive to dropped fragments.

(Some/much) content can be encoded a priori (e.g., as in
a media server) so the cost of coding or transcoding can
be considerably higher than decoding.  OTOH, it shouldn't
be prohibitively higher precluding any "real-time" use.

The type of source material shouldn't have a dramatic effect on
the efficiency or cost of either coder or decoder (speech, music,
etc. -- don't worry about "white/pink/chartreuse/etc noise")


 From some observations of existing CODECS (open and proprietary):

All try to encapsulate a variety of different source formats:
bits per sample, samples per second, seek points, tags, etc.

All try to apply different compression strategies which are
then encoded in the data stream.

Most seem to treat the source material as discrete "sessions"
(song 1, song 2, etc.) instead of an endless *stream* of content.

It appears that most compression gains come from exploiting
the reduced bandwidth of the difference channel.  This only
works if you have two (related) source channels -- i.e., the
compression is less remarkable for mono sources.

Coder efficiencies (costs) tend to vary, greatly.  Often the
added "expense" results in very little additional gain in
compression (which can't be determined a priori).  While this
isn't significant for "batch" applications (encode, then store
for later distribution), it can be a deal breaker for "live"
content.


So...

In this sort of dedicated application, many of the "features"
of these CODECs are superfluous or redundant.  E.g., you can
probably fix the sample rate and compensate for variations
in source materials in the encoder (this makes it simpler for
the decoder to blindly reproduce that content without concern
for the actual sample rate of the original source).  Ditto
for sample sizes.

But, I'm not sure if you can as easily discard the adaptive
coding (decoding) strategies without knowing more about the
actual signal you will be encountering.

Do certain models/predictors/encodings tend to solve most
of the coding problem -- with the others present to cover
special cases?  E.g., without a difference channel, it
doesn't seem that RLE for the residual would be of much
use (?)

Anything else I've missed as shortcuts to reduce the
complexity of the coder/decoder?  Any risks that these
shortcuts might have lurking inside them?  Pathological
cases that could (realistically) be encountered?

Any suggestions as to developing/acquiring a versatile
test suite with which to gauge performance?  Or, just
pick some of the things that are *likely* to pass down
the wire?

I've implemented this with a couple of different "open"
CODECs and am now trying to determine if there are any
changes that are worthwhile to attempt to improve these
criteria.

Thx!
--don

[Apologies if I don't reply quickly.  I'm rearranging machines
here so my news server is a mess -- and likely to get worse
before it "recovers".  I may try reading news via google just
to keep abreast of anything posted, here. ]

Reply by Vladimir Vassilevsky ●February 10, 20122012-02-10

Don Y wrote:

> I'm looking some pointers concerning the design of lossless
> audio (plus "silence") codecs.

The design is trivial: backwards adaptive predictor followed by 
conventional Huffman coder.

> I want to deploy these on either end of a packet switched
> network (coder at server, decoder at client).  I.e., they
> are intended primarily for communication bandwidth reduction.

Loseless audio compressor is hardly useful in this scenario, as it does 
not guarantee a fixed bandwidth.

> The decoder needs to be *fast*. 

Then omit backward adaptation. Transmit forward prediction coefficients 
over the channel.

> [[[Note:  I am targeting general purpose MCU's, not DSP's!]]]

Like what, for example?
> 
> Smaller frame sizes are better than larger ones (requires
> less resources to hold in the client WHILE CONSUMING).  And,
> bigger frames mean bigger packets mean supporting fragmentation
> and reassembly in the protocol stack, etc.  Alternatively,
> makes the data stream more sensitive to dropped fragments.

That's irrelevant.
Large buffers will be needed to cope with the variable data rate.

> The type of source material shouldn't have a dramatic effect on
> the efficiency or cost of either coder or decoder (speech, music,
> etc. -- don't worry about "white/pink/chartreuse/etc noise")

The compression ratio is going to be only about 50% or so.
Then why bother about compression at all?

>  From some observations of existing CODECS (open and proprietary):
> 
> All try to encapsulate a variety of different source formats:
> bits per sample, samples per second, seek points, tags, etc.
> 
> All try to apply different compression strategies which are
> then encoded in the data stream.

The codecs are LOSELESS. Once the session is started, you can't drop any 
  data.

> Most seem to treat the source material as discrete "sessions"
> (song 1, song 2, etc.) instead of an endless *stream* of content.

So you will have to parse the uninterruptible stream from very beginning 
in the last year. If you loose a packet, you are lost.

> 
> So...

[...]

So.

I am tired of your bleat. Stop here and do anything useful other then 
spewing internet with nonsense.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by brent ●February 10, 20122012-02-10

On Feb 10, 11:34&#4294967295;am, Vladimir Vassilevsky <nos...@nowhere.com> wrote:

>
> I am tired of your bleat.

Ha ha.  To me, it seems that you Live for this kind of "bleat".

Reply by glen herrmannsfeldt ●February 10, 20122012-02-10

Vladimir Vassilevsky <nospam@nowhere.com> wrote:

(snip)
> The design is trivial: backwards adaptive predictor followed by 
> conventional Huffman coder.

>> I want to deploy these on either end of a packet switched
>> network (coder at server, decoder at client).  I.e., they
>> are intended primarily for communication bandwidth reduction.

> Loseless audio compressor is hardly useful in this scenario, as it does 
> not guarantee a fixed bandwidth.

(snip)

>> All try to apply different compression strategies which are
>> then encoded in the data stream.

> The codecs are LOSELESS. Once the session is started, you 
> can't drop any  data.

Not very useful in the real world, though. Even back to the
beginning of CDs, there is error concealment when error
correction fails.

There used to be stories (maybe still are) of testing CD players
with a CD with a black wedge on it. The wedge blocks the light
for an ever increasing length of time each revolution. You can
then listen, and see at what point the error correction fails,
and how well the concealment sounds. Thinks like that used to
be in reviews for CD players, but maybe not anymore.

With VoIP, you never know what will happen on the net, in terms
of delayed or lost packets. Some kinds of concealment is needed.

-- glen

Reply by Don Y ●February 10, 20122012-02-10

Hi Vladimir,

On 2/10/2012 9:34 AM, Vladimir Vassilevsky wrote:
> Don Y wrote:
>
>> I'm looking some pointers concerning the design of lossless
>> audio (plus "silence") codecs.
>
> The design is trivial: backwards adaptive predictor followed by
> conventional Huffman coder.

If it was "trivial", one-size-fits-all, someone would have
designed, patented, and commercialized it -- and retired
to some sunny beach to drink "tuna" coladas and watch
cute young things frolick in the surf.

The fact that there are so many different CODECs testifies to
the non-triviality of the task.

>> I want to deploy these on either end of a packet switched
>> network (coder at server, decoder at client). I.e., they
>> are intended primarily for communication bandwidth reduction.
>
> Loseless audio compressor is hardly useful in this scenario, as it does
> not guarantee a fixed bandwidth.

It doesn't *have* to guarantee a fixed bandwidth.  It just has to
nominally afford some reduction in required bandwidth to offset
the implementation cost.

>> The decoder needs to be *fast*.
>
> Then omit backward adaptation. Transmit forward prediction coefficients
> over the channel.
>
>> [[[Note: I am targeting general purpose MCU's, not DSP's!]]]
>
> Like what, for example?

Like whatever the implementor decides is appropriate!  If
you can't design hardware, you might port it to a PC platform.
If you've got a DSP in an existing product, port it there.
Etc.

Tying an implementation to a particular hardware platform is
"premature optimization".  Figure out what needs to be done
(with your best effort) and use that to determine the minimum
requirements for any hosting platform.

>> Smaller frame sizes are better than larger ones (requires
>> less resources to hold in the client WHILE CONSUMING). And,
>> bigger frames mean bigger packets mean supporting fragmentation
>> and reassembly in the protocol stack, etc. Alternatively,
>> makes the data stream more sensitive to dropped fragments.
>
> That's irrelevant.
> Large buffers will be needed to cope with the variable data rate.

No.  The model that you chose for the coder (and thus, decoder)
determines the resources that will be needed.

Silly example:

You're going to be passing (pure) sine waves down the wire.
I can encode that as 4 values:  amplitude, frequency, phase
and duration.

The decoder can take those four values and reconstruct an
equivalent sine wave with almost *0* resources at its
disposal.

This is why the knowledge of folks with first-hand experience
is worthwhile.  What models work best in which circumstances, etc.
An encoder for speech is not going to be as effective encoding
"music".  (speech has lots of silence).  OTOH, it might work
well encoding a (single) *singer*.

>> The type of source material shouldn't have a dramatic effect on
>> the efficiency or cost of either coder or decoder (speech, music,
>> etc. -- don't worry about "white/pink/chartreuse/etc noise")
>
> The compression ratio is going to be only about 50% or so.
> Then why bother about compression at all?

Because compressing is cheaper than running another wire or
upgrading the communications fabric to handle higher bandwidths.
Save a dollar on the processor and spend thousands on more cable?

>> From some observations of existing CODECS (open and proprietary):
>>
>> All try to encapsulate a variety of different source formats:
>> bits per sample, samples per second, seek points, tags, etc.
>>
>> All try to apply different compression strategies which are
>> then encoded in the data stream.
>
> The codecs are LOSELESS. Once the session is started, you can't drop any
> data.

That doesn't directly depend on the strategy used in doing the
compression.  Rather, that depends on how the protocol handles
errors/dropouts.

OTOH, a CODEC that has to handle music *and* speech might chose
to use different strategies to represent that data based on its
knowledge/examination of the data stream.

>> Most seem to treat the source material as discrete "sessions"
>> (song 1, song 2, etc.) instead of an endless *stream* of content.
>
> So you will have to parse the uninterruptible stream from very beginning
> in the last year. If you loose a packet, you are lost.

No.  Only if the coder stores no "absolute state" in the data stream.
This doesn't seem to be the case for the codecs that I've examined.
It would be a burden to implementors for that reason as well as
making the stream "unseekable" (or, only seekable at elevated
expense) -- your decoder would have to process all of the "skipped
over" content in order to accurately track state.

>> So...
>
> [...]
>
> So.
>
> I am tired of your bleat. Stop here and do anything useful other then
> spewing internet with nonsense.

Add me to your kill file.  Or, discipline yourself not to open
any posts with my name on them.  I don't try to "elude" filters.
I always post from the same IP, use the same news service, etc.
I'm sure someone like you should be able to figure out how to
rid yourself of these "unpleasant distractions".

If not, ask one of the kids in the neighborhood to show you how...

Reply by Don Y ●February 10, 20122012-02-10

Hi Glen,

On 2/10/2012 12:36 PM, glen herrmannsfeldt wrote:
> Vladimir Vassilevsky<nospam@nowhere.com>  wrote:
>
>> The design is trivial: backwards adaptive predictor followed by
>> conventional Huffman coder.
>
>>> I want to deploy these on either end of a packet switched
>>> network (coder at server, decoder at client).  I.e., they
>>> are intended primarily for communication bandwidth reduction.
>
>> Loseless audio compressor is hardly useful in this scenario, as it does
>> not guarantee a fixed bandwidth.
>
>>> All try to apply different compression strategies which are
>>> then encoded in the data stream.
>
>> The codecs are LOSELESS. Once the session is started, you
>> can't drop any  data.
>
> Not very useful in the real world, though. Even back to the
> beginning of CDs, there is error concealment when error
> correction fails.

I'm not even asking for the decoder to *fix* those errors,
dropouts, etc.  As long as it can INDICATE where errors
exist so that I can take my own remedial action.

But, the point of my statement ("different compression
strategies") was to elicit comments as to why certain
strategies/encodings are preferable to others AND IN
WHICH CIRCUMSTANCES.  I.e., why isn't a single strategy
employed?  Or, why only <some_number>?

> There used to be stories (maybe still are) of testing CD players
> with a CD with a black wedge on it. The wedge blocks the light
> for an ever increasing length of time each revolution. You can
> then listen, and see at what point the error correction fails,
> and how well the concealment sounds. Thinks like that used to
> be in reviews for CD players, but maybe not anymore.
>
> With VoIP, you never know what will happen on the net, in terms
> of delayed or lost packets. Some kinds of concealment is needed.

Yes.  Though you can do things to minimize the "discomfort"
of those errors -- within reason.  E.g., reproducing a
signal with periodic dropouts at a "high" frequency (e.g.,
so the signal sounds to be gated off, often) is probably
more annoying than dropping the connection.  Or, muting
until signal delivery is stable enough for "more pratical"
use.

If the decoder can indicate problems, "something" can
be done to resolve them in a manner that is appropriate
to the application.

Audio CODECs

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group