In my current project, I need to match two audio streams that runs on different clocks. Both claims to have 48k sample rate, but one contains an unknown small error (for examples, only it produces 47959.4 samples per second). Simple dropping or duplicating samples to match the rate obviously will be audible, interpolation would be difficult because the error is unknown.
In real time, I have an interrupt occuring every 1 millisecond, the good stream gives me 48 samples always, the stream that contains error give me either 48 or 47/49 samples at an unknown time. Without buffer up too much, what would be a good way to solve this?
Does anyone have any insight? Thank you very much.
This is a common problem when you are presenting and accepting data streams with different clock domains: it is called clock domain alignment. The way to do this is have the input stream enter a buffer at its nominal rate and have an output stream pull samples from the buffer at the same nominal rate to feed an arbitrary interpolator. Suppose the input rate is actually faster than its nominal rate... then the data pulls at the nominal output rate are not fast enough to keep the buffer from overflowing. We define a control error as the distance (in address space) between the data filled boundary and the center of the buffer. If the input and output rates are the same... the data filled boundary can be placed at the buffer center. If the input rate exceeds the output rate the boundary shift towards the right of the mid point...(towards overflow). That distance is the control variable for the arbitrary interpolator that will convert the pull rate to match the input rate to support the interpolated output rate. This process was used and described in the attached paper converting a nominal 48 kHz rate to a nominal 293 kHz.. The interpolator ratio initialized to 293/48 expecting the output rate from buffer should be 48. the buffer boundary shifted to the right when the input clock was higher than 48 kHz and the pull rate from the buffer had to be increased to match the input rate … the interpolation rate of the interpolater had to be reduced to permit the higher pull rate from buffer....it Worked like a charm... have done this for many systems with a wireless link between platforms... one was input clock 44.1 kHz, output clock 48.0 kHZ… cd input, MP3 output. Any ratio can be supported by this technique.
Fred's response is spot-on, I'll throw in some additional things I've seen that might be a problem in some cases. If one (or more) of the sources are coming over a network there will probably be really high jitter associated with that versus the case where you have two pieces of equipment next to each other. (what to do about dropouts in that scenario is a separate topic).
The higher the jitter the generally longer buffer size you want to have so you don't starve/overflow the buffer. This works against any latency requirement and depending on what you need to do might end up with different size buffers on each stream.
If one or more of the streams isn't well defined (i.e. might come in at a number of different sample rates) the ASRC (Asynchronous Sample Rate Converter) software can be improved by having two modes, coarse lock and fine lock. The coarse lock makes a rough estimate of sample rate and starts the interpolator at that value, sets the buffer to half full, unmutes the output (gracefully! zero crossings are your friend), and then goes to its normal operating mode.
(google Audio ASRC to find more info about HW and SW out there)
Does your system need to operate if one or more input streams go away? You have to pick one as the master to compare against to get the long term ratios and if that input stops what will your system do?
You want to manage the changes to the ASRC's interpolation ratio carefully- if you change it quickly/too often you will get distortion. Some sort of PID controller is not unreasonable, again with a "loss of lock" check to deal with incoming stream changes. Ideally you get the ratio exactly right and the source doesn't drift much, i.e. you could go seconds or longer without needing to adjust.
The fine lock state might be averaging across 10 seconds or more to get accurate average rates as there's always some sampling uncertainty.
Going the other way you want to make sure your controller doesn't hunt up and down quickly over a value. Again this comes back to the coarse/fine lock concept; in fine lock you allow more buffer fullness variation on the assumption it averages out and better to do that than keep futzing with the interpolator and risk creating distortion (where we were shooting for < -120 dB THD+N, if this is some non-quality audio case then you can get a way with a lot of cheats).
Speaking of cheats, I have observed/measured Windows 7 handling of merging multiple audio streams/their ASRC quality. To say it's awful would be giving them way too much credit. Would be OK for a voice call.
A resampling polyphase filter will do that job, but if the error is unknown how will you know what correction to apply?
If the source with the timing error has an *average* rate which is equal to that of the non-error source, then buffering the error source as well as the non-error source by the same buffer depth is the most reliable (and simplest!) approach to matching the two streams. However, if the error source has a average rate which is *not equal* to that of the non-error source, then the occasional "hiccup" in the audio is inevitable.
fred harris' response is spot on, as stated.
The control signal he mentions can be run into a PLL to smooth the process significantly, and the phase difference derived in the PLL can be used to drive a Farrow interpolator.
Depending on the difference between the rates, a wrap of the phase difference will occur as the sample difference in the Farrow interpolator approaches either +0.5 or -0.5 and an extra sample will be pulled or dumped from the input buffer that feed the Farrow interpolator delay line.
This is the techniques used to re-time (timing recovery) in modems to get the received sign to re-sample at the transmit rate.