DSPRelated.com
Forums

Beginner's question on undersampling and aliasing

Started by prad June 6, 2007
Eric Jacobsen wrote:
> On Thu, 07 Jun 2007 17:39:05 -0500, "prad" > <pradeep.fernando@gmail.com> wrote:
(snip)
>> Actually the data samples are produced by a code. But the code can >>exactly produce the ith data sample that I need out of the entire 10^13 >>data points. So storage is not an issue. My limitation is that the C code >>for the FFT takes the data input as an array of samples. And the largest >>array size is 2^32. Since the data is not periodic, I need equidistant >>samples picked out from the entire 10^13 data points.
> Bleah. Sounds like you want every Nth sample. Not good.
Maybe not so bad. If you really want a large FFT it is possible to make that up out of smaller FFTs. That is the whole idea behind the FFT algorithm. To do that, though, the data comes in an unusual order. If you can produce data in that order it should be easy to do very large FFTs. That is especially true if you only want the low frequency end of the FFT. (snip)
> Might you have access to any hardware accelerators? Like a board > with an FPGA on it connected to the computer? At a 400MHz clock > rate, which is practical these days, you could run the entire 10^13 > set through a hardware filter in an FPGA in 69 hours (about three > days) assuming one clock per sample. Keeping the buffers fed and > getting the data out in a timely fashion might be a challenge, but > this sort of thing might be worth looking into.
How did you get 69 hours? 1e13/4e8 is 25000, for about 6.9 hours. It would seem easy to make a low pass filter in a reasonable sized FPGA that could run at that rate. -- glen
On 7 Jun, 20:57, "prad" <pradeep.ferna...@gmail.com> wrote:
> >On 7 Jun, 02:59, "prad" <pradeep.ferna...@gmail.com> wrote: > >> Hi all, > > >> I am a newbie to DSP. I have a huge number of data samples that I > want > >> to perform FFT on. > > >Why do you want to compute the DFT? What do you hope > >to achieve? > > >> But due to the enormous amount of the data samples, I'd > >> like to use only 1M samples and perform FFT. I know that this is > severe > >> undersampling and that it does not satisfy the Nyquist rate. > > >Then you are in deep trouble. If the data are aliased, you > >face a question on the form > > >"Given x+y = 1, find x and y." > > >There exists one true answer, but there is no way you can > >find it, let lone tell it from all the other combinations > >of x and y which satisfy x+y=1. > > >> But all I > >> want is to prevent aliasing so that I can get correct information > about > >> the low frequency content in the data. I cannot use an analog > anti-alias > >> filter as the data is inherently digitized. > > >What do you mean? If you sample an analog process, then you > >have no choise but to use an analog anti-alias filter. > >If you have data from some discrete-time process, aliasing > >is not an issue. > > >> Is there any time-domain > >> method to help with my problem? > > >As far as filtering the data is concered, only the analog > >anti-alias filter. Provided, of course, you are actually > >sampling an analog process. > > >> I read about a half-band digital filter > >> that eliminates the frequencies higher than half the sampling > frequency. > >> Can this filter help my problem as I read a lot of articles that the > high > >> frequencies have to be eliminated before digitization? > > >Doesn't apply here. Analog anti-alias filtering is the way > >to go. > > >> If so, can someone > >> point me to some link that discusses the design of such a filter? > > >I am sure someone can. However, it won't help you. > >Did I mention that analog anti-alias filters are useful? > > >Rune > > Dear all, > Thanks for all your replies but I am still a bit confused. Let me try > rephrasing my question so that everyone gets a better picture of what I > am > trying to do. I have a large number (order of 10^13) of discretized data > samples.
OK, 10e13 samples, each of (at least) two bytes; we are talking about some 20 TeraBytes of data here. Just the magnitude of data shuffling means that you are into specialist's work. Even the most trivial task becomes huge; just forget everything you tought you knew. Find a copy of Knuth's "The art of computer programming", the volume on searching huge amounts of data.
> I have to obtain the frequency spectrum of this data so that I > can predict minima in the discretized data.
You haven't said how you measured your data, but forget about this. one 1M DFT won't help, you still only represent only 1/10 millionth of the data samples. The slightest incoherency in the process which generated the data will throw your estimate off.
> I am using the FFTW C library > to obtain the Frequency spectrum. But the maximum size of my data array > is > limited to 2^32 in C.
Yep. If you use a 32-bit operating system.
> So I am picking a smaller subset of 1M samples from > the original set of 10^13 data points. > > Jerry is right about the fact that what I am looking for is subsampling > or decimation without running into aliasing problems. But due to the > large > amount of data, I cannot perform decimation on the entire data set. Also, > I'd like to point out that my data is inherently discretized and so using > an analog anti-aliasing filter to remove all the high frequency > components > that I am not interested in is NOT an option for me.
If the data are already aliased (no anti-aliasing filter was used while sampling the data) then you are the proud owner of several TeraBytes worth of rubbish. Using an anti-alias filter when sampling, is such a basic piece of knowlegde that not doing so ought to be liable for criminal prosecution.
> So some scheme that > resembles what John mentioned is of great interest to me. But I am not > sure if that will work. An ideal scheme for me would be like the > following: > > 1. Collect N=1M equidistant samples out of the n=10^13 data points. > 2. Filter out the high frequency components present in the samples > using some kind of LPF, i.e. > For each sample "s_i" in the 1M samples > i. Collect some constant number (say 10) samples before and > after the sample "s_i" > ii. Perform a Digital Low Pass Filtering operation on these > 21 samples. > iii. Pick one of these 21 samples (say the median of the > 21 samples) to replace "s_i" > > Will the above procedure eliminate or atleast drastically reduce > aliasing (assuming of course that the original 10^13 data points were > sampled without aliasing)? Can I somehow tell which is the lowest > frequency that was significantly filtered out (so that I know which of > the > lower frequencies contain aliases)?
Decimation is a standard process in DSP. Not all implementations rely on using the FFT. Try another version of decimation and be prepared to spend a lot of time working out the trivial logistics of shuffling data around. Rune

glen herrmannsfeldt wrote:
> > prad wrote: > > > Are you implying that I first pick 1M samples, then LPF the samples and > > use them? Will that work? Or are you saying that I have to LPF the entire > > digitized data and then do decimation? The second method is not feasible > > for me as the data set is too large and will require an enormous amount of > > time. Please advise. > > The exact answer depends much on the ratio of sampling rate to the > frequency you are interested in. Most likely it is possible to come > up with a relatively fast filter to get what you want. > > Consider the moving average filter. You will find that DSP people > don't like it very much because its properties are not very good, > but it is easy to describe and fast to implement. > > You might, for example, do a 1000 point moving average where the > first output value is the mean of the first 1000 input samples, > the second is the mean of 2 through 1001, etc.
What good is a 1000 point moving average filter going to be? If he is downsampling from 10^13 samples to 10^6 samples to prevent aliasing he's going to need a low pass filter that has millions and millions of coefficients. Don't know what the actual problem is, but let's suppose the 10^13 data points are a recording of all the pop songs that have ever been played on the radio. After filtering and decimation the OP would have about one sample per song. If you are going to use a moving average filter than each output sample would be something like the average of all the samples of one song (which probably in most songs would be close to zero). It's hard to imagine why, if the data contains only important info at such a low frequency, it is being sampled at such a high sample rate in the first place. -jim
>This can be done > very fast with a circular buffer of length 1000 to update the > sum appropriately. It might be that a moving average filter > followed by a decimation by 10, (output only every 10th point) > would be a good initial filter, and reduce your problem by > a factor of ten. Then another filter could finish off the job. > > I have not done the calculations to show that the above is > actually a good way to proceed, but only offer it to show > that there are fast ways to filter large amounts of data, > and that in some cases they might be useful. Also, that > cascaded filters might be faster than doing the whole > thing in one step. > > -- glen
----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==---- http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups ----= East and West-Coast Server Farms - Total Privacy via Encryption =----
On Fri, 08 Jun 2007 01:24:59 -0800, glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:

>Eric Jacobsen wrote: >> On Thu, 07 Jun 2007 17:39:05 -0500, "prad" >> <pradeep.fernando@gmail.com> wrote: > >(snip) > >>> Actually the data samples are produced by a code. But the code can >>>exactly produce the ith data sample that I need out of the entire 10^13 >>>data points. So storage is not an issue. My limitation is that the C code >>>for the FFT takes the data input as an array of samples. And the largest >>>array size is 2^32. Since the data is not periodic, I need equidistant >>>samples picked out from the entire 10^13 data points. > >> Bleah. Sounds like you want every Nth sample. Not good. > >Maybe not so bad. If you really want a large FFT it is possible >to make that up out of smaller FFTs. That is the whole idea behind >the FFT algorithm. To do that, though, the data comes in an unusual >order. If you can produce data in that order it should be easy to do >very large FFTs. That is especially true if you only want the low >frequency end of the FFT.
That's a very interesting thought. Some judicious pruning of the FFT computations to get the fraction of the spectrum of interest.
>> Might you have access to any hardware accelerators? Like a board >> with an FPGA on it connected to the computer? At a 400MHz clock >> rate, which is practical these days, you could run the entire 10^13 >> set through a hardware filter in an FPGA in 69 hours (about three >> days) assuming one clock per sample. Keeping the buffers fed and >> getting the data out in a timely fashion might be a challenge, but >> this sort of thing might be worth looking into. > >How did you get 69 hours? 1e13/4e8 is 25000, for about 6.9 hours. >It would seem easy to make a low pass filter in a reasonable >sized FPGA that could run at that rate. > >-- glen
Yeah, musta mis-typed an exponent or something. You're right, it's an order of magnitude better than I thought it'd be. So it must be a good idea!! ;) The buffering could be an issue, but even if the filter has to stop and wait occassionally to load/unload the buffers, it'd still be an excellent way to accelerate computation. Eric Jacobsen Minister of Algorithms Abineau Communications http://www.ericjacobsen.org
On 7 Jun, 22:47, "prad" <pradeep.ferna...@gmail.com> wrote:

> One of my major problems is that the data sets I have are huge, in fact > in terms of memory complexity, they are O(n!). 10^13 is one of the smallest > data sets I have.
What the heck are you doing?? Only ten years ago, the 3D/4C seismic data surveys broke the 1 TB limit; if you want to get into 10s of Ts you ahve to get into 4D/4C seismic surveys. Only the really big boys do that sort of thing. A couple of days ago the marine survey company I work for issued a press release about their activities. In one year, the company recorded some 18 TB of data during survey operations. These are the combined surveys from 12 ships, each of 2000 to 3000 tons (about 90 m long), each running a 24/7 survey operation for 4 months per year. And you get that sort of data apparently without even trying... Something is seriously weird here. I just don't believe your experiment design is a good one; it is fundamentally flawed. No well-designed experiment, particyularly one designed by a DSP newbie, ends up with that sort of data. Period. Why don't you get into details about how these data come about and what you try to do with them? Rune

Rune Allnor wrote:

> On 7 Jun, 22:47, "prad" <pradeep.ferna...@gmail.com> wrote: > > >> One of my major problems is that the data sets I have are huge, in fact >>in terms of memory complexity, they are O(n!). 10^13 is one of the smallest >>data sets I have. > > > What the heck are you doing??
Don't you know? This is homework. A stupident is generating a huge data by software and then trying to make use of that data. Numerical nonsense instead of thinking of the better approaches, as usual. VLV
I am sorry that you think that. This is for a research subject and not
homework. I am not sharing details about the data as it would give away
the novel modeling I am trying to do. Thanks to all those who gave useful
information and helped me. 

VLV: Please think before you send a comment like that. 



Pradeep.

> > >Rune Allnor wrote: > >> On 7 Jun, 22:47, "prad" <pradeep.ferna...@gmail.com> wrote: >> >> >>> One of my major problems is that the data sets I have are huge, in
fact
>>>in terms of memory complexity, they are O(n!). 10^13 is one of the
smallest
>>>data sets I have. >> >> >> What the heck are you doing?? > >Don't you know? This is homework. A stupident is generating a huge data >by software and then trying to make use of that data. Numerical nonsense
>instead of thinking of the better approaches, as usual. > >VLV >
On 9 Jun, 02:39, "prad" <pradeep.ferna...@gmail.com> wrote:
> I am sorry that you think that.
Don't top post when commenting on previous posts.
> This is for a research subject and not > homework. I am not sharing details about the data as it would give away > the novel modeling I am trying to do.
Those who have followed my posts here for a couple of years would know that I sometimes make a point of saying that I am an engineer, not a researcher, despite my misfortune of having obtained a PhD degree. The difference is that the engineer knows what he is doing whereas the researcher [*] does not. Tour project is flawed. There is no need whatsoever to come up with those sorts of data in any but the biggest survey projects. There is the vanishing (though still non-zero) chance that you really are into something, but even so, your timing is wrong. These days, the sheer logistics of what you try to do is out of reach for anyone but the largest computer centra. That will change with time. When I started playing with computers 20 years ago, it was specialist's work to handle more than 64 kilobytes of data at any one time. When I first played with certan sonar data processing ideas some 10 years ago, anything beyond 1MByte was, for practical purposes, outside my reach. These days my computer's RAM is the limitation (it has only 512 MBytes) and I plan my programs for use with, say, 10 Gyte of data once I can get my hands on that sort of computer. It will be affordable by the time I finish my program. So times change, and maybe you or I will look up this thread in five or ten years time and smile at the old days when a mere 20 TB of data represented an insurmountable obstacle. However, ath the time of writing, June 2007, 20 TB of data *is* an insurmountable obstacle. If you end up with the need to process that sort of data, there is sucha huge discrepancy between what you want and what is whithin your abilities to do, that you might as well do something else in the mean time, waiting for the required technology to become available to you. If you do this as part of an employment, circulate your resume. Whoever assigned you to this task has no idea what he or she is doing.
> VLV: Please think before you send a comment like that.
Vladimir is, as is his habit, perfectly on the mark. Rune [*] In Norwegian, "researcher" is translated to "forsker" "one who does research", whereas "scientist" is translated to "vitenskapsmann" which means "one who knows the sciences." The difference might be subtle, but anyone can emark on some sort of research, while insight into the sciences requires more insight, usually otained through decades of dedicated studies. Needless to say, the world are full of researchers, with hardly one scientist alive, world wide.
jim wrote:
(snip)

> What good is a 1000 point moving average filter going to be? If he is > downsampling from 10^13 samples to 10^6 samples to prevent aliasing he's > going to need a low pass filter that has millions and millions of > coefficients.
Maybe it should be more than 1000, maybe less. I haven't seen any numbers related to the frequency range of interest.
> It's hard to imagine why, if the data contains only important info at > such a low frequency, it is being sampled at such a high sample rate in > the first place.
It seems that it is generated, and not from a natural source. It might be the output of a linear congruential or linear feedback shift register, for example. It might be a 43 bit LFSR, and one bit samples. Note that the number of bits per sample still hasn't been mentioned. I believe with either linear congruential or LFSR you can rewrite it to generate every Nth point of the original series. -- glen
Rune Allnor wrote:

(snip)

> Why don't you get into details about how these data > come about and what you try to do with them?
He seems to want to keep the data source proprietary, in which case I believe our answers also should be. Now, if he wants to pay for answers, that is different. -- glen