# Beginner's question on undersampling and aliasing

Started by June 6, 2007
```Eric Jacobsen wrote:
> On Thu, 07 Jun 2007 17:39:05 -0500, "prad"

(snip)

>>    Actually the data samples are produced by a code. But the code can
>>exactly produce the ith data sample that I need out of the entire 10^13
>>data points. So storage is not an issue. My limitation is that the C code
>>for the FFT takes the data input as an array of samples. And the largest
>>array size is 2^32. Since the data is not periodic, I need equidistant
>>samples picked out from the entire 10^13 data points.

> Bleah. Sounds like you want every Nth sample.   Not good.

Maybe not so bad.  If you really want a large FFT it is possible
to make that up out of smaller FFTs.  That is the whole idea behind
the FFT algorithm.  To do that, though, the data comes in an unusual
order.  If you can produce data in that order it should be easy to do
very large FFTs.  That is especially true if you only want the low
frequency end of the FFT.

(snip)

> Might you have access to any hardware accelerators?   Like a board
> with an FPGA on it connected to the computer?   At a 400MHz clock
> rate, which is practical these days, you could run the entire 10^13
> set through a hardware filter in an FPGA in 69 hours (about three
> days) assuming one clock per sample.   Keeping the buffers fed and
> getting the data out in a timely fashion might be a challenge, but
> this sort of thing might be worth looking into.

How did you get 69 hours?  1e13/4e8 is 25000, for about 6.9 hours.
It would seem easy to make a low pass filter in a reasonable
sized FPGA that could run at that rate.

-- glen

```
```On 7 Jun, 20:57, "prad" <pradeep.ferna...@gmail.com> wrote:
> >> Hi all,
>
> >>    I am a newbie to DSP. I have a huge number of data samples that I
> want
> >> to perform FFT on.
>
> >Why do you want to compute the DFT? What do you hope
> >to achieve?
>
> >> But due to the enormous amount of the data samples, I'd
> >> like to use only 1M samples and perform FFT. I know that this is
> severe
> >> undersampling and that it does not satisfy the Nyquist rate.
>
> >Then you are in deep trouble. If the data are aliased, you
> >face a question on the form
>
> >"Given x+y = 1, find x and y."
>
> >There exists one true answer, but there is no way you can
> >find it, let lone tell it from all the other combinations
> >of x and y which satisfy x+y=1.
>
> >> But all I
> >> want is to prevent aliasing so that I can get correct information
> >> the low frequency content in the data. I cannot use an analog
> anti-alias
> >> filter as the data is inherently digitized.
>
> >What do you mean? If you sample an analog process, then you
> >have no choise but to use an analog anti-alias filter.
> >If you have data from some discrete-time process, aliasing
> >is not an issue.
>
> >> Is there any time-domain
> >> method to help with my problem?
>
> >As far as filtering the data is concered, only the analog
> >anti-alias filter. Provided, of course, you are actually
> >sampling an analog process.
>
> >> that eliminates the frequencies higher than half the sampling
> frequency.
> >> Can this filter help my problem as I read a lot of articles that the
> high
> >> frequencies have to be eliminated before digitization?
>
> >Doesn't apply here. Analog anti-alias filtering is the way
> >to go.
>
> >> If so, can someone
> >> point me to some link that discusses the design of such a filter?
>
> >Did I mention that analog anti-alias filters are useful?
>
> >Rune
>
> Dear all,
> Thanks for all your replies but I am still a bit confused. Let me try
> rephrasing my question so that everyone gets a better picture of what I
> am
> trying to do. I have a large number (order of 10^13) of discretized data
> samples.

OK, 10e13 samples, each of (at least) two bytes; we are
talking about some 20 TeraBytes of data here.

Just the magnitude of data shuffling means that you are
into specialist's work. Even the most trivial task becomes
huge; just forget everything you tought you knew.

Find a copy of Knuth's "The art of computer programming",
the volume on searching huge amounts of data.

> I have to obtain the frequency spectrum of this data so that I
> can predict minima in the discretized data.

You haven't said how you measured your data, but forget
only 1/10 millionth of the data samples. The slightest
incoherency in the process which generated the data will

> I am using the FFTW C library
> to obtain the Frequency spectrum. But the maximum size of my data array
> is
> limited to 2^32 in C.

Yep. If you use a 32-bit operating system.

> So I am picking a smaller subset of 1M samples from
> the original set of 10^13 data points.
>
> Jerry is right about the fact that what I am looking for is subsampling
> or decimation without running into aliasing problems. But due to the
> large
> amount of data, I cannot perform decimation on the entire data set. Also,
> I'd like to point out that my data is inherently discretized and so using
> an analog anti-aliasing filter to remove all the high frequency
> components
> that I am not interested in is NOT an option for me.

If the data are already aliased (no anti-aliasing filter
was used while sampling the data) then you are the proud
owner of several TeraBytes worth of rubbish. Using an
anti-alias filter when sampling, is such a basic piece
of knowlegde that not doing so ought to be liable for
criminal prosecution.

> So some scheme that
> resembles what John mentioned is of great interest to me. But I am not
> sure if that will work. An ideal scheme for me would be like the
> following:
>
> 1. Collect N=1M equidistant samples out of the n=10^13 data points.
> 2. Filter out the high frequency components present in the samples
> using some kind of LPF, i.e.
> For each sample "s_i" in the 1M samples
> i. Collect some constant number (say 10) samples before and
> after the sample "s_i"
> ii. Perform a Digital Low Pass Filtering operation on these
> 21 samples.
> iii. Pick one of these 21 samples (say the median of the
> 21 samples) to replace "s_i"
>
> Will the above procedure eliminate or atleast drastically reduce
> aliasing (assuming of course that the original 10^13 data points were
> sampled without aliasing)? Can I somehow tell which is the lowest
> frequency that was significantly filtered out (so that I know which of
> the
> lower frequencies contain aliases)?

Decimation is a standard process in DSP. Not all implementations
rely on using the FFT. Try another version of decimation and
be prepared to spend a lot of time working out the trivial
logistics of shuffling data around.

Rune

```
```
glen herrmannsfeldt wrote:
>
>
> >   Are you implying that I first pick 1M samples, then LPF the samples and
> > use them? Will that work? Or are you saying that I have to LPF the entire
> > digitized data and then do decimation? The second method is not feasible
> > for me as the data set is too large and will require an enormous amount of
>
> The exact answer depends much on the ratio of sampling rate to the
> frequency you are interested in.  Most likely it is possible to come
> up with a relatively fast filter to get what you want.
>
> Consider the moving average filter.  You will find that DSP people
> don't like it very much because its properties are not very good,
> but it is easy to describe and fast to implement.
>
> You might, for example, do a 1000 point moving average where the
> first output value is the mean of the first 1000 input samples,
> the second is the mean of 2 through 1001, etc.

What good is a 1000 point moving average filter going to be? If he is
downsampling from 10^13 samples to 10^6 samples to prevent aliasing he's
going to need a low pass filter that has millions and millions of
coefficients.

Don't know what the actual problem is, but let's suppose the 10^13 data
points are a recording of all the pop songs that have ever been played
on the radio. After filtering and decimation the OP would have about one
sample per song. If you are going to use a moving average filter than
each output sample would be something like the average of all the
samples of one song (which probably in most songs would be close to
zero).
It's hard to imagine why, if the data contains only important info at
such a low frequency, it is being sampled at such a high sample rate in
the first place.

-jim

>This can be done
> very fast with a circular buffer of length 1000 to update the
> sum appropriately.  It might be that a moving average filter
> followed by a decimation by 10, (output only every 10th point)
> would be a good initial filter, and reduce your problem by
> a factor of ten.  Then another filter could finish off the job.
>
> I have not done the calculations to show that the above is
> actually a good way to proceed, but only offer it to show
> that there are fast ways to filter large amounts of data,
> and that in some cases they might be useful.  Also, that
> cascaded filters might be faster than doing the whole
> thing in one step.
>
> -- glen

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
```
```On Fri, 08 Jun 2007 01:24:59 -0800, glen herrmannsfeldt
<gah@ugcs.caltech.edu> wrote:

>Eric Jacobsen wrote:
>> On Thu, 07 Jun 2007 17:39:05 -0500, "prad"
>
>(snip)
>
>>>    Actually the data samples are produced by a code. But the code can
>>>exactly produce the ith data sample that I need out of the entire 10^13
>>>data points. So storage is not an issue. My limitation is that the C code
>>>for the FFT takes the data input as an array of samples. And the largest
>>>array size is 2^32. Since the data is not periodic, I need equidistant
>>>samples picked out from the entire 10^13 data points.
>
>> Bleah. Sounds like you want every Nth sample.   Not good.
>
>Maybe not so bad.  If you really want a large FFT it is possible
>to make that up out of smaller FFTs.  That is the whole idea behind
>the FFT algorithm.  To do that, though, the data comes in an unusual
>order.  If you can produce data in that order it should be easy to do
>very large FFTs.  That is especially true if you only want the low
>frequency end of the FFT.

That's a very interesting thought.   Some judicious pruning of the FFT
computations to get the fraction of the spectrum of interest.

>> Might you have access to any hardware accelerators?   Like a board
>> with an FPGA on it connected to the computer?   At a 400MHz clock
>> rate, which is practical these days, you could run the entire 10^13
>> set through a hardware filter in an FPGA in 69 hours (about three
>> days) assuming one clock per sample.   Keeping the buffers fed and
>> getting the data out in a timely fashion might be a challenge, but
>> this sort of thing might be worth looking into.
>
>How did you get 69 hours?  1e13/4e8 is 25000, for about 6.9 hours.
>It would seem easy to make a low pass filter in a reasonable
>sized FPGA that could run at that rate.
>
>-- glen

Yeah, musta mis-typed an exponent or something.   You're right, it's
an order of magnitude better than I thought it'd be.

So it must be a good idea!!   ;)

The buffering could be an issue, but even if the filter has to stop
excellent way to accelerate computation.

Eric Jacobsen
Minister of Algorithms
Abineau Communications
http://www.ericjacobsen.org
```
```On 7 Jun, 22:47, "prad" <pradeep.ferna...@gmail.com> wrote:

>    One of my major problems is that the data sets I have are huge, in fact
> in terms of memory complexity, they are O(n!). 10^13 is one of the smallest
> data sets I have.

What the heck are you doing??

Only ten years ago, the 3D/4C seismic data surveys
broke the 1 TB limit; if you want to get into 10s of
Ts you ahve to get into 4D/4C seismic surveys. Only
the really big boys do that sort of thing.

A couple of days ago the marine survey company I work
for issued a press release about their activities.
In one year, the company recorded some 18 TB of data
during survey operations. These are the combined surveys
from 12 ships, each of 2000 to 3000 tons (about 90 m long),
each running a 24/7 survey operation for 4 months per year.

And you get that sort of data apparently without even
trying...

Something is seriously weird here. I just don't
believe your experiment design is a good one; it is
fundamentally flawed. No well-designed experiment,
particyularly one designed by a DSP newbie, ends up
with that sort of data. Period.

Why don't you get into details about how these data
come about and what you try to do with them?

Rune

```
```
Rune Allnor wrote:

>
>
>>   One of my major problems is that the data sets I have are huge, in fact
>>in terms of memory complexity, they are O(n!). 10^13 is one of the smallest
>>data sets I have.
>
>
> What the heck are you doing??

Don't you know? This is homework. A stupident is generating a huge data
by software and then trying to make use of that data. Numerical nonsense
instead of thinking of the better approaches, as usual.

VLV
```
```I am sorry that you think that. This is for a research subject and not
homework. I am not sharing details about the data as it would give away
the novel modeling I am trying to do. Thanks to all those who gave useful
information and helped me.

VLV: Please think before you send a comment like that.

>
>
>Rune Allnor wrote:
>
>>
>>
>>>   One of my major problems is that the data sets I have are huge, in
fact
>>>in terms of memory complexity, they are O(n!). 10^13 is one of the
smallest
>>>data sets I have.
>>
>>
>> What the heck are you doing??
>
>Don't you know? This is homework. A stupident is generating a huge data
>by software and then trying to make use of that data. Numerical nonsense

>instead of thinking of the better approaches, as usual.
>
>VLV
>
```
```On 9 Jun, 02:39, "prad" <pradeep.ferna...@gmail.com> wrote:
> I am sorry that you think that.

Don't top post when commenting on previous posts.

> This is for a research subject and not
> homework. I am not sharing details about the data as it would give away
> the novel modeling I am trying to do.

Those who have followed my posts here for a couple of years
would know that I sometimes make a point of saying that I
am an engineer, not a researcher, despite my misfortune
of having obtained a PhD degree. The difference is that
the engineer knows what he is doing whereas the researcher [*]
does not.

Tour project is flawed. There is no need whatsoever to
come up with those sorts of data in any but the biggest
survey projects.

There is the vanishing (though still non-zero) chance that
you really are into something, but even so, your timing is
wrong. These days, the sheer logistics of what you try to do
is out of reach for anyone but the largest computer centra.
That will change with time. When I started playing with
computers 20 years ago, it was specialist's work to handle
more than 64 kilobytes of data at any one time. When I first
played with certan sonar data processing ideas some 10 years
ago, anything beyond 1MByte was, for practical purposes,
outside my reach. These days my computer's RAM is the
limitation (it has only 512 MBytes) and I plan my programs
for use with, say, 10 Gyte of data once I can get my hands
on that sort of computer. It will be affordable by the
time I finish my program.

So times change, and maybe you or I will look up this
thread in five or ten years time and smile at the old
days when a mere 20 TB of data represented an insurmountable
obstacle.

However, ath the time of writing, June 2007, 20 TB of data
*is* an insurmountable obstacle. If you end up with the
need to process that sort of data, there is sucha huge
discrepancy between what you want and what is whithin
your abilities to do, that you might as well do something
else in the mean time, waiting for the required technology
to become available to you.

If you do this as part of an employment, circulate your
resume. Whoever assigned you to this task has no idea
what he or she is doing.

> VLV: Please think before you send a comment like that.

Vladimir is, as is his habit, perfectly on the mark.

Rune

[*] In Norwegian, "researcher" is translated to "forsker"
"one who does research", whereas "scientist" is translated
to "vitenskapsmann" which means "one who knows the
sciences." The difference might be subtle, but anyone
can emark on some sort of research, while insight into
the sciences requires more insight, usually otained
through decades of dedicated studies. Needless to say,
the world are full of researchers, with hardly one
scientist alive, world wide.

```
```jim wrote:
(snip)

> What good is a 1000 point moving average filter going to be? If he is
> downsampling from 10^13 samples to 10^6 samples to prevent aliasing he's
> going to need a low pass filter that has millions and millions of
> coefficients.

Maybe it should be more than 1000, maybe less.  I haven't seen any
numbers related to the frequency range of interest.

>  	It's hard to imagine why, if the data contains only important info at
> such a low frequency, it is being sampled at such a high sample rate in
> the first place.

It seems that it is generated, and not from a natural source.  It might
be the output of a linear congruential or linear feedback shift
register, for example.  It might be a 43 bit LFSR, and one bit samples.
Note that the number of bits per sample still hasn't been mentioned.

I believe with either linear congruential or LFSR you can rewrite it
to generate every Nth point of the original series.

-- glen

```
```Rune Allnor wrote:

(snip)

> Why don't you get into details about how these data
> come about and what you try to do with them?

He seems to want to keep the data source proprietary,
in which case I believe our answers also should be.

Now, if he wants to pay for answers, that is different.

-- glen

```