DSPRelated.com
Forums

Fast histogram algorithm

Started by Jaime Andres Aranguren Cardona November 2, 2004
Hi,

Could somebody please point me to papers, tutorials, websites, etc
about fast algorithm for histogram calculation?

I need to classify 120000 values, distributed in 6400 classes (or
bins). The particular thing here is that the classes are not uniformly
spaced. The step is not calculated like
(MaxValue-MinValue)/NumberOfClasses. I am given the edges of every bin
(or class), instead.

So, the way I am doing it now is by scanning every sample over all of
the classes (bins) to check which of these does the sample belongs to.
This kind of algorithm is rather simplistic, but works. The BIG
drawback is that it takes too much time to execute: around 16 minutes
in an ADSP-21065L, coded in C (without optimization), an execution
time which I can not tolerate.

Before going for optimization flags on the compiler, or doing assembly
coding, I'd like to have your suggestions and advice for a more
efficient algorithm to do it.

Regards,

JaaC
I'm not sure if this is enough info to help you out. As I read it, you 
have a bunch of data that needs to be classified whether or not one 
particluar datum (sample) belongs to a particular class (bin), where 
each class (bin) can span several (sample?) values. Each class may have 
a different span of samples that it encompasses. Is that correct?

Can't you just create a mapping function that directly creates the 
class number from the actual sample value?

Rather than checking for the 6400 classes' boundaries by an if 
statement inside a loop try to set up a mapping function that creates 
the bin number by taking the sample value through a mapping function, 
quantized to an integer bin number...

I don't know if this applies (or is possible) in your case, but from 
what you say it should be feasible...
-- 
Stephan M. Bernsee
http://www.dspdimension.com

jaime.aranguren@ieee.org (Jaime Andres Aranguren Cardona) writes:

> Hi, > > Could somebody please point me to papers, tutorials, websites, etc > about fast algorithm for histogram calculation? > > I need to classify 120000 values, distributed in 6400 classes (or > bins). The particular thing here is that the classes are not uniformly > spaced. The step is not calculated like > (MaxValue-MinValue)/NumberOfClasses. I am given the edges of every bin > (or class), instead. > > So, the way I am doing it now is by scanning every sample over all of > the classes (bins) to check which of these does the sample belongs to. > This kind of algorithm is rather simplistic, but works. The BIG > drawback is that it takes too much time to execute: around 16 minutes > in an ADSP-21065L, coded in C (without optimization), an execution > time which I can not tolerate. > > Before going for optimization flags on the compiler, or doing assembly > coding, I'd like to have your suggestions and advice for a more > efficient algorithm to do it. > > Regards, > > JaaC
Hi Jaime, An obvious thing to try is table-driven mapping, but this isn't practical if you have floating point data, and even with integer the table is big. I'm thinking with a 21065 you're talking float. What about sorting the data first using a fast sort algorithm? -- % Randy Yates % "Maybe one day I'll feel her cold embrace, %% Fuquay-Varina, NC % and kiss her interface, %%% 919-577-9882 % til then, I'll leave her alone." %%%% <yates@ieee.org> % 'Yours Truly, 2095', *Time*, ELO http://home.earthlink.net/~yatescr
"Jaime Andres Aranguren Cardona" <jaime.aranguren@ieee.org> wrote in message
news:14a86f87.0411020421.3bc955b@posting.google.com...
> Hi, > > Could somebody please point me to papers, tutorials, websites, etc > about fast algorithm for histogram calculation? > > I need to classify 120000 values, distributed in 6400 classes (or > bins). The particular thing here is that the classes are not uniformly > spaced. The step is not calculated like > (MaxValue-MinValue)/NumberOfClasses. I am given the edges of every bin > (or class), instead. > > So, the way I am doing it now is by scanning every sample over all of > the classes (bins) to check which of these does the sample belongs to. > This kind of algorithm is rather simplistic, but works. The BIG > drawback is that it takes too much time to execute: around 16 minutes > in an ADSP-21065L, coded in C (without optimization), an execution > time which I can not tolerate. > > Before going for optimization flags on the compiler, or doing assembly > coding, I'd like to have your suggestions and advice for a more > efficient algorithm to do it. > > Regards, > > JaaC
Hi Jaime, You've probably thought of this already but would a fast sort help? I know there are a number of resources for performing fast sort, then you can just identify the difference in the index where you cross your next bin boundary? Best of Luck - Mike
Randy Yates wrote:

> jaime.aranguren@ieee.org (Jaime Andres Aranguren Cardona) writes: > > >>Hi, >> >>Could somebody please point me to papers, tutorials, websites, etc >>about fast algorithm for histogram calculation? >> >>I need to classify 120000 values, distributed in 6400 classes (or >>bins). The particular thing here is that the classes are not uniformly >>spaced. The step is not calculated like >>(MaxValue-MinValue)/NumberOfClasses. I am given the edges of every bin >>(or class), instead. >> >>So, the way I am doing it now is by scanning every sample over all of >>the classes (bins) to check which of these does the sample belongs to. >>This kind of algorithm is rather simplistic, but works. The BIG >>drawback is that it takes too much time to execute: around 16 minutes >>in an ADSP-21065L, coded in C (without optimization), an execution >>time which I can not tolerate. >> >>Before going for optimization flags on the compiler, or doing assembly >>coding, I'd like to have your suggestions and advice for a more >>efficient algorithm to do it. >> >>Regards, >> >>JaaC > > > Hi Jaime, > > An obvious thing to try is table-driven mapping, but this isn't practical > if you have floating point data, and even with integer the table is big. > I'm thinking with a 21065 you're talking float. > > What about sorting the data first using a fast sort algorithm?
And if you can't sort all 120000 at once, bring it in in chunks small enough to sort. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Hi Stephan,

> I'm not sure if this is enough info to help you out. As I read it, you > have a bunch of data that needs to be classified whether or not one > particluar datum (sample) belongs to a particular class (bin), where > each class (bin) can span several (sample?) values. Each class may have > a different span of samples that it encompasses. Is that correct?
Absolutely!
> Can't you just create a mapping function that directly creates the > class number from the actual sample value? > > Rather than checking for the 6400 classes' boundaries by an if > statement inside a loop try to set up a mapping function that creates > the bin number by taking the sample value through a mapping function, > quantized to an integer bin number... > > I don't know if this applies (or is possible) in your case, but from > what you say it should be feasible...
Could be... will think about it. Regards, JaaC
> Hi Jaime, > > An obvious thing to try is table-driven mapping, but this isn't practical > if you have floating point data, and even with integer the table is big. > I'm thinking with a 21065 you're talking float.
Hi, Randy Yes, using float with he '065L. One concern with table mapping is that the range, or in other words, the maximum and minimum values of the 120000 samples is not the same for every block of 120000 samples... So, a different table for every block of data? Any further idea is very welcome.
> What about sorting the data first using a fast sort algorithm?
Yes, I've thinking about it. I'm also thinking of the best way to take advantage of the sorted version of data buffer. The first approach to come to my mind, maybe the simplest one, would be use a "while" loop, instead of a "for" loop... this way, the classification of the lower valued samples would certainly take a short time, and as the values become higher, it would take somewhat longer time, but definitely less than it takes with the current approach, for which the scanning is performed even if the data sample has already found the bin (or class) it belongs to. Any additional ideas, please? And thanks a lot in advance, JaaC
> > Hi Jaime, > > > > An obvious thing to try is table-driven mapping, but this isn't practical > > if you have floating point data, and even with integer the table is big. > > I'm thinking with a 21065 you're talking float. > > > > What about sorting the data first using a fast sort algorithm? > > And if you can't sort all 120000 at once, bring it in in chunks small > enough to sort.
Hi Tim, Will take your suggestion into account. Thanks, JaaC
> Hi Jaime, > You've probably thought of this already but would a fast sort help? I > know there are a number of resources for performing fast sort, then you can > just identify the difference in the index where you cross your next bin > boundary? > Best of Luck - Mike
Hi Mike, Let me check if I fully understood your suggestion: 1. Sort the input data. 2. Start to scan the sorted data to check until which index (in the sorted data buffer) do I have values less than the first upper limit of the classes, which will mean, how many samples belong to the first class (or bin). 3. Repeat until all data is scanned. But I wouldn't need to start over again from the beginning, just continue from the next "unclassified" (but sorted) sample, to check until which index the samples belong to the class (bin). Is this what you meant? Any additional idea? Regards and TIA, JaaC
"Jaime Andres Aranguren Cardona" <jaime.aranguren@ieee.org> wrote in message
news:14a86f87.0411030342.4e81ad30@posting.google.com...
> > Hi Jaime, > > You've probably thought of this already but would a fast sort help? >snip>
> Hi Mike, > > Let me check if I fully understood your suggestion: > > 1. Sort the input data. > 2. Start to scan the sorted data to check until which index (in the > sorted data buffer) do I have values less than the first upper limit > of the classes, which will mean, how many samples belong to the first > class (or bin). > 3. Repeat until all data is scanned. But I wouldn't need to start over > again from the beginning, just continue from the next "unclassified" > (but sorted) sample, to check until which index the samples belong to > the class (bin). > > Is this what you meant? Any additional idea? > > Regards and TIA, > > JaaC
Yes - that's as far as I went, using a binary tree sort (or quick sort) anything you can get your hands on easilly must be faster than the 16 minutes you were quoting and I got the impression you were looking for a quick, low effort way to improve your histogram generation times rather than something that was as fast as you could get. You can pull free source code in C from several web sources and should not need to worry about compile optimisation to get comfortably into the 'a few seconds' range if you've got enough programme space and memory. Best of Luck - Mike