Forums

fast sliding max algorithm in SHARC assembly.

Started by robert bristow-johnson April 13, 2015
so this is sorta a specific question (and better suited to comp.dsp than 
dsp.stackexchange.com):

there is this O(log2(N)) complexity max (or min) algorithm perhaps 
originally from:

Brookes: "Algorithms for Max and Min Filters with Improved Worst-Case 
Performance"  IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS�II: ANALOG AND 
DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 9, SEPTEMBER 2000

i have a very short C program for it (actually it's a min algorithm, but 
it's the same thing) at the bottom.  there is an exclusive-or operation 
done on the index value which is necessary instead of incrementing.  in 
the old Mot 56K i was able to do this modulo2 address mode at every pair 
of samples in a buffer that started at an even address.  the 56K 
*infers* its base address from the index address, but the SHARC makes 
you *load* a base address (which would be the same as the index but with 
the LSB set to zero which is another bunch of operations), so it's not 
as efficient.

so, it appears that an assembly language version using the DAG registers 
judiciously won't be any better than the C program below.  but in that C 
program, there will be pipelining overhead transferring an ALU register, 
where the index manipulation is done, to a DAG register where it will be 
used in table lookup.

Q1:  anyone else use this algorithm?

Q2:  have you any insight in how to efficiently pipeline this in SHARC 
assembly?  not that i want specific code (but it wouldn't hurt).  i know 
about the Fn = MAX(Fx, Fy) instruction.  it's more about the index 
manipulation.  it's like there will be 9 or 10 instructions per loop and 
even if that is multiplying ceil(log2(N)) instead of N, if N is like 16, 
it gets a lot more expensive than, say a decaying envelope which is a 
max() and a multiply.

sorry, by SHARC chops have long been atrophied.


-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."



#define A_REALLY_LARGE_NUMBER 1.0e20

typedef struct
     {
     unsigned long filter_length;          // array_size/2 < 
filter_length <= array_size
     unsigned long array_size;             // must be power of 2 for 
this simple implementation
     unsigned long input_index;            // the actual sample 
placement is at (array_size + input_index);
     float* big_array_base;                // the big array is malloc() 
separately and is actually twice array_size;
     } search_tree_array_data;


void initSearchArray(unsigned long filter_length, 
search_tree_array_data* array_data)
     {
     array_data->filter_length = filter_length;

     array_data->array_size = 1;
     filter_length--;
     while (filter_length > 0)
         {
         array_data->array_size <<= 1;
         filter_length >>= 1;
         }
                                             // array_size is a power of 
2 such that
                                             // filter_length <= 
array_size < 2*filter_length
     array_data->input_index = 0;

     array_data->big_array_base = 
(float*)malloc(sizeof(float)*2*array_data->array_size);        // dunno 
what to do if malloc() fails.

     for (unsigned long n=0; n<2*array_data->array_size; n++)
         {
         array_data->big_array_base[n] = A_REALLY_LARGE_NUMBER; 
// init array.
         } 
// array_base[0] is never used.
     }

/*
  *   findMinSample(new_sample, &array_data) will place new_sample into 
the circular
  *   buffer in the latter half of the array pointed to by 
array_data.big_array_base .
  *   it will then compare the value in new_sample to its "sibling" 
value, takes the
  *   minimum of the two and then pops up one generation to the parent 
node where
  *   this parent also has a sibling and repeats the process.  since the 
other parent
  *   nodes already have the min value of the two child nodes, when 
getting to the
  *   top-level parent node, this node will have the minimum value of 
all the samples
  *   in the big_array.  the number of iterations of this loop is 
ceil(log2(filter_length)).
  */

float findMinSample(float value, search_tree_array_data* array_data)
     {
     register float* big_array = array_data->big_array_base;

     register unsigned long index = array_data->array_size + 
array_data->input_index;        // our main buffer is in the latter half 
of the big array.

     while (index > 1UL)
         {
         big_array[index] = value;

         register float sibling_value = big_array[index ^ 1UL]; 
// the upper bits of the sibling address are the same.

         if (value > sibling_value)
             {
             value = sibling_value;                        // use 
minimum of the two values
             }

         index >>= 1;                                      // parent 
address is index/2 (drop remainder or "sibling bit")
         }

     array_data->input_index++;
     if (array_data->input_index >= array_data->filter_length)
         {
         array_data->input_index = 0;
         }

     return value;
     }


float getInput(void);
void putOutput(float);

int main(int argc, const char* argv[])
     {
     search_tree_array_data array_data;
     unsigned long filter_length = 2048;                    // 
filter_length need not be a power of 2

     initSearchArray(filter_length, &array_data);

     while (1)                                              // loop forever
         {
         float input_sample = getInput();
         float output_sample = findMinSample(input_sample, &array_data);
         putOutput(output_sample);
         }

     return 0;
     }

On 4/13/15 10:32 PM, robert bristow-johnson wrote:
> > so this is sorta a specific question (and better suited to comp.dsp than > dsp.stackexchange.com): > > there is this O(log2(N)) complexity max (or min) algorithm perhaps > originally from: > > Brookes: "Algorithms for Max and Min Filters with Improved Worst-Case > Performance" IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS&#2013266071;II: ANALOG AND > DIGITAL SIGNAL PROCESSING, VOL. 47, NO. 9, SEPTEMBER 2000 > > i have a very short C program for it (actually it's a min algorithm, but > it's the same thing) at the bottom. there is an exclusive-or operation > done on the index value which is necessary instead of incrementing. in > the old Mot 56K i was able to do this modulo2 address mode at every pair > of samples in a buffer that started at an even address. the 56K *infers* > its base address from the index address, but the SHARC makes you *load* > a base address (which would be the same as the index but with the LSB > set to zero which is another bunch of operations), so it's not as > efficient.
just to spark your memory: the SHARC uses 4 registers (index I, modifier M, length L, base B) to do circular addressing and it's more "general", you can place any size circular buffer and base it at any address: B <= I < B + L whereas the 56K uses 3 registers (index R, modifier N, length M+1) to do circular addressing. the base or bottom of the circular buffer is inferred from this formula: base = 2^(ceil(log2(M+1)) * floor( R / 2^(ceil(log2(M+1)) ) so the base is at an integer multiple of a power of 2 that is at least as large as the length of the buffer, that's how the 56K infers that information from two of the existing three registers. i could use this to advantage to get to the "sibling" value in that maxtree algorithm that my C code depicts by setting M+1 to 2 and either incrementing or decrementing (didn't matter). so if one value was at an odd address, the sibling was at an even address where the upper bits were all identical. in the C code we do that with the XOR operation on the LSB.
> > so, it appears that an assembly language version using the DAG registers > judiciously won't be any better than the C program below. but in that C > program, there will be pipelining overhead transferring an ALU register, > where the index manipulation is done, to a DAG register where it will be > used in table lookup. > > Q1: anyone else use this algorithm? > > Q2: have you any insight in how to efficiently pipeline this in SHARC > assembly? not that i want specific code (but it wouldn't hurt). i know > about the Fn = MAX(Fx, Fy) instruction. it's more about the index > manipulation. it's like there will be 9 or 10 instructions per loop and > even if that is multiplying ceil(log2(N)) instead of N, if N is like 16,
i meant of log2(N) = 16 or N might be as big as 64K. so there could still be a good 150 instructions to do this...
> it gets a lot more expensive than, say a decaying envelope which is a > max() and a multiply.
like a capacitor-filtered rectifier. simple and cheap.
> > sorry, my SHARC chops have long been atrophied.
but my lamb chops are right on the money. much easier chopping up a lamb than a shark. lambs don't put up much of a fuss. still hoping that someone has a wonderful insight that i do not. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."