On Tue, 06 Mar 2012 10:58:45 -0800, wicore wrote:>> The ADSP21xx DSP instruction set is very versatile, because they made >> the hardware looping explicit; this meant that you could do more than >> just a chain of MAC instructions way fast (most notably for me, you >> could do >> >> y = a0 + a1 * x + a2 * x^2 + a3 * x^3 + ... >> >> at two clocks per term -- one to square x, and one to do the MAC). >> >> > >> Tim Wescott, Communications, Control, Circuits & >> Softwarehttp://www.wescottdesign.com > > Maybe not the best example. > One usually recommends evaluating polynomials with Horner's rule: > > y = a0 + (a1 + (a2 + a3*x) * x) * x > > It will need MPY and ADD not MPY and MAC.One doesn't if one understands the ADSP21xx instruction set. -- My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com
Texts on generic DSP architecture designs?
Started by ●March 2, 2012
Reply by ●March 6, 20122012-03-06
Reply by ●March 6, 20122012-03-06
Tim Wescott <tim@seemywebsite.com> wrote: (snip, someone wrote)>>> >But each register in the ADSP21xx has a different set of overlapping >>> >special purposes, so writing code for it was very much a >>> >puzzle-solving exercise: you would find yourself painted into a >>> corner and have to go back half a dozen instructions, change a >>> choice of register, then try again.(snip)>> I have developed a lot of embedded systems compilers (28) and on many of >> them it was for new instruction sets or substantially revised >> instruction sets. The reason silicon vendors have so little information >> on their instruction set usage is they literally don't know how the >> instruction set can be best used in the context of full applications.(snip)> In the case of the ADSP21xx processors, I could see how the register > usage came about -- they were clearly saving precious silicon, and > keeping their logic paths short, by doing what they did.> As difficult as it was to program those things by hand, I couldn't > imagine that writing a truly optimizing compiler for the thing would be > anything but a nightmare,With dynamic-programming algorithms and the appropriate weights, it is pretty easy to write a code generator that will select the optimum set of instructions.> if you could even express the intended functionality in a > GP language like C. The C virtual machine just doesn't address > the needs of fixed-point DSP computations.Well, yes, C doesn't make it so easy. But if you can write it in C, it isn't so hard to get the compiler to generate the best code for the C you wrote.> And, indeed, the C compiler that we used for the 21xx basically > worked by choosing a subset of the registers that could be > treated as general-purpose registers, and simply not using any > of the "DSP-ish" features of the part. So we did all of the > general purpose stuff in C (probably very inefficiently), > and the actual DSP happened in less than 100 lines of > assembly code whose data structures were set up in C, and > which were called from C routines.That is probably the best way. It always was and likely always will be. -- glen
Reply by ●March 6, 20122012-03-06
> > One doesn't if one understands the ADSP21xx instruction set. > > -- > My liberal friends think I'm a conservative kook. > My conservative friends think I'm a liberal kook. > Why am I not happy that they have found common ground? > > Tim Wescott, Communications, Control, Circuits & Softwarehttp://www.wescottdesign.comHehe, reminds me of the saying - "An empty barrel makes the most noise". -- rgds
Reply by ●March 6, 20122012-03-06
On Tue, 06 Mar 2012 20:38:26 +0000, glen herrmannsfeldt wrote:> Tim Wescott <tim@seemywebsite.com> wrote: > > (snip, someone wrote) >>>> >But each register in the ADSP21xx has a different set of overlapping >>>> >special purposes, so writing code for it was very much a >>>> >puzzle-solving exercise: you would find yourself painted into a >>>> corner and have to go back half a dozen instructions, change a choice >>>> of register, then try again. > > (snip) >>> I have developed a lot of embedded systems compilers (28) and on many >>> of them it was for new instruction sets or substantially revised >>> instruction sets. The reason silicon vendors have so little >>> information on their instruction set usage is they literally don't >>> know how the instruction set can be best used in the context of full >>> applications. > > (snip) >> In the case of the ADSP21xx processors, I could see how the register >> usage came about -- they were clearly saving precious silicon, and >> keeping their logic paths short, by doing what they did. > >> As difficult as it was to program those things by hand, I couldn't >> imagine that writing a truly optimizing compiler for the thing would be >> anything but a nightmare, > > With dynamic-programming algorithms and the appropriate weights, it is > pretty easy to write a code generator that will select the optimum set > of instructions.You really need to try writing some algorithms in 21xx assembly. I'm not saying it's impossible -- just that there's a lot of back-tracking, and that decisions made at instruction N can have a direct effect at instruction N+5, and repercussions for tens of instructions on.>> if you could even express the intended functionality in a GP language >> like C. The C virtual machine just doesn't address the needs of >> fixed-point DSP computations. > > Well, yes, C doesn't make it so easy. > > But if you can write it in C, it isn't so hard to get the compiler to > generate the best code for the C you wrote.While I don't think the problem of getting an optimizing compiler for this instruction set is trivial, the worse problem is that the core actions that one takes with a DSP part just don't fit into the C virtual machine -- at least not when you're doing the job with fixed-point arithmetic. Trying to capture a MAC instruction in C, with an extended-precision accumulator, would be hard enough in itself; having a compiler that could recognize what you're doing and realize that it could be mapped to hardware is even harder yet.>> And, indeed, the C compiler that we used for the 21xx basically worked >> by choosing a subset of the registers that could be treated as >> general-purpose registers, and simply not using any of the "DSP-ish" >> features of the part. So we did all of the general purpose stuff in C >> (probably very inefficiently), and the actual DSP happened in less than >> 100 lines of assembly code whose data structures were set up in C, and >> which were called from C routines. > > That is probably the best way. It always was and likely always will be.I could envision being able to do better with a floating-point DSP -- but for fixed point, I think there'll always be library code written in assembly. -- My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com
Reply by ●March 6, 20122012-03-06
On 3/6/12 1:58 PM, wicore wrote:>> >> The ADSP21xx DSP instruction set is very versatile, because they made the >> hardware looping explicit; this meant that you could do more than just a >> chain of MAC instructions way fast (most notably for me, you could do >> >> y = a0 + a1 * x + a2 * x^2 + a3 * x^3 + ... >> >> at two clocks per term -- one to square x, and one to do the MAC). >> > > Maybe not the best example. > One usually recommends evaluating polynomials with Horner's rule: > > y = a0 + (a1 + (a2 + a3*x) * x) * x > > It will need MPY and ADD not MPY and MAC. >back in my 56K daze, i came to the conclusion that Horner's rule was not useful for a fixed-point DSP with a double-wide accumulator. in either case, you get quantization error because you just cannot allow the word width to increase each time the order or x^n increases. with the right pipelining, you can do an N-th order finite power series with 2*N operations + overhead. my wishlist for DSP designs were that the binary point be bumped to the right a couple of bits so that the numbers naturally lived in the range of, say, -8 to just under +8. the only DSP that got this right, that i know of, are the ADI "sigma" DSPs. i also strongly disliked having different word widths for opcodes, addresses, and data. it should be all the same, probably 32 bits. it means that, if you're committed to single-word instructions (and 1 clock per instruction), that not all immediate operands are possible, but you can fix it by putting a constant somewhere in memory and loading it with a different addressing mode. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by ●March 6, 20122012-03-06
Tim Wescott wrote:> On Tue, 06 Mar 2012 20:38:26 +0000, glen herrmannsfeldt wrote: > > > Tim Wescott <tim@seemywebsite.com> wrote: > > > > (snip, someone wrote) > >>>> >But each register in the ADSP21xx has a different set of overlapping > >>>> >special purposes, so writing code for it was very much a > >>>> >puzzle-solving exercise: you would find yourself painted into a > >>>> corner and have to go back half a dozen instructions, change a choice > >>>> of register, then try again. > > > > (snip) > >>> I have developed a lot of embedded systems compilers (28) and on many > >>> of them it was for new instruction sets or substantially revised > >>> instruction sets. The reason silicon vendors have so little > >>> information on their instruction set usage is they literally don't > >>> know how the instruction set can be best used in the context of full > >>> applications. > > > > (snip) > >> In the case of the ADSP21xx processors, I could see how the register > >> usage came about -- they were clearly saving precious silicon, and > >> keeping their logic paths short, by doing what they did. > > > >> As difficult as it was to program those things by hand, I couldn't > >> imagine that writing a truly optimizing compiler for the thing would be > >> anything but a nightmare, > > > > With dynamic-programming algorithms and the appropriate weights, it is > > pretty easy to write a code generator that will select the optimum set > > of instructions. > > You really need to try writing some algorithms in 21xx assembly. I'm not > saying it's impossible -- just that there's a lot of back-tracking, and > that decisions made at instruction N can have a direct effect at > instruction N+5, and repercussions for tens of instructions on. > > >> if you could even express the intended functionality in a GP language > >> like C. The C virtual machine just doesn't address the needs of > >> fixed-point DSP computations. > > > > Well, yes, C doesn't make it so easy. > > > > But if you can write it in C, it isn't so hard to get the compiler to > > generate the best code for the C you wrote. > > While I don't think the problem of getting an optimizing compiler for > this instruction set is trivial, the worse problem is that the core > actions that one takes with a DSP part just don't fit into the C virtual > machine -- at least not when you're doing the job with fixed-point > arithmetic. > > Trying to capture a MAC instruction in C, with an extended-precision > accumulator, would be hard enough in itself; having a compiler that could > recognize what you're doing and realize that it could be mapped to > hardware is even harder yet. > > >> And, indeed, the C compiler that we used for the 21xx basically worked > >> by choosing a subset of the registers that could be treated as > >> general-purpose registers, and simply not using any of the "DSP-ish" > >> features of the part. So we did all of the general purpose stuff in C > >> (probably very inefficiently), and the actual DSP happened in less than > >> 100 lines of assembly code whose data structures were set up in C, and > >> which were called from C routines. > > > > That is probably the best way. It always was and likely always will be. > > I could envision being able to do better with a floating-point DSP -- but > for fixed point, I think there'll always be library code written in > assembly.Compiler tool technology has come a long way. A lot of what is happening now in a compiler is not significantly different from well planned hand written code except the implementation is done for every compile. Our compilers do a strategy pass that makes some high level choices on the code generation at a lower level, we use an expert systems to make implementation choices and map source to the instruction set and data structures on the processor. Embedded systems have always had some special purpose instructions. When you think of a mac as just another special purpose instruction then compiler implementation becomes easier. One important test is to make sure that the compiler actually knows about every instruction and every addressing mode that the processor has. Many processors are designed swap silicon for instruction restrictions assuming the code generation tools are smart enough to prevent combinations that are not allowed. w.. -- Walter Banks Byte Craft Limited http://www.bytecraft.com
Reply by ●March 6, 20122012-03-06
On Tue, 06 Mar 2012 18:36:26 -0500, Walter Banks wrote:> Tim Wescott wrote: > >> On Tue, 06 Mar 2012 20:38:26 +0000, glen herrmannsfeldt wrote: >> >> > Tim Wescott <tim@seemywebsite.com> wrote: >> > >> > (snip, someone wrote) >> >>>> >But each register in the ADSP21xx has a different set of >> >>>> >overlapping special purposes, so writing code for it was very >> >>>> >much a puzzle-solving exercise: you would find yourself painted >> >>>> >into a >> >>>> corner and have to go back half a dozen instructions, change a >> >>>> choice of register, then try again. >> > >> > (snip) >> >>> I have developed a lot of embedded systems compilers (28) and on >> >>> many of them it was for new instruction sets or substantially >> >>> revised instruction sets. The reason silicon vendors have so little >> >>> information on their instruction set usage is they literally don't >> >>> know how the instruction set can be best used in the context of >> >>> full applications. >> > >> > (snip) >> >> In the case of the ADSP21xx processors, I could see how the register >> >> usage came about -- they were clearly saving precious silicon, and >> >> keeping their logic paths short, by doing what they did. >> > >> >> As difficult as it was to program those things by hand, I couldn't >> >> imagine that writing a truly optimizing compiler for the thing would >> >> be anything but a nightmare, >> > >> > With dynamic-programming algorithms and the appropriate weights, it >> > is pretty easy to write a code generator that will select the optimum >> > set of instructions. >> >> You really need to try writing some algorithms in 21xx assembly. I'm >> not saying it's impossible -- just that there's a lot of back-tracking, >> and that decisions made at instruction N can have a direct effect at >> instruction N+5, and repercussions for tens of instructions on. >> >> >> if you could even express the intended functionality in a GP >> >> language like C. The C virtual machine just doesn't address the >> >> needs of fixed-point DSP computations. >> > >> > Well, yes, C doesn't make it so easy. >> > >> > But if you can write it in C, it isn't so hard to get the compiler to >> > generate the best code for the C you wrote. >> >> While I don't think the problem of getting an optimizing compiler for >> this instruction set is trivial, the worse problem is that the core >> actions that one takes with a DSP part just don't fit into the C >> virtual machine -- at least not when you're doing the job with >> fixed-point arithmetic. >> >> Trying to capture a MAC instruction in C, with an extended-precision >> accumulator, would be hard enough in itself; having a compiler that >> could recognize what you're doing and realize that it could be mapped >> to hardware is even harder yet. >> >> >> And, indeed, the C compiler that we used for the 21xx basically >> >> worked by choosing a subset of the registers that could be treated >> >> as general-purpose registers, and simply not using any of the >> >> "DSP-ish" features of the part. So we did all of the general >> >> purpose stuff in C (probably very inefficiently), and the actual DSP >> >> happened in less than 100 lines of assembly code whose data >> >> structures were set up in C, and which were called from C routines. >> > >> > That is probably the best way. It always was and likely always will >> > be. >> >> I could envision being able to do better with a floating-point DSP -- >> but for fixed point, I think there'll always be library code written in >> assembly. > > Compiler tool technology has come a long way. A lot of what is happening > now in a compiler is not significantly different from well planned hand > written code except the implementation is done for every compile. Our > compilers do a strategy pass that makes some high level choices on the > code generation at a lower level, we use an expert systems to make > implementation choices and map source to the instruction set and data > structures on the processor. > > Embedded systems have always had some special purpose instructions. When > you think of a mac as just another special purpose instruction then > compiler implementation becomes easier. One important test is to make > sure that the compiler actually knows about every instruction and every > addressing mode that the processor has. > > Many processors are designed swap silicon for instruction restrictions > assuming the code generation tools are smart enough to prevent > combinations that are not allowed.So, can you give an example of C code that adequately expresses the concept of doing the dot product of two vectors of 16-bit signed numbers into a 40 or 48 bit extended accumulator, possibly with shifting, so that the optimizer even has a chance of knowing the intent of the coder well enough to identify it as an opportunity for doing a classic fixed-point DSP one-cycle-per-tap vector multiply? Followed, of course, by the appropriate conditional saturation of said extended accumulator? And while you're at it, can you show how to write this code so that it'll work both in the case of the TI C2812 processor, which does a shift as an inherent part of the MAC (via either a field or a special register setting - I can't remember which), and the case of the ADSP21xx processor, which does the MAC "straight", but then does a single-cycle barrel shift-and-saturate when you extract the data from the MAC? Forget making the compiler recognize it -- can you make it happen in C so that _I_ can recognize it? -- My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com
Reply by ●March 6, 20122012-03-06
Tim Wescott <tim@seemywebsite.com> wrote: (big snip)> So, can you give an example of C code that adequately expresses the > concept of doing the dot product of two vectors of 16-bit signed numbers > into a 40 or 48 bit extended accumulator, possibly with shifting, so that > the optimizer even has a chance of knowing the intent of the coder well > enough to identify it as an opportunity for doing a classic fixed-point > DSP one-cycle-per-tap vector multiply?Many C compilers will recognize a cast for the operand of a multiply, such as: int32 x,y; int64 z; z=x*(int64)y; and generate a single 32x32 --> 64 multiply instruction. It is reasonably, though not 100% sure, that one could recognize something like: int i; int16 x[100],y[100]; int48 sum; sum=0; for(i=0;i<100;i++) sum += x[i]*(int48)y[i]; (or possibly casting to int32, but doing a 48 bit sum)> Followed, of course, by the appropriate conditional saturation > of said extended accumulator?Yes, saturation arithmetic isn't (yet) part of C.> And while you're at it, can you show how to write this code so > that it'll work both in the case of the TI C2812 processor, > which does a shift as an inherent part of the MAC (via either a > field or a special register setting - I can't remember which), > and the case of the ADSP21xx processor, which does the MAC > "straight", but then does a single-cycle barrel shift-and-saturate > when you extract the data from the MAC?Dynamic programming should be able to find this if the processor can do it and the code generator knows about it. The gcc code generator isn't so easy to figure out, but the LCC generator, as described in the book about LCC isn't so bad. It is a combinatorial optimization problem, which is one thing (besides DSP) that computers are pretty good at doing. The code generator is given a set of instruction sequences, each with a weight, that can be put together to do what needs to be done. All it needs to do, then, is find the combination of such that, when put together, minimizes the sum of the weights. If there is a template for z += x*y, then it will be used when needed. If not, separate x*y and += templates will be used. Special purpose registers comlicated things a little bit, but not so much that it can't be done.> Forget making the compiler recognize it -- can you make it > happen in C so that _I_ can recognize it?-- glen
Reply by ●March 7, 20122012-03-07
On Wed, 07 Mar 2012 03:30:57 +0000, glen herrmannsfeldt wrote:> Tim Wescott <tim@seemywebsite.com> wrote: > > (big snip) > >> So, can you give an example of C code that adequately expresses the >> concept of doing the dot product of two vectors of 16-bit signed >> numbers into a 40 or 48 bit extended accumulator, possibly with >> shifting, so that the optimizer even has a chance of knowing the intent >> of the coder well enough to identify it as an opportunity for doing a >> classic fixed-point DSP one-cycle-per-tap vector multiply? > > Many C compilers will recognize a cast for the operand of a multiply, > such as: > > int32 x,y; > int64 z; > z=x*(int64)y; > > and generate a single 32x32 --> 64 multiply instruction. > > It is reasonably, though not 100% sure, that one could recognize > something like: > > int i; > int16 x[100],y[100]; > int48 sum; > > sum=0; > for(i=0;i<100;i++) sum += x[i]*(int48)y[i]; > > (or possibly casting to int32, but doing a 48 bit sum) > >> Followed, of course, by the appropriate conditional saturation of said >> extended accumulator? > > Yes, saturation arithmetic isn't (yet) part of C. > >> And while you're at it, can you show how to write this code so that >> it'll work both in the case of the TI C2812 processor, which does a >> shift as an inherent part of the MAC (via either a field or a special >> register setting - I can't remember which), and the case of the >> ADSP21xx processor, which does the MAC "straight", but then does a >> single-cycle barrel shift-and-saturate when you extract the data from >> the MAC? > > Dynamic programming should be able to find this if the processor can do > it and the code generator knows about it. The gcc code generator isn't > so easy to figure out, but the LCC generator, as described in the book > about LCC isn't so bad. > > It is a combinatorial optimization problem, which is one thing (besides > DSP) that computers are pretty good at doing. > > The code generator is given a set of instruction sequences, each with a > weight, that can be put together to do what needs to be done. All it > needs to do, then, is find the combination of such that, when put > together, minimizes the sum of the weights. > > If there is a template for z += x*y, then it will be used when needed. > If not, separate x*y and += templates will be used. > > Special purpose registers comlicated things a little bit, but not so > much that it can't be done. > >> Forget making the compiler recognize it -- can you make it happen in C >> so that _I_ can recognize it?I suppose if one took your sum and followed it by a shift, and a saturate, and a cast back to 16 bits, one could hope that the compiler would recognize it and optimize it. There would be subtleties in getting the C to exactly match the behavior of the machine code -- I think it'd take the same level of knowledge and be less time-consuming to just write the assembly for the thing than to try to cook up some C and successfully surround it with enough comments that future developers would understand that it is read-only. -- My liberal friends think I'm a conservative kook. My conservative friends think I'm a liberal kook. Why am I not happy that they have found common ground? Tim Wescott, Communications, Control, Circuits & Software http://www.wescottdesign.com
Reply by ●March 7, 20122012-03-07
Tim Wescott wrote:> Forget making the compiler recognize it -- can you make it happen in C so > that _I_ can recognize it?Good point. At the application implementation level a choice for every application needs to be made on every C program. Is the source a description of algorithm that is being implemented or is it the description of of the implementation of the algorithm? A compiler can be made to recognize C source and meaning and map the source on to the executable code on a processor. The "C meaning" part is a problem in implementation of a compiler. In our case for application portability we try to be consistent producing the same execution results for compilers targeting the same application area's. Essentially we have made a choice that C is an algorithm descriptive language for our customers and the implementation details are buried in the compiler. This choice also allows access to the low level details of the processor by using short 1 or 2 operator C statements. Short statements with the functionally of individual processor instructions essentially all produce one and two instruction outcomes or are combined with nearest neighbours. In our tools this coding style is preferable to embedded assembler for detailed implementation because the compiler can still make context choices of data flow and control flow. The issues are more language representation than compiler technology. The current state of compiler technology can do that very well with a selection of compiler goals, code size, execution speed or minimal power consumption. Walter..






