DSPRelated.com
Forums

Thoughts on the C/Assembly Debate

Started by Randy Yates July 28, 2007
There may come a day when a compiled language can consistently beat a
good assembly programmer, but that language will never be C. It will
be a new language that has a radically new, high level way of specifying the 
algorithm to the compiler, and that has complete control over the generated
code, including data structures.

To make my point, take a simple example. 

Consider a fictitious processor that has an architecture with an indirect
addressing mode. That is, the address can be loaded in a register, and 
that register be used to generate the address for data used in other instructions.

Now let's also posit that the processor can post-increment-by-one the
address registers used for indirect mode for free, but post-increment
by-two (or other higher values) is not free - i.e., it cost more
cycles. 

Now let's say the algorithm to be coded is to sum the real and imaginary
parts of a complex array. 

Our programmer, being ignorant of the processor architecture details, 
writes the following code:

typedef struct
{
  int16_t real;
  int16_t imag;
} COMPLEX_T;

#define N 1024;

COMPLEX_T data[N];

COMPLEX_T MyAlgorithm(void)
{
  uint32_t n;
  COMPLEX_T sum;

  sum.real = 0;
  sum.imag = 0;
  for (n = 0; n < N; n++)
  {
    sum.real += data[n].real;
    sum.imag += data[n].imag;
  }

  return sum;
}

Well, this is not optimized! Never can be, because C can't restructure
the "data[]" array to put the real and imaginary parts into individual,
continuous arrays.

Also note that if the programmer were to realize this is a problem and
manually reorganize the data structure, then the resulting code
becomes SPECIFIC TO THE PROCESSOR. Targeting the code for another
processor using this new data structure may not be optimal.

Also note that requiring the programmer to know the processor architecture
defeats, at least in part, the point of a high-level-language since time
is spent learning things that ideally would be handled by the compiler.

Note that this is absolutely irrefutable. The C language (and every other
computer language that I know of) is not capable of optimizing this BECAUSE
IT DOESN'T HAVE CONTROL OVER DATA ALIGNMENT AND STRUCTURE. 
-- 
%  Randy Yates                  % "She tells me that she likes me very much,
%% Fuquay-Varina, NC            %     but when I try to touch, she makes it
%%% 919-577-9882                %                            all too clear."
%%%% <yates@ieee.org>           %        'Yours Truly, 2095', *Time*, ELO  
http://home.earthlink.net/~yatescr
Check Intel's marketing for its C/C++ compiler.
I'm sure in there somewhere is how it detects
SOA/AOS requirements and reorders data, and
vectors it all using SSE2/3/4 for best effect.
I'm almost as sure that MS/VS will do the same.
All this even without #pragma hints.  Since so
few will ever know asm well enough to beat a
current compiler, guru code (potentially) being
better than compiler code is pretty much moot.
Besides, this is more about knowing how to order
data, not a compiler/assembler thing.  Had to yawn

RY- [Sat, 28 Jul 2007 10:18:04 -0400]:
 >Now let's say the algorithm to be coded is to sum the real and imaginary
 >parts of a complex array. 
 >typedef struct>{>  int16_t real;>  int16_t imag;>} COMPLEX_T;
 >#define N 1024;
 >COMPLEX_T data[N];
 >Well, this is not optimized! Never can be, because C can't restructure
 >the "data[]" array to put the real and imaginary parts into individual,
 >continuous arrays.

-- 
 40th Floor - Software  @  http://40th.com/
  iplay.40th.com   iPlay advanced audio player
Randy Yates wrote:
> Also note that if the programmer were to realize this is a problem and > manually reorganize the data structure, then the resulting code > becomes SPECIFIC TO THE PROCESSOR. Targeting the code for another > processor using this new data structure may not be optimal. > > Also note that requiring the programmer to know the processor architecture > defeats, at least in part, the point of a high-level-language since time > is spent learning things that ideally would be handled by the compiler. > > Note that this is absolutely irrefutable. The C language (and every other > computer language that I know of) is not capable of optimizing this BECAUSE > IT DOESN'T HAVE CONTROL OVER DATA ALIGNMENT AND STRUCTURE.
Randy, I think that this debate can actually be thought of in a slightly more general way. That is, rather than being the "C vs assembly debate" I think you could think of this as the "Optimization vs Portability debate". This would encompass not only C vs assembly, but also things like drivers vs "integrated" code or OS vs simple scheduling (ISRs). In general you can get more optimized code by keeping things simpler. I think that the simplest and most optimized you can get is keeping everything in assembly and handling everything through interrupts and a background loop. However, as complexity continues to grow we need to have things like C code, OSes, drivers, etc. or else we'd be forever re-coding the same things. I think that it is not an easy task to maintain a healthy equilibrium between portability and optimization. You get one extreme where even the simplest of tasks go through 10 layers of software just to read a register. On the other extreme you might have assembly code that needs to be completely rewritten in order to move to a different procesor. I think maintaining this equilibrium is one of the toughest challenges facing programmers, particularly in the embedded world where cycles and memory are at a premium. Brad
Hi there,

[excessive Newsgroups line shortened]

Randy Yates wrote:
> Now let's also posit that the processor can post-increment-by-one the > address registers used for indirect mode for free, but post-increment > by-two (or other higher values) is not free - i.e., it cost more > cycles.
[snip]
> Our programmer, being ignorant of the processor architecture details, > writes the following code:
[snip]
> Well, this is not optimized! Never can be, because C can't restructure > the "data[]" array to put the real and imaginary parts into individual, > continuous arrays.
But the programmer can (and, if that's the bottleneck, will) do that. With C, he still has the advantage that he can take the final code and run it unmodified on his fast, well-equipped PC. So he can make huge testsuites of vectors to sum up, and let them run automatically and quickly. I'm doing that all day long. Sure, I could use a simulator for my target code. Ours does a mere 10k instructions per second. But I've got other things to do than to wait quarter of an hour until it finished clearing the .bss, let alone decoded my 3 MB MPEG file.
> Also note that if the programmer were to realize this is a problem and > manually reorganize the data structure, then the resulting code > becomes SPECIFIC TO THE PROCESSOR. Targeting the code for another > processor using this new data structure may not be optimal.
It may not be optimal for the new processor, but at least it's correct in the first place. If it's not fast enough, you can still (or: again) modify it for the new one. But if it is fast enough, you don't have to touch it. With assembly, you'd have to rewrite and test it from scratch. De facto, most code is a little processor-specific anyway - assumptions about endianness, sizeof(int), or even small things such that a floating-point multiply is faster than a divide, or a 'short' variable is faster (chip with 16-bit bus) / slower (x86) than an 'int' variable.
> Also note that requiring the programmer to know the processor architecture > defeats, at least in part, the point of a high-level-language since time > is spent learning things that ideally would be handled by the compiler.
Actually, every high-level language programmer *should* know some basics about the architecture he's targeting. Stefan
Randy Yates wrote:
> Now let's also posit that the processor can post-increment-by-one the > address registers used for indirect mode for free, but post-increment > by-two (or other higher values) is not free - i.e., it cost more > cycles.
> typedef struct > { > int16_t real; > int16_t imag; > } COMPLEX_T;
> COMPLEX_T MyAlgorithm(void) > { > uint32_t n; > COMPLEX_T sum; > > sum.real = 0; > sum.imag = 0; > for (n = 0; n < N; n++) > { > sum.real += data[n].real; > sum.imag += data[n].imag; > } > > return sum; > } > > Well, this is not optimized! Never can be, because C can't restructure > the "data[]" array to put the real and imaginary parts into individual, > continuous arrays.
I think a clever C compiler could optimize the code to use your cheap increment-by-one instruction. Since n is a local variable with no pointers to it, and since it is never written inside the loop, the compilers is free to notice that you are really just accessing memory in a totally sequential way and transform your loop to this equivalent: int16_t *p; p = &( data[0].real ); two_N = 2 * N; n = 0; while (n < two_N) { sum.real += p[n++]; sum.imag += p[n++]; } Note that it would be pretty bad coding style for a human programmer to write this in C, but since the compiler is in control of the implementation details (like the way structs are padded and aligned) it's perfectly fine for it to make transformations like this. The only negative I can see with this is that it requires one extra register than the version that would use parallel arrays (because you have to keep two continuous running totals instead of one). That should not be a big limitation in most cases. I'm not saying compilers actually do this (although they may). I'm just disagreeing with your claim that the generated code cannot ever be optimized to use the post-increment operator. C doesn't *need* to restructure the array to optimize the code! Also, as someone else pointed out, you are blurring the differences between architecture-specific code and portable code with the differences between C and assembly. It is possible (and sometimes beneficial) to write C code with a specific platform's performance in mind. (And if you want, you can even write that code so that it will still run on other architectures.) In the end, what it comes down to is compromises between programmers' time and performance of the program. Since assembly gives you absolute control over the instructions that are run, you have more flexibility to squeeze out a tiny bit of extra performance in cases where the compiler doesn't do the right thing. Compilers these days to the right thing a LOT of the time, so this would be rare, but I can't deny that it must happen now and then that the compiler generates code that a very knowledgeable person could improve upon. The question is whether it's worth the time necessary to do it. You generally have to have a very serious need for performance for it to be worth it. - Logan
hel@40th.com wrote:
> Check Intel's marketing for its C/C++ compiler. > I'm sure in there somewhere is how it detects > SOA/AOS requirements and reorders data, and > vectors it all using SSE2/3/4 for best effect. > I'm almost as sure that MS/VS will do the same.
If you checked the marketing, you might have seen that Intel C++ implements complex data type auto-vectorization for SSE3, but requires C99 syntax for that. Since you are cross-posting to c.l.f, you might consider that some people would prefer to use Fortran rather than less optimal asm hacks. Sorry if you are put off by flashbacks to the days when people called Fortran functions from COBOL rather than do asm. Beats me why people would rather use asm than embed C99 in C++ or C89. Don't know whether you consider the <?intrin.h> stuff to be assembly.
On Jul 29, 2:18 am, Randy Yates <ya...@ieee.org> wrote:
> There may come a day when a compiled language can consistently beat a > good assembly programmer, but that language will never be C. It will > be a new language that has a radically new, high level way of specifying the > algorithm to the compiler, and that has complete control over the generated > code, including data structures. > > To make my point, take a simple example. > > Consider a fictitious processor that has an architecture with an indirect > addressing mode. That is, the address can be loaded in a register, and > that register be used to generate the address for data used in other instructions. > > Now let's also posit that the processor can post-increment-by-one the > address registers used for indirect mode for free, but post-increment > by-two (or other higher values) is not free - i.e., it cost more > cycles. > > Now let's say the algorithm to be coded is to sum the real and imaginary > parts of a complex array. > > Our programmer, being ignorant of the processor architecture details, > writes the following code: > > typedef struct > { > int16_t real; > int16_t imag; > > } COMPLEX_T; > > #define N 1024; > > COMPLEX_T data[N]; > > COMPLEX_T MyAlgorithm(void) > { > uint32_t n; > COMPLEX_T sum; > > sum.real = 0; > sum.imag = 0; > for (n = 0; n < N; n++) > { > sum.real += data[n].real; > sum.imag += data[n].imag; > } > > return sum; > > } > > Well, this is not optimized! Never can be, because C can't restructure > the "data[]" array to put the real and imaginary parts into individual, > continuous arrays. > > Also note that if the programmer were to realize this is a problem and > manually reorganize the data structure, then the resulting code > becomes SPECIFIC TO THE PROCESSOR. Targeting the code for another > processor using this new data structure may not be optimal. > > Also note that requiring the programmer to know the processor architecture > defeats, at least in part, the point of a high-level-language since time > is spent learning things that ideally would be handled by the compiler. > > Note that this is absolutely irrefutable. The C language (and every other > computer language that I know of) is not capable of optimizing this BECAUSE > IT DOESN'T HAVE CONTROL OVER DATA ALIGNMENT AND STRUCTURE. > -- > % Randy Yates % "She tells me that she likes me very much, > %% Fuquay-Varina, NC % but when I try to touch, she makes it > %%% 919-577-9882 % all too clear." > %%%% <ya...@ieee.org> % 'Yours Truly, 2095', *Time*, ELO http://home.earthlink.net/~yatescr
I have a feeling that data-flow is the answer. Check out the latest LabView. You must think differently.Sequencial code will die.
gyansorova@gmail.com wrote:
> On Jul 29, 2:18 am, Randy Yates <ya...@ieee.org> wrote: >> There may come a day when a compiled language can consistently beat a >> good assembly programmer, but that language will never be C. It will >> be a new language that has a radically new, high level way of specifying the >> algorithm to the compiler, and that has complete control over the generated >> code, including data structures. >> >> To make my point, take a simple example. >> >> Consider a fictitious processor that has an architecture with an indirect >> addressing mode. That is, the address can be loaded in a register, and >> that register be used to generate the address for data used in other instructions. >> >> Now let's also posit that the processor can post-increment-by-one the >> address registers used for indirect mode for free, but post-increment >> by-two (or other higher values) is not free - i.e., it cost more >> cycles. >> >> Now let's say the algorithm to be coded is to sum the real and imaginary >> parts of a complex array. >> >> Our programmer, being ignorant of the processor architecture details, >> writes the following code: >> >> typedef struct >> { >> int16_t real; >> int16_t imag; >> >> } COMPLEX_T; >> >> #define N 1024; >> >> COMPLEX_T data[N]; >> >> COMPLEX_T MyAlgorithm(void) >> { >> uint32_t n; >> COMPLEX_T sum; >> >> sum.real = 0; >> sum.imag = 0; >> for (n = 0; n < N; n++) >> { >> sum.real += data[n].real; >> sum.imag += data[n].imag; >> } >> >> return sum; >> >> } >> >> Well, this is not optimized! Never can be, because C can't restructure >> the "data[]" array to put the real and imaginary parts into individual, >> continuous arrays. >> >> Also note that if the programmer were to realize this is a problem and >> manually reorganize the data structure, then the resulting code >> becomes SPECIFIC TO THE PROCESSOR. Targeting the code for another >> processor using this new data structure may not be optimal. >> >> Also note that requiring the programmer to know the processor architecture >> defeats, at least in part, the point of a high-level-language since time >> is spent learning things that ideally would be handled by the compiler. >> >> Note that this is absolutely irrefutable. The C language (and every other >> computer language that I know of) is not capable of optimizing this BECAUSE >> IT DOESN'T HAVE CONTROL OVER DATA ALIGNMENT AND STRUCTURE. >> -- >> % Randy Yates % "She tells me that she likes me very much, >> %% Fuquay-Varina, NC % but when I try to touch, she makes it >> %%% 919-577-9882 % all too clear." >> %%%% <ya...@ieee.org> % 'Yours Truly, 2095', *Time*, ELO http://home.earthlink.net/~yatescr > > I have a feeling that data-flow is the answer. Check out the latest > LabView. You must think differently.Sequencial code will die.
Until processors routinely execute and n instructions in parallel, all code will be executed sequentially. You might as well write it that way. Jerry -- Engineering is the art of making what you want from things you can get. &macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;&macr;
On Jul 28, 3:18 pm, Randy Yates <ya...@ieee.org> wrote:

> There may come a day when a compiled language can consistently beat a > good assembly programmer, but that language will never be C. It will > be a new language that has a radically new, high level way of specifying the > algorithm to the compiler, and that has complete control over the generated > code, including data structures.
One mistake that people often make is trying to compare for example the efficiency of C code vs. assembler code for the same problem. This is not the right comparison. You have to compare C vs. assembler code, assuming that two equally competent programmers spend the same amount of time writing the code. I can write assembler code that is faster than equivalent C code. If it takes me one day to write the C code, it takes me a week to write the assembler code. In that week, I could have written much faster C code. In two months, I could write equivalent assembler code that is faster. In the same two months, I could write faster C code.
> To make my point, take a simple example. > > Consider a fictitious processor that has an architecture with an indirect > addressing mode. That is, the address can be loaded in a register, and > that register be used to generate the address for data used in other instructions. > > Now let's also posit that the processor can post-increment-by-one the > address registers used for indirect mode for free, but post-increment > by-two (or other higher values) is not free - i.e., it cost more > cycles. > > Now let's say the algorithm to be coded is to sum the real and imaginary > parts of a complex array. > > Our programmer, being ignorant of the processor architecture details, > writes the following code: > > typedef struct > { > int16_t real; > int16_t imag; > > } COMPLEX_T; > > #define N 1024; > > COMPLEX_T data[N]; > > COMPLEX_T MyAlgorithm(void) > { > uint32_t n; > COMPLEX_T sum; > > sum.real = 0; > sum.imag = 0; > for (n = 0; n < N; n++) > { > sum.real += data[n].real; > sum.imag += data[n].imag; > } > > return sum; > > } > > Well, this is not optimized! Never can be, because C can't restructure > the "data[]" array to put the real and imaginary parts into individual, > continuous arrays.
You're on a loser here. A decent compiler will easily figure out that data [n].real, data [n].imag, data [n+1].real etc. are all at consecutive addresses. So it transforms this to int16_t *p = &data[n].real; for (n = 0; n < N; n++) { sum.real += *p++; sum.imag += *p++; }
Jerry Avins wrote:
> Until processors routinely execute and n instructions in parallel, all > code will be executed sequentially. You might as well write it that way.
n had been 1 for decades. For most new hardware, it's now 2, and it's already starting to change to 4. - Logan