Hello experts: I am working on porting some code from 62x to 64x processors and there are a few things on which I would like your opinion. Most of these pertain to the SIMD instructions available for c64x. 1. In a thread of emails regarding the correct set of steps to write optimized code, the following steps were listed by an expert: * natural C code: text book implementation * optimized C code: C code with advanced loop level optimizations and pragma's * intrinsic C code: C code with intrinsics * Serial assembly code: Linear sequence of assembly instructions * Partitioned Serial assembly code: Code with .1's and .2's to guide optimizer * Hand code: If needed. My question with c64x porting is, are there ways to write natural or optimized 'C' (with no intrinsics) to make the compiler use packed instructions ? If there aren't any, is it good to start with intrinsic C code as the first step to development ? [Assuming the application performs better using packed instructions] 2. Found an interesting thing when trying to load unaligned double words - the iteration interval (ii) somehow was always higher if I specify the unaligned double word load explicitly, compared to specifying two unaligned word-loads. Checking the list file (*.lst) showed that both the listings had an LDNDW ! Related question - one of the compiler feedbacks (in the higher ii case) is - "Inserted to break DPG cycle". Wondering what this might mean ?! I figured DPGa Precedence Graph from Web, but I dont understand the context, and this one's not documented in the spru187 (Optimizing C compiler users guide) 3. Are there any recommendations on the usage of unaligned loads and aligned loads ? 4. I am curious to know if there are any general guidelines on using packed instructions - particularly cases where the compiler used packed instructions just by looking at the processor specification. And apps where packed instructions fared worse than unpacked. Thanks in advance for sharing your views and ideas, Regards ka |
|
c64x software pipeline
Started by ●August 27, 2003
Reply by ●August 27, 20032003-08-27
My few cents on this....
Using #pragmas to align the data as well as passing on useful information
like minimum trip count, etc of a loop can indeed help the compiler to select
packed data processing. Looking at what the compiler gives you back, you can
always switch to intrinsics to see if there is any further improvement. In
addition, true power of the C64x can be used with data types such as unsigned
char, unsigned short, etc.
http://cs-tr.cs.rice.edu/Dienst/Repository/2.0/Body/ncstrl.rice_cs/TR02-410/postscript
The paper talks about a scheduling algorithm implementation on the C6200;
might give some pointers as to why the DPG cycle was broken....
To answer #3 of your questions, look at page 6-40 of the Programmer's
Guide - "When to use Non-aligned memory accesses"; its bandwidth versus amount
of vectorization you require.
Packed instructions are generally used in efficient & optimized
implementations of multimedia algorithms. An implementation benefits
from such instructions if the underlying algorithm is suited for packed
data processing.
cheers,
indrajit
Anand K <a...@yahoo.com> wrote: Hello experts: |
Reply by ●August 27, 20032003-08-27
ka,
I will try to answer #1 and combine it with some of my personal
philosophy.
If you have the luxury... After I have designed my code [and possibly
performed some prototyping and proof of concept], I get it working in C. I
preserve and update the C code to run comparison tests in the event that I need
to add a significant amount of asm code [which equals a great opportunity for
errors]. I always think that I know 'where the bottle necks' or
'code bloat' is located, but I try to benchmark/profile my code
[ideally a debug, max speed and min size build - looks like the CCS 2.20
profiler does this for me] for an objective opinion So often the
80/20 rule [or some slight variation will hold true - there are obviously
exceptions]. By doing this, I can now assess my situation.
Is my "problem" code size or speed [it's usually some of each].
Observing code generated for a debug and optimized build will give some
some idea of what/how the compiler optimizes - although it seems to have
the ability to generate code that I do not recognize due to it optimization
techniques.
I take the "tallest pole or two" and start addressing it, check my results
and continue the process. If I get into a situation that requires severe
optimization, I always consider the product life cycle - how is someone going to
test and maintain this code??
okay, I am getting off of my soapbox...
#2. A couple of useful documents are:
Code Coverage and Multi-event Profiler User's Guide (Rev. A)
spru624
Using Code Coverage & Multi-event Profiler for Robustness & Efficiency Analysis spra868 CCS 2.20 has a fairly nice tool for profiling/tuning your code.
#3. My preference is to ignore alignment [let the compiler do it] until i
have correctly working code.
#4. I do not know a general answer to this... It is my belief [but
since I have never been "cramped for memory" in my limited c64x experience, I do
not know] that the compiler will use the best instruction sequence that it can -
regardless of instruction type [as long as you tell the compile that it is a
c64 [:-)
good luck,
mikedunn
Anand K <a...@yahoo.com> wrote: Hello experts: |