Hi, all
In DSP, it can generate cache miss if you access memory
sequentially.
DMA can transfer data from external memory to internal memory.But
sometimes
the access pattern is not easy to use DMA. I think it will helpful if
prefetch exists.
What do you think?
Jogging
I think prefetch instuction should be supported in C64x+
Started by ●January 3, 2009
Reply by ●January 4, 20092009-01-04
<joggingsong@gmail.com> wrote in message news:3cf9508e-6524-47fb-9209-066fdc980f90@v13g2000vbb.googlegroups.com...> Hi, all > In DSP, it can generate cache miss if you access memory > sequentially. > DMA can transfer data from external memory to internal memory.But > sometimes > the access pattern is not easy to use DMA. I think it will helpful if > prefetch exists. > What do you think?I think that you don't have a clue and just casting the smart words without understanding their meaning. RTFM. VLV
Reply by ●January 4, 20092009-01-04
On Jan 4, 6:35�pm, "Vladimir Vassilevsky" <antispam_bo...@hotmail.com> wrote:> <joggings...@gmail.com> wrote in message > > news:3cf9508e-6524-47fb-9209-066fdc980f90@v13g2000vbb.googlegroups.com... > > > Hi, all > > � � In DSP, it can generate cache miss if you access memory > > sequentially. > > DMA can transfer data from external memory to internal memory.But > > sometimes > > the access pattern is not easy to use DMA. I think it will helpful if > > prefetch exists. > > What do you think? > > I think that you don't have a clue and just casting the smart words without > understanding their meaning. > RTFM. > > VLVAnyway thanks. Oh, for example I scan a image using raster scan. When the memory is not in cache, a cache line fill is needed. If a prefetch instruction exists, I can access current memory size of cache line at the same time I can prefetch the next memory size of cache line to the cache line. Maybe this example can use DMA to transfer data. But in image processing, sometimes it is difficult to transfer using DMA. Jogging
Reply by ●January 4, 20092009-01-04
joggingsong@gmail.com wrote:> Hi, all > In DSP, it can generate cache miss if you access memory > sequentially. > DMA can transfer data from external memory to internal memory.But > sometimes > the access pattern is not easy to use DMA. I think it will helpful if > prefetch exists. > What do you think? > > JoggingThat would be great! I miss a real prefetch instruction. I've been hit by the same probelem, e.g. accessing chunks of data so small that DMA is not worth it in a very predictable pattern. I know which addresses I'm going to access in the future early on. Real "backgrund working" prefetching could give me quite a significiant speed boost. If you only write to the memory that you want to have prefetched you can work around a little bit: Do a memory-read on the address. This will trigger a cache-miss, but it will also move the data into the L1D and you subsequent accesses will be a **lot** faster. If you never read from the chunk of memory the data will never end up in the L1D because it's read-allocate. You can get around 25% speed-up that way. The least intrusive way to prefetch via loads goes like this: void dostuff (int * data, int n) { int dummy=0; volatile int dummy_writer; int i; for (i=0; i<n; i++) { dummy += data[i]; // write something to data[i]. } dummy_writer = dummy; } This uses the least possible resources. The write to dummy_writer becomes one memory access on the stack and does not get optimized out by CCS. 6 out of the 8 execution pipelines can do addition, so the instruction scheduler still has a lot of freedom. If you know that you access less than 128 bytes in a row most of the time it might be a good idea to set the L2 cache to freeze or twiddle with the MAR-register bits (later one recommended as it's more multi-thread friendly). One L2 cache-line is 128 bytes long, but even if you only write 4 bytes the L2 cache will slurp an entire cache line from external memory and that takes a while. You will gain a lot if you just read a 32 byte cache-line into the L1D via the read-allocate trick and bypass the L2. But these are all rule of thumb methods. You have to benchmark your code and measure if L1D preloading works for you (after all - it will kick out other stuff out of the cache. It may make more harm than good). Hope it helps. Nils Pipenbrinck
Reply by ●January 4, 20092009-01-04
joggingsong@gmail.com wrote:> On Jan 4, 6:35 pm, "Vladimir Vassilevsky" <antispam_bo...@hotmail.com> > wrote: > >><joggings...@gmail.com> wrote in message >> >>news:3cf9508e-6524-47fb-9209-066fdc980f90@v13g2000vbb.googlegroups.com... >> >> >>>Hi, all >>> In DSP, it can generate cache miss if you access memory >>>sequentially. >>>DMA can transfer data from external memory to internal memory.But >>>sometimes >>>the access pattern is not easy to use DMA. I think it will helpful if >>>prefetch exists. >>>What do you think? >> >>I think that you don't have a clue and just casting the smart words without >>understanding their meaning. >>RTFM. >>> Anyway thanks. > Oh, for example I scan a image using raster scan. When the memory is > not in cache, > a cache line fill is needed. If a prefetch instruction exists, I can > access current memory > size of cache line at the same time I can prefetch the next memory > size of cache line > to the cache line. > Maybe this example can use DMA to transfer data. But in image > processing, sometimes > it is difficult to transfer using DMA.You can not do the fetch of one cache line and the use of the other cache line at the same moment of time. Doing that simultaneously would require the significant complication of the cache logic. Since C64x doesn't have the means for that, the prefetch instruction would be useless. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by ●January 4, 20092009-01-04
Nils wrote:> Do a memory-read on the address. This will > trigger a cache-miss, but it will also move the data into the L1D and > you subsequent accesses will be a **lot** faster.But at first, the CPU will stall till the read operation is complete, i.e. the cache line will be fetched into L1D. So where is the point?> If you never read from the chunk of memory the data will never end up in > the L1D because it's read-allocate. > > You can get around 25% speed-up that way.How? Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by ●January 4, 20092009-01-04
Vladimir Vassilevsky wrote:> Nils wrote: >> Do a memory-read on the address. This will trigger a cache-miss, but >> it will also move the data into the L1D and you subsequent accesses >> will be a **lot** faster. > > But at first, the CPU will stall till the read operation is complete, > i.e. the cache line will be fetched into L1D. So where is the point?You get a cache-line fetch stall once. That's true. Infact at the start of an array you get two cache-line fetches. One for the L1D and one for the L2. However, after the chache-line fetches the data is in L1D, so subsequent writes will hit the L1D and thus run at 1 cycle/access. If the data wouldn't be in the L1D each write will take around 10 cycles. or 1 cycle to execute and 9 cycles stall. Nothing will execute in parallel while the DSP is waiting for the memory to be written. Even multi-cycle multiplications will stall. Assume that you write 32 bytes cache-aligned in 32 bit accesses. With pre-loading you get L2-miss + L1-miss + 8 cycles. Without pre-loading you get L2-miss + 80 cycles. 72 cycles saved for the cost of an L1D cache-line fetch (which i measured to be roughly around 21 cycles). The inner-loop take some cycles as well. In my use-cases (vector graphic rendering) a typical inner-loop timing is around 8 cycles. Let's take this and put all the numbers together: Real-world example for 8 iterations (on cache-line): With preloading: cycles = 21 (L1D-miss) + 8 (inner-loop) * 8 (iterations) = 117 Without preloading: cycles = 21 (L1D-miss) + (8 (inner-loop) + 9 (stall)) * 8 (iterations) = 157 cycles. 34% improvement for this case. L2 cache-miss has been left out of the estimates because they happen in both cases. Of course all these timings depend on the available memory bandwidth, the speed and latency of the external RAM ect. The numbers I've used here are measured on a DM6446 eval board with DDR2 and video-out enabled, e.g. a typical example. Hope it's a bit clearer now.. Nils
Reply by ●January 4, 20092009-01-04
Ah - I messed it up.
Correction:
The "without preload"-case does not take a L1D-miss, so the estimates
are infact:
With preloading:
cycles = 21 (L1D-miss) +
8 (inner-loop) * 8 (iterations) = 117
Without preloading:
cycles = (8 (inner-loop) + 9 (stall)) * 8 (iterations) = 137
16% improvement for this case.
Sorry,
Nils
Reply by ●January 4, 20092009-01-04
Nils wrote:> Vladimir Vassilevsky wrote: > >> Nils wrote: >> >>> Do a memory-read on the address. This will trigger a cache-miss, but >>> it will also move the data into the L1D and you subsequent accesses >>> will be a **lot** faster. >> >> >> But at first, the CPU will stall till the read operation is complete, >> i.e. the cache line will be fetched into L1D. So where is the point? > > You get a cache-line fetch stall once. That's true. Infact at the start > of an array you get two cache-line fetches. One for the L1D and one for > the L2. > > However, after the chache-line fetches the data is in L1D, so subsequent > writes will hit the L1D and thus run at 1 cycle/access. > > If the data wouldn't be in the L1D each write will take around 10 > cycles.So, there is a gain only due to the fact that the cache lines are not allocated on writes, but only on reads. But then the initial cache miss will result in even longer stall because the dirty cache line should be flushed before it is fetched.> Assume that you write 32 bytes cache-aligned in 32 bit accesses. With > pre-loading you get L2-miss + L1-miss + 8 cycles.SDRAM miss + L2 miss + L1 miss + 8 cycles to flush, plus the about the same number of cycles to fetch.> Without pre-loading > you get L2-miss + 80 cycles. 72 cycles saved for the cost of an L1D > cache-line fetch (which i measured to be roughly around 21 cycles).This is not so obvious. It couid be a gain, it could be a loss depending on the particular access pattern. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
Reply by ●January 4, 20092009-01-04
Nils wrote:> Ah - I messed it up. > > Correction: > > The "without preload"-case does not take a L1D-miss, so the estimates > are infact: > > With preloading: > > cycles = 21 (L1D-miss) + > 8 (inner-loop) * 8 (iterations) = 117 > > Without preloading: > > cycles = (8 (inner-loop) + 9 (stall)) * 8 (iterations) = 137 > > 16% improvement for this case.There you go :) Is the questionable 16% improvement worth black magic with preloads? Would this optimization be useful if the program is modified, linked into the different addresses, etc? The whole purpose of cache is to liberate the programmer from manipulation of the fast and the slow memories. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com






