Randy Yates wrote:> glen herrmannsfeldt wrote:>>Another thing that goes along with pipelining is register renaming. >>Register move instructions then execute in zero time, as the result >>goes directly to or from the appropriate register.> Ya' learn something new everyday around here. I'd heard > of "register renaming" but never knew this was why it > was used. Thanks glen!Well, that is just the simple case. More complicated cases come with out of order execution when a register might be used temporarily for something else. C=A*B F=D+E LE 0,A load A ME 0,B multiply by B STE 0,C store in C LE 0,D load D AE 0,E add E STE 0,F store in F A pipelined machine can overlap or execute the two operations out of order, but register 0 is used in both cases. The register renaming logic figures out that because of the second load there is no overlap between the two operations. A different hardware register is used for the add and multiply. -- glen
1 instruction / nanosecond plus a nanosecond for on-chip cache access,
Started by ●July 6, 2006
Reply by ●July 13, 20062006-07-13
Reply by ●July 13, 20062006-07-13
anon wrote:> > no single instruction runs faster, therfore, < insert IBM marketing > .here> ...If you are only interested in executing a single instruction, why do you want it to execute so quickly?
Reply by ●July 13, 20062006-07-13
Wilco Dijkstra wrote:> > signficantly less because of the tiny zero operand opcodes that run so > > much faster than memory without pipelines. For some things we need > > fewer instructions than other processors and for other things we need > > more, but our opcodes are so tiny that our code is often among the > > smallest. We know what code works. > > Sounds unlikely. A quick glance at the instruction set shows that it > doesn't support common operations such as multiplies and shifts.It has shifts and multiply steps. So an 4x4 multiply can be faster than an 8x8 or 18x18.> A shift by 16 bits would need 16 instructions (80 bits) and take > 16 cycles.Yes or a loop with a few shifts. It is one the places where we would need more cycyles and bits. In other places we need less. And we have focused on things that happen more than 1% of the time in our code and not optimized low frequency things more than we need to. The code is denser than what most people are used to. And we profile the code. People are often amazed when told that profiled code like the VLSI cad software doesn't use floating point or even have very many multiplies. So ideas of how code can be factored, and the assumptions that we make are often different than the assumptions that other people use. What's that? That's the VLSI CAD system, the OS, the Pentium compiler, the command line, the source editor, the dozen public utilites, layout tools, design rule checks, different views, descriptions of cad process parameters, descriptions of components, registers, memory, cores, pads, complete chip desription, various views, GDS, netlist, target chip compiler, target chip ROM sources software simulator, hardware simulator with virtual 600 trace oscilloscope, complete source for all of that and more, and half of it is the documentation. That takes a third of a floppy. The code is denser than what most people are used to. But when they see the CAD software it helps give people the idea that we are talking about code that is more dense than usual.> You could use subroutines of course, but even then you > need at least 15 bits for the call and it takes even more cycles.No, we can do them with less bits with paged calls, anyway our interal address space is smaller than 15 bits. And we do it in one memory cycle, which could be different in differnt kinds of memory. But you are right that subroutine calls and returns take more than 1ns because a memory cycle will take longer. The mix of opcodes will bring total througput down below 1G. 1G is the rate of the stack opcodes, register opcodes, alu opcodes, but not memory opcodes. Still it is still unusual to get a max of 1G in a 1$ tiny core in scalable arrays with fast I/O and parallel processing hardware and software on the chip. So maybe the idea that not all opcodes run at 1G makes it more believable. The web pages do say 1G max.> Other trivial things like addition (eg. x = y + z) take at least 10 > instructions (80 bits) compared to just one 16 or 32-bit instruction > on a typical RISC. It doesn't look very good...It's a 5-bit opcode in our set.> >> So even if it did reach GHz speeds (which it can't if it isn't > >> pipelined), > > > > Pipelining is a form of parallelism, but staggered. It takes a lot of > > transistors and gets little return. > > It doesn't need a lot of transistors as long as you don't pipeline > like the P4. A 5-stage pipeline typically provides a 4 fold speedup > so it is more than worth it.We want GHz speed and without pipelines so that we can have five or ten or a hundred times as many processors at the same cost. If we didn't believe it was possible, then like you we would consider pipelines.> >Best case gives several times > > speedup and worst case is many times slowdown. > > That's rubbish. Pipelining doesn't cause a slowdown. At worst > you don't get as much of a speedup as you expected, but you > do still get a speedup.There was a wonderful article by Dr. Koopman in Embedded Systems Design a few years ago that showed how even with relatively shallow pipelines on older Intel chips that a low percentage of the time pipeline stalls and cache misses would cause a critical interrupt driven routine to run one hundred times slower than its average execution time. He pointed out that the only solution was to pay for a processor that was 100 times faster than it needed to be on the average to gaurentee that average performance. You may say it is rubbish. For realtime embedded computing, which is the subject in this thread, pipelines and caches can have catastrophic effects. So if you need to guarentee 10Mhz performance you might have to pay for tens of millions of transistors and a maximum execution rate of 1Ghz. We would prefer to reduce costs by a few orders of magnitude for those real-time problems. But I can conceed the point to you and say I was thinking of the biggest chips with the deepest pipelines and you must have been thinking of something in between. But it doesn't matter much because we win to much over interupts.> >Interrupts are > > notorious for instance in causing catastrophic pipeline stalls and > > cache misses. It makes chips hot and expensive and harder > > to program, and programs get bigger, and it reduces production yield. > > Again, this is rubbish. Modern CPUs have interrupt latencies that > are tens of nano seconds, even with caches (eg. 50 ns on a > 400 MHz Cortex-R4). Interrupts have a cost, but you can't live > without them.Ok, to simplfiy things I won't argue with your numbers and say that you have a processor that with interrupts can run the code associated with an external event in tens of nanoseconds. (assuming we are not talking about a bunch of asynchronous events yet and a bunch of associated interupts yet,) They are essential if you have a single processor.... but .... My point that was the reason that we don't have interupts is that we have more processors and so that we get the code associated with external events to run in tens of picoseconds after an event, not nanoseconds. My point was nano to pico is a thousand to one one ratio and that's why we prefer to not to use interupts. I would also add that we can respond to multiple asynchronous external events in a few picoseconds each, we can being the service of each of a dozen simultaneous events within picoseconds. With a single processor and interrupts the average service time for those same dozen events would be half the total. If each ISR took 100ns when none else hit, if they all hit at once the aveage response time would be 600ns and what is more important, the worst case would be 1.2us. And again I am saying that in the worst case we can launch our code for those same dozen events in a few picoseconds because we don't use interupts, we have multiple parallel processors with event and message wake-up circuits. These make parallel processing easy and they make realtime response to real-world events very fast, much faster than interrupts that require memory access.> > The point here is getting up to a 1GHz instruction exection at the > > mininal cost in transistors by not using pipelining or caches, > > making a classs of programs smaller, faster, and simpler as > > odd as the might sound. > > Odd doesn't quite cut it. My first thought it was a joke :-)I can tell that from the tone of your comments that it seems like preposterous nonsense to you but perhaps it might sound more real after you have read these responses. Many people have that reaction. From those who have been using Forth on some other processor and might see a 10,000 times speedup per node over the processor that they are using now, because it is not just our clock rate, is that our primitives take 1 cycle not a hundred like those on some virtual Forth machines implemenations, and many of their subroutines in memory disappear because we have those as our 5-bit opodes. Another group of people who have trouble with our idea are those who are used to all the chips that use register machine design, pipelining, and on-chip caches to get speeds like 1G and who haven't seen little non-pipelined non-cached chips doing it before. It takes a while for some people to realize what combination of thigns we have. Instead of 50 or 100 cycles to execute a Forth primitive that is executed more than 1% of the time in Forth programs we do that in one cycle. Then we run the clock 10x 100x or 1000x faster than similar chips in size or price on top of that with clever and usual full custom vlsi designs. People have often questioned if it was real, or if there was real value in this architecture and new features that were developed for it. When Intel, and AMD, and Fujitsu, and Casio, and Sony purchased technology licenses we thought that added credibility for people who said it all sounded like a joke to them. We added special hardware for 2ns remote proceedure calls and add a ROM BIOS with OS type code and routines to assist parallel programs and give each node its own high speed RAM and ROM. We have fast on-chip memory, but because of our tiny 5-bit opcodes and ultra-fast register stacks we can pack up to four opcodes in a word and execute them four times as fast as memory can run. So with about the number of transistors of 1970s micro we can run about 10x faster than many other architectures can on our code with a lot more transistors, and sometimes as fast as chips with thousands of times more transistors which are usually there to do other things like FP. And sometimes, namely on the things we intend to do, we beat those deeply pipelined and multilayer cached chips by very large margins. People don't usually say it is impossible usually until we mention what process we are using. It would be one thing if we said we aimed for 1G in 60nanometers, or 90nanometers, or maybe even .13u, but when we say .18u is cheaper and has lower leakage current people say, but wait....> I'm not a circuit designer, but it's obvious to me that 1GHz is > wildly optimistic, even in an advanced process. You would > have less than 20 levels of logic for instruction fetch, decode, > operand fetch, execute and writeback. That's impossible.In the nineties we made the F21 and i21. They were made in .8u and .65u. Intel used that technology a few years earlier. With pipelines and big on-chip cache they were able to get a maximum of 64Mhz. The CPU in F21 ran at 500, 2ns opcode execution in .8u. It could sustain 400mips in internal ROM and 220 in cheap external RAM. On-chip composite and RGB video IO coprocessor, on-chip 40MSPS analog I/O coprocessor, on-chip high speed network router, ROM, parallel port, RTC, echo timer, memory controller, CPU with 2ns Forth opcodes, .8u 20mw, $1-$2 chip in the old days when those fabs were about to leave for the third world. We could do that for of two reasons, architecture and custom vlsi on the cutting edge. In fact beyond the edge according to the most expensive cad tools which said it couldn't happen. Sony sent some engineers one time to visit us, they were currious. Their engineers told us 500 mips in .8u was impossible even if we used pipelines. They said, 50. They were willing to believe 50 but they warned us that we were totally wrong because what we thought was possible was technically impossible. I said, how do you explain that? They said, what? I said, you see that guy there? He is surfing the Internet, he is doing email. He is doing this on a box running our software on our chip. You say it is impossible, so how do you explain that it exists? I said, count the chips! You see our chip and some memory, about $10 bom. I said, you may think that it is impossible but that doesn't explain what you are actually seeing. When I took a 68hc05 out of an old mouse and droppped my chip in and threw in some code, connected it up to a monitor, run a short demo and ask people what they had seen. A pc running windows was the typical answer. Then I would point out that there was no PC, just a mouse with a cheap mouse type chip and some memory running some code to create a desktop on the monitor and let me point, click, launch and run applications with a gui and event driven OS. Sometime people would want me to pry off the lid to show them that it really was a little tiny vlsi chip doing this off batteries inside of a mouse that they had mistaken for a PC running windows. Then I would say, how much code do you think is inside of that mouse that is imitating a PC. They were often suprised that it was about 1 kilobyte. Our code is pretty dense.> > The idea is you pay for transistors. You can get a many millions > > to execute a billion pipelined instructions per second on one > > processor with parallel pipelines, or you can have a large number > > of non-pipelined processors runing at that same speed perhaps > > thousands of them. > > Or you could have a small number of simple but fast CPUs. > I agree that using many simpler CPUs uses transistors more > efficiently and so can be cheaper, but there is a cost: software. > So it remains cheaper overall to go for fewer faster CPUs.Although I do work with and on cad software a fair amount of the time I am really a software person and the parallel software is the part I like. Thanks for the comments.
Reply by ●July 13, 20062006-07-13
cpu16x1832@wmconnect.com writes:> Pipelining DOES increase THE RATE at which INSTRUCTION STREAMS complete > execution, however, NO SINGLE INSTRUCTION RUNS FASTER! > > My example urls, at the top of my request for information post, > reference parallel processors and may execute DOZENS of INSTRUCTION > STREAMS SIMULTANEOUSLY!!! > > DOZENS of Ghz ( in parrallel) is SIGNIFICANTLY GREATER than 1 GHz stuck > inside an ancient pipeline architecture.In my experience, the prolific use of capitals is an indicator that the quality of discourse is about to plunge to new depths.
Reply by ●July 14, 20062006-07-14
Oh, my, anon has now started cross posting this to three groups. I think anon is probably maw, ie. m. in the wilderness, ma washburn, and has switched id again to stop google from censoring his posts. I should probably know better than to comment on any threads that he starts, but I will give this one more try. Wilco Dijkstra wrote about concents that with 32 opcodes sometimes we would have to use some more than once or put them in a loop, like having to do four multiply step opcodes to do a 4x4 multiply or 18 to do an 18x18 multiply. We optimized almost everything that happens more than 1% of the time to one one cycyle 5-bit opodes that pack into words so that we can execute them several times faster than memory access and can run, without caches or pipelines. And we have profiled applications and know where the bottlenecks really are, we have real code.> Other trivial things like addition (eg. x = y + z) take at least 10 > instructions (80 bits) compared to just one 16 or 32-bit instruction > on a typical RISC. It doesn't look very good...We do that in one 5-bit opode, We think 5-bits looks better than 32.> It doesn't need a lot of transistors as long as you don't pipeline > like the P4. A 5-stage pipeline typically provides a 4 fold speedup > so it is more than worth it.A 5-stage pipeline gets 4x speedup so it is almost a break even. We want to use the same number of transitors to build 5 processors that run just as fast, and many times faster for real-time event processing because we remove the bottleneck of one processor having to process the events sequentially.> >Best case gives several times > > speedup and worst case is many times slowdown. > > That's rubbish. Pipelining doesn't cause a slowdown. At worst > you don't get as much of a speedup as you expected, but you > do still get a speedup.I would provide references to how pipeline stalls and cache misses can easily cause an ISR to take 100 times as long as the average execution speed but in a very infrequent and hard to debug way. The fact that it limits 4GHz processors to microsecond or millisecond ISRs is not one I need to argue. I will let you assume that pipelining does not cause these kinds of problems because I have a trump card in my hand. So call that point rubbish if you wish and I will just move on.> >Interrupts are > > notorious for instance in causing catastrophic pipeline stalls and > > cache misses. It makes chips hot and expensive and harder > > to program, and programs get bigger, and it reduces production yield. > > Again, this is rubbish. Modern CPUs have interrupt latencies that > are tens of nano seconds, even with caches (eg. 50 ns on a > 400 MHz Cortex-R4). Interrupts have a cost, but you can't live > without them.Ok, So latency of say 50ns to flush the pipe, save some state, and load the ISR from memory. Pretty impressive. And yes, with a single CPU interrupts are essential as you say! But we are talking parallel processing, where we have a CPU que up the ISR code and go into a sleep state to process an event. When the external event on a pin, or message arrives the processor wakes up and begins processing the event in less than 50ps, picoseconds. 50ps vs 50ns would seem to prove my point that Interrupts on a single CPU is a bottlenect but that's just the start. If ten of these events happen simultaneously we can launch the code for each of these events just as fast. We are still below 50ps latency. If that pipelined and interrupted single CPU design gets ten events happening simultaneously the average latency expands from 50ns to 1/2 times the total of time to process all the routines to a maximum of all the time required to process all the other routines. If the ten ISR each take only 100ns the worse case latency now has gone up to about 1.5us. We are still talking 50ps latency. If the ten ISR take 1us each the worse case latency has now gone to 10us while we remain below 50ps.> > The point here is getting up to a 1GHz instruction exection at the > > mininal cost in transistors by not using pipelining or caches, > > making a classs of programs smaller, faster, and simpler as > > odd as the might sound. > > Odd doesn't quite cut it. My first thought it was a joke :-)People had the same reaction to the previous genertation of chips running too fast in .8u. Too small, too fast, too cheap to be understood by people who are thinking register machine designs with pipelines and caches. Many have thought it a joke. ;-) People who have Forth programs that need subroutines in memory for their Forth primtives and where were they are used to the idea that primtives take 100 cycles on a slow clock on a cheap chip and require more than a dozen bytes of memory each and could be replaced with a 5-bit one-cycle opcode running on a clock that we run 1000 times faster because of the other architectural innovations it has been hard for them to understand and all some have done is make jokes that it was too good to be true. But the fact that Intel, AMD, Sony, Fujitsu, Casio and others have paid a lot of money to license technology developed in the process of doing this and have put this technology into products that people don't think is a joke the joke may have been on the people who have called it a joke. ;-)> I'm not a circuit designer, but it's obvious to me that 1GHz is > wildly optimistic, even in an advanced process. You would > have less than 20 levels of logic for instruction fetch, decode, > operand fetch, execute and writeback. That's impossible.Here is another impossible idea to consider. Register machines may fetch a RISC opcode decode the alu operation bits and register select bits address the selected registers and gate io through the alu launch the alu operation and wait latch the output in the appropriate register and repeat Because of the inherently sequential nature of those steps one solution is to have pipelined execution units doing the stages in parallel out of phase to operate as you have described. What we do is. fetch a word with up to 4 opcodes execute all opcodes while decoding 5-bits latch the output in the appropriate register and we may be able to excute several opcodes before we have to make another memory operation so we can start instruction prefetch fetching the next group of opcodes while we are executing the last group we fetched. And our stack registers are much faster than addressable registers by design and definition. So without pipelining we can get higher execution rates because we have fewer sequential steps to our opcodes internally. Yes, we have to do a dozen things in sequence inside of each opcode, so yes, they have less than 100ps each for these for a complete opcode to execute in 1ns without pipelining. And no we didn't got to 90nano meters to do it.> > The idea is you pay for transistors. You can get a many millions > > to execute a billion pipelined instructions per second on one > > processor with parallel pipelines, or you can have a large number > > of non-pipelined processors runing at that same speed perhaps > > thousands of them. > > Or you could have a small number of simple but fast CPUs. > I agree that using many simpler CPUs uses transistors more > efficiently and so can be cheaper, but there is a cost: software. > So it remains cheaper overall to go for fewer faster CPUs.The reason that interupts are not essential to us is that we have parallel processors. Instead of one 20 million transitor CPU we can have a thousand 20 thousand transistor micro-computers with local ROM and RAM and capable of running code to many external events with picosecond latency from the same silicon. When we say an OS, a custom VLSI CAD system with a dozen utility programs, a complete chip design, and an equal amount of documenation fits on a third of floppy disk people say that's impossible, it sounds like a joke, that takes gigabytes of code. Not everyone understands that we talk about code that is more dense than what other people talk about. Real code. And we have state of the art custom VLSI design techniques and have developed important new circuit designs and have a novel architecture and unusual way of coding to get very compact and fast code. It sounds like a joke to many people. And some refuse to see it any other way. Some are frightened by the whole concept that programming a sea of embedded processors in parallel is something that they don't understand. I am a software person and the programming of the parallel sea of processors is what I like. Maybe I have been able to explain to you why we say that 1ns 5-bit opcodes without cache or pipelines on $1 core interests us or that they don't need interupts because they run in parallel. Our sevice routine latency is measured in picoseconds on our cheap chips not micro or milli as it is with bigger more expensive and power hungry chips that have a single CPU and need many memory accesses to service multiple interupts. Thanks for the questions, I try to avoid maw's threads, and now that this one is being cross posted I think it is time for me to say goodbye, and thanks.
Reply by ●July 14, 20062006-07-14
fox@ultratechnology.com wrote:> Oh, my, anon has now started cross posting this to three groups. > I think anon is probably maw, ie. m. in the wilderness, ma washburn, > and has switched id again to stop google from censoring his posts. > I should probably know better than to comment on any threads that > he starts, but I will give this one more try. >Google doesn't censor with nearly as much force as IBM. ( a joke?) You, ( http://groups.google.com/groups/search?q=SMP+OR+MPP&start=0&scoring=d&enc_author=0RVSxxcAAABwrwIXGmzirMeG17NErG6d6LAj6ElDBqtfTMuG4Ts3HA&hl=en&num=100&filter=0 ) and I ( http://groups.google.com/groups?q=VLIW+OR+SMP+OR+MPP+OR+STACK&start=0&scoring=d&enc_author=UMMf9RgAAAB028iMl1-EGchWGfA9OZiJwdEhmbh6wtHbY0AV513Zgw&num=100&filter=0& ) have both enjoyed working with SMP MPP FORTH design theory. ( VLIW SMP MPP FORTH for me.) You know the story, I tried ten years ago to tell Washington people about the formula then just receive a personally not interested reply from Bill about a month later. I am not sure what the hold up is with industry. I told Washington that if Washington would ask the Department of Defense to normally ask for a Request For Information or with a Request For Proposal to the software/hardware industry, maybe, thru a program like DARPA, MAYBE IBM Defense would ( in past) happily pay for owing work that should ( have) bought them the entire foreseeable future of computer technology ( silicon fabrication chips, bio-chemical, optical gates or otherwise, you name it, simply letting simply) from a very basic constraint formula: VLIW SMP MPP FORTH ( A Variable Length Instruction Word Symmetric Multi Processor Multiple Parallel Processor extended stack machine ( FORTH)) How can we get TV, radio or newspapers to admit the simplistic truth, where IBM(software), Intel(hardware), and MoneySoft(sales) own the public sales? Maybe forget Washington for help. Regards, anon http://mywebpage.netscape.com/mawcowboy/homepage.html
Reply by ●July 14, 20062006-07-14
<fox@ultratechnology.com> wrote in message news:1152896935.192641.160500@s13g2000cwa.googlegroups.com...>> Other trivial things like addition (eg. x = y + z) take at least 10 >> instructions (80 bits) compared to just one 16 or 32-bit instruction >> on a typical RISC. It doesn't look very good... > > We do that in one 5-bit opode, We think 5-bits looks better than 32.No, it only takes 1 instruction if both operands happen to be in the top 2 stack slots and the result stays on the TOS. Variables need to be stored however. Global variables are stored in memory, never on a stack. Function local variables can't be stored on the stack because: A. hardware stacks are typically tiny (like 8 or 16 entries), so only suitable for expression evaluation within a (few) basic block(s) B. with a fixed size stack you need to reserve some entries for calls to other functions (or interrupts) C. without exchange instructions you can only access the top of the stack so variables below that are inaccessible D. stacks are great for evaluating expression trees, but don't deal well with more complex DAGs that result from optimization The bottom line is you have to store most variables in memory. Doing that takes a large number of instructions with a 5-bit instruction set, especially if you need to emulate stack and frame pointers etc (necessary if you want functions to be reentrant).>> It doesn't need a lot of transistors as long as you don't pipeline >> like the P4. A 5-stage pipeline typically provides a 4 fold speedup >> so it is more than worth it. > > A 5-stage pipeline gets 4x speedup so it is almost a break even.How do you mean "almost break even" - a 4x speedup means 4 times as fast. You don't get the ideal 5x speedup, but that's life.> We want to use the same number of transitors to build 5 processors > that run just as fast, ...The overhead of pipelining is fairly small, in the order of 10-20%, so you can't tradeoff 1 pipelined processor with 5 unpipelined ones. With on-chip memory the overhead is much lower as the core is only a tiny proportion of the total area. So you could tradeoff 20 pipelined CPU for maybe 21 unpipelined ones. But since they are 4 times slower, I don't see the point.> I would provide references to how pipeline stalls and cache misses > can easily cause an ISR to take 100 times as long as the average > execution speed but in a very infrequent and hard to debug way.It is definitely not due to pipeline stalls. Prescott has the longest pipeline at 31 stages. Flushing that pipe takes what, 31 cycles? Compare that to a cache miss which can take hundreds of cycles (and you can have many misses). Desktop CPUs and OSes are just not designed for realtime interrupt processing, and it shows. On the other hand, many embedded CPUs are designed to be good at hard realtime work. And even with caches, TLBs etc, some can manage latencies in the 10's of nanoseconds.> But we are talking parallel processing, where we have a CPU que > up the ISR code and go into a sleep state to process an event. > When the external event on a pin, or message arrives the processor > wakes up and begins processing the event in less than 50ps, > picoseconds.You're not helping your case by making up ridiculous numbers - 50ps is 2 gate delays on an 180nm process, maybe 3 at 90nm. If you're blocked on an event then yes you can deal with it quicker. But what you call an "interrupt" is what everybody else calls a stall. Real interrupts are complex as they interrupt the CPU while it busy, so it needs to save state etc. Stalling is much simpler and current CPUs can react to stalls in 250ps.> 50ps vs 50ns would seem to prove my point that Interrupts on a > single CPU is a bottlenect but that's just the start.Where do you get the idea that a 50ns interrupt latency is a bottleneck? Wasting a few cycles on each interrupt is not important, even at a high rate of 1 million interrupts per second the overhead would be 5% at worst. On typical systems you rarely get over 10K interrupts/s, the highest spec I've ever seen was 20K/s, and I once did 25K/s for PC sampling.> If that pipelined and interrupted single CPU design gets ten > events happening simultaneously the average latency expands from > 50ns to 1/2 times the total of time to process all the routines to > a maximum of all the time required to process all the other routines.No, it doesn't. Interrupts are typically prioritised, and high priority interrupts can interrupt another interrupt that is being processed. Ie. the interrupt latency for high priority interrupts doesn't change. Lower priority (non realtime, eg. key pressed) interrupts may have to wait of course, that is why they are called "low priority".>> Odd doesn't quite cut it. My first thought it was a joke :-) > > People had the same reaction to the previous genertation of chips > running too fast in .8u. Too small, too fast, too cheap to be > understood by people who are thinking register machine designs > with pipelines and caches. Many have thought it a joke. ;-)There are many hoaxers on the internet making similar claims. I have to say your claims are more outrageous than most... Maybe you're not a hoaxer, but you're not trying very hard to sound believeable. You come across as having little understanding of how modern CPUs work and yet you claim you can do better. Published papers describing this stuff would make it a lot more plausible. I did a quick search for the F21, and the best I could find was a 5 page datasheet (looking distinctly amateuristic). It was immediately obvious the CPU is much slower than you claimed as it is totally memory limited (how surprising). It isn't clear to me whether they ever got a chip fully working.> People who have Forth programs that need subroutines in memory > for their Forth primtives and where were they are used to the idea > that primtives take 100 cycles on a slow clock on a cheap chip and > require more than a dozen bytes of memory each and could be > replaced with a 5-bit one-cycle opcode running on a clock that we > run 1000 times faster because of the other architectural innovations > it has been hard for them to understand and all some have done > is make jokes that it was too good to be true.Lets forget about TeraHerz CPUs for one moment. Can you point me to a Forth chip for sale that is only as fast as current CPUs?> What we do is. > > fetch a word with up to 4 opcodes > execute all opcodes while decoding 5-bits > latch the output in the appropriate register > > and we may be able to excute several opcodes before we > have to make another memory operation so we can start > instruction prefetch fetching the next group of opcodes > while we are executing the last group we fetched.That sounds like a 2 stage pipeline, so a memory access is not part of the claimed speed, which makes it more plausible.> And our stack registers are much faster than addressable > registers by design and definition.Sure, but the register read/write overhead is hidden by pipelining in a traditional CPU, so it doesn't limit the speed. You still need to have an ALU, and you may know that RISC cycle time is based on the speed of a single cycle ALU. I guess your ALU may be a bit simpler than a conventional one, but not enough to be much faster. Addition takes log(N) time.> So without pipelining we can get higher execution rates > because we have fewer sequential steps to our opcodes > internally. Yes, we have to do a dozen things in sequence > inside of each opcode, so yes, they have less than 100ps > each for these for a complete opcode to execute in 1ns > without pipelining. And no we didn't got to 90nano meters > to do it.If you really do 10 steps in 1 ns, then you would be able to improve speed using pipelining, easily doubling it.> When we say an OS, a custom VLSI CAD system with a > dozen utility programs, a complete chip design, and an equal > amount of documenation fits on a third of floppy disk people > say that's impossible, it sounds like a joke, that takes > gigabytes of code. Not everyone understands that we talk > about code that is more dense than what other people talk > about. Real code.Something really simplistic might be feasible. But nothing that is equivalent in features, functionality and fluff to a modern OS. For example the backdrop image I use doesn't fit on a floppy.> And we have state of the art custom VLSI design techniques > and have developed important new circuit designs and > have a novel architecture and unusual way of coding to > get very compact and fast code. It sounds like a joke to > many people. And some refuse to see it any other way.And I tend to agree. Extraordinary claims require extraordinary evidence. Are there patents or published papers on those VLSI techniques? Do you have a C compiler so it is possible to run benchmarks to verify those size and performance claims? If not, do you think it is odd most people believe it is a joke?> Some are frightened by the whole concept that programming > a sea of embedded processors in parallel is something that > they don't understand. I am a software person and the > programming of the parallel sea of processors is what I > like.If that is true you can make a fortune solving the parallel programming problems. The rest of the world has been wrestling with this for the last 50 years or so and made little or no progress. Most programmers seem to have difficulty enough writing defect free software that runs on one CPU... Wilco






