DSPRelated.com
Forums

OT: Ariane 5 Launcher Failure

Started by Randy Yates September 1, 2015
On 9/2/2015 8:47 PM, glen herrmannsfeldt wrote:
> Eric Jacobsen <eric.jacobsen@ieee.org> wrote: >> On Wed, 02 Sep 2015 16:11:01 -0400, robert bristow-johnson > > (snip on ADA) >>> i was told that ADA was meant to "become all things to all men." (a >>> biblical reference for those who might not recognize it.) > >> I never heard it described that way, but it was originally developed >> to force a bit more discipline in many error-prone errors in order to >> increase code reliability, traceability, readability, etc., etc. It >> has since fallen out of favor a bit after a few decades of >> demonstrating that bad coders will write crap no matter what language >> you hand them. > > But its decendant, VHDL, is still with us. > > Personally, I like verilog more, but after learning a few rules, > writing structural verilog in VHDL isn't that hard. There are a > few convenient operations that were added to VHDL so recently > that most parsers don't know them yet.
You mean VHDL 2008? If your tools don't know VHDL 2008, at least a major subset, you need a new tool vendor. -- Rick
rickman <gnuarm@gmail.com> wrote:
> On 9/1/2015 6:18 PM, Randy Yates wrote: >> rickman <gnuarm@gmail.com> writes:
(snip)
>> Yet another could be the designers' decision to allow this to generate >> an exception at all and not test for it and take other non-exceptional >> action. That is essentially my argument.
(snip)
>> Let me ask a question: what if the alignment algorithm designer had used >> ONLY a 16-bit integer for the horizontal bias. Then, AT DESIGN TIME, the >> algorithm designer would have been forced to consider out-of-range input >> and choose the action more intelligently. For example, instead of >> shutting the software down, they could have saturated the value. Granted >> this could have been done with the double value as well, but the point >> is that designer is FORCED to consider the case if you are thinking with >> an integer frame-of-mind.
> You may be over simplifying the algorithm. We don't know the details > and it might not be suitable to 16 bit integers. I would think if it > were, they would have used 16 bit integers.
Since floating point is rarely done in 16 bits (though IEEE seems to have a format if you want one), why always the comparison to 16 bit integers? Yes fixed point, but 32 or 48 or 64 bits.
>> If a saturation had been used, then we wouldn't be talking about >> exceptions in this report as it would have never happened.
> Nothing you say here changes the fact that the problem was not due to > the use of floating point. If the algorithm required calculations at > higher resolution than 16 bits a higher resolution integer result would > still need to be truncated in some manner to fit the 16 bit integer > receiving the data. This would still cause the same failure if done in > the same way under the same conditions.
Maybe. But if done in fixed point the required test might have been done, where it rarely is in the case of floating point. There have been some good examples of errors made in floating point posted to comp.lang.fortran, but I don't remember them well enough to post. I believe one has to do with using a specific very large (well, not all that large) value where a NaN might have been used, and then later allowing this value into actual calculations. -- glen
On 9/2/2015 8:33 PM, Steve Pope wrote:
> In article <ms7r1k$tqe$1@dont-email.me>, rickman <gnuarm@gmail.com> wrote: > >> On 9/2/2015 5:14 PM, Steve Pope wrote: > >>> I'm not a rocket scientist, but of the top of my head I will say you >>> don't want *any* code to be able to enter an exception handler >>> during launch. Before launch; once in orbit; sure. Maybe >>> after you've fired your last stage and are approaching orbit. But in >>> the early part of a launch you need to go with what you have, >>> not throw an exception. > >> I was working on a system using Transputers many years ago. They had >> hardware range checking which would throw an exception if a register >> over/underflowed during a math operation. > > Almost all CPU's can do this.
Yes, this is why Transputers were used... not this feature alone, but because they had a bit of a fail-safe intent to the design with the comms, etc.
>> I asked why this was a good >> thing and it was pointed out that flying without knowing you had bad >> data would be a *very* bad thing. Better to shut down and let another >> system take over. In this case the other system had already shut down >> for the same reason since they were running identical code. The >> assumption here was that a fault of this sort would be hardware related >> and so unlikely that the redundant system would also be faulty. >> Shutting down the "bad" system was necessary to allow the other system >> to take over I assume. Shutting down the second processor is clearly >> not a good idea, lol. > >> What would the rocket have done if the exception did not shut down the >> processor? > > As someone suggested above, it could have been coded to saturate > the value into the integer type, and depending on details this > could have resulted in the system continuing to operate.
The key part here is "depending on details"... The real problem was elsewhere and so the fix was too.
> (But one other important design task they clearly did not do is regress > the code against all possible input conditions... that would have > caught the unexpected exception.)
Bingo!
>> Would it have flown properly? Or would it have possibly >> flown far enough off course to cause other problems like crashing on >> people? I expect they had other means of preventing that. > > The range control officer can always prevent that -- I'm pretty > sure the destruct command is not processed by software. > > Steve >
-- Rick
On 9/2/2015 8:18 PM, Steve Pope wrote:
> Eric Jacobsen <eric.jacobsen@ieee.org> wrote: > >> On Wed, 2 Sep 2015 21:14:18 +0000 (UTC), spope33@speedymail.org (Steve > >>> I'm not a rocket scientist, but of the top of my head I will say you >>> don't want *any* code to be able to enter an exception handler >>> during launch. Before launch; once in orbit; sure. Maybe >>> after you've fired your last stage and are approaching orbit. But in >>> the early part of a launch you need to go with what you have, >>> not throw an exception. > >> Margaret Hamilton disagrees. The very beginning of "engineered" >> software included exception handling that essentially saved the first >> Apollo Moon landing: >> >> https://boingboing.net/2015/05/07/photo-celebrates-unsung-nasa-s.html > > To me, the above link does not describe exception handling. > Certainly not arithmetic exceptions. It seems to mostly > decribe operating system design.
I can't read the article currently, but my understanding about the landing issue was that there were some error reports because of a loss of sync in something that was never tested out of sync on the ground. The error reports used enough CPU time that they caused other events to miss their deadlines which caused further reports, etc. I got this from a former program manager who was teaching a course in program management. But that was some time ago and I may be confusing details. -- Rick
Randy Yates <yates@digitalsignallabs.com> wrote:

(big snip)

> Although I don't understand why ANY conversion to an integer (e.g., > whether from a double or from a wider integer) in such an application > wouldn't be carefully analyzed. But it seems like when folks throw > floats into their algorithms, they stop being careful.
Yes. This happens way to often, most of the time where it doesn't matter so much. A favorite early programming problem has always been C to F, or F to C temperature conversion tables. If one isn't careful, it is easy to get it wrong in fixed point, so the usual solution is to go to floating point. A better solution is to do it correctly in fixed point. -- glen
rickman <gnuarm@gmail.com> wrote:

(snip)
> I was working on a system using Transputers many years ago. They had > hardware range checking which would throw an exception if a register > over/underflowed during a math operation. I asked why this was a good > thing and it was pointed out that flying without knowing you had bad > data would be a *very* bad thing. Better to shut down and let another > system take over. In this case the other system had already shut down > for the same reason since they were running identical code. The > assumption here was that a fault of this sort would be hardware related > and so unlikely that the redundant system would also be faulty. > Shutting down the "bad" system was necessary to allow the other system > to take over I assume. Shutting down the second processor is clearly > not a good idea, lol.
Reminds me of the story about why it took so long to discover the arctic ozone hole. There were satellites collecting the data, but the analysis had a test to ignore out-of-range data. (This isn't a fixed/float problem, it happens either way.) They just ignored data that was out of the expected range. No flag raised for someone to look at. Well, it is a long time since I read this one, but it is again the result of not thinking about what the data might do. -- glen
On 9/2/2015 8:55 PM, glen herrmannsfeldt wrote:
> rickman <gnuarm@gmail.com> wrote: >> On 9/1/2015 6:18 PM, Randy Yates wrote: >>> rickman <gnuarm@gmail.com> writes: > > (snip) >>> Yet another could be the designers' decision to allow this to generate >>> an exception at all and not test for it and take other non-exceptional >>> action. That is essentially my argument. > > (snip) > >>> Let me ask a question: what if the alignment algorithm designer had used >>> ONLY a 16-bit integer for the horizontal bias. Then, AT DESIGN TIME, the >>> algorithm designer would have been forced to consider out-of-range input >>> and choose the action more intelligently. For example, instead of >>> shutting the software down, they could have saturated the value. Granted >>> this could have been done with the double value as well, but the point >>> is that designer is FORCED to consider the case if you are thinking with >>> an integer frame-of-mind. > >> You may be over simplifying the algorithm. We don't know the details >> and it might not be suitable to 16 bit integers. I would think if it >> were, they would have used 16 bit integers. > > Since floating point is rarely done in 16 bits (though IEEE seems > to have a format if you want one), why always the comparison > to 16 bit integers? > > Yes fixed point, but 32 or 48 or 64 bits.
Yes, most likely the algorithm would need a large integer word. But the result was being converted to a 16 bit integer. So if the algorithm *could* have been done in 16 bit integer, I'm sure they would have done so. Otherwise you still are left with a conversion with a possible overflow and the same crashed rocket.
>>> If a saturation had been used, then we wouldn't be talking about >>> exceptions in this report as it would have never happened. > >> Nothing you say here changes the fact that the problem was not due to >> the use of floating point. If the algorithm required calculations at >> higher resolution than 16 bits a higher resolution integer result would >> still need to be truncated in some manner to fit the 16 bit integer >> receiving the data. This would still cause the same failure if done in >> the same way under the same conditions. > > Maybe. But if done in fixed point the required test might have > been done, where it rarely is in the case of floating point.
What test? What would be done if the result was too large for a 16 bit integer? Shut down the processor I suspect... Remember the result was supposed to be small enough to fit the 16 bit variable. This problem is clearly not the use of floating point. The problem has to do with bad input data, a data type conversion and a processor shutdown in the event of an error in an algorithm that was not crucial to the mission (because it assumes the cause was hardware, not software). Getting rid of the floating point does not fix any of this. -- Rick
rickman wrote:
> On 9/2/2015 8:18 PM, Steve Pope wrote: >> Eric Jacobsen <eric.jacobsen@ieee.org> wrote: >> >>> On Wed, 2 Sep 2015 21:14:18 +0000 (UTC), spope33@speedymail.org (Steve >> >>>> I'm not a rocket scientist, but of the top of my head I will say you >>>> don't want *any* code to be able to enter an exception handler >>>> during launch. Before launch; once in orbit; sure. Maybe >>>> after you've fired your last stage and are approaching orbit. But in >>>> the early part of a launch you need to go with what you have, >>>> not throw an exception. >> >>> Margaret Hamilton disagrees. The very beginning of "engineered" >>> software included exception handling that essentially saved the first >>> Apollo Moon landing: >>> >>> https://boingboing.net/2015/05/07/photo-celebrates-unsung-nasa-s.html >> >> To me, the above link does not describe exception handling. >> Certainly not arithmetic exceptions. It seems to mostly >> decribe operating system design. > > I can't read the article currently, but my understanding about the > landing issue was that there were some error reports because of a loss > of sync in something that was never tested out of sync on the ground.
There was an undiagnosed problem with the sequence/checklist used. This caused a radar to be turned on when it did not need to be. This drained the CPU, and the alarms were timeouts for nonessential processes blowing their CPU budget. Since somebody ( Hal Laning ) on the MIT team had made the task switcher on the Apollo guidance computer priority driven, it was a nonproblem.
> The error reports used enough CPU time that they caused other events to > miss their deadlines which caused further reports, etc. I got this from > a former program manager who was teaching a course in program > management. But that was some time ago and I may be confusing details. >
Everybody *NEEDS NEEDS NEEDS* to see this documentary series: http://www.dailymotion.com/video/xxxiev_moon-machines-2008-part-1-the-saturn-v-rocket_tech Part 3 is the one about the computer and they discuss Hal about 15-17 minutes in. -- Les Cargill
glen herrmannsfeldt  <gah@ugcs.caltech.edu> wrote:

>Reminds me of the story about why it took so long to discover >the arctic ozone hole. There were satellites collecting the data, >but the analysis had a test to ignore out-of-range data. >(This isn't a fixed/float problem, it happens either way.)
>They just ignored data that was out of the expected range. >No flag raised for someone to look at. Well, it is a long time >since I read this one, but it is again the result of not >thinking about what the data might do.
Similarly seismograph logs generally delete signals that are not seismic in origin. Well it turns out some of those non-seismic signals are candidates for dark matter particles passing through earth (which have the property that they move much faster than the speed of sound). By throwing away all this out-of-range data, scientists are left with much less to work from. Steve
robert bristow-johnson wrote:
> On 9/2/15 12:24 AM, Randy Yates wrote: >> Les Cargill<lcargill99@comcast.com> writes: >> >>> Randy Yates wrote: >>>> spope33@speedymail.org (Steve Pope) writes: >>>> >>>>> Randy Yates<yates@digitalsignallabs.com> wrote: >>>>> >>>>>> I find it almost laughable (if it weren't for the expense and danger >>>>>> such a failure had or potentially had) that the root cause was a >>>>>> conversion from float to integer! It supports a "feeling" I've had >>>>>> for a >>>>>> long time that coding in float is dangerous for just such reasons. >>>>> >>>>> This puts you with Von Neumann. >>>> >>>> I'm not sure if that's a complement or a criticism... >>>> >>>>> Floats and doubles are not dangerous. They can store integers >>>>> within a >>>>> certain range, just like any other format. >>>> >>>> Yes, but when humans use them, they start being sloppy! >>> >>> Oh no no no!. You cannot trust them. Although really - integer >>> saturation is just as dangerous and probably more common. >>> >>>> And if you are >>>> not sloppy, you might as well use integers/fixed-point (for many many >>>> things). > > to paraphrase (or as intimated in) https://xkcd.com/163/ "Different > tasks call for different [types]" > > there are many places where i'm convinced a floating-point (or mixed) > environment is most appropriate. but, it's worse than "sloppy" to make > a moving-average or CIC filter using floats. you gotta make sure that > what you add to the accumulator is exactly subtracted from it later. > can't do that with float. >
I've never had that particular problem. But I tend to use "faded" or exponential moving averages. And you can de-saturate classical moving averages by dividing the sum and count by 2 now and again.
> >>>>> Programmers however are dangerous. >>>> >>>> This sounds a lot like the anti-gun-control sentiment.. >>>> >>> >>> "Floating point doesn't kill rockets... programmers >>> kill rockets... " >> > > woot!! > >> Exactly!!! True enough... > > > (i disagree with the original anti-gun-control canard, but i love the > comparison.) > >
-- Les Cargill