DSPRelated.com
Forums

OT: Ariane 5 Launcher Failure

Started by Randy Yates September 1, 2015
On Tue, 01 Sep 2015 23:53:33 +0200, Sebastian Doht wrote:

> Am 01.09.2015 um 21:01 schrieb gyansorova@gmail.com: >> On Tuesday, September 1, 2015 at 11:36:05 PM UTC+12, Randy Yates wrote: >>> Folks, >>> >>> I've been in a LinkedIn discussion in which the following analysis an >>> Ariane 5 failure is documented: >>> >>> http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html >>> >>> I'm just reposting it here since I find it fascinating, and I bet >>> there are a few folks here (Tim, you come to mind especially) who >>> might have a few things to say about it. >>> >>> I find it almost laughable (if it weren't for the expense and danger >>> such a failure had or potentially had) that the root cause was a >>> conversion from float to integer! It supports a "feeling" I've had for >>> a long time that coding in float is dangerous for just such reasons. >>> -- >>> Randy Yates Digital Signal Labs http://www.digitalsignallabs.com >> >> I thought they used ADA for such things >> >> > As far as I recall they used Ada but turned all range checks of Ada off > which makes the usage of Ada as an argument for increased safety quite > meaningless. > > "A fool with a tool is still a fool"
The article said something about overflowing an integer value and popping an exception. Which sounds more like they DID hit a range check, and lost a rocket ship because of it. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On Tue, 01 Sep 2015 12:01:51 -0700, gyansorova wrote:

> On Tuesday, September 1, 2015 at 11:36:05 PM UTC+12, Randy Yates wrote: >> Folks, >> >> I've been in a LinkedIn discussion in which the following analysis an >> Ariane 5 failure is documented: >> >> http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html >> >> I'm just reposting it here since I find it fascinating, and I bet there >> are a few folks here (Tim, you come to mind especially) who might have >> a few things to say about it. >> >> I find it almost laughable (if it weren't for the expense and danger >> such a failure had or potentially had) that the root cause was a >> conversion from float to integer! It supports a "feeling" I've had for >> a long time that coding in float is dangerous for just such reasons. >> -- >> Randy Yates Digital Signal Labs http://www.digitalsignallabs.com > > I thought they used ADA for such things
You remind me of the people I was working with the one time that I debugged ADA code. We were a C house. They were a bunch of ADA people with the attitude "if there's a bug, and there's C code, then the bug is in the C code". The bug was in their (ADA) code, where they improperly used an ADA feature that's not present in C. AND, it took a C programmer who'd never written a line of ADA to find the bug (and rub their noses in it until they opened their eyes and LOOKED). Bad code is bad code. There is no magic language that'll enforce error- free software. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Tim Wescott  <seemywebsite@myfooter.really> wrote:

>You remind me of the people I was working with the one time that I >debugged ADA code.
>We were a C house. They were a bunch of ADA people with the attitude "if >there's a bug, and there's C code, then the bug is in the C code".
>The bug was in their (ADA) code, where they improperly used an ADA >feature that's not present in C. AND, it took a C programmer who'd never >written a line of ADA to find the bug (and rub their noses in it until >they opened their eyes and LOOKED).
Classic. Steve
On 9/1/2015 6:18 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes: > >> On 9/1/2015 7:36 AM, Randy Yates wrote: >>> Folks, >>> >>> I've been in a LinkedIn discussion in which the following analysis an >>> Ariane 5 failure is documented: >>> >>> http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html >>> >>> I'm just reposting it here since I find it fascinating, and I bet there >>> are a few folks here (Tim, you come to mind especially) who might have a >>> few things to say about it. >>> >>> I find it almost laughable (if it weren't for the expense and danger >>> such a failure had or potentially had) that the root cause was a >>> conversion from float to integer! It supports a "feeling" I've had for a >>> long time that coding in float is dangerous for just such reasons. >> >> I find this conclusion to show an immense lack of understanding of the >> cause of the failure. Did we read the same report? >> >> The use of integers for the variable that was the float would not have >> mitigated the accident. If you had used an N bit integer the same >> conversion to a 16 bit integer would have resulted in the same >> overflow and conversion error. >> >> The two primary causes of the accident were allowing the software for >> alignment of the strap-down inertial platform to continue to run after >> liftoff when it received invalid inputs which resulted in the out of >> range problem and the decision to shut down the processor on this >> error based on the assumption that the software was not faulty but >> rather the hardware was, which was an erroneous assumption in this >> case. > > I think there are several places one could lay the "cause" (perhaps > "root cause" was too extreme of a term). I certainly won't argue that > one would be the decision to leave the calibration running after it was > no longer required. That just doesn't make sense. > > Another could be the generic exception-handling specification that all > exceptions were catastrophic and should result in the processor being > shut down. > > Yet another could be the designers' decision to allow this to generate > an exception at all and not test for it and take other non-exceptional > action. That is essentially my argument.
If you read the full report they had to make some tradeoffs in the interest of performance. Now that I have thought about this a bit, I understand their reasoning for the shutdown. They were working with the premise that the software would be adequately vetted and such errors should not exist. Obviously this premise was not correct in the end.
> Let me ask a question: what if the alignment algorithm designer had used > ONLY a 16-bit integer for the horizontal bias. Then, AT DESIGN TIME, the > algorithm designer would have been forced to consider out-of-range input > and choose the action more intelligently. For example, instead of > shutting the software down, they could have saturated the value. Granted > this could have been done with the double value as well, but the point > is that designer is FORCED to consider the case if you are thinking with > an integer frame-of-mind.
You may be over simplifying the algorithm. We don't know the details and it might not be suitable to 16 bit integers. I would think if it were, they would have used 16 bit integers.
> If a saturation had been used, then we wouldn't be talking about > exceptions in this report as it would have never happened.
Nothing you say here changes the fact that the problem was not due to the use of floating point. If the algorithm required calculations at higher resolution than 16 bits a higher resolution integer result would still need to be truncated in some manner to fit the 16 bit integer receiving the data. This would still cause the same failure if done in the same way under the same conditions. -- Rick
On 9/1/2015 6:24 PM, Randy Yates wrote:
> rickman <gnuarm@gmail.com> writes: > >> On 9/1/2015 10:59 AM, Tim Wescott wrote: >>> >>> Now, if I'm going to bring MY prejudices to bear on this, it was because >>> the systems engineering team was of the opinion that embedded software is >>> Black Magic, or they considered that it doesn't really have value because >>> it doesn't show up as a line item on the bill of materials. >> >> Prejudice is exactly the right word. > > Call it what you want - if a different approach had been made, as I > outlined in a post just a few minutes ago, the Europeans would be > millions of dollars and a missile launch up.
Lol. That is quite a stretch... -- Rick
Tim Wescott <seemywebsite@myfooter.really> writes:

> On Tue, 01 Sep 2015 18:33:36 -0400, Randy Yates wrote: > >> Tim Wescott <tim@seemywebsite.com> writes: >> >>> On Tue, 01 Sep 2015 07:36:01 -0400, Randy Yates wrote: >>> >>>> Folks, >>>> >>>> I've been in a LinkedIn discussion in which the following analysis an >>>> Ariane 5 failure is documented: >>>> >>>> http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html >>>> >>>> I'm just reposting it here since I find it fascinating, and I bet >>>> there are a few folks here (Tim, you come to mind especially) who >>>> might have a few things to say about it. >>>> >>>> I find it almost laughable (if it weren't for the expense and danger >>>> such a failure had or potentially had) that the root cause was a >>>> conversion from float to integer! It supports a "feeling" I've had for >>>> a long time that coding in float is dangerous for just such reasons. >>> >>> Well, I don't see that as the biggest error, or even one that, given >>> the nature of the root problem, would have saved the thing if it was >>> corrected. >> >> Why not? If the BH conversion was protected as other variables, or an >> integer was used that saturated, there would have been no exception >> generated and thus no crash (due to this bug). > > I think that saying that the problem was that they used floating point is > like saying "he didn't apply the brakes early enough" about a guy who > went driving on wet roads with bald tires.
Modify that analogy to say, "he got drunk and didn't apply the brakes early enough" about a guy who went driving on wet roads with bald tires. and I think it actually becomes quite applicable. What was more the problem, that he was driving drunk, or that his tires were bald? You could blame either one.
> Yes, it's _a_ correct interpretation of the evidence. But I don't think > it's the _most useful_ interpretation.
Interpretations aside, the fact is the rocket would not have crashed had this (integer) problem been avoided (assuming a lot of other things that are pretty obvious, like an O-ring busting). If that doesn't make the error an issue, I don't know what does. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
rickman <gnuarm@gmail.com> writes:

> On 9/1/2015 6:24 PM, Randy Yates wrote: >> rickman <gnuarm@gmail.com> writes: >> >>> On 9/1/2015 10:59 AM, Tim Wescott wrote: >>>> >>>> Now, if I'm going to bring MY prejudices to bear on this, it was because >>>> the systems engineering team was of the opinion that embedded software is >>>> Black Magic, or they considered that it doesn't really have value because >>>> it doesn't show up as a line item on the bill of materials. >>> >>> Prejudice is exactly the right word. >> >> Call it what you want - if a different approach had been made, as I >> outlined in a post just a few minutes ago, the Europeans would be >> millions of dollars and a missile launch up. > > Lol. That is quite a stretch...
How so? If the integer overflow issue had been dealt with at design time, none of the rest of the bad decisions (e.g., leaving the calibration code running after launch) would have mattered. -- Randy Yates Digital Signal Labs http://www.digitalsignallabs.com
Randy Yates wrote:
> Folks, > > I've been in a LinkedIn discussion in which the following analysis an > Ariane 5 failure is documented: > > http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html > > I'm just reposting it here since I find it fascinating, and I bet there > are a few folks here (Tim, you come to mind especially) who might have a > few things to say about it. > > I find it almost laughable (if it weren't for the expense and danger > such a failure had or potentially had) that the root cause was a > conversion from float to integer! It supports a "feeling" I've had for a > long time that coding in float is dangerous for just such reasons. >
It was a complex failure. floats are perfectly safe but it takes a lot of filtering - pun intended - to make them so. I probably went ... 20 years not using them, though. It just didn't come up, plus all the math was money when there was math, so BCD was used. -- Les Cargill
spope33@speedymail.org (Steve Pope) wrote:
> Randy Yates <yates@digitalsignallabs.com> wrote: > >> I find it almost laughable (if it weren't for the expense and danger >> such a failure had or potentially had) that the root cause was a >> conversion from float to integer! It supports a "feeling" I've had for a >> long time that coding in float is dangerous for just such reasons. > > This puts you with Von Neumann. > > Floats and doubles are not dangerous. They can store integers within a > certain range, just like any other format. > > Programmers however are dangerous. >
Very.
> Steve >
-- Les Cargill
Randy Yates wrote:
> spope33@speedymail.org (Steve Pope) writes: > >> Randy Yates <yates@digitalsignallabs.com> wrote: >> >>> I find it almost laughable (if it weren't for the expense and danger >>> such a failure had or potentially had) that the root cause was a >>> conversion from float to integer! It supports a "feeling" I've had for a >>> long time that coding in float is dangerous for just such reasons. >> >> This puts you with Von Neumann. > > I'm not sure if that's a complement or a criticism... > >> Floats and doubles are not dangerous. They can store integers within a >> certain range, just like any other format. > > Yes, but when humans use them, they start being sloppy!
Oh no no no!. You cannot trust them. Although really - integer saturation is just as dangerous and probably more common.
> And if you are > not sloppy, you might as well use integers/fixed-point (for many many > things). > >> Programmers however are dangerous. > > This sounds a lot like the anti-gun-control sentiment.. >
"Floating point doesn't kill rockets... programmers kill rockets... " -- Les Cargill