Airbus A320 – Intense Solar Radiation May Corrupt Data Critical for Flight

Postedabout 1 month agoActiveabout 1 month ago

pyrophoenix

18 points

2 comments

airbus.comTechstory

calm

Debate

10/100

AviationSolar RadiationData CorruptionAirbus

Key topics

Aviation

Solar Radiation

Data Corruption

Airbus

Airbus issued a precautionary update regarding potential data corruption in A320 aircraft due to intense solar radiation, which could impact critical flight data. The update was met with minimal discussion on Hacker News. The original title was modified to better reflect the content.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

12-15h

Avg / period

12.3

Key moments

01Story posted
Nov 28, 2025 at 4:40 PM EST
about 1 month ago
Step 01
02First comment
Nov 28, 2025 at 4:42 PM EST
2m after posting
Step 02
03Peak activity
31 comments in 12-15h
Hottest window of the conversation
Step 03
04Latest activity
Nov 30, 2025 at 7:11 PM EST
about 1 month ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (2 comments)

Showing 160 comments

jMyles

about 1 month ago

1 reply

This is one of the rare cases where, IMO, it makes sense to use a modified title as you've done here.

jb1991

about 1 month ago

Thank you for everything you do to moderate this forum.

op00to

about 1 month ago

2 replies

Solar radiation like solar wind, or sunlight? They don’t say.

mr_toad

about 1 month ago

2 replies

[delayed]

fwip

about 1 month ago

2 replies

[delayed]

dtagames

about 1 month ago

1 reply

Gamma rays penetrate everything and have definitely been known to disrupt computer circuits.

fwip

about 1 month ago

[delayed]

awesome_dude

about 1 month ago

1 reply

> The grounding of Airbus A320neo aircraft around the world can be traced back to an incident on a JetBlue flight operating a Cancun to New Jersey service on 30 October.

> At least 15 passengers were injured and taken to the hospital after a sudden drop in altitude on the flight from Mexico was forced to make an emergency landing in Florida, US aviation officials said at the time.

> The Thursday flight from Cancun was headed to Newark, New Jersey, when the altitude dropped, leading to the diversion to Tampa International Airport, the US Federal Aviation Administration said in a statement.

> Pilots reported “a flight control issue” and described injuries including a possible “laceration in the head,” according to air traffic audio recorded by LiveATC.net.

> Medical personnel met the passengers and crew on the ground at the airport. Between 15 and 20 people were taken to hospitals with non-life-threatening injuries, said Vivian Shedd, a spokesperson for Tampa Fire Rescue.

> Pablo Rojas, a Miami-based attorney who specialises in aviation law, said a “flight control issue” indicated that the aircraft wasn't responding to the pilots.

https://www.stuff.co.nz/travel/360903363/what-happened-fligh...

lostlogin

about 1 month ago

1 reply

I’m surprised passengers are allowed to unbuckle for so much of each flight. You can get injured while buckled it, but that seems less common.

MaxfordAndSons

about 1 month ago

5 replies

The flight attendants/safety card will tell you to stay buckled whenever seated, even if the seat belt sign is off, but many (most?) people will ignore that guidance and stay unbuckled for as long as they are technically allowed.

Only aviation professionals or recovering flight phobics like me who have watched every episode of Air Crash Investigation will take proactive safety measure of their own accord. To normies it's all just a pointless hassle.

sailfast

about 1 month ago

1 reply

[delayed]

MaxfordAndSons

about 1 month ago

Having an understanding of the bell curve of turbulence makes you a bit more advanced than a normie on the aviation knowledge bell curve imo :)

rainsford

about 1 month ago

1 reply

I'm amazed how many grown ass adults on airplanes act like little kids when it comes to seat belts and basically everything else.

Not just ignoring flight crew advice and common sense to generally stay buckled in order to gain maybe a minor amount of comfort and convenience being unbuckled, but unbuckling even when the seat belt sign is on and again common sense says being buckled in is the smart move. On my most recent flight I heard quite a few people unbuckling their seat belts while the plane was still rolling down the runway after landing. You couldn't wait 5 more minutes until the plane is at the gate?

MaxfordAndSons

about 1 month ago

lol yep. It's like they have the same mentality as being a schoolbus (which, it's similarly wild to me that kids are just implicitly allowed to not wear their seatbelts on them but I guess thats an even more intractable enforcement problem).

Also: people clapping the second the back wheels touch on landing, is particularly hilarious to me, because it implies an acknowledgement of the precariousness of flying, but a complete ignorance that you're in fact just entering the second most dangerous 30 seconds of the entire flight.

baq

about 1 month ago

People have different priors for bad things that can happen on a plane. If you’ve experienced turbulence you’ll probably buckle up.

danmaz74

about 1 month ago

I'm not flight phobic but I still stay buckled all the time when I don't need to move. It's a very little nuisance.

seg_lol

about 1 month ago

No reason to not buckle, I keep the belt a little looser, but buckled the entire time. Esp on Boeing planes, I want to get sucked out with the seat.

bparsons

about 1 month ago

1 reply

There was a very large CME ten days ago. The NOAA scale had predicted a high likelihood of disruptions, and had specifically suggested that spacecraft and high altitude aircraft could be impacted.

https://www.swpc.noaa.gov/noaa-scales-explanation

https://kauai.ccmc.gsfc.nasa.gov/CMEscoreboard/prediction/de...

glaucon

about 1 month ago

FWIW the "industry sources say" line on the incident is that it occurred on 30 October[1], so further back than ten days ago but of course there may have been other CME incidents at that time.

The European Agency Aviation Safety Agency [2] instruction describes the characteristics of the incident but not the date.

[1] https://www.theguardian.com/business/2025/nov/28/airbus-issu...

[2] https://ad.easa.europa.eu/ad/2025-0268-E

tyingq

about 1 month ago

Would guess "cosmic rays". Sun Microsystems had a batch of UltraSparc servers that were very sensitive to it, and it was a big issue.

https://docs.oracle.com/cd/E19095-01/sf4810.srvr/816-5053-10...

https://en.wikipedia.org/wiki/Cosmic_ray

addaon

about 1 month ago

5 replies

I’d really, really like to know what microcontroller family this was found on. Assuming that this is a safety processor (lockstep, ECC, etc) it suggests that ECC was insufficient for the level of bit flips they’re seeing — and if the concern is data corruption, not unintended restart, it means it’s enough flips in one word to be undetectable. The environment they’re operating in isn’t that different from everyone else, so unless they ate some margin elsewhere (bad voltage corner or something), this can definitely be relevant to others. Also would be interesting to know if it’s NVM or SRAM that’s effected.

jayanmn

about 1 month ago

5 replies

I am worried about a software fix for what looks like hardware problem.

afavour

about 1 month ago

1 reply

It could be as simple as storing multiple copies of the relevant data and adding a checksum, something like that.

Hardware fix is the ultimate solution but it might be possible to paper over with software.

andy_ppp

about 1 month ago

What about the software that checks the copies? That could become corrupted too?

themerone

about 1 month ago

4 replies

Gracefully handling hardware faults is a software problem. The Air France Flight 447 crash was the result of bad software and bad hardware.

vel0city

about 1 month ago

1 reply

I'm reminded of the Apollo moon landing where the computer was rapidly rebooting and being in an OK-ish state to continue to be useful almost immediately

CrossVR

about 1 month ago

2 replies

It wasn't rebooting, it ran out of memory and started aborting lower priority tasks. It was a excellent example of robust programming in the face of unexpected usage scenarios.

f1shy

about 1 month ago

1 reply

Of topic for the thread, but on for the comment: I was working in an automotive project 3 years ago. It was all about safety, and one hypothesis was the processor could get overloaded. I was astonished no one in a grouo of 20 “senior sw architecs” had any idea about the concept of load shedding. The proposed solution was “in that case, reboot”.

Mind you whatever came out of that project is rolling on the street today.

concinds

about 1 month ago

We really should mandate all that stuff to be open-source, so we can be aware of how defective everything is.

K0balt

about 1 month ago

1 reply

Fail safe/fail soft

I still design this into many of the things I work on, especially if I’m working close to the metal on controller systems. At some point it becomes ridiculous / impossible but I’m often thinking about how a system would handle memory corruption, bit flips, invalid sensor data, etc. These days, somebody should design a triple redundant microcontroller that runs quorum on the gpio at the hardware level. It could be a 0.30 part instead of 0.10 one, but I would specify it just about everywhere. Add $3 to BOM cost to categorically eliminate an entire class of failure would be ramrodded by legal into just about every medical device, PLC, critical automotive system, etc one would think. Seems like a good gambit for a riscV startup, but what do I know.

K0balt

about 1 month ago

Ok so, turns out there are a lot of MCUs like this, including a riscV triple core lockstep with ECC lol. No super cheap ones, but microchip makes the AVR-SD which leverages a pair of their AVR8 cores in lockstep with ECC flash and RAM. It’s ~$1, so I think I’ll pick that as my next toy project to play with. Turns out, Simpsons already did it.

f1shy

about 1 month ago

1 reply

And bad pilot training, if I recall correctly.

amelius

about 1 month ago

2 replies

I suppose because they were not instructed to work around the software and hardware flaws.

janeric1

about 1 month ago

1 reply

Because they kept pitching up in a stall

f1shy

about 1 month ago

And two pilots were trying to fly the plane without talking to each other. One learned “if something happens just pull back, this plane cannot stall”. No pilot learned to say “my plane” when flying. Way too many errors.

f1shy

about 1 month ago

No. The behavior was wrong, no matter what you are flying. Same problem with proper training would have result in a simple report.

foldr

about 1 month ago

1 reply

Crashes caused by pilots failing to execute proper stall recovery procedures are surprisingly common, and similar accidents have happened before in aircraft with traditional control schemes, so I’m skeptical that there are any hardware changes that would have made much difference, despite this having become the unofficially accepted story of the accident online (pace the official report, which doesn’t identify the hardware or software as significant factors).

julik

about 1 month ago

3 replies

True. I would say, however, that every "concept" of airliner flight deck has its own gimmicks that can kill. The Airbus "dual input" is such a gimmick. Even though there was, for example, an AF accident with a 777 where there was hardware linkage between yokes and the two pilots were fighting... each other. Physically.

p_l

about 1 month ago

There was more than one case where pilots would accidentally fight and break the linkage, or one would overpower the other.

One glider instructor talked about taking a stick with him in case of panicking student, so they could hit them hard enough so they would stop holding the controls.

foldr

about 1 month ago

The official report doesn't identify the lack of sidestick linkage as a factor in the accident. Neither of the two pilots who were at the controls had any idea what was happening. Both pulled back on their sticks repeatedly right up to the moment of impact. The captain, who eventually realized (too late) that the plane was stalled, was standing behind them, and so would not have benefited from linked sticks.

There's a detailed breakdown here: https://admiralcloudberg.medium.com/the-long-way-down-the-cr...

foldr

about 1 month ago

In the case of AF 447, the two pilots at the controls both failed to identify the stall condition. It was the captain, standing behind them, who tried (too late) to try to get them to push the nose down. So it doesn't seem to me that the sidesticks being linked or not had any effect on the accident.

exidy

about 1 month ago

Although the pitot tubes on AF447 were due to be replaced with a type more resistant to icing, nonetheless there's no such thing as a 100% reliable pitot tube and there were procedures to follow in the event of unreliable airspeed indication. Had they been followed the accident would not have happened. Instead the co-pilot held back on his stick until the aircraft fell out of the sky.

I don't believe there was any issue identified with the software of the plane.

willis936

about 1 month ago

It's a system problem. The system is being fixed.

bigyabai

about 1 month ago

ECC?

kachapopopow

about 1 month ago

software fixes are totally fine since the chance of two redundant pairs failing within the time it takes to correct these errors is more zero's than there are atoms in the universe. (each pilot has a redundant computer and because there's two pilots there's two redundant pairs)

TehCorwiz

about 1 month ago

2 replies

An early revision of the Raspberry Pi 2 would crash if you hit it with a bright light like a camera flash. Specifically a xenon flash.

https://forums.raspberrypi.com/viewtopic.php?t=99167

https://forums.raspberrypi.com/viewtopic.php?f=28&t=99042

https://www.raspberrypi.com/news/xenon-death-flash-a-free-ph...

https://www.youtube.com/watch?v=wyptwlzRqaI

russdill

about 1 month ago

1 reply

Completely unrelated and due to a design failure by the rpi folks.

hughw

about 1 month ago

1 reply

Is it really so unrelated? Isn't it a case where a similar phenomenon -- radiation impacting a computer calculation -- happened and it's one we can all relate to more easily, and reproduce if we cared to, than high altitude avionics? Not necessarily disputing but it just seems like a relatable case that helps me understand the issue better. If it's a radically different case somehow I'm interested to learn.

russdill

about 1 month ago

1 reply

No, because it's a completely different kind of radiation.

hughw

about 1 month ago

1 reply

Different band, sure. Same principle, right?

russdill

about 1 month ago

1 reply

The difference between ionizing and non-ionizing radiation is quite different. But for much of the radiation effecting electronics at high altitudes it's largely subatomic particles.

And of course you can block the type radiation that caused problems for the rpi with a good piece of paper.

hughw

about 1 month ago

Got it. Ty.

mlyle

about 1 month ago

Yah, but that's a case of the package not being opaque enough.

anonymousiam

about 1 month ago

2 replies

proper SEU mitigation goes far beyond ECC. Satellites fly higher than the A320, and they (at least the ones I know about) use Triple Modular Redundancy: https://en.wikipedia.org/wiki/Triple_modular_redundancy

For manned spaceflight, NASA ups N from 3 to 5.

Other mitigations include completely disabling all CPU caches (with a big performance hit), and continuously refreshing the ECC in background.

rkagerer

about 1 month ago

6 replies

In redundant systems like these, how do you avoid the voting circuit becoming a single point of failure?

Eg. I could understand if each subsystem had its own actuators and they were designed so any 3 could aerodynamically override the other 2, but I don't think that's how it works in practice.

exe34

about 1 month ago

1 reply

if the issue is radiation bit flipping, you could make that part overly shielded?

baq

about 1 month ago

1 reply

Define ‘overly’. You can submerge it in a sphere of water, but that’s going to be expensive to launch.

exe34

about 1 month ago

I suspect a couple millimeters of lead in the right place would do it. cheaper to shield the voting mechanism than the whole thing.

simne

about 1 month ago

1 reply

> how do you avoid the voting circuit becoming a single point of failure

They do not. Just make voting circuit much more reliable than computing blocks.

As example, computing block could be CMOS, but voting circuit made from discrete components, which are just too large to be sensitive to particles.

Unfortunately, discrete components are sensitive to overall exposure (more than nm scale transistors), because large square gather more events and suffered by diffusion.

Other example from aviation world - many planes still have mechanic connection of steering wheel to control surfaces, because mechanic connection considered ideally reliable. Unfortunately, at least one catastrophe happen because one pilot blocked his wheel and other cannot overcome this block.

BTW weird fact, modern planes don't have rod physically connected to engine, because engine have it's own computer, which emulate behavior of old piston carburetor, and on Boeing emulating stick have electronic actuator, so it automatically placed in position, corresponding to actual engine mode, but Airbus don't have such actuator.

I want to say - especially big planes (and planes overall), are weird mix of very conservative inherited mechanisms and new technologies.

anonymousiam

about 1 month ago

Electronics in high-radiation environments benefit from a large feature size with regard to SEU reduction, but you're correct that the larger parts degrade faster in such environments, so they've created "rad-hard" components to mitigate that issue.

It's interesting to me that triple-voting wasn't necessary on the older (rad-hard) processors. Every foundry in the world is steering toward CPUs with smaller and smaller feature sizes, because they are faster and consume less power, but the (very small) market for space-based processors wants large feature sizes. Because those aren't available anymore, TMR is the work-around.

jasonwatkinspdx

about 1 month ago

My understanding is you're roughly right: the actuators will have their own microcontroller. It receives commands from the say 3 flight computers, then decides locally how respond if they mismatch. Ie for 2 out of 3 matching it may continue as commanded, but with only 1 out of 3 it may shift into a fail safe strategy for whatever that actuator is doing.

AlphaSite

about 1 month ago

Voting can be coordinated between the N cpus rather than an external arbiter (even making that redundant eventually required the CPUs to decide what to do if they disagree so may as well handle it internally).

V__

about 1 month ago

Can't this be solved by having a "high refresh-rate"? Even if the voting circuit gets hit, if it updates 60 times a second it won't really affect any mechanical parts since the next signal will quickly override the error?

cpgxiii

about 1 month ago

In some cases, it is exactly the case of multiple independent actuators, such that the "voting" is effectively performed by the physical mechanism of the control surface.

In other cases all of the subsystems implement the comparison logic and "vote themselves out" if their outputs diverge from the others. A lot of aircraft control systems are structured more as primary/secondary/backup where there is a defined order of reversion in case of disagreement, rather than voting between equals.

But, more generally, it is very hard to eliminate all possible single points of failure in complex control systems, and there are many cases of previously unknown failure points appearing years or decades into service. Any sort of multi-drop shared data bus is very vulnerable to common failures, and this is a big part of the switch to ethernet-derived switched avionics systems (e.g. Afdx) from older multi-drop serial busses.

aborsy

about 1 month ago

TMR and co are basically repetition codes, which is an ECC.

RealityVoid

about 1 month ago

4 replies

See my other comments in the other threads. This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips. That flight computer was designed in the 90's and updated in 2002 with a new hw variant that does have edac. So yes, for this kind of thing, I can buy that a bit flip happened.

You can see much more data in te report.

Reason077

about 1 month ago

7 replies

The recalled aircraft include the latest A320neo model, some of which are basically brand new. Why would they be using flight computers from before 2002? Why is an old report from 2008, relating to a completely different aircraft type (A330), relevant to the A320 issue today?

LiamPowell

about 1 month ago

1 reply

> Why would they be using flight computers from before 2002?

Why would you assume they're not? I don't know about aircraft specifically, but there's plenty of hardware that uses components older than that. Microchip still makes 8051 clones 45 years after the 8051 was released.

K0balt

about 1 month ago

1 reply

That’s just wild to think about. We should all strive to build solutions that plague our descendants with their persistent utility.

hylaride

about 1 month ago

From a pure safety point of view, it's easier to deal with older, but well-understood products, only updating them if it's an actual safety issue. The alternative is having to deal with many generations of tech, as well as permutations with other components, that could get infinitely complicated. On top of that, it's extremely time consuming and expensive to certify new components.

There's a reason the airlines and manufacturers hem and haw about new models until the economics overwhelmingly make it worthwhile, and even then it can still be a shitshow. The MCAS issue is case in point of how introducing new tech can cause unexpected issues (made worse by Boeing's internal culture).

The 787 dreamliner is also a good example of how hard it is. By all accounts is a success, but it had some serious teething problems and still has some concerns about the long term wear and tear of the composite materials (though a lot of it's problems wasn't necessarily the application of new tech, but Boeing's simultaneous desire to overcomplicate the manufacturing pipeline via outsourcing and spreading out manufacturing).

4ndrewl

about 1 month ago

1 reply

The neo is not brand new - it's an incremental update to the 320. neo refers to New Engine Option

rkomorn

about 1 month ago

They wrote "some of which are basically brand new", which is technically correct.

They didn't say the design was brand new.

t0mas88

about 1 month ago

2 replies

> Why would they be using flight computers from before 2002?

Because getting a new one certified is extremely expensive. And designing an aircraft with a new type certificate is unpopular with the airlines. Since pilots are locked into a single type at a time, a mixed fleet is less efficient.

Having a pilot switch type is very expensive, in the 50-100k per pilot range. And it comes with operational restrictions, you can't pair a newly trained (on type) captain with a newly trained first officer, so you need to manage all of this.

Reason077

about 1 month ago

1 reply

I think you're confusing a type certificate (certifying the airworthiness of the aircraft type) with a type rating, which certifies the pilot is qualified to operate that type.

Significant internal hardware changes might indeed require a new/updated type certificate, but they generally wouldn't mean that pilots need to re-qualify or get a new type rating.

t0mas88

about 1 month ago

1 reply

No I meant designing a new aircraft with a new type certificate instead of creating the A320neo generation on the same type certificate. The parent comment wondered why Airbus would keep the old computers around, I tried to explain why they keep a lot of things the same and only incrementally add variants.

darkwater

about 1 month ago

1 reply

Asking from ignorance: shouldn't the computer design be an implementation detail to the captain, while the interface used by who pilots stays the same for that type of airplane? I understand physical changes in the design need a retraining but the computer?

t0mas88

about 1 month ago

1 reply

Ideally you would not change the computer at all so your type certificate doesn't change. If you have to (or for commercial reasons really want to) make a change you would try very hard to keep that the same type certificate or at most a variant of the same type certificate. If you can do that then it will be flown with the same type rating and you avoid all the crew training cost issues.

But to do that you'll still have to prove that the changes don't change any of the aircraft characteristics. And that's not just the normal handling but also any failure modes. Which is an expensive thing to do, so Airbus would normally not do this unless there is a strong reason to do it.

The crew is also trained on a lot of knowledge about the systems behind the interface, so they can figure out what might be wrong in case of problems. That doesn't include the software architecture itself but it does include a lot of information on how redundancy between the systems work and what happens in case one system output is invalid. For example how the fail over logic works in case of a flight control computer failure, or how it responds to loosing certain inputs. And how that affects automation capabilities. For example no autoland when X fails, no autopilot and degradation to alternate contol law when Y fails, further degradation if X and Z fail at the same time. Sometimes also per "side", not all computers are connected to all sensors.

mlyle

about 1 month ago

1. I don't think adding robustness necessarily requires changing how systems are presented to the flight crew.

2. Bigger changes than this are made all the time under the same type certificate. Many planes went from steam gauges to glass cockpits. A320 added a new fuel tank with transfer valves and transfer logic and new failure modes, and has completely changed control law over the type. etc.

RealityVoid

about 1 month ago

Since the new versions of the same ADIRU have EDAC, they have been using it on planes since 2002 and they have been putting the EDAC variant in whenever an old one was being returned for repairs, I don't think this is the reason. I think the reason is that they had 3 ADIRU's and even if one got wonky, the algorithm on the ELAC flight computer would have to take the correct decision. It did not take the correct decision. The ELAC is the one being updated in this case.

p_l

about 1 month ago

For the same reasons that Honeywell is building new devices with AMD 29050 CPUs today[1] - by sticking with the same CPU ISA, they can avoid recertifying portion of the software stack.

[1] Honeywell actually bought full license et al from AMD and operates a fabless team that ensures they have stock and if necessary updates the chip.

RealityVoid

about 1 month ago

The issue detailed in the linked report details why the spike happened in the first place on the ADIRU (produced by Northrop Gruman). The recalled controller is the ELAC that comes from Thales. The problem chain was that despite the ADIRU spiking up, the ELAC should not have taken the reactions it took. So they are fixing it in the ELAC.

Havoc

about 1 month ago

> Why would they be using flight computers from before 2002?

Guessing that using previously certified stuff is an advantage

RealityVoid

about 1 month ago

Because the problem isn't just this. It's that the flight controller did not properly decide what to do when the data spiked because of this issue as well.

lxgr

about 1 month ago

3 replies

> This does not have EDAC. I was as surprised as you but it doesn't seems to be an MCU but a composition of several distinct chips.

Wasn't the philosophy back then to run multiple independent (and often even designed and manufactured by different teams) computers and run a quorum algorithm at a very high level?

Maybe ECC was seen as redundant in that model?

JorgeGT

about 1 month ago

1 reply

I don't know about the A320 but this was certainly the model for the Eurofighter. One of my university professors was in one of the teams, they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

RealityVoid

about 1 month ago

3 replies

> they were given the specs and not allowed to communicate with the other teams in any way during the hw and sw development.

Jeez, it would drive me _up the wall_. Let's say I could somewhat justify the security concerns, but this seems like it severely hampers the ability to design the system. And it seems like a safety concern.

K0balt

about 1 month ago

1 reply

The idea was to build three completely different systems to produce the same data, so that an error or problem in one could not be reasonably replicated in the others. In a case such as this, even ideas about algorithms could result in undesirable similarities that could end up propagating an attack surface, a logic error, or a hardware weakness or vulnerability. The desired result is that the teas solve the problem separately using distinct approaches and resources.

Filligree

about 1 month ago

4 replies

And did they?

Sometimes the solution is obvious, such that if you ask three engineers to solve it you’ll get three copies of the same solution, whereas that might not happen if they’re able to communicate.

I’m sure they knew what they were doing, but I wonder how they avoided that scenario.

sandworm101

about 1 month ago

Not at airbus. Ask a german, french and british engineer the same question and you will never, ever get the same answer from each.

cmckn

about 1 month ago

Even with the same approach, I imagine the implementation could differ enough to still meet the goal. But I’m also curious if the differences were actually quantified after the fact, it seems an important step.

K0balt

about 1 month ago

I think this would come down to team selection. At airbus they have the advantage of cultural diversity to lean on, I have no doubt that implementations would differ not only in implementation but in design philosophy, compromises, and priorities.

addaon

about 1 month ago

I can (and have in past) written a long explanation on my experience with this, but…

Redundancy is a tool for reducing the probability of encountering statistical errors, which come from things like SEUs.

Dissimilarity is a tool for reducing the “probability” of encountering non-statistical errors — aka defects, bugs — but it’s a bit of a category error to discuss the probability of a non-probabilistic event; either the bug exists or it does not, at best you can talk about the state coverage that corresponds to its observability, but we don’t sample state space uniformly.

There has been a trend in the past few decades, somewhat informed by NASA studies, to favor redundancy as the (only, effective) tool for mitigating statistical errors, but to lean against heavy use of dissimilarity for software development in particular. This is because of a belief that (a) independent software teams implement the same bugs anyway and (b) an hour spent on duplication is better spent on testing. But at the absolute highest level of safety, where development hours are a relatively low cost compared to verification hours, I know it’s still used; and I don’t know how the hardware folks’ philosophy has evolved.

lambdaone

about 1 month ago

1 reply

What you are trying to minimize here is the error rate of the composite system, not the error rate of the individual modules. You take it as a given that all the teams are doing their human best to eliminate mistakes from their design. The idea of this is to make it likely that the mistakes that remain are different mistakes from those made by the other teams.

Providing errors are independent, it's better to have three subsystems with 99% reliability in a voting arrangement than one system with 99.9% reliability.

saltcured

about 1 month ago

This seems like it would need some referees who watch over the teams and intrude with, "no, that method is already claimed by the other team, do something else"!

Otherwise, I can easily see teams doing parallel construction of the same techniques. So many developments seem to happen like this, due to everyone being primed by the same socio-technical environment...

lxgr

about 1 month ago

How so? It’s a safety measure at least as much as a security one.

It’s essentially a very intentional trade-off between groupthink and the wisdom of crowds, but it lands on a very different point on that scale than most other systems.

Arguably the track record of Airbus’s fly-by-wire does them some justice for that decision.

p_l

about 1 month ago

1 reply

One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

ECC memory usage in the past was heavily correlated with, well, way lower quality of the hardware from chips to assembly, electromagnetic interference from unexpected sources, or even customer/field technician errors. Remember an early 1980s single user workstation might require extensive check & fix cycle just from moving it around.

An aircraft component would eliminate all major parts of that, including both through more thorough self-testing, careful sealed design, selection of high grade parts, etc.

The possibility of space radiation causing considerable issues came up as fully digital fly by wire became more common in civilian usage and has led over time to retrofitting with EDAC, but radiation-triggered SEU was deemed low enough risk due to design of the system.

addaon

about 1 month ago

1 reply

> One thing to remember is that unless you explicitly made a device to operate in space and even there beyond LEO, you quite probably might have seen no requirement for ECC memory for radiation reasons.

This does not match my experience (although, admittedly, I've been in the field only a couple decades -- the hardware under discussion predates that). The problem with SEU-induced bit flips is not that errors happen, but that errors with unbounded behavior happen -- consider a bit flip in the program counter, especially in an architecture with variable sized instructions. This drives requirements around error detection, not correction -- but the three main tools here are lockstep processor cores, parity on small memories, and SECDED on large memories. SECDED ECC here is important both because it can catch double errors that happen close together in time, and because memory scrubbing with single error correction allows multiple errors spaced in time to be considered separately. At the system level, the key insight is that detectable failures of a single ECU have to be handled anyway, because of non-transient statistical failures -- connector failures, tin whiskers, etc. The goal, then, is to convert substantially all failures to detectable failures, and then have defined failure behavior (often fail-silent). This leads to dual-dual redundancy architectures and similar, instead of triplex; each channel consists of two units that cross-check each other, and downstream units can assume that commands received from either channel are either correct or absent.

p_l

about 1 month ago

The incident report on the 2008 case specifically mentions SEU for memory - certain level of EDAC techniques is applied on the whole unit, but one area that was not covered was the possibility of non-catastrophic (in terms of operation) failure of memory module with a bitflip.

An under-appreciated thing is also that the devices in question used to be rebooted pretty often which triggered self-test routines in addition to the run-time tests - something that didn't trigger anything in case of A330 in 2008, but was impactful in risk assessments missing certain things with 787 some years later (and newer A380/A350 recently).

RealityVoid

about 1 month ago

It was, and they did. It seems at the moment of the module's creation, EDAC was not required, and it probably was quite more expensive. The new variant apparently has EDAC. They retrofitted all units with the newer variants whenever one broke down. Overall, ECC is an extra layer of protection. The _presumably_ bit flip would be plausible to blame for data spikes. But even so, the data spikes should not have caused the controls issue. The controls issue is a separate problem, and it's highly likely THAT is what they are going to address, in another compute unit.

"There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]"

This is most likely what they will address.

Liftyee

about 1 month ago

2 replies

What does EDAC mean here? I wasn't able to find a definition. My guess is "error detection and correction"?

Difference between it and ECC?

RealityVoid

about 1 month ago

3 replies

That was my initial confusion as well. It means exactly what you guessed, "Error detection and correction". I asked Claude about it (caveat emptor) and it said EDAC is the correct name for the circuitry and implementation itself whereas ECC is the algorithm. Gemini said that EDAC is the general technique and ECC is one implementation variant. So, at this point, I'm not sure. They are used interchangeably (maybe wrongly so), and in this case, we're referring to, essentially, the same thing, with maybe some small differences in the details. In my professional life, almost always I referred to ECC. In the report, they were only using EDAC. I thought I'd maintain consistency with the report so I tried using EDAC as well.

Normal_gaussian

about 1 month ago

1 reply

Large portions of this comment provides zero to negative value. You've quoted two LLMs and couched it in "caveat emptor" and "so I'm not sure". The rest of your comment has then mused over this data you do not trust using generalities ("my profession" are you a JS S/W eng? A chip design specialist at ARM? A security researcher?).

All of the value of your comment comes from the first sentence and the last two.

RealityVoid

about 1 month ago

1 reply

Sheesh, tough crowd.

ImPostingOnHN

about 1 month ago

Feel free to consult LLMs, with all their downsides (like needing to verify what they say, because it could be totally wrong).

What you're doing here is half the job: consulting an LLM and sharing the output without verifying whether it is true. You're then saying 'okay everyone else, finish my job for me, specifically the hard part of it (the verification), while I did the easy part (asking a magic 8 ball)'.

From this perspective, your comment is disrespectful of others by asking them to finish your job, and of negative value because it could be totally hallucinated and false, and you didn't care enough about others to find out.

dgacmu

about 1 month ago

The more correct and general answer is that:

- EDAC is a term that encompasses anything used to detect and correct errors. While this almost always involves redundancy of some sort, _how_ it is done is unspecified.

- The term ECC used stand-alone refers specifically to adding redundancy to data in the form of an error correcting code. But it is not a single algorithm - there are many ECC / FEC codes, from hamming codes used on small chunks of data such as data stored in RAM, to block codes like reed-solomon more commonly used on file storage data.

- The term ECC memory could really just mean "EDAC" memory, but in practice, error correcting codes are _the_ way you'd do this from a cost perspective, so it works out. I don't think most systems would do triple redundancy on just the RAM -- at that point you'd run an independent microcontroller with the RAM to get higher-level TMR.

Yokolos

about 1 month ago

EDAC is a general term for an error detection and correction system. It can encompass ECC memory or other solutions.

https://www.sciencedirect.com/science/article/abs/pii/S01419...

K0balt

about 1 month ago

EDAC is the concept, ECC is a family of algorithmic solutions in the service of the concept. Specific implementations of ECC are the engineering solution that implement the specific form of ECC in specific devices at the hardware or software level.

It’s confusing because EDAC and ECC seem to mean the same thing, but ECC is a term primarily used in memory integrity, where EDAC is a system level concept.

asdefghyk

about 1 month ago

Specifically of the above document , APPENDIX G: ELECTROMAGNETIC RADIATION mentions in detail how radiation (possible from the sun ) can cause flipped bits and other errors in electronic circuits in some detail ..... We are also at the PEAK of the 11 year sunspot...

asdefghyk

about 1 month ago

RE "...The environment they’re operating in isn’t that different from everyone else...." NO this is incorrect. High flying aircraft more likely to suffer increased radiation caused by 11 year peak sunspot cycle . such aircraft should be using "radiation hardened electronics" , somewhat like spacecraft use...

qaq

about 1 month ago

2 replies

Has BoFesc vibes "It's friday, so I get into work early, before lunch even. The phone rings. Shit!

I turn the page on the excuse sheet. "SOLAR FLARES" stares out at me. I'd better read up on that..."

ezconnect

about 1 month ago

Solar flares are the best excuse. We just have to wait it out.

suprjami

about 1 month ago

[delayed]

jfoster

about 1 month ago

3 replies

I've noticed that some carriers seem to be suggesting that there might be no impact to flights, but isn't this an immediate grounding for each aircraft until the update is made?

How is it possible that this wouldn't impact upon flight schedules?

icegreentea2

about 1 month ago

1 reply

The grounding is for 6000 of 11000 A320 series. I believe it's some combination of software and hardware configuration that is at risk.

jfoster

about 1 month ago

1 reply

Thank you; that makes sense. I had the impression it was the entire fleet.

julik

about 1 month ago

It depends on whether the ELAC is an LRU (line-replaceable unit, i.e. a box with ports that can be swapped at an airport) and whether a software update can be uploaded into a unit that is installed (not all aircraft have a "firmware update via cable or floppy", so to speak)

arrel

about 1 month ago

N of 1, but I’m stuck in phoenix overnight because our flight was delayed an hour and a half by airbus maintenance and we missed our connection.

simne

about 1 month ago

If possible for exact this plane, could make software update just as routine procedure.

But as I hear, air transporters could buy planes in different configurations, so for example, Emirates airlines, or Lufthansa always buy planes with all features included, but small Asian airlines could buy limited configuration (even without some safety indicators).

So for Emirates or Lufthansa, will need one empty flight to home airport, but for small airline will need to flight to some large maintenance base (or to factory base) and wait in queue there (you could find in internet images of Boeing factory base with lot of grounded 737-MAXes few years ago).

So for Emirates or Lufthansa will be minimal impact to flights (just like replacement of bus), but for small airlines things could be much worse.

kappi

about 1 month ago

1 reply

Following the Airbus A320 emergency airworthiness action, everyone will be talking about the ELAC (Elevator Aileron Computer) manufactured by Thales, which caused a sudden pitch-down without pilot input on JetBlue 1230 back in October.

So here’s everything you need to know about ELAC.

The ELAC System in the Airbus A320: The Brains Behind Pitch and Roll Control https://x.com/Turbinetraveler/status/1994498724513345637

ThePowerOfFuet

about 1 month ago

https://xcancel.com/Turbinetraveler/status/19944987245133456...

joelthelion

about 1 month ago

7 replies

Do they really need to ground the entire fleet for that? One incident for ten thousand planes in the air for years. I'd think that giving airlines two months to fix it would be sufficient.

kijin

about 1 month ago

2 replies

I imagine it could help with Airbus marketing.

"We take proactive measures, whereas our competitor only takes action after multiple fatal crashes!"

brabel

about 1 month ago

1 reply

Imagine an airplane crashed in these 2 months. I bet you would join the chorus and blame them for gross negligence.

kijin

about 1 month ago

There's a huge difference between "manufacturer recommended updates, but airline waited until the last week to apply them" and "manufacturer didn't even acknowledge the issue" in terms of who the chorus is going to blame.

probably_wrong

about 1 month ago

1 reply

I know someone who is stranded in another continent thanks to this. Trust me, all the understanding I could have as a technical user has been offset by the MASSIVE pain in the ass that is rebooking an international flight.

As far as I'm concerned it has not helped with their marketing.

lxgr

about 1 month ago

1 reply

> "the plane will not travel because it requires a software update", which does not inspire confidence.

It actually inspires a lot of confidence to people who can at least think economically, if not technically:

Grounding thousands of planes is very expensive (passengers get cash for that in at least the EU, and sometimes more than the ticket cost!), so doing it both shows that it’s probably a serious issue and it’s being taken seriously.

probably_wrong

about 1 month ago

First, I feel the implication that "if you aren't reassured it's only because you're dumb" is unwarranted.

With that out of the way, being expensive does not preclude shoddy work. At the end of the day, the only difference between "they are so concerned about security that they are willing to lose millions[1]" and "their process must be so bad that they have no other choice but to lose millions before their death trap cost them ten times that" is how good your previous perception of their airplanes is.

I think that, had this exact same issue happened to Boeing, we would be having a very different conversation. As the current top-comment suggests, it would probably be less "these things happen" and more "they cheapened out on the ECC".

[1] Disclaimer: I have no idea who loses money in this scenario, if it's also Airbus or if it's exclusively the airlines who bought them.

jfoster

about 1 month ago

1 reply

I wonder who eats the cost of this? I presume it's the airlines.

So the immediate cost to Airbus of grounding the fleet is quite low, whilst the downside of not grounding the fleet (risk of incident, lawsuits, reputation, etc.) could be substantial.

Havoc

about 1 month ago

Yeah should be airlines

It sounds like the fix is fairly quick so probably not as expensive as the max multi month groundings

I doubt anyone is going to sue. Repairs etc are a part of life when owning aircraft. So as long as Airbus makes this happen fast and smooth they’re probably ok

f1shy

about 1 month ago

1 reply

I would personally not want to seat in those planes in those 2 months.

upcoming-sesame

about 1 month ago

nothing worse than rushing a fix in production - only to find out the fix has caused more damage than the original bug

refulgentis

about 1 month ago

Yeah, because the alternative is knowing you might kill people due to a mundane engineering known issue.

miyuru

about 1 month ago

this is Airbus, not Boeing

mrpippy

about 1 month ago

I don’t believe it’s been years, only the latest firmware version for the ELAC is affected. The fix is to downgrade (or replace hardware with a unit running earlier firmware)

pyb

about 1 month ago

I get the feeling that they are doing this partly for marketing purposes.

pyb

about 1 month ago

1 reply

The aerospace industry has had countermeasures in place against bit-flips for a long time, oftentimes thanks to redudancy

Airbus/Thales's fix in this case appears to add more error checking, and to restart the misbehaving component. https://bea.aero/fileadmin/user_upload/BEA2024-0404-BEA2025-...

("une supervision interne du composant à l’origine de la défaillance ; - un mécanisme de redémarrage automatique de ce composant dès lors que la défaillance est détectée)

nolist_policy

about 1 month ago

The linked document is not related to this incident.

raverbashing

about 1 month ago

1 reply

Apparently the fix is reverting to a previous version of the SW (see https://avherald.com/h?article=52f1ffc3&opt=0 )

Curious what a sw change might have done in terms of resiliency. Maybe an incorrect memory setting or some code path that is not calculating things redundantly maybe?

rootusrootus

about 1 month ago

[delayed]

rene_d

about 1 month ago

1 reply

The Aviation Herald has more technical details:

https://avherald.com/h?article=52f1ffc3&opt=0

loxodrome

about 1 month ago

1 reply

Thanks for the link. This line in particular is concerning.

"This identified vulnerability could lead in the worst case scenario to an uncommanded elevator movement that may result in exceeding the aircraft structural capability."

isodev

about 1 month ago

Well, I think in the grand scheme of things (including on the ground), the range of safety faults that can be triggered by a simple bitflip at the wrong moment range from inconvenient to absolute disaster. So in that sense, I'm very happy that Airbus has managed to identify opportunities to improve their design to be even more resilient.

nickdothutton

about 1 month ago

2 replies

I’d just like to point out that if you are in the computing industry long enough, you will get to see a few such incidents under different circumstances, not only in industries like aerospace. Mostly things like ECC save your a*, sometimes your software will be able to recognise a temporary spurious reading and disregard it because you had enough alternative checking logic, or in the case of realtime and safety critical maybe even your systems can take a vote between them. Got caught out by (cpu cache line) bit flips in the 90s, months of pain trying to track it down. Some of your will know :-)

LadyCailin

about 1 month ago

3 replies

[delayed]

Retr0id

about 1 month ago

Most bit flips you see in computing probably aren't from actual cosmic rays, there are a whole bunch of other potential causes. Charge leakage in DRAM cells is a stochastic process, and if you get unlucky (also influenced by temperature, manufacturing variance, unfriendly access patterns as in rowhammer, etc.) then one of your cells might discharge before its next refresh.

Philip-J-Fry

about 1 month ago

I also saw a similar thing. I also naively pointed at "cosmic rays". It wasn't until someone found the actual bug that I realised how unlikely that was.

The actual bug was unsafe code somewhere else in the application corrupting the memory. The application worked fine, but the log message strings were being slightly corrupted. Just a random letter here and there being something it shouldn't be.

The question really should have been, if this was truly cosmic interference, why only this service and why was the problem appearing more than once over multiple versions of the application?

Cosmic rays are a great excuse to problems you don't yet understand. But the reality of them is extremely rare and it's like 99% a memory corruption bug caused by application code.

tuetuopay

about 1 month ago

I had a similar story on my NAS that got one btrfs path corrupt. Plopped in on the btrfs IRC, one of the devs noticed the inconsistency was one bitflip away from the right value. Incredibly they were able to give me the right commands to fix it! Got to give credit where it is due, btrfs took the safe path and refused to touch the affected directory until fixed, and has enough tooling to fix this.

I won’t blame cosmic rays but more likely dying RAM. The NAS now runs ECC memory.

Theodores

about 1 month ago

Is that you, Julian?

I jest, but, once upon a time I worked with an infallible developer. When my projects crashed and burned, I would assume that it was my lack of competence and take that as my starting point. However, my colleague would assume that it was a stray neutrino that had flipped a bit to trigger the failure, even if it was a reproducible error.

He would then work backwards from 93 million miles away to blame the client, blame the linux kernel, blame the device drivers and finally, once all of that and the 'three letter agencies' were eliminated, perhaps consider the problem was between his keyboard and his chair.

In all fairness, he was a genius, and, regarding the A320 situation, he would have been spot on!

supernova87a

about 1 month ago

1 reply

I wonder how the incident was diagnosed? Does the FDR record low level errors that might've contributed to this? I thought that it only recorded certain input parameters and high-level flight metrics but I'm no expert.

If a radiation event caused some bit-flip, how would you realize that's what triggered an error? Or maybe the FDR does record when certain things go wrong? I'm thinking like, voting errors of the main flight computers?

Anyway, would be very interested to know!

yread

about 1 month ago

From a comment on avherald:

"Had the same problem with low power CMOS 3 transistor memory cells used in implantable defibrillators in the 1990s. Needed software detection and correction upgrade for implanted devices, and radiation hardening for new devices. Issue was confirmed to be caused by solar radiation by flying devices between Sydney and Buenos Aires over the south pole multiple times, accumulating a statistically significant different error rate to control sample in Sydney."

jakub_g

about 1 month ago

1 reply

From newspaper reporting on this, they are rolling back a software update. I wonder what was the original cause or the update? How often are flight computers software updated and why?

julik

about 1 month ago

This ELAC version is 100-something, and the A320 first flew around 1988. Why the updates - for example, there are updates to flight control law transitions, like after 1991 where the aircraft would limit flight control inputs during landing, thinking it would be preventing a stall - because it would not go into the flare law appropriately. See https://en.wikipedia.org/wiki/Iberia_Flight_1456

The cause could have also been an extra check introduced in one of the routines - which backfired in this particular failure scenario.

rvz

about 1 month ago

Better not be "vibe-coded".

1970-01-01

about 1 month ago

They said the same thing at Toyota when the unintended accel problem was in the news, but never found a real world example.

65a

about 1 month ago

There's a great postmortem here about what might have been a similar SEU (single event upset--bitflip) here: https://www.atsb.gov.au/sites/default/files/media/3532398/ao...

viiralvx

about 1 month ago

I was traveling during this entire ordeal. My flight got delayed by 7 hours. Insane day, just now boarding my flight. American Airlines was in shambles today.

minitoar

about 1 month ago

We flew too close to the sun

schmuckonwheels

about 1 month ago

SEE is a known failure mode.

Trapping for every known instance can be tricky and difficult. When things go wrong they tend to really go wrong.

owenthejumper

about 1 month ago

A friend works at Jetblue. They are scrambling hard to do the updates.

ChrisArchitect

about 1 month ago