Sunday, April 28, 2024

Exceeding System Design Scale Limits: Why the 737 MAX Is Failing


(I’m flying on a 737-8 MAX today. I’m sitting 2 rows in front of the infamous door plug in row 26L. I reflect on the 737 MAX and its problems.)

An important engineering concept is “system design scale.” When designing a system, how much can you scale up a system before the architecture is no longer suitable for performance, reliability, or economics?


In software, I generally design for 10x, maybe 100x scale up/out. So, if you design for say 100 CPU cores, can you push the system to 1,000 cores? If you look as how Google or Facebook has scaled, you will see complete rethinking/redesigns of their technology stack as they have grown/scaled up. For example, I once wrote about how Facebook has dramatically and aggressively scaled up image serving — from buying a third party proprietary solutions (Netapp), to homegrown, to Haystack, to whatever they are doing now.


Google scaled from a machine at Stanford, to using many machines in the CS department, to off the shelf racks, to custom racks, to completely rewiring from TPU to dynamic optical interconnects in data centers to support AI. I hear their Advanced Development work is even more extraordinary — Go Hank Levy? 😀


This is long, round about way to get to my point — what’s with the Boeing 737 MAX? Sure, the usual criticism is warranted —   Boeing has lost its way as it transformed from to a business culture from an engineering culture.


But I think (without talking to anyone at Boeing for direct evidence) that there is an issue with system design scale. Latest incantation of the 737 had a design architecture. When Boeing wanted to address the business threat of Airbus, it decided to scale up the 737 to build bigger planes — the MAX. And they stretched the design to build a bigger plane. The scale up went past the “system design scale” limit.


Now, I don’t know what the limits of scale for airplanes are. Certainly not 10x - 100x like software  — that might mean 737’s with say 20,000 seats. Maybe to address the Airbus threat, it was only a bigger engine or a few more rows. But it was past the design scale imperative.


Bigger engines, more rows, new range limits, and probably hundreds of (small) design changes were needed  to meet the “stretch” goal. Instead of being able to think systemically from first principles to build a plane, it was a bunch of tweaks to the original design. In software parlance, “design” (and implementation) modifications became “hacks” to make the system work. Coherent, architectural principles were tossed aside to make it work. The system is held together by “gum and baling wire.”


MCAS and the door plug were artifacts/symptoms of the problem. Procedural and manufacturing hacks, in addition to product deficiencies followed. For example pilot training and certification, QA oversight controls, and assembly issues (bolts on doors). It’s the classic “putting your finger in a dam to stop leaks” problem. Problems will continue to surface because the MAX was implemented beyond the design scale limits.


So what can be done? Maybe Boeing can continue to hack away and plug all the problems — hopefully without catastrophic failures (crashes). Perhaps a new design with a different design scale is in the works. However, I don’t think Boeing can back away from the MAX and wait for a new plane. But if they prove unsafe, would the FAA ground them? Seems like Boeing must stay the course, continue to hack away, and work with regulators to keep the MAX in the air.


I hope Boeing finds its footing and emerges as a better, stronger, and safer company. Ironically, maybe it’s Marketing and PR that will be needed to save the company, even if they fix their engineering and manufacturing issues. I’m rooting for them. Boeing has been a pillar of success in Seattle. It has been a key contributor to what makes Seattle great today.  They set a culture and built an ecosystem that has grounded the greater Seattle area for nearly a century.  I personally owe much to Boeing — my dad worked there for 30+ years.  Go Boeing!


Some side observations: 

“Move Fast and Break Things”

Facebook/Meta has been criticized for their ethos of “Move fast and break things.”  It’s an important value of many startups, not just Facebook. It works when you are the small underdog and not deploying mission critical apps (e.g. where lives are at stake). But not when you are big, as Facebook is now. In the early days, no one died when posting, “I had a great hamburger for lunch.” If I squint, I could say Boeing was “moving fast and breaking things” with the MAX. This was not appropriate — failure (“breaking things”) put lives at risk and Boeing is a large company.


787 Dreamliner Problems Were Different

Problems with the 787 is a different issue. The 787 was designed to be built in a using a loosely coupled distributed system where subcomponents could be built by independent manufacturers. Final assembly (integration and system engineering) would be done by Boeing. Boeing would have to write specifications for the independent suppliers to implement. This was  a complex process with supply chain, vendor/partner management, manufacturing, and integration issues.


These design principles are often used  in software. Loose coupling, strong specifications, separating interface from implementation, interoperability of different subcomponents, etc. has been used to great success. The Internet was built on these principles. This did not work for the 787 Dreamliner. Boeing had to pull back from many of its partners and bring things back “in-house.” In general, what works for software might not work for many complex physical world systems.


Afternote: I just read this article — apparently, the 737 is also plagued by the distributed, outsourced production system too. And Boeing is trying to reel it in.