banner
yono

yono

哈喽~欢迎光临
follow
github

Failure of Complex Systems and PDSA

[!NOTE]

This article has not undergone in-depth study and reflection and may be significantly modified in the future.

Exceptions in Complex Systems#

Every time a major product failure occurs, people's first reaction is always surprisingly consistent: find the person who made the mistake or the part that failed. We eagerly search for a clear "root cause" because it gives us an illusory sense of control—if we just fix this point, everything will get back on track.

In some cases, this simple way of thinking is not wrong; simplicity means speed. I believe that quickly identifying the "root cause" and claiming to have solved it is a solution in situations where staff are eager to resolve customer complaints and rescue the scene. We often feel satisfied or even smug about "solving problems," but we must clearly realize that this "solution" addresses human issues rather than the entire system.

Whether it's energy networks, companies, equipment, or software systems, these things are complex in design or practical terms, and no one can easily understand how they currently work. They are actually filled with various minor faults, but enough redundancy in design allows them to function normally.

At a certain moment, when some minor faults suddenly team up, a scheduled task becomes impossible to complete, leading to an accident. We need to address the accident and apply a not-so-ugly patch based on superficial understanding, convincing the person who discovered the problem that the issue has been resolved. I once accompanied a classmate to deal with an administrative director at a school. Although my classmate and I were cursing her for being overly involved, she said something I found very philosophical—“Every action leaves a trace; everything you do has consequences.” Similarly, our eagerness to apply a patch can introduce more subtle minor faults into the entire system.

In engineering management, people often obsess over finding superficial "root causes," neglecting the true underlying issues. Because those who actually do the work are providing "an explanation" to those above them, and this explanation usually requires attributing responsibility to a specific person or thing, making it easier to pass off, but ultimately only obscures the systemic root causes.

This mindset is rooted in the early second stage of quality control, the "statistical quality control stage," which is a legacy of the "Ford assembly line" industrial era, overly focused on disassembly and standardization, with a strong causal relationship. Compared to the first stage, the "quality inspection stage," this is certainly a significant methodological improvement. However, in the current era, the complexity of products, engineering, society, and organizations has sharply increased, forming complex systems with multiple variables, non-linear relationships, and real-time changes, where variables influence each other. It has become very difficult for humans to grasp the causal relationships at each link, yet our thinking patterns remain stuck in the primitive, natural handling of simple, linear relationships, leading to a significant cognitive gap.

[^Quality management's three stages]: 1. Quality inspection stage: Before the 18th century, products typically came from workshops, and quality assurance in workshops relied on the skills and experience of manual operators, with final checks performed by skilled workers. This inertia continued until the early 20th century, essentially just picking out defective products from finished goods as "post-fact checks." 2. Statistical quality control stage: Mainly uses statistical methods and control charts proposed by Shewhart to timely identify defects in a process and improve them. 3. Total quality management stage: The TQC paper from 1956 proposed that quality issues arising during the production process account for only 20% and introduced the idea of total quality management that fully considers market research, design, production, and service.

Focus on the System#

If a coffee shop's quality fluctuates and a customer complaint arises one day, the manager's first reaction is always to "put out the fire." They hold an emergency meeting, quickly find the on-duty staff member, accuse them of not adjusting the coffee machine properly, and then impose a fine and compensate the customer.

Is that enough? The complaint is handled very promptly, but why does the same issue keep occurring? The essence of relying on staff skills to produce coffee has not changed within the entire coffee shop system.

In our time, there are almost no such coffee shops left. Think about it: in chain-brand coffee shops, isn't the product flavor almost consistent across all locations? Of course, we know that such flavors are not necessarily excellent, perhaps not as good as that occasionally error-prone staff member. But that is the positioning of chain coffee shops; I only produce products of this quality and naturally only serve customers who are satisfied with this quality.

Deming1 categorizes all quality issues into two types.

The first type is called "controllable failures." This is like your computer suddenly blue-screening. It is an abnormal, sudden disruption, and the cause is clear—it could be user error, a hardware failure, or a driver crash. For such problems, you must take immediate action to find it, fix it, and ensure it doesn't happen again. It's like putting out a fire; immediate execution is required.

But more common and troublesome is the second type of problem, which Deming calls "occasional failures." This is more like your computer's overall speed fluctuating. It is not caused by a single, clear failure but is an inherent part of the system. It could be that your operating system is a bit bloated, too many programs are running in the background, or there is insufficient hard drive space... Countless small, random factors work together to create this overall, indescribable "lag." This is the "background noise" of the system; it is always present.

It is evident that because staff are human, quality issues caused by coffee shop staff are "occasional failures," while chain coffee shop managers wisely downgrade their target customers, establish a stable and complete coffee bean supply system, and minimize the complexity of staff operations as a means to optimize the system.

Deming's path is to first extinguish all those suddenly igniting "controllable failures." By establishing a set of standards (to be explained later) to scientifically judge which are the true abnormal signals, once all "controllable failures" are eliminated, the system enters a "stable state." At this point, there are still problems and fluctuations, but these are all normal noise.

At this moment, the truly important improvements are just beginning; the root causes of all subsequent problems are no longer a specific person or thing but the entire system itself. Managers need to be smarter and more prudent in improving the system and continuously repeat the process of reflection and improvement.

How to Determine if the System Has Entered a "Stable State"#

Some mathematical methods and indicators are still not understood.

PDCA and PDSA#

First, the concept of the Deming cycle involves repeatedly going through several stages to achieve system optimization.

PDCA is the widely recognized "Deming cycle" today. It stands for Plan-Do-Check-Act, which means planning, executing, evaluating, and improving. Although Deming himself clearly stated that he never proposed this, it may be a misinterpretation.

PDSA is the original "Deming cycle." It stands for Plan-Do-Study-Act, which means planning, executing, learning, and improving.

Repeating these four stages achieves a stepwise improvement of the system.

Modern methodologies mention the concept of "big cycles encompassing small cycles," and some stages pretentiously expand, such as extending C to 4C—Check (inspect), Communicate (communicate), Clean (clean), Control (control). However, my view is that methodologies should not be overly detailed; excessive detailing ultimately equals having no methodology.

I believe that overemphasizing cycle-driven approaches can stifle the system's innovativeness, which can be fatal at certain stages of the system. At the same time, this method lowers the system's upper limits; the system's remarkable breakthroughs and innovations naturally bring more minor faults. Therefore, I believe this quality optimization system is only applicable in a "stable state" and should be treated as a tool for execution.

Moreover, Deming also expresses similar views in [Deming's New Economics (2nd Edition) | yono's document](https://data.yono233.cn/ 书籍 / 戴明的新经济观(原书第 2 版)=THE NEW ECONOMICS FOR INDUSTRYK,GOVERNMENT,EDUCATION SECOND EDITION_13726844.pdf), stating "do not be superstitious about methods, but adapt to local conditions." This book is Deming's final work, and I strongly recommend downloading and reading it. It also discusses the futility of performance rankings, the commitment of everyone to optimizing the system, and respecting rather than materially rewarding employees—these quite idealistic views, like PDCA/PDSA, should just be understood as the thoughts of a master.

image-20250624173735952

Additionally, there are the famous Fourteen Points of Deming that can be searched for further learning, which are not particularly relevant to us common folks.

Reflection#

My biggest takeaway is to stop blaming mistakes on a specific point; when problems occur, it is actually a flaw in the system's design. So there is no need to be anxious or self-blaming; these are issues for those above us.

This article is synchronized and updated by Mix Space to xLog. The original link is https://www.yono233.cn/posts/white/25_6_24_FailSysPDSA

Footnotes#

  1. An American quality management master who laid a solid foundation for quality management in Japanese enterprises.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.