Part 2: Why Are Mechanical Engineers Important in Achieving Functional Safety?
Functional safety consists of automatic protection systems that are part of the overarching safety of a system or a piece of equipment. Despite the scope of the standard implying a deference towards electrical engineering, mechanical engineers have a critical role to play. To be most effective, there are key principles that mechanical engineers must understand.
This is the second article in a 2-part series focused on the role that mechanical engineers play in quantifying and achieving functional safety:
- Part 1 is entitled “Principles for Mechanical Engineers to Understand Related to Functional Safety”. It outlines the transition of the automotive industry from electromechanical to mechatronic vehicles, the various types of faults, and random hardware failures.
- Part 2 is entitled “Why Are Mechanical Engineers Important in Achieving Functional Safety?” It focuses on the tie-in for mechanical engineers, and lessons learned.
The tie-in for mechanical engineers
In Part 1 of this series, we reviewed the evolution of the industry from electromechanical vehicles to mechatronic vehicles, and some of the complexity brought about by this new technology. Yet as we read ISO 26262, the standard carries a recurring message that defines its scope as being focused on electrical and/or electronic (E/E) systems that are installed in series production passenger vehicles. The recurring use of the words “electrical” and “electronic” could be problematic, if one were to assume that these words alone defined the scope of the standard. At this point, mechanical engineers might be tempted to think they are off the hook. But that temptation must be resisted. Context plays a major role and once it is taken into consideration, the mechanical correlations become evident.
Electronic components are still physical devices, regardless of their electronic function. All these electronic parts… capacitors, resistors, processors, FPGAs… they're all physically mounted somewhere, and they are impacted by temperature and vibration. The failure rate of each hardware part depends on the physics of the part itself, rated parameters, design-oriented parameters, derated parameters, and the vehicle mission profile.
If you are building these electronics, and they are being placed in a harsh environment that includes high temperature stresses and frequent exposure to vibration, their probability of individual failure goes up and the reliability of each part and system goes down. As a system, the probability increases that the whole system may fail. So, where you mount your cameras, your light detection and ranging (LiDAR), and your other sensors, is a key consideration. The selection of mounting locations not only needs to consider how to protect the electronics from the environment, but they must also consider the mounting itself, the material selection of the enclosure, and the mounting location, in order to minimize interference among sensors and maximize sensor effective range and detection capability to improve safety.
As the engineer, I am seeking temperature profiles and vibration profiles, so I can estimate the Failure In Time (FIT) rate. There's a table that tells you the calculations for determining the failure rates for individual parts; see Table F.1 – “Possible combinations of sources of target values and failure rates to produce consistent failure rates for use in calculations” in ISO 26262.
In addition to providing the mathematical formulas for the calculations that need to be performed, this table also identifies the accepted data sources for the failure rates of hardware parts. You can choose from a standard database, military handbooks, statistics, and expert judgement… yes, they devised a mathematical formula for human judgment!
What about field data? Field data collection will require a good service tracking system that tracks the number of field returns and enables engineers to conduct root cause analysis. With an entire circuit board being a replaceable item rather than a particular part on a circuit board, it’s hard to collect field data for failure rates at the hardware part level, nor are the resources typically available to do so economically.
Let’s examine a typical piece of electronics… the radar in an advanced driver-assistance system (ADAS). It has a card in it. There are two primary considerations for that card: temperature, and vibration. What range of temperatures is a given electronic component on that card going to be exposed to while on that vehicle over time? And how much vibration (both frequency and amplitude) must it tolerate to keep functioning properly?
It depends on the use case. For example, during the COVID year, my car has spent most of the time parked in my garage. This is a temperate space that consistently experiences moderate temperatures and humidity, and the vehicle itself has seen little operation over the pandemic year. In comparison, a commercial vehicle is going to be driven across the country under challenging conditions on an almost continuous use basis, with little to no respite. Even if both vehicles are equipped with the same radar system, these are two very different use cases for the same component. The usage of the vehicle must be defined, and the packaging must be designed around the use case.
Generally, when electronics are operated in hotter environments, they tend to last for shorter periods of time. With any electronics, you want them to operate at reasonably cool temperatures. In some past instances, this analysis has led to installing additional fans to aid cooling. If electronic components are kept cool, they tend to keep running for extremely long periods of time before they fail. So, someone must apply expert judgement and analyze whether additional cooling is needed.
Vibration is the other primary concern. Vibration can cause stresses that result in cracked or broken connections and components. You can put these electronics in a box, you can mount them on shock absorbers, or you can mount them straight to the engine. Regardless of where you consider mounting them, you must take into consideration the big picture. That component might fit in a given spot, but how much vibration is it going to experience there? What is going to happen to the internal element of the parts, the soldered connections, the joints, and all the manufacturing processes? What is going to happen when you shake and bake that system over time?
Something we haven't seen at all yet in autonomous driving, and maybe in a lot of cases for ADAS, is the failures over time. We haven't examined an ADAS system over 10 years’ worth of continuous use. Once that milestone is reached, we might see that the blind spot detector or the adaptive cruise control slowly stopped working at some point. We just don't have enough data yet. And there are other mechanical considerations too, such as exposure to rain or salt environments.
The mechanical engineering analysis, the finite element analysis, the thermal analysis, the vibration analysis… The accuracy of these analyses will form a more realistic mission profile and stress level that more accurately reflects what each hardware parts experiences in the design. Thich will help to calculate a failure rate that is closer to reality. In turn, this will ultimately lead to the three metrics that we talked about that are required for Functional Safety ASIL A, B, C, or D.
Three metrics calculations must be performed and they all require FIT rates. Even though failure rates can be obtained from different sources, the ones that are closer to reality have to be calculated by combining the industry-recognized ISO 26262 recommended failure rate model, with close-to-reality mission profiles, and design-dependent stress levels.
And so, the mechanical engineers have a real job to do here. Their job is to provide that protection by determining the proper profile, which then leads to the analysis, which then leads to three different metrics: single-point fault metric, latent-point fault metric, and the Probabilistic Metric for random Hardware Failures (PMHF).
Lessons learned for mechanical engineering practices
There are several lessons that have been learned, and missteps that should be avoided:
- Not considering design and usage-specific stress profiles and only using generic published failure rates, is overly optimistic. This eventually leads to a failure on the vehicle, which then eventually gets analyzed as a heat problem that causes a random failure, which in turn forces the redesign of an entire module and the recall of those vehicles.
For example, a major manufacturer recently experienced a failure in their infotainment system, specifically, the screen. When it started to fail, the system could no longer project the rear-view camera, and the driver lost the ability to adjust the defrost and climate control settings. The loss of the screen itself was a single-point fault, and it led to the loss of climate control capability, which was a violation of the safety goal because the operator was not able to adjust the climate control anymore.
In this instance, it appears that the temperature profile at that load wasn’t fully accounted for, and derating was not applied properly in the design. They are now replacing the affected parts with parts that have longer lifespans. It appears that the shorter lifespan was caused by the load on the processor being too high a load for too long.
Typically, load targets are defined for processors. For example, in aerospace, the flight control computer should only be operating at 25% load on average. It should not be operating at 100% load because it can fail due to the heat it generates at that load, and the electrical stress of that part operating at 100% of its maximum allowed load. It is like an athlete; you do not want to make them run at full speed 100% of the time, because you will prematurely wear them out. Likewise, despite these components being solid state, they can still wear out.
- Mechanical engineers are not off the hook for functional safety. Mechanical engineers can and should contribute to the manufacturing process to improve product reliability. The assembly processes and soldering used, the coatings, space claim and maintenance procedures, all those mechanical elements… they all gain importance.
Adding to the complexity, many controllers are now designed to be utilized in only a few model-year designs. Then the technology is improved, the design is altered, and the further manufacture of the old design is eclipsed by the new design. The result is fewer spare parts available, and components that were never designed to be swapped out.
That's a fairly new consideration in the automotive realm. We change our brakes; we change the oil. The designs and maintenance processes are relatively consistent, and have been for decades. The parts and consumables are readily available, and they are engineered to be accessible and designed to be changed out at regular intervals. However, some electronic components, such as screens or controllers, are not designed to be replaced at regular intervals, if at all.
Compounding matters, now we have fully electronic vehicles. They must be designed so they can be maintained over the intended lifespan of the vehicle, typically 10 years. And to keep the vehicle cost-effective to own, it needs to be able to be maintained by someone other than the original manufacturer. All of these considerations fall squarely under the domain of mechanical engineering, and underscore the importance of weaving functional safety into these new designs as the requirements for hardware lifespans evolve.
- Using humans as your failsafe doesn’t work with a fully autonomous vehicle system. In the case of the infotainment system failure mentioned above, the manufacturer said to the National Highway Traffic Safety Administration (NHTSA) that if drivers experience this problem, the driver can perform a shoulder check and use the mirrors. In other words, manual intervention became their safety mechanism… if all else fails, default to the human. Well, that's a problem in a fully autonomous system. Once autonomy becomes the norm, people tend to lose skills, and they quickly forget how to do the things that they no longer have to do. If the human becomes the failsafe, they may no longer perform that function and the failsafe is lost. The human behavior aspect is massive.
Ideally, failure that could impact functional safety needs to be detected and mitigated by the system design. Human intervention should only be used as last resort. In fully autonomous applications, human guards are down because the human is faithfully relying on the machine to do the work for them; after all, that is what a fully autonomous system is designed for. If human intervention has to be used as a last resort to avoid an accident when failure occurs, human factors need to be considered by mechanical design engineers regarding warning device placement, warning messages, failsafe instruction font size and location, visual warning flashing frequency, and audible warning decibel levels that will “wake up” the human occupant(s) in the vehicle to take failsafe mitigation action fast enough to avoid a safety goal violation.
A high degree of quality is required for functionally safe designs. Rather than relying on the human operators to be the failsafe, functionally safe engineering can relieve that burden, and mechanical engineers can help.
- To avoid a safety goal violation, the vehicle needs to be brought to a safety state within the Fault Tolerate Time Interval. A leading autonomous vehicle company was involved in an accident during testing. The autonomous vehicle had a human backup pilot for last resort human intervention in case a failure occurred. A pedestrian entered the path of the vehicle, and the system failed to detect and mitigate the failure. The interior camera footage of the accident showed that the human backup pilot was focused on their cellphone during the testing, which is fully expected in real life scenarios. The pilot did not move their attention away from the cellphone until the moment the pilot was shocked when the vehicle struck the pedestrian. Apparently, even if there’s a warning, it didn’t “wake up” the backup pilot in time to take action to avoid the accident.
If human intervention is involved as the last resort to avoid a safety goal violation, that means that the Fault Tolerance Time Interval must cover fault detection and fault mitigation, which includes the time it takes to “wake up” human occupants in the vehicle, overcome the “shock,” and follow instructions or instinct to take action to avoid the accident. That requires both electronics and mechanical design engineers to work together to achieve this goal.
Mechanical engineering practices, properly applied, would have prevented these incidents. A temperature analysis and vibration analysis tied into a proper safety analysis, as well as the involvement of human factor engineering in conjunction with the cooperation between electronics and mechanical design engineers, would have avoided these recalls and incidents, and many others that have recently been in the news. These examples illustrate the important role that mechanical engineers play.
When using standards to perform useful work, one must resist the temptation to focus on individual specialties alone, and instead, also digest the scope and intent of the entire standard, cover-to-cover. Engineering is a team profession. Only by reading beyond the scope alone and comprehending the overarching context, will the individual fully understand their role on the greater team. This knowledge is powerful. The strengths and ingenuity of the total engineering team must be leveraged to maximum effectiveness in order to advance and achieve true functional safety.