Component failures in complex systems are often expensive. The loss of operation time is compounded by the costs of emergency repairs, excess labor, and compensation to aggrieved customers. Prognostic health management presents a viable option when the failure onset is observable and the mitigation plan actionable.
As data-driven approaches become more favorable, success has been measured in many ways, from the basic outcomes, i.e. costs justify the prognostic, to the more nuanced detection tests.
Prognostic models, likewise, run the gamut from purely physics-based to statistically inferred. Preserving some physics has merit as that is the source of justification for removing a fully functioning component. However, the method for evaluating competing strategies and optimizing for performance has been inconsistent.
One common approach relies on the binary classifier construct, which compares two prediction states (alert or no alert) with two actual states (failure or no failure). A model alert is a positive; true positives are followed by actual failures and false positives are not. False negatives are when failures occur without any alert, and true negatives complete the table, indicating no alert and no failure.
Derivatives of the binary classifier include concepts like precision, i.e. the ratio of alerts which are true positives, and recall, the ratio of events which are preceded by an alert. Both precision and recall are useful in determining whether an alert can be trusted (precision) or how many failures it can catch (recall). Other analyses recognize the fact that the underlying sensor signal is continuous, so the alerts will change along with the threshold. For instance, a threshold that is more extreme will result in fewer alerts and therefore more precision at the cost of some recall. These types of tradeoff studies have produced the receiver operating characteristic (ROC) curve.
A few ambiguities persist when we apply the binary classifier construct to continuous signals. First, there is no time axis. When does an alert transition from prescriptive to low-value or nuisance? Second, there is no consideration of the nascent information contained in the underlying continuous signal. Instead, it is reduced to alerts via a discriminate threshold.
Fundamentally, prognostic health management is the detection of precursors. Failures which can be prognosticated are necessarily a result of wear-out modes. Whether the wear out is detectable and trackable is a system observability issue. Observability in signals is a concept rooted in signal processing and controls. A system is considered observable if the internal state of the system can be estimated using only the sensor information.
In a prognostic application, sensor signals intended to detect wear will also contain some amount of noise. This case, noise is anything that is not the wear-out mode. It encompasses everything from random variations of the signal, to situations where the detection is intermittent or inconsistent. Hence, processing the raw sensor signal to maximize the wear-out precursors and minimize noise will provide an overall benefit to the detection before thresholds are applied.
The proposed solution is a filter tuned to maximize detection of the wear-out mode. The evaluation of the filter is crucial, because that is also the evaluation of the entire prognostic. The problem statement transforms from a binary classifier to a discrete event detection using a continuous signal. Now, we can incorporate the time dimension and require a minimum lead time between a prognostic alert and the event.
Filter evaluation is fundamentally performance evaluation for the prognostic detection. First, we aggregate the filtered values in a prescribed lead interval n samples before each event. Each lead trace is averaged so that there is one characteristic averaged behavior before an event. In this characteristic trace, we can consider the value at some critical actionable time, tac, before the event, after which there is insufficient time to act on the alert. The filtered signal value at this critical time should be anomalous, i.e. it should be far from its mean value. Further, the filtered value in the interval preceding tac should transition from near-average to anomalous.
Both the signal value at tac as well as the filtered signal behavior up to that point present independent evaluation metrics. These frame the prognostic detection problem as it should be stated, as a continuous signal detecting a discrete event, rather than a binary classifier.
A strong anomaly in the signal that precedes events on an aggregated basis is the alternate performance metric. If only a subset of events show an anomaly, that means the detection failure mode is unique to those events, and the performance can be evaluated accordingly.
Thresholding is the final step, once the detection is optimized. The threshold need not be ambiguous at this step. The aggregated trace will indicate clearly which threshold will provide the most value.