Reliability modelling and analysis of a multi-state element based on a dynamic Bayesian network

This paper presents a quantitative reliability modelling and analysis method for multi-state elements based on a combination of the Markov process and a dynamic Bayesian network (DBN), taking perfect repair, imperfect repair and condition-based maintenance (CBM) into consideration. The Markov models of elements without repair and under CBM are established, and an absorbing set is introduced to determine the reliability of the repairable element. According to the state-transition relations between the states determined by the Markov process, a DBN model is built. In addition, its parameters for series and parallel systems, namely, conditional probability tables, can be calculated by referring to the conditional degradation probabilities. Finally, the power of a control unit in a failure model is used as an example. A dynamic fault tree (DFT) is translated into a Bayesian network model, and subsequently extended to a DBN. The results show the state probabilities of an element and the system without repair, with perfect and imperfect repair, and under CBM, with an absorbing set plotted by differential equations and verified. Through referring forward, the reliability value of the control unit is determined in different kinds of modes. Finally, weak nodes are noted in the control unit.


Introduction
The reliability of a system or an element is defined as: the ability to perform its required functions under specific operating conditions for a specified period of time [1]. Traditional analysis methods, such as a fault tree analysis (FTA), a binary decision diagram (BDD) and a failure modes and effects analysis (  there are only two states in the system, normal and failure, and the events in the system are independent of each other. However, in real-world systems, in addition to perfect functionality and complete failure, an element may have several intermediate states; therefore, it is considered a multi-state element (MSE). A system consisting of MSEs is called a multi-state system (MSS). In addition, as redundant design and dynamic logic gates are introduced, systems become more complex and sophisticated, and traditional analysis methods no longer apply. Thus, new methods are required to assess the reliability parameters from the perspective of multi-states or multi-stages to decrease the downtime probability and degradation of complex systems [2,3].
To determine the dynamic characteristic parameters of MSEs or single MSE systems, many multistate models have been established based on Markov processes in the domains of engineering, medicine and economics [4][5][6]. The Markov processes are widely used because the number of failures in arbitrary time intervals can be described as a Poisson process, and the corresponding time to failure and repair are assumed to obey an exponential distribution. Anatoly et al. [4] built a multi-state Markov model to predict the reliability of a coal power generating unit for a short-term range. Viewing the disease process as a multi-state progression, Malcolm et al. [5] performed a meta-analysis to determine the parameters of the treatment effects in multi-state Markov models. Similarly, Azza & Adel [6] extended the Markovswitching model to build a four-state indicator to detect inflexions and deterioration. When transition densities of MSEs between states do not obey exponential distributions, modified Markov models are applied to describe the degradation and maintenance process of MSEs including perfect repair, minimal repair and imperfect repair [7][8][9].
To obtain the reliability parameters of an MSS, Helge & Luigi [10] applied a Bayesian network (BN) in the reliability analysis community, and discussed its relevant ongoing research for practitioners. The BN was developed on the basis of probability and graph theory, and it is advantageous for performing a forward or predictive analysis and backward or diagnostic analysis, and for expressing uncertain causal relations [11,12]; the BN is widely used in system reliability assessment [13][14][15], human reliability analysis [16,17], fusing uncertain information [18,19] and operational risk assessment [20,21]. The BN can describe any MSE or MSS with a single node, which simplifies the state-transition in the stochastic process. In addition, all causal relationships can be denoted by conditional probability distributions. For deterministic logic relations, the conditional probability tables (CPTs) can be obtained through static or dynamic logic gates. In other cases, the CPTs can be obtained by consulting experts or referring to recorded failure data.
Methods such as FTA, BDD and FMEA are static tools used to direct the reliability improvement of the system or its elements at the beginning or at a specific time. A dynamic fault tree (DFT) is developed on the basis of a Markov process and is a useful tool to expand and upgrade the existing models to further improve the reliability and reduce system unavailability [22][23][24]. Because of the state explosion problem in Markov processes and the difficulty in obtaining a minimal cut sequence set, the DFT application is limited in complex systems with many dynamic logic gates. By introducing relevant temporal dependencies between representations, a BN is expanded into a dynamic Bayesian network (DBN), which overcomes the shortcomings of a DFT [21,25]. Compared with a DFT, a DBN is more suitable for monitoring and predicting the change of random variables and representing states of the system or its elements at any time. Daniele et al. [26] reported a DBN framework inside a system or among systems to evaluate cascading effects in a power grid. Shubharthi et al. [20] mapped a DFT into a DBN to perform a dynamic operational risk assessment and illustrated the methodological capability. Esmaeil et al. [27] developed a DBN model for an accident scenario and the risk associated with natural gas stations and indicated the failure of a regulator system.
In a reliability analysis, repair is a non-negligible factor. Fan et al. [28] introduced an algorithm based on a DBN for a repairable model to evaluate the reliability and security of complex systems. Cai & Liu [29,30] developed a reliability model of subsea blowout preventers to perform a common cause failure analysis based on a DBN. To improve the benefit of combined maintenance, Wang et al. [31] established a stochastic deterioration model for multi-element systems under condition-based maintenance (CBM). For equipment inaccessible to humans, repairs, including perfect repairs and imperfect repairs after a failure, are adopted. For equipment under monitoring, the CBM is better. When degradation or a failure occurs, maintenance measures can be adopted immediately. This paper is structured as follows: §2 presents the reliability model of an MSE based on Markov processes; §3 illustrates the method to develop a DBN of MSEs; §4 illustrates a control unit as an example; the results and discussion are considered in §5; and §6 summarizes this paper ( R i (t) reliability function of an element with performance rate higher than g i

Reliability modelling of a multi-state element
An element has k different states corresponding to its performance rates, denoted by the set g = {g 1 , g 2 , . . . , g k }, g i+1 > g i for any i. Herein, g k represents the perfect functionality state of the element, and g 1 represents the complete failure state. The intermediate value g i (1 < i < k) denotes a state of degradation. At any time, the performance rate G(t) of an element is a random variable taking a value from g, resulting in G(t) ∈ g. Assume that p(t) = {p 1 (t), p 2 (t), . . . , p k (t)} is the probability set associated with different states of the element at any time t. Now that g concludes the complete group of exclusive events, then k i=1 p i (t) = 1 for any t:0 ≤ t ≤ T. Assume that the desired level of performance W(t) takes discrete values from a set w = {w 1 , w 2 , . . . , w m }. The acceptability function F(G(t), W(t)) expresses the desired relationship between the performance and demand. If F(G(t), W(t)) ≥ 0, it refers to the acceptable states, and if F(G(t), W(t)) < 0, it refers to the unacceptable states defined as failures. The MSEs are divided into two groups, non-repairable elements and repairable elements.

Modelling of non-repairable elements
The case where an MSE can enter the subset only once usually refers to a non-repairable deteriorating element. The element acceptability depends on the relation between the element performance and the desired demand. An MSE has two kinds of failures: minor failures and major failures, which can occur at any time. Minor failures cause an element transition from state i to the adjacent state i − 1, while major failures cause an element transition from state i to state j:j < i − 1. Assume that the sojourn time in any state is exponentially distributed. The state transition diagram is presented in figure 1. In addition, the corresponding differential equations are written as follows to find the state probabilities for the Markov process.  where λ e,i represents the degradation intensity from state e to state i. It is obvious that in state k there are k − 1 transitions from this state to state e:1 ≤ e ≤ k − 1 with the intensity λ k,e , and there are no transitions back to state k. In each state i:2 ≤ i ≤ k − 1, there are transitions to this state from upper states and transitions from this state to lower states. There are no transitions from state 1, which means it is an absorbing state for non-repairable MSEs.
At the very beginning, an element is in the best state k with a maximal performance rate of g k . Therefore, the initial conditions are If the demand is g i < w ≤ g i+1 , i = 1, 2, . . . , k − 1, the reliability function is denoted as

Modelling of repairable elements
For repairable elements, the transitions between subsets of acceptable states and unacceptable states can occur at any time. Similar to failures, repairs can be divided into two groups: minor repairs and major repairs. Minor repairs return an element from state j to the adjacent state j + 1 with the parameter u j,j+1 , while major repairs return an element from state j to state i:j + 1 < i with the parameter u j,i . The differential equations are written as follows for the state probabilities for the repairable MSE with minor and major failures and repairs, as shown in figure 2.
where μ i,e represents the repair intensity from state i to state e. In addition, the initial conditions are the same as those for equation (2.2).
To determine the reliability function for repairable MSEs, the probability of the element entering the set of unacceptable states for the first time must be obtained. To find the reliability function R i (t) for   Figure 3. State-transition diagram for a repairable element under a constant demand.
a constant demand w(g i < w ≤ g i+1 ), another Markov model is established, as shown in figure 3. All states lower than the demand w are eliminated in an absorbing state, denoted as state 0. All repairs from this state back to acceptable states are forbidden, i.e. zeroing all the transition intensities u 0,m for m = i + 1, · · · ,k. In addition, the transition intensity λ m,0 from any acceptable state m to state 0 is equal to that of the transitions to all the unacceptable states, denoted as The differential equations to determine the reliability of the repairable element are denoted as with the initial conditions of equation (2.2), and the reliability function is obtained by equation (2.3).
When t → ∞, the element enters state 0 with final state probabilities given by 3. Dynamic Bayesian network modelling for a multi-state element

Dynamic Bayesian network model
A DBN is an extension of the static BN by introducing the temporal evolution of variables. The DBN is represented as a pair (B 1 , B → ), where B 1 is the initial BN that defines the prior P(X t ), and B → are BNs that include multiple copies of time slices. The transition probability P(X t |X t−1 ) between two adjacent slices is Here, X i t denotes the ith node at time slice t, and pa(X i t ) denotes its parent nodes. There are two assumptions in a DBN, i.e. the system is the first-order Markov and a timehomogeneous system. Therefore, the edges between the nodes in a DBN locate in the same slice or two adjacent slices. In addition, the parameters of the conditional probability distribution will not change as time progresses. By unrolling T time slices, the joint distribution probability is obtained by Shown in figure 4, the series and parallel systems are extended from time slice t = 1 to t = 2, respectively. In the series system shown in figure 4a, the nodes A and B at time t = 1 are extended to time t = 2 with an inter-slice arc, respectively. There is no intra-arc between nodes A and B, so they are independent of each other. The parent nodes A and B have four states, namely, the perfect, useful, pseudo-fault and fault states. The child node C has two states, namely, the normal and fault states. Having the same structure, except for different CPTs, the parallel system shown in figure 4b has a higher reliability value than the series system at time t = 1 and t = 2.

Dynamic Bayesian network modelling for a multi-state element
For a multi-state degraded element, four assumptions are described as follows: (1) The element has many levels of degradation, taking a value from perfect functioning to a complete failure; (2) The element may fail randomly at any time from operational states including minor failures and major failures; (3) All state-transition rates are constant, obeying the exponential distribution; (4) The current state of an element is observable through some testing parameters.
Every parent node in a DBN has four states, i.e. perfect, useful, pseudo-fault and fault. The perfect state refers to perfect functioning. The state fault refers to a complete failure. In addition, the useful state and the pseudo-fault state represent the first and second degraded element states, respectively. At the beginning, each parent node in a DBN is in the perfect state. As time elapses, the DBN will either    move to the useful state or the pseudo-fault state, or proceed to the fault state. For equipment that is not accessible for humans or inspection, it is only suitable to perform maintenance measures after a failure. When a non-repairable element reaches the fault state, a replacement is needed. When this happens to a repairable element, a repair is needed. The DBN can either return to the perfect state, which is viewed as a perfect repair, or it can simply return to the first or second degraded state, which is viewed as an imperfect repair. For equipment that is observable and accessible, CBM is suitable. If a state degradation occurs, the maintenance measure can be performed immediately. The element will return to the perfect state or the useful state. The state-transition diagram for an MSE is shown in figure 5. Compared with the perfect repair and imperfect repair, CBM will make the element recover from the pseudo-fault state to the perfect state or the useful state, or recover from the useful state to the perfect state. The failure rates and repair rates between the states of an element are given in a simplified mode above the state transition arcs.

Conditional probability table
If there are n parent nodes in a BN, and each parent node has m states, then m n independent parameters are needed to determine the CPTs. This is a non-deterministic polynomial (NP) problem when the       traditional OR-gate and AND-gate constructs are introduced for the series and parallel systems. Assume that there are n parent nodes X 1 , X 2 , . . . , X n for node Y , and the degradation probability of node j is f j , then the unreliability for an OR-gate can be calculated as Similarly, the unreliability for an AND-gate can be expressed as

Case study 4.1. Dynamic fault tree modelling for a control unit
A control unit from a vibrator consisting of many electric and mechanical elements is complex and has different kinds of failure modes. In operating conditions, this control unit suffers from various environmental stresses and degrades gradually. A DFT model of the control unit is built for the case of its power in failure model, as shown in figure 6. The top event, power in failure model, is caused by three intermediate events: sys1, sys2 and sys3. Event sys1 contains an AND-gate with elements E1 and E2. Event sys2 contains an OR-gate with elements E3, E4, E5 and E6. In addition, event sys3 contains a hot spare gate with elements E7 and E8.

Dynamic Bayesian network modelling for a control unit
By referring to the recorded data and consulting the domain experts, the failure rates, repair rates and degradation probabilities of the elements in the control unit are obtained and are shown in table 8. Figure 7 depicts a DBN model of the control unit that was built using the algorithm to convert static and dynamic logic gates in the (dynamic) fault tree into a DBN. With the parameters provided in

Model validation and reliability evaluation
Degradation, including minor failures and major failures, can occur at any time. Let us take element E1 in the control unit as an example. To obtain the state probabilities for the Markov process in figure 9a, differential equations are established in equation (5.1) according to equation (2.1). In addition, the state probability curves are drawn in figure 10a. It is obvious that with the increase of time steps, the probability of the perfect state drops from 1 to approximately 0 in approximately 1000 weeks. Although the probabilities of the useful state and the pseudo-fault state continue to increase for a period, the fault state captures the greatest proportion gradually. In figure 11, the DBN model for element E1 at different time slices is described by a relatively simple representation, with a node at time slice t 0 and a node at time slice t. The repair mode 'without repair' can be denoted by using transition densities according to  overlapped completely with the curves determined by the Markov process, which verifies the accuracy of our model. dp 4 (t) dt = −(λ 4,3 + λ 4,2 + λ 4,1 )p 4 (t), dp 3 (t) dt = λ 4,3 p 4 (t) − (λ 3,2 + λ 3,1 )p 3 (t), dp 2 (t) dt = λ 4,2 p 4 (t) + λ 3,2 p 3 (t) − λ 2,1 p 2 (t), dp 1 (t) dt = λ 4,1 p 4 (t) + λ 3,1 p 3 (t) + λ 2,1 p 2 (t).
By referring to the DBN, the reliability values of the control unit under different repair modes can be obtained, as shown in figure 12. For a control unit without the absorbing set, the CBM maintains a high reliability level of approximately 0.9947 at approximately week 1000. Compared with the imperfect repair, the perfect repair has a relatively higher reliability value. However, the imperfect repair does not affect the performance of the control unit significantly. In practice, the perfect repair will not always be attainable. For a control unit with the absorbing set, the reliability value will continue to decrease until a replacement occurs. Compared with the absorbing set {pseudo-fault, fault}, the reliability value of the control unit without repair is higher. The reliability curve of the control unit with the absorbing set {fault} more suitably reflects the degradation of elements. At approximately the 200th week, the control unit reliability remains above 0.8, and a replacement of badly degraded elements will improve its reliability significantly. Universal generating function (UGF), another widely used reliability analysis method for an MSS, has been applied to verify the DBN model of the control unit. More details regarding UGF are available elsewhere [12,33,34]. On the basis of the Markov processes, the performance distributions of all the elements in the control unit can be determined in polynomial form. By constructing the overall model of the control unit considering its logic gates, the performance distributions of the entire MSS under the desired demand performance level are obtained through like-terms collection and a recursive procedure, which overlapped completely with the results in figure 12. Compared with that of UGF, the application of a DBN reduces a large amount of calculation and provides a more impressive result.

Importance analysis of the control unit
The relative weights of the elements in the control unit reflect their contribution to the system performance by using mutual information, as shown in figure 13. For the control unit without repair, with imperfect repair, with perfect repair or with an absorbing set {fault}, the nodes E3, E4, E5 and E6, respectively, contribute appreciably to the top event. Among them, node E4 holds the most relative weight because it has a relatively higher failure rate. For the control unit under CBM, the repair occurs whenever a failure or degradation occurs. To maintain a stable level of high reliability, every element in   the system is important. Because the failure rate of node E3 is the lowest among the eight elements, its relative weight is lower than that of others.

Conclusion
In this paper, a method of modelling an MSS using the Markov process and a DBN is proposed, taking perfect repair, imperfect repair and CBM into account. The reliability parameter can be obtained by fusing the same parameters of elements with multi-states, and it can be predicted easily from the