Dependable System: Dependability

Showing posts with label Dependability. Show all posts

Saturday, December 15, 2012

Dual-graph Model for Error Propagation Analysis of Mechatronic Systems

The electronic version of my PhD thesis is available free of charge.

You can also purchase a hardcopy at amazon or on the publisher's web-page ;)
or on the google books.

Fast abstract:

Error propagation analysis is an important part of a system development process. This thesis addresses a probabilistic description of the spreading of data errors through a mechatronic system. An error propagation model for these types of systems must use a high abstraction layer that allows the proper mapping of the mutual interaction of heterogeneous system components such as software, hardware, and physical parts.

A literature overview reveals the most appropriate error propagation model that is based on Markovian representation of control flow. However, despite the strong probabilistic background, this model has a significant disadvantage. It implies that data errors always propagate through the control flow. This assumption limits model application to the systems, in which components can be triggered in arbitrary order with non-sequential data flow.

A motivational example, discussed in this thesis, shows that control and data flows must be considered separately for an accurate description of an error propagation process. For this reason, a new concept of system analysis is introduced. The central idea is a synchronous examination of two directed graphs: a control flow graph and a data flow graph. The structures of these graphs can be derived systematically during system development. The knowledge about an operational profile and properties of individual system components allow the definition of additional parameters of the error propagation model.

A discrete time Markov chain is applied for the modeling of faults activation, errors propagation, and errors detection during operation of the system. A state graph of this Markov chain can be generated automatically using the discussed dual-graph representation. A specific approach to computation of this Markov chain makes it possible to obtain the probabilities of all erroneous and error-free system execution scenarios. This information plays a valuable role in development of dependable systems. For instance, it can help to define an effective testing strategy, to perform accurate reliability estimation, and to speed up error detection and fault localization processes.

This thesis contains a comprehensive description of a mathematical frame- work of the new dual-graph error propagation model, several methods for error propagation analysis, and a case study that demonstrates key features of the application of the presented error propagation model to a typical mecha- tronic system. A numerical evaluation of the mechatronic system in question proves applicability of the introduced concept.

Thursday, May 24, 2012

Flip happens :(

Hello dear DS readers,

Let me start this post with a sad story about a recent Russian space mission Phobos-Grunt.

"16 February 2012—The failure of Russia’s ambitious Phobos-Grunt sample-return probe has been shrouded in confusion and mystery, from the first inklings that something had gone wrong after its 9 November launch all the way to inconsistent reports of where it fell to Earth on 15 January." More detailed info you can find here. The image by Michael Carroll.

According to the oﬃcial report of Roscosmos, the most likely cause of this failure was an SRAM fault caused by ”a local influence of heavy charged particles”, aka galactic cosmic rays.

This is a particular case of a well-known hardware fault, so-called "bit-flip".

A negative environmental impact like increasing heat, lowering voltage, or cosmic radiation, like in the case of the Phobos Grunt, corrupt a part of system’s memory. This can result in a single or several bit-ﬂips, like it shown in the figure. The bit-flips may change the application state, for instance, the value of a critical variable. Later, during the execution of some software function, this erroneous value can be read and propagate further as a system error. Such an error may lead to various unintended consequences. Similar hardware failures can happen not only in memory, but in the CPU or on a BUS.

A number of research projects aim this problem. Roughly speaking, all of them can be classified into two groups: hardware-based and software-based. Heat and radiation protected hardware or memory/CPU/cache redundancy are typical hardware-based solutions. However, these approaches usually have a number of disadvantages like cost, limited markets, and extremely low performance.

The second group contains software-based approaches to bit-flip detection and masking. In my opinion, these solutions are much more advanced, interesting, and feet better to the scope of the DS blog. In the next post I plan to give an overview of existing methods and even tools to cope with bit-flips.

Here, as a teaser, I want to share the next fantastic video created by my colleagues from German R&D сompany Silistra.

Sunday, April 24, 2011

What Language Do We Speak?

Hello dear readers,

here is my very first post in the Dependable System (DS) blog. It is devoted to basic terminology of this research area. I've decided not to waste your time and without any useless discussion refer to a publication of one of the pioneers of dependability - Jean-Claude Laprie:

A. Avizienis, J.-C. Laprie and B. Randell: Fundamental Concepts of Dependability. Research Report No 1145, LAAS-CNRS, April 2001

This post gives a short summary of this article.

"Fundamental Concepts of Dependability" outlined the results of nearly 20 years of activity in this domain and related ones. Introduced concept and the taxonomy will be used in my further posts. Next figure (taken from the article) shows co-called 'the dependability tree' that gives some intuition what is it all about:

Dependability is a system characteristic like functionality, performance or costs. Formal definition is as follows: "Dependability of a (computing) system is the ability to deliver service that can justifiably be trusted".

So according to "the dependability tree", we can describe it from 3 points of view: Attributes, means, and threats. Attributes - a kind of sub-characteristics of the dependability:

Availability: readiness for correct service
Reliability: continuity of correct service
Safety: absence of catastrophic consequences on the user(s) and the environment
Confidentiality: absence of unauthorized disclosure of information
Integrity: absence of improper system state alterations
Maintainability: ability to undergo repairs and modifications

Means - goals of dependability analysis:

Fault prevention: how to prevent the occurrence or introduction of faults;
Fault tolerance: how to deliver correct service in the presence of faults;
Fault removal: how to reduce the number or severity of faults;
Fault forecasting: how to estimate the present number, the future incidence, and the likely consequences of faults.

Threats - classification of threats:

Fault is a defect on the system that can be activated and become a cause of an error (broken wire, electrical short, bug in a program).
Error refers to incorrect internal state of the system or a discrepancy between the intended behavior of a system and its actual behavior inside the system boundary.
Failure is an instance in time when a system displays behavior that is contrary to its specification.

I want to tell a little bit more about the threats, because this concept is very interesting but not so obvious. Faults, errors and failures operate according to the chain shown in the next figure:

Fault activation can lead to an error. Once a fault is activated an error occurs. Examples of fault activation are execution of a line of code with a bug, an attempt to send a signal via corrupted connector or execution of a broken hardware part. An error may act in the same way as a fault. It can create further error’s conditions. An invalid state generated by the error may lead to another error or to a failure. Important to note, that failures are deﬁned according to the system boundary. They are basically errors that have propagated out of the system and have become observable. If an error propagates outside the system boundary a failure is said to occur.

So, the general idea of the post: take a look at the papers of J.C Laprie to get basic definitions and classification.