Transcript Document
Reliable Software for Real-Time Mission Critical Systems Chris Turiano Mesa State College Grand Junction, CO 81501 Some Quick Definitions Mission Critical Items (data, software, etc) determined to be vital to operational readiness or mission effectiveness in terms of both content and timeliness and must be absolutely accurate and available on demand Failure when a system does not conform to its specified behavior Fault internal or algorithmic cause of a failure Real-time system a system that produces a time-constrained output into the physical world based on input from the same physical world Hard real-time system system failure occurs for every missed deadline Soft real-time system system is tolerant of missed deadlines Reliability a quantitative measure of how well a system conforms to its specified behavior System Design System design (planning) is crucial to reliability Incorporate – High-end design specifications Choices – Hardware, Software, IDE, OS – Fault Prevention Testing Reviews, program analysis tools – Fault Tolerance Error handling Error recovery Design Specification • Hardware – Mil-spec vs. cost – Single expensive unit vs. Redundant cheap unit • Software – Language • Data-abstraction • Modularity • Strongly typed – Strict choice of timing requirements • Integrated Development Environments – Help prevent typographical errors and reduce complexity • Operating System – Support for concurrency – Top level error detection/recovery Fault Prevention – Begins during design specification Appropriate specifications Fault Avoidance techniques Help prevent introduction of faults into system – – – – Well established techniques Strongly typed languages Data abstraction and modularity Tools Design reviews Program-verification software Fault Removal – Testing, testing, testing the best method to improve the reliability exhaustive system testing is impossible – difficulty proving correctness – adequate realism – incorrect early assumptions Unit Testing vs. System Testing Fault Tolerance • 3 levels – Full fault tolerance • All functionality in the presence of faults (perhaps for a limited time) – Graceful degradation • Less functionality until repair of the fault – Fail safe • Records a prior safe state so it can return to that state after shutdown – Allow regression • • Data recovery • System restart may restore functionality Add-on components – To detect, handle and recover from faults – Redundancy vs. complexity • Balance redundancy, introducing new faults and physical restrictions • Separate the support components from the potentially faulty components Fault Tolerance (cont) Dynamic Redundancy – Error detection system “exception” block in Ada95 – 4 stages Error detection Error diagnosis Error recovery Fault treatment Static Redundancy – Masking errors by comparing the output of multiple components An average or a majority-rule algorithm to select the correct value n-version programming Error Detection • Error Detection package body SomeThing is procedure SomeProc (param : OUT integer) is ProcException : exception; begin <the internal workings of SomeProc> exception when ProcException => ProcExceptionHandler(param : IN integer); when anError : Others => Ada.Text_IO.Put_Line(Exception_Information(anError)); end SomeProc; end SomeThing; – Levels of abstraction – Propagation – Error detection occurs everywhere but diagnosis/recovery occur where the specific error is best handled Error Detection Shortcomings – Stack corrupting errors No pointers to the error handling code nor to the parent – Infinite loop Error never fires and the infinite loop continues Necessitates other mechanisms Timeliness of interaction – Detecting deadline errors Handles infinite loop and stack corruption errors RTOS scheduling – Inherent part of the OS Watchdog Scheduler – Analogy Diagnosis and Recovery • Error diagnosis – Preventing spread of erroneous data – Examining components where error could have spread • Object-oriented design naturally limits the interactivity between components – Access from outside the object is well defined – Object hides its data – All access to shared resources indivisible • Error Recovery – Roll participating threads to a known point • Forward error recovery – Predetermined checkpoints are written into the code • Backward error recovery – A “good” state is saved as threads progress – Infinite loop Threads enter atomic action and each creates a recovery point, then an error occurs rolling all thread back to the same state they just left A Little Real-Time System in Ada95 Imagine a satellite with an array of solar cells that extends or retracts based on the amperage of the battery it recharges. User’s keyboard controller A task accepting input to alter states Amperage Sensor System Controller A task which senses amps A task which decides to extend or retract array Watchdog Scheduler A task which handles problems with the timeliness of extension or retraction of the array Motor Controller A task which extends or retracts array using a protected object Ada 95’s Task TASK SensorTask IS PRAGMA Priority(SensorPriority); END SensorTask; TASK BODY SensorTask IS T : Ada.Real_Time.Time; MyWait: Ada.Real_Time.Time_Span := SensorPeriod; Sensor : CurrentSensor.Object; Span : Time_Span := Milliseconds (5000); BEGIN theWindows.WindowInitSensor; T := Ada.Real_Time.Clock + MyWait; LOOP EventID eid := TheWatchDogScheduler.AddTimerEvent(TimerEvent.MakeEvent(newPid => Current_Task, newID => SensorID, newSpan => Span, newErrHan => ErrorHandler.Object(TheSensorErrorHandler))); TheSensor.SetAmpsRandom; TheWatchDogScheduler.RemoveTimerEvent(eid) DELAY UNTIL(T); T := T + MyWait; END LOOP; EXCEPTION WHEN AnError : OTHERS => OutPoint := (x => 1, y => 7); theWindows.WriteString(3, OutPoint, Exception_Information(AnError)); END SensorTask; Another Example Ada 95’s Design History Built-in Fault Prevention Built-in mechanisms to aid reliability – Very strongly typed – Modular and Object-oriented – Exception mechanism Raise <exception name> – Delays – Tasks and Protected Objects Readers/writer (analogy) Entry – Queues – Barriers/Guards Accept Select… – When <condition> – Or…Delay/Abort/Terminate – Else… Requeue Priority Ada 95’s Error Detection PROCEDURE InitWindow (newWindow : IN OUT Window.Object) IS BEGIN <statements of procedure removed due to space limitations> EXCEPTION WHEN WindowOutofBoundsX => Movecursor(2, 23); Put_Line(Item => "The window width is too large: recalling InitWindow with smaller box."); newWindow.deltaX := newWindow.deltaX - 1; InitWindow(newWindow => newWindow); WHEN WindowOutofBoundsY => Movecursor(2, 23); Put_Line(Item => "The window height is too large: recalling InitWindow with smaller box."); newWindow.deltaY := newWindow.deltaY - 1; InitWindow(newWindow => newWindow); WHEN Another : OTHERS => Movecursor(2, 23); Put_Line(Item => Exception_Information(Another)); END InitWindow; Full code here Ada 95’s Error Recovery Watchdog Scheduler TASK BODY WatchDog IS BEGIN LOOP TheWatchDogScheduler.GetTopItem(TE, Exists); IF Exists THEN MyTime := Clock; IF TE.ExpTime >= MyTime THEN CASE TE.ID IS WHEN 1 => TheWatchDogScheduler.SolarArrayErrorRecoveryProcedure(TE); WHEN 2=>TheWatchDogScheduler.SystemControllerErrorRecoveryProcedure(TE); WHENOTHERS=>TheWatchDogScheduler.CurrentSensorRecoveryProcedure(TE); END CASE; END IF; END IF; DELAY 0.1; END LOOP; END WatchDog; Ada 95’s Error Recovery • Select… – Requeue/Delay – Abort/Terminate • Exception handling Wrap-up Reliability begins at Design Specification stage – Choices hardware, programming languages, timing requirements – Incorporating Correctness Proofs Design Reviews Extensive Testing – Planning Fault Prevention Fault Tolerance – Error Handling – Error Recovery Use time tested methodologies