System Continuity Model
• Consider your operation as a “system” of interacting components that produce services:
– An urban system: city operations, agencies, services, etc.
– An enterprise system: businesses, business units, functions, etc.
– A societal system: people, social groups, activities, etc.
• Consider the kind of response models required to address various kinds of unplanned events:
– Proactive or preventative responses
– Reactive and recovery responses
– Combinations of these
• Challenge: embed these responses into your system model
–
–
–
–
–
Within services, systems, personnel, information, processes, etc.
Understanding of threats & risks endangering your system
Collaborative preparedness & response
Self-healing to disruption where possible
Loss of life and property damage is prevented or minimized
Event Mechanics
• Unexpected events WILL ALWAYS occur
– Fires, floods, storms
– Security breaches
– Insolvency
– Mishaps
• Disruptions are mainly due to human error
– Mistakes & oversights
– Unpreparedness
– Improper response
• Human error is the ultimate root cause of any disruption! Event Model
Impact
Event
Response Options
Causal
Events
t-1
t
t+1
t+2
t+3
Event is
Monitored
Captured
&
analyzed
Event is reported Response
t+4
Event is evaluated t+5
Further
Response if necessary
t+N
Event Mechanics
Loss
Detection
Containment
Event(s)/
Hidden
Effects
Failover
Slowdown /
Outage
Recovery/Repair
Contingency
Resumption
System Continuity Paradigm
Detection
Event
Classification
Deterrence
Containment
Notification
Response
Remediation
Recovery
Preparedness
Continuum of Recovery Activities
Event
Time
Scale
Failover
Function A
Recovery
Function B
Recovery
Function C
Recovery
Function D
Recovery
Operation
Recovery
Resumption
Faults or Failures
•
•
•
•
•
Faults lead to disruption, individually or collectively
– Can be a single or multiple set of events or conditions
Usually a single cause is not enough to create a disaster:
– Almost always requires multiple failures & mistakes to reach fruition.
Usually a combination of unanticipated factors:
– Multiple chains of failure seem improbable as they get complex
– As a system gets more complex and tightly coupled, more combinations of faults can lead to failure.
– Key: little leaks can foretell bigger problems
Examples:
– Failure begins when one weak point begins linking with others
– An accident occurring after a precursor incident
They will ALWAYS happen
– The key is to prevent disruption
– Taking action in early stages breaks the chain of causation
– Barriers can trap a disturbance and keep it from leading to a disaster
Types of Faults
Simplex error
Self healing error
• Single Points of Failure (SPOFs)
• Blind spots
• Trojan effects
– Flawed system/process experiencing unusual circumstances
• Errors or mistakes
Intermittent error
Rolling error
– Oversights or neglect
– Tendency to jump to conclusions in crisis situations:
• Develop a theory and sticking to it
(right or wrong)
• Mis-judgement
• Keeping narrow focus
– Types:
• Simplex, self-healing, intermittent, rolling Single Points of Failure
-Serial Path-
Outage
Single point of failure
• An isolated element upon failure will disrupt service
• “Weakest link”
• Serial paths can appear at any logical level
– logical
– physical
– processes
– tasks
• If you cannot recover a process while in productive service, add redundancy Single Points of Failure
-Redundant PathOutage
Failover
• A redundant solution must:
– Eliminate single point of failure
– Have no single point of failure
– Have an adequate failover process
– Provide equivalent level of service
– Should be diverse
• Beware of false redundancy!
Redundant,
diverse parallel path
Blind Spots
• Problems can happen in any system that has a blind spot
– You can’t see what’s going on or detect a problem
• Show up in various forms, such as
–
–
–
–
–
–
A system whose behavior is hidden from