Disaster Recovery Policy
The RosettaHealth Disaster Recovery Policy establishes procedures to recover RosettaHealth following a disruption resulting from a disaster. This Disaster Recovery Policy is maintained by the RosettaHealth Security Officer and CTO.
The following objectives have been established for this plan:
-
Maximize the effectiveness of contingency operations through an established plan that consists of the following phases:
-
Notification/Activation phase to detect and assess damage and to activate the plan;
-
Recovery phase to restore temporary IT operations and recover damage done to the original system;
-
Reconstitution phase to restore IT system processing capabilities to normal operations.
-
-
Identify the activities, resources, and procedures needed to carry out RosettaHealth processing requirements during prolonged interruptions to normal operations.
-
Identify and define the impact of interruptions to RosettaHealth systems.
-
Assign responsibilities to designated personnel and provide guidance for recovering RosettaHealth during prolonged periods of interruption to normal operations.
-
Ensure coordination with other RosettaHealth staff who will participate in the contingency planning strategies.
-
Ensure coordination with external points of contact and vendors who will participate in the contingency planning strategies.
This RosettaHealth Disaster Recovery Policy has been developed as required under the Office of Management and Budget (OMB) Circular A-130, Management of Federal Information Resources, Appendix III, November 2000, and the Health Insurance Portability and Accountability Act (HIPAA) Final Security Rule, Section §164.308(a)(7), which requires the establishment and implementation of procedures for responding to events that damage systems containing electronic protected health information.
Example of the types of disasters that would initiate this plan are natural disaster, political disturbances, man made disaster, external human threats, internal malicious activities.
RosettaHealth defined two categories of systems from a disaster recovery perspective.
-
Critical Systems. These systems host application servers and database servers or are required for functioning of systems that host application servers and database servers. These systems, if unavailable, affect the integrity of data and must be restored, or have a process begun to restore them, immediately upon becoming unavailable.
-
Non-critical Systems. These are all systems not considered critical by definition above. These systems, while they may affect some secondary capabilities of the platform, do not prevent Critical Systems from functioning and being accessed appropriately. These systems are restored at a lower priority than Critical Systems.
Applicable Standards
Applicable Standards from the HITRUST Common Security Framework
- 12.c - Developing and Implementing Continuity Plans Including Information Security
Applicable Standards from the HIPAA Security Rule
- 164.308(a)(7)(i) - Contingency Plan
Line of Succession
The following order of succession to ensure that decision-making authority for the RosettaHealth Contingency Plan is uninterrupted. The Chief Technology Officer (CTO) is responsible for ensuring the safety of personnel and the execution of procedures documented within this RosettaHealth Contingency Plan. If the CTO is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the CEO shall function as that authority. To provide contact initiation should the contingency plan need to be initiated, please use the contact list below.
-
Kevin Puscas, CTO: (301) 919-2978, kevin.puscas\@rosettahealth.com
-
Buff Colchagoff, CEO: (202) 345-0298, buff.colchagoff\@rosettahealth.com
-
Zach Hill, Operations Manager: (301) 518-5597 zach.hill\@rosettahealth.com
Responsibilities
The RosettaHealth Tech Team is responsible for coordinating with ClearDATA in the recovery of the RosettaHealth production environment in AWS to include AWS services, network services, and all EC2 servers.
The RosettaHealth Tech Team is directly responsible for assuring all RosettaHealth Platform components are working. It is also responsible for testing redeployments and assessing damage to the environment.
Testing and Maintenance
An effective disaster response and recovery plan for the HealthBus platform is higly dependent on the coordination between the RosettaHealth technical team and ClearDATA. Because the platform is built as a High Availability system of systems with heavy reliance on AWS services this coordination is exercised on a continual basis as a part of normal operations In addition the HealthBus platform is continually evolving and changing. The platform and its components are updated on a regular basis often in coordination with ClearDATA. These technical activities are the same ones that would be called upon in a disaster response scenario.
Disaster Recovery Scenarios
HealthBus is built to be highly resilient with components distributed across multiple data centers operated by AWS and Rackspace. This provides a level of high-availability and resiliency to the production environment. However there still remains the possibility, however remote, that loss of a data center or a platform wide issue could cause loss of critical capability of the platform.
Disaster Recovery Procedures
Notification and Activation Phase
This phase addresses the initial actions taken to detect and assess damage inflicted by a disruption to RosettaHealth. Based on the assessment of the Event, the Recovery Phase may be activated by either the Operations Manager.
The notification sequence is listed below:
-
The first responder is to notify the Operations Manager. All known information must be relayed to the CTO/CEO.
-
The Operations Manager is to notify team members and direct them to complete the assessment procedures to determine the extent of the service interruption and estimated recovery time.
-
Operations Manager determines that the event is adversely impacting multiple customers either directly or indirectly and begins the recovery phase.
Recovery Phase
This section provides procedures for recovering HealthBus operations. The goal is to restore RosettaHealth Platform to an acceptable production state.
The tasks outlines below define the Action Plans for each scenario.
Scenario 1 - Platform wide issue:
-
Contact Partners and Customers affected
-
Compile list of impacted platform components. These should include
-
Ec2 Instances
-
spring-boot components
-
HISP Components (James, tomcat, Dovecot)
-
mirth interface instances
-
nginx and haproxy web servers
-
php portal applications
-
StrongSwan VPN appliances
-
Sftp services
-
iptables
-
Ec2 OS
-
-
AWS Services
-
Aurora MySql DataBase
-
Athena Database reporting engine
-
Lambda functions
-
Step Functions
-
S3 storage
-
SFTP Services
-
MongoDb
-
AWS Networking (VPC, ALB, NLB, API Gateway, Security Rules)
-
-
-
Determine if roll-back will address the impacted components.
Component | Roll-back action |
---|---|
spring-boot components | change service to the last know good working version (multiple past versions should still be on the Ec2). |
HISP Components (James, tomcat, Dovecot, DNS) | Rollback the Ec2 instance to the last know working version. |
mirth interface instances | Rollback the Ec2 instance to the last know working version. Note may require re-installing to the last know working version and spring-boot apps that also run on those Ec2 servers |
nginx and haproxy web servers | Rollback the Ec2 instance to the last know working version. |
php portal applications | Rollback the Ec2 instance to the last know working version. |
StrongSwan VPN appliances | Rollback the Ec2 instance to the last know working version. |
Sftp services | Rollback the Ec2 instance to the last know working version. |
Ec2 OS | Rollback the Ec2 instance to the last know working version of the root ebs volume. |
Lambda function | Change production alisas to previous version volume. |
Special Considerations:
-
Roll-back of Ec2 instances will require working with the the ClearDATA team to coordinate the roll-back and restoration.
-
Depending on the instance, roll back may require dismounting the /data volume and remounting the latest /data volume
-
For each roll-backed component, confirm by watching the live logs that the component is working as expected.
-
Visually verify logging, security, monitoring and alerting functionality for components
-
Scenario 2 - AWS Data Center/ Services interruption:
-
Contact Partners and Customers affected
-
Compile list of impacted platform components. These should include
-
AWS Services.
-
Aurora MySql DataBase
-
Athena Database reporting engine
-
Lambda functions
-
Step Functions
-
S3 storage
-
SFTP Services
-
MongoDb
-
AWS Networking (VPC, ALB, NLB, API Gateway, Security Rules)
-
-
-
Determine dependent platform components and functions that are impacted.
-
For AWS Services, coordinate with ClearDATA for status of issues from AWS. Recovery will depend on evaluation
-
Once new AWS services have been established, verify that the RosettaHealth Platform components are running on the appropriate EC2 instances.
-
Test logging, security, monitoring and alerting functionality
Post-Recovery Phase
This section discusses activities to be preformed after recovery of platform capabilities have been confirmed by the RosettaHealth technical team.
-
Contact customers impacted.
-
Perform traffic impact analysis.
- Determine what traffic for customers may have been impacted
-
Document a Root-Cause Analysis
- Document the timeline of the event, cause, actions taken, any short term mediations
-
Determine any mediation/preventive actions that can be taken to prevent future events or can make the platform more resilient.
-
Update the Platform Risk Assessment as needed.
-
Perform an external vulnerability scan
Scenario 3 - Security Incident/Virus or Malware Detected:
-
Compile list of impacted platform components. These should include
-
AWS Services.
-
Aurora MySql DataBase
-
Athena Database reporting engine
-
Lambda functions
-
Step Functions
-
S3 storage
-
SFTP Services
-
MongoDb
-
AWS Networking (VPC, ALB, NLB, API Gateway, Security Rules)
-
-
Employee Workstations
-
RosettaHealth Business Support Services (Google Workspace, FreshDesk, 1Password, Slack)
-
-
Determine dependent platform components and functions that are impacted.
-
Isolate and quarantine any components that may have been compromised.
-
Identify last good timepoint before the incident as a recovery target.
-
Contact Partners and Customers affected
-
Erradicate the threat/vulerability back to the good timepoint.
-
Restore functionality of the component
-
Test logging, security, monitoring and alerting functionality
Post-Recovery Phase
This section discusses activities to be preformed after recovery of platform capabilities have been confirmed by the RosettaHealth technical team.
-
Contact customers impacted.
-
Perform traffic impact analysis.
- Determine what traffic for customers may have been impacted
-
Document a Root-Cause Analysis
- Document the timeline of the event, cause, actions taken, any short term mediations
-
Determine any mediation/preventive actions that can be taken to prevent future events or can make the platform more resilient.
-
Update the Platform Risk Assessment as needed.
-
Perform an external vulnerability scan
Recovery Point Objective (RPO) and Recovery Time Objective (RTO)
Snapshots of the EBS volumes for the EC2 instances are taken every 24 hours which establishes our minumum RPO. In the case of a disaster recovery time is dependent on AWS recovery of impacted services. However from a business continuity and service to customers the HealthBus platform the RTO is a maximum of 30 days.
Disaster Recovery Testing
Annually we will conduct a table-top expercise that involves the coordination and collaboration between RosettaHealth and ClearDATA to address the scenarios outlined.