“Crack!!”, a loud sound, as I sat in front of my home PC, startled me. My anxiety increased when I looked down and saw that the 2 TB external drive I had stored 50,000 family photos (from past 20 yrs) on, had dropped 18 inches from the precarious perch on top of my Dell tower PC onto my wooden floor. A direct result of my foot getting tangled in the USB cord. I picked up the now hopeless drive, heard clicking noises and noticed the case partly cracked open. I had only myself to blame for my negligence for not one, but two things – the placement of my drive and for mapping only my C-Drive to my new external backup services. I never got around to mapping my external drives, the way I had done previously with my old backup service.
Like the proverbial shoemaker’s son that has no shoes, this former CIO, an evangelist for disaster recovery (DR), was wholly embarrassed by my own ineptitude. It was a failure on multiple levels. But it did get me to reflect upon my long focus on disaster recovery and downtime mitigation – a focus that had so dominated much of my CIO career.
Disaster preparedness, always a key pillar of IT strategy and architecture, took on a new level of definition and criticality when paper charts gave way to the EMR. No longer just a matter of lost revenue or productivity, lives were at real risk. Eventually, a goal of 99.9% uptime (often referred to as “3 – Nines”) became the target for vendors and health systems as they examined the resiliency of their infrastructure, platforms and software. While not nearly as resilient as a hard-line phone system at 99.99% (“4 – Nines”), it was nonetheless a step in the right direction. Lessons learned in high reliability venues, e.g., naval aviation and nuclear power, needed to be applied rigorously in healthcare. Movement towards remote hosting or cloud-based provisioning of the EMR provided an alternative to the potential need for expensive investment in Tier 4 redundant data centers by each organization. Complementary read-only downtime systems became part of the expected disaster recovery solution and were essential to the investment in core electronic records.
Having redundant Tier 4 infrastructure available (through an EMR vendor, other third-party or your own efforts) is a major step in the right direction. However, it is only part of the broader approach needed. Indeed, based on experience with major EMR outage events at two former academic medical centers and that of others with significant EMR downtime, those outages were often attributed to the human side of change management. In fact, those experiences had very little to do with infrastructure failures. Errors made in software changes, upgrades or even apparently the simplest configuration adjustments have taken down our EMRs for many hours with the consequential impact to care delivery and patient safety. While it is always easy to blame an individual for a mistake, it is the lack of high reliability processes and training that contribute greatly to this issue.
My most recent employer was an organization on the forefront of applying principles of high reliability and error prevention to patient safety – a 10-year journey. We recognized that our efforts were going to be limited if key vendor systems we relied on lacked the same level of commitment to error prevention that is applied to patient safety. With that in mind, our executive team offered to transfer our learning and experience to the EMR vendor. We took the “Crew Management”, “Checklist” and “BEEP” (Behavior Expectations for Error Prevention) tools that we used in the OR and throughout our clinical enterprise and applied those with the vendor’s technical staff that is responsible for the aforementioned software changes. Over time this improved their performance and reliability and, thus, ours as well.
Having accomplished much of that “EMR hardening” in my prior CIO roles, there remained the question of how to manage DR for the rest of the applications. The 100 to 300 non-life critical apps still essential to running the enterprise. To address that need in a comprehensive and rational way, we applied a Tier based (Tier 1 – Tier 4) approach of setting reasonable recovery time objectives (RTO’s) and recovery point objectives (RPO’s). We engaged the various system stakeholders in a “refereed” fashion. That meant no – we were not going to spend $1 million to give a fixed asset system a Tier 1 level, when Tier 3 or 4 was more than adequate. As we worked to determine what was reasonable and affordable for a given business risk, we eventually sorted those 100+ other systems into the best fit Tiers and created a Heat Map that plotted existing and needed infrastructure (e.g., processors and storage) to achieve the agreed upon Recovery Point Objectives and Recovery Time Objectives (RPO/RTO).
Oh, and what happened with my damaged drive? It was, unfortunately unrecoverable. Thankfully, after many, long and stressful weeks of effort I was able to reconstruct those photos from multiple other sources (old drives, thumb drives, and memory cards). Two new external drives, exact copies – now sit on the floor in a secure case. My external backup service now maps to and backs up all my drives every week. As an additional step, I plan to move a copy of that data into the cloud and will let my cloud vendor worry about my data.