ERICA A cloud orchestration meta-framework for secure health data analytics Tim Churches SW Sydney Clinical School & Centre for Big Data Research in Health UNSW Medicine
Why we need secure platforms for health data analysis
Overcoming the GP data drought • No systematically reported national data on the size and structure of general practices in Australia • Network analysis of 21 years of Medicare claims shows: • general practices have increased in size • continuity of care and patient loyalty have remained stable • greater sharing of patients by GPs is associated with greater patient loyalty • This new approach allows continuous monitoring of the characteristics of Australian general practices Tran B, Straka P, Falster MO, Douglas KA, Britz T, Jorm LR. Overcoming the data drought: exploring general practice in Australia by network analysis of big data. Med J Aust 2018; 209(2):68-73
Re-operation after breast conserving surgery • Linked hospital inpatient and death data for NSW • Primary unilateral or bilateral BCS • 90-day reoperation (re-excision or mastectomy) • 29% overall re-operation • 17% BCS • 12% mastectomy • ↑ BCS over time, ↓ mastectomy over time • Significant variation by hospital van Leeuwen MT, Falster MO, Vajdic CM, Crowe PJ, Lujic S, Klaes E, Jorm L, Sedrakyan A. Reoperation after breast-conserving surgery for cancer in Australia: statewide cohort study of linked hospital data. BMJ Open 2018, vol. 8, pp. e020858.
Sydney Morning Herald
Why we need secure platforms for health data analysis • Current research at UNSW Medicine (CBDRH, SPHCM, SWSCS, MRIs) ofr health services research, clinical epidemiology and ML research • whole-of-NSW-Health administrative data (hospital admissions, ED visits, cancer registry, death certificates) linked at person level over 15 year span • linked MBS-PBS data (not the retracted dataset!) • EMR and cancer information system data from specific hospitals • DVA linked data • 25% subset of NPS MedicineWise data • All of these are de-identified • But all of these are potentially re-identifiable • Must be kept safe! • Security requirements that exceed “…data will be stored on a password - protected file server..”
ERICA: key features • Provides up to 256 secure remote-access analysis project spaces per instance • “Enclave” model: each project space is completely self -contained and disconnected from other projects, from the internet, and from the users’ desktops • Provides an invigilated gateway for data coming in and research results going out, with complete audit trail • Uses Amazon Web Services (AWS) commercial cloud computing • Leverages the features and scalability of AWS • Different OS and workspace configurations • High performance computing • Multiple storage and pricing options
ERICA: key features • Is institution-based • Governed and managed by a host institution and its policies and procedures • Multiple instances (‘clones’), governed by different host institutions can be established (anywhere that AWS operates), currently: • UNSW • Australian Institute of Health and Welfare (AIHW SRAE) • NSW Government Data Analytics Centre • A code- driven ‘orchestration framework’ • Testable and tested for correct behaviour • System administrators do not manually configure resources • Project space configuration is point-and-click • Minimises human error • Accredited by eHealth NSW under their Privacy and Security Assessment Framework (PSAF) to hold fully-identified NSW Health data
Typical ERICA virtual workstation
ERICA virtual workstations • Most current users Windows 7 or Windows 10 • Linux workstations available • Software can be pre-installed in workstation images (up to 100) • e.g. MS Office, SAS, SPSS, Stata, R, python, TensorFlow etc pre-installed • System administrators can define additional HPC resources via templates, restricted to specific project spaces e.g. • Linux compute server with multiple high-end GPU cards • Apache Spark cluster with many nodes • End-users in the project space can start and stop these on demand and are given warnings if left running!
‘Five safes’ framework 1. Safe Projects Is this use of the data appropriate? 2. Safe People Can the researchers be trusted to use it appropriately? 3. Safe Data Is there a disclosure risk in the data itself? 4. Safe Settings Does the access facility limit unauthorised use? 5. Safe Outputs Are the statistical results non-disclosive? Desai T, et al. Five Safes: designing data access for research . Economics working paper series 1601. Bristol: University of the West of England, 2016
ERICA: Safe projects • Policies set by host institution • UNSW ERICA • Projects must have data custodian and ethics approvals • Projects must therefore meet NHMRC guidelines for human research • ERICA must be named as data storage and analysis facility on HREC applications
ERICA: Safe people • Roles defined in ERICA code and assigned to individuals according to policies of host institution • System Administrator • Project Chief Investigator • Project Controller • Project Manager • Project Researcher • Online training module for researchers • With an exam that must be passed…
ERICA: Safe data • Designed for research using sensitive microdata • The datasets, variables, level of detail and any suppression or perturbation are governed by host institution’s policies and data provider policies • UNSW ERICA: governed according to data custodian and ethics approvals • Project Controller checks and approves all inbound files • Role can be assigned to data custodian nominee (e.g. AIHW staff member) or research team member • Data custodians can upload encrypted data themselves through eHub or large file ingress facility • By carefully attending to the other four “Safes”, ERICA and similar secure analysis platforms dramatically reduce the level of anonymization which data providers and data custodians need to do • Data anonymisation is the enemy of quality research and effective ML model development
ERICA: Safe settings - threat model • Basic premise: researchers are honest-but-sloppy • Ignorant of IT security • Reliant on institutional IT security • Driven by convenience • Designed to protect against • Innocent acts-of-omission by researchers • Acts-of-carelessness by researchers • Malicious acts by non-users (i.e. external hackers) • But not necessarily malicious acts-of-commission by researchers • e.g. Filming the screen as they scroll through data
ERICA: Safe settings – identity and authentication • Authentication and authorisation uses a Microsoft Active Directory instance specific to ERICA • ERICA user accounts are assigned to one or more roles (e.g. Project Controller, Project Researcher) for each project space • At all external access points, users authenticate themselves using a single set of login credentials (account name and password) plus mandatory multi- factor authentication code (using smartphone) • External access points can be further restricted to specific IP address ranges or source networks, or client-side digital certificates can be used to restrict access to specific devices (e.g. specific laptop or desktop computers) • e.g. UNSW medicine ERICA instance is accessible only from the UNSW internal network, behind the main UNSW firewall, so no Internet-facing interface
Logging into ERICA AWS Desktop client
ERICA: Safe settings – movement of data • All research data held in ERICA are encrypted both at-rest and in movement • AWS key management and encryption services are used to strongly encrypt all EBS and S3 data stores used by ERICA • Secure protocols, including HTTPS (TLS v1.3), LDAPS, scp and encrypted SMB/CIFS are used for all communications and data movement • Users can only import or export data via a controlled gateway mechanism known as the Hub • All other file or data ingress and egress mechanisms, including clipboard, email, messenger services, printing services and internet access, are blocked by two independent and redundant layers in the system network architecture. • Project workspaces are isolated from each other, and no data can be transferred between them (except via the Hub)
Importing and exporting
ERICA: Safe settings – logging and audit • All data movements inbound to and outbound from ERICA are fully logged and subject to full-copy audit trails • An activity trail displays the time, project and the action that a particular user has taken within the system regarding data movement • A checksum of the imported/exported file is maintained and logged to ensure the file has not been modified during the ingress/egress. • All logging is aggregated into AWS Cloudwatch, which provides a single unalterable and digitally signed and timestamped source of information for auditing purposes • Key security event logs include those generated by: border routing devices, network and application firewalls, intrusion detection, anti-virus and malicious code protection services, internet-connected services • Automated log analysis and notification using industry standard tools is currently being implemented
Recommend
More recommend