Best Practices for Timely and Trusted Data Acquisition, Curation and Coordination in Microscope Environment Klara Nahrstedt University of Illinois at Urbana-Champaign Joint work with Phuong Ngyuen,Steve Konstanty, Todd Nicholson, Roy Campbell, Indy Gupta, Tim Spila, Michael Chan, Kenton McHenry, Tommy O’Brien, Aaron Scwartz- Duval Project funded by NSF ACI DIBBs grant.
Outline Motivation • • Problem Description and Challenges • 4CeeD Approach • Lessons Learned So Far • Best Practices So Far
Motivation Consideration of National • Academy studies -> 20 years from discovery of new materials to fabrication of next- generation devices • Need for REAL TIME and TRUSTED Capture, Curation, Correlation, and Coordination of http://www.build-electronic- circuits.com/integrated-circuit/ materials-to-devices digital data before full archiving and publishing
Current State of Data Collection at Microscopes Current situation for experimental data • involves manual processes for data capture and storage leading to poor documentation of results Data transfer is often done via “sneaker-net” • techniques using flash-drives or email • “Best” results and images are kept, but what is “best” is determined by a narrow, specific scientific objective. “Imperfect” data is often discarded or not available for others to review.
Effects of Current State • Measurements on multiple instruments for a new material may not be well correlated due to mechanisms to encode the linkages between measurements. Novel device prototypes can be difficult to reproduce due • to a lack of proper capture of “recipes” used. • In addition, previous experiments in the deposition systems may affect subsequent experiments. • Curation of system information can greatly improve the reproducibility and understanding of results.
Steps towards Problem Solution • Determine Physical Environments for Data Collection • Understand Physical and Digital Processes that are going on during material and semiconductor fabrication research • Determine Instruments for Investigation • Determine Cyber-and-Data Infrastructure for Real- time and Trusted Data Collection from Instruments • Design and Develop Distributed Data Collection Tool • Identify Test Users , Test Tool and Extract Feedback • Feed Feedback to Distributed Data Collection Tool
Materials and Semiconductor Fabrication Cyber-Physical Environments Micro-Nano Technology Laboratory • Growth and characterization of photonics, microelectronics, nanotechnology and biotechnology Materials Research Laboratory Research in condensed matter physics, • materials chemistry and materials science Facilities for nanostructural and • nanochemical analysis
Microscope Data: Development Process (Example) SiO 2 Mask SiN x Plasma Lithography Diffusion Deposition Deposition Etching Profilometry SIMS Profilometry Profilometry Optical microscopy Ellipsometry SEM Ellipsometry SEM SiN x Device Oxidation Lithography Metallization Removal Characterization Profilometry SEM SEM Optical microscopy SEM SPA Optical microscopy
Collected Data from Microscope (Oxidation Step) An example of the result from an experiment at MNTL
Challenges for Real-Time and Trusted Data Collection • Understanding user requirements for data curation • Development of policies for protecting data during research project and making data available after research project is completed Creating a system that is able to handle many • different types of work processes • Ability to read and display images and data from many different sources, many of which are proprietary Networking challenges for collecting data • * Networked Microscopes
Our Approach 4CeeD: Timely and Trusted Capture, Curation, Correlation, Coordination and Distribution
4CeeD Approach: Cyber-infrastructure Client Curators Edge Computing Cloud Coordinator
User User Coordinator view, edit, share data (via Webapp) Process, coordinate, correlate data from multiple sources MRL MNTL upload DM3, images, upload DM3, metadata, text images, metadata, text Curator Curator bulk data transfer (via API) Curator Curator Cloudlet Cloudlet ... Curator Curator
4CeeD Curator
4CeeD Curator Goals at Microscope Enable researchers to have a Digital Logbook System • Data is organized by researcher and by sample name • Recipes are collected and related to the deposition equipment used • Analytical data is collected as it is created and contains metadata needed to reproduce measurements
4CeeD Curator – Input Data Collection Create or Select A Collection Create or Select Dataset Upload Files Optional: Choose template and enter metadata
4CeeD Curator (Modified Clowder) Architecture Web Browser Custom Clients Client Server Load balancer (nginx) Clowder External Webapp Webapp Webapp Software (Scala/Play) (Scala/Play) (Scala/Play) Event Bus (RabbitMQ) Multimedia Text Search Data/Metada Multimedia Search Data/ Text Search (Elastic ta Search (Versus) (Elastic Metadata search) (MongoDB) (Versus) search) (MongoDB) Extractor 1 Extractor 2 (Java) (Python)
4CeeD Coordinator
Data Infrastructure’s Challenges (1) Heterogeneity of the types of job and input data Extract metadata Extract DM3 structured Index Classify Index information parsing Analyze TEM data processing workflow image SEM data processing workflow • How to model complex interactions between jobs’ tasks?
Data Infrastructure’s Challenges (2) Changing workload •Static resource allocation and rule- based provisioning are not suitable Flexible provisioning •QoS-based, cost-based provisioning
Coordinator Data Processing Flow Coordinator models jobs’ tasks on data as task workflow on incoming data Data processing job is abstracted as workflow to support flexibility & applicability Extract Classify Example of data processing sentiment metadata workflow Index End Start Analyze image
Front-end Database / Coordinator’s front-end File system 1 A B C Job type From To Job 1 A B Control plane 1 B C 1 Start A Resource Job invoker Broker(s) 1 C End manager ... ... ... A B B C Start End A B C Extrac Classif Index t y Sub Pub Sub Pub Sub Pub A B C Compute plane TEM data processing workflow A’s Consumers B’s Consumers C’s Consumers
4CeeD Pub-Sub Subsystem New publish subscribe-based system to support executing heterogeneous workflows • Leverage of flexibility of asynchronous message passing mechanism of pub/sub system Apache Kafka •However: • Out-of-the box pub/sub systems do not support executing workflows • Resource management is done manually by user
4CeeD Resource Management - Job request rates Resource Resource - Average response time scheduler monitor - Topics’ message queues and consumers statistics Resource allocator Resource manager A’s B’s C’s Consumers Consumers Consumers
4Ceed Coordinator System Implementation Modified Front-end Leverage Clowder’s Webapp & APIs Control plane Resource managers & other control plane programs implemented in Python - RabbitMQ as message Compute plane queue - Consumers implemented as Docker’s container. - Kubernetes is used for container orchestration
Evaluation Case study: Executing scientific workflows
Efficient real-time resource provisioning m = (1, 1, 1, 1) m = (2, 2, 1, 4) Our proposed approach efficiently provisions resources to cope with bursty workload
Lessons Learned of 4CeeD so far • Huge inefficiency exist at microscopes: (1) users spend time on deciding which data to delete; (2) users spend time on data conversions to view data, instead of data collection • There are security concerns: (1) users want to keep data secure and private until published, (2) instruments run on old not-patched Windows software • Metadata related to data is lost: (1) some metadata is not properly extracted from images, (2) some metadata is not even captured • Current cloud solutions are not all suitable for backend storage and processing of microscope data
Best Practices of 4CeeD so far • Talk to users and introduce them to Data Tools • CyberFab 2016 Workshop, May 24, 2016, Urbana • Consider cloud solutions and do not reinvent everything • Develop open frameworks to enable integration • Do integration with other tools towards sustainable tool suite • Talk and collaborate with other developers of data infrastructures
Recommend
More recommend