Rapid Prototyping of Composable Concurrent Workflows using Typed Templates Albert Schimpf wiki.scraper.server1.link Technische Universität Kaiserslautern (TUK), Kyoto University 4. Februar 2020 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 1 / 16
Problem Boundary Task is ... 1 2 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 2 / 16
Problem Boundary Task is ... ◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure 1 2 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 2 / 16
Problem Boundary Task is ... ◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure Applications ◮ IoT devices (e.g. fridge monitor 1 ) 1 https://git.server1.link/scraper/scraper/wikis/examples/fridge 2 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 2 / 16
Problem Boundary Task is ... ◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure Applications ◮ IoT devices (e.g. fridge monitor 1 ) ◮ Archiving/Monitoring websites (e.g. archiving news threads 2 ) 1 https://git.server1.link/scraper/scraper/wikis/examples/fridge 2 https://git.server1.link/scraper/jobs/hackernews-json 3 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 2 / 16
Problem Boundary Task is ... ◮ Resource-intensive (proxies, I/O bound) ◮ Resume-able, long-running ◮ Flexible stream of fresh data ◮ Easily modifiable structure Applications ◮ IoT devices (e.g. fridge monitor 1 ) ◮ Archiving/Monitoring websites (e.g. archiving news threads 2 ) ◮ Stream processing (extract/transform data 3 ) 1 https://git.server1.link/scraper/scraper/wikis/examples/fridge 2 https://git.server1.link/scraper/jobs/hackernews-json 3 https://git.server1.link/scraper/jobs/extract-vcards Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 2 / 16
Informal Use Case Specification BEGIN:VCARD .... END:VCARD Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 3 / 16
Informal Use Case Specification BEGIN:VCARD .... END:VCARD Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 3 / 16
Informal Use Case Specification BEGIN:VCARD .... END:VCARD Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 3 / 16
Informal Use Case Specification BEGIN:VCARD .... END:VCARD Data: entire HDD Matches can be processed independently Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 3 / 16
Possible Approaches One program for each task (Java) ◮ too much effort, fragile ◮ code duplication ◮ explicit concurrency handling ⋆ Java Stream interface not suited for task Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 4 / 16
Possible Approaches One program for each task (Java) ◮ too much effort, fragile ◮ code duplication ◮ explicit concurrency handling ⋆ Java Stream interface not suited for task Reuse functionality, abstract and share code ◮ modifications of sub-routines affected other programs ◮ mixed control-flow and data-flow hard to reason about ◮ language focused on control-flow less suited for data-flow problems Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 4 / 16
Possible Approaches One program for each task (Java) ◮ too much effort, fragile ◮ code duplication ◮ explicit concurrency handling ⋆ Java Stream interface not suited for task Reuse functionality, abstract and share code ◮ modifications of sub-routines affected other programs ◮ mixed control-flow and data-flow hard to reason about ◮ language focused on control-flow less suited for data-flow problems Use tools that OS provides (e.g. pipes) ◮ grep -aoz "BEGIN:VCARD.*?END:VCARD" / dev / sda1 | ... ◮ Sequential specification unwieldy Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 4 / 16
Functional Nodes Functional nodes, what about... ◮ ... connecting them (graph structure)? ◮ ... how data is passed around (API)? ◮ ... concurrent access? ◮ ... configuration? ◮ ... complex control-flow, data-parallelism? Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 5 / 16
Functional Nodes Functional nodes, what about... ◮ ... connecting them (graph structure)? ◮ ... how data is passed around (API)? ◮ ... concurrent access? ◮ ... configuration? ◮ ... complex control-flow, data-parallelism? Use specifications instead of programming in Java (DSL) Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 5 / 16
Requirements Reusable & adaptable nodes Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 6 / 16
Requirements Reusable & adaptable nodes Separation of business logic and program logic Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 6 / 16
Requirements Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 6 / 16
Requirements Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Reliability ◮ Guarantee concurrent access and processing of data at any time without crashing Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 6 / 16
Requirements Reusable & adaptable nodes Separation of business logic and program logic Quasi-static graph-like specification Reliability ◮ Guarantee concurrent access and processing of data at any time without crashing Robustness ◮ Errors only happen during initialization of the specification ◮ After initialization, errors are guaranteed to be of business-logic nature Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 6 / 16
Flows - Arrows & Nodes Simple Graph Nodes ... ◮ ... implement single unit of work ◮ ... forward data to another node No forward target denotes end with some result data Where is the process, how is work done? Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 7 / 16
Flows - Arrows & Nodes Flows (implicit) Initial (empty) flow map F i (flow map) accepted by first node Result data for input F i is F 1 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 7 / 16
Flows - Arrows & Nodes Dependent & Dispatched Flows dispatch node creates a new flow new flow is independent F 2 does not depend on F 1 Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 7 / 16
Nodes - Typed Configuration - type: RegexNode regex: "(BEGIN:VCARD.*?END:VCARD)" content: "{output-chunk}" groups: content: 1 collect: false streamTarget: vcard-match Every key of the configuration is typed (implementation, program logic) Access to flow map content via templates (e.g. {output-chunk} ) Format of configuration is not important (declarative, either JSON or YAML for now) Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 8 / 16
Flows With arrows (dependent, dispatched, multiple) all types of data-parallelism is possible ◮ Fork, ForkJoin ◮ Map, MapJoin ◮ IfThenElse ◮ Pipe ◮ Retry ◮ Application specific flow routing Concurrency is handled by the system with predefined semantics Concurrency can be visualized in a graph Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 9 / 16
Evaluation Flow graph model with flows, nodes, and arrows Operational semantics with type safety (Master’s thesis result) Java framework based on operational semantics Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 10 / 16
Evaluation - Recover VCards Workflow Pipes sudo grep -a -A 20 BEGIN:VCARD /dev/sda | wc -l Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 11 / 16
Evaluation - Recover VCards Workflow Pipes sudo grep -a -A 20 BEGIN:VCARD /dev/sda | wc -l Time: 9 minutes 3 seconds Max Memory: 1GB Average memory usage: 4% Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 11 / 16
Evaluation - Recover VCards Workflow Pipes sudo grep -a -A 20 BEGIN:VCARD /dev/sda | wc -l Time: 9 minutes 3 seconds Max Memory: 1GB Time: 9 minutes 34 seconds Average memory usage: 4% Max Memory: 2,5GB Average memory usage: 1.5% Schimpf (TUK, Kyoto University) Rapid Prototyping of Workflows 4. Februar 2020 11 / 16
Recommend
More recommend