Automating User-Centered Design of Data- Intensive Processes Research Project Report (RPR) Vasileios Theodorou 26-05-2015 Host University Home University Coadvisor: Supervisor: Supervisor: Dr. Maik Thiele Prof. Alberto Abelló Prof. Wolfgang Lehner RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Example - Two Alternative Flows Conceptual model of flow: “Details about suppliers i n Europe sorted on revenue” • ETL Flow A • ETL Flow B 2 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Measures from experiments ETL Flow A ETL Flow B Process cycle time 10.4 sec 18.9 sec Performance Throughput 52,906 tuples/sec 29,179 tuples/sec % of correct tuples 91.5% 100% Data quality % of non-null tuples 90.3% 95.2% # of precedence 20 40 Understandability dependencies Length of longest 9 steps 23 steps Manageability path E XECUTION • TPC-H with s.f.=1 • Executed on Pentaho Data Integration (Kettle) • Data quality improved – Performance, Understandability and Manageability reduced 3 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Agenda A PPROACH Conceptual model reflecting user requirements User requirements-driven flow redesign Automatic “quality” pattern integration Configurable testing C HALLENGES AND D ISCUSSION Relate patterns to utility Assess pattern significance, model accuracy & completeness Future plan 4 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
ETL Quality Attributes Paper: Quality Measures for ETL Processes (DaWaK ’ 14) T RADE - OFFS It’s not only about performance! Improving some quality attributes can affect others positively or negatively 5 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
ETL Quality Attributes Paper: Quality Measures for ETL Processes (DaWaK ’ 14) C ONTRIBUTION Define a set of ETL process quality characteristics AND the relationships between them Provide quantitative measures for each characteristic, backed by literature! M ETHODOLOGY SLR for quality attributes specific to data intensive processes Collection from literature of (proven) metrics for monitoring and quantitatively evaluating ETL processes I NVITED J OURNAL E XTENSION Special Issue of Journal CCPE 2015 (under minor revision) Introduce and apply goal modeling “stepping” on defined models Showcase evaluation of use case ETLs using proposed measures 6 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
User requirements driving flow redesign Paper: A Framework for User-Centered Declarative ETL (DOLAP ’ 14) requirements T RADITIONAL A PPROACH P ROBLEMS Business User Expensive process IT Hard to map requirements-implementation IT optimize only for performance ETL Process Need more dynamicity (Big Data, data scope…) DB1 I NSPIRATION Model-driven approach DW ETL process as a business process Agile BI, Self-service BI DB2 A PPROACH User at the center of the iterative process Functional and non-functional requirements are analyzed at the same time using automatic Pattern management 7 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
User requirements driving flow redesign Paper: A Framework for User-Centered Declarative ETL (DOLAP ’ 14) High level representation for Business Users Translation to low level models for IT and vice versa 8 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Automated Process Redesign (POIESIS) Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) A UTOMATIC GENERATION OF ALTERNATIVE PHYSICAL ETL FLOWS • Alternative designs: Same functionality (constant data schemata), different flow components- permutations • Policies and patterns • Measures estimation for evaluation 9 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Logical Modeling & FCPs Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) L OGICAL M ODELLING OF ETL F LOWS Each operator is a node in a DAG structure Flow Component Patterns represented in the same logical model Each (combination of) pattern application(s) produces a new ETL flow 10 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Flow Component Patterns (FCPs) Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) Component Types FCP example: Crosscheck Data Sources Application Point: Cro ssflo w Component Atomic ETL Step • Edge • Node Sequence Component Extract from • Complete Graph SC1 E Alternative Sequence Component Data Source Project Atomic ETL Step P Application Properties: Attributes of Interest Join on • Applicability based on rules Pruning J Speci fi c Keys Flow Component • Fitness based on heuristics Optimization A1 SC2 Sequence Component Project out Compare Added Attributes Attributes C P Crossflow Component 11 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Example Visualization Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) M ULTIDIMENSIONAL A NALYSIS Pareto frontier Each point represents an ETL flow Metrics (compound and detailed) compared to initial flow 12 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Quality-aware testing Paper: Bijoux: Data Generator for Evaluating ETL Process Quality (DOLAP ’ 14) A PPROACH An automatic, semantic-aware framework for generating testing workloads for evaluating quality of ETL processes Using a taxonomy of ETL operations and their semantics, create synthetic datasets to test flows Configurable properties (e.g., selectivity, distribution) to emphasize specific flow parts characteristics I NVITED J OURNAL E XTENSION Information Systems, Elsevier 2015 (under review) Highlight workflow perspective and analyze properties like flow coverage Propose architecture and showcase updated implementation that scales 13 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Execution on the Cloud E LASTICITY FOR RESPONSIVENESS EC2 instance 1 nd: Slave Web app Hundreds of flows executed very fast edg: Load balancing based on pre-evaluation JSON Pentaho DI (Kettle) m1: m2: Master Web app EC2 instance 2 nd: Load Balancer edg: JSON Slave Web app Pre-evaluator O PEN RESEARCH QUESTIONS m1: Policy Manager m2: Pentaho DI (Kettle) Do instances share state? Common input Monitor ... data? m1: Measures JSON Can results be generalized for platform EC2 instance n m2: Collector dependent executions? Slave Web app nd: edg: Pentaho DI (Kettle) 14 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Decomposition to Structural Patterns Q UALITY EVALUATION OF ETL FLOWS Different design choices large number of alternative ETL flows Need for fine-grained cost models Repository of patterns to increase reusability of models P2 P4 O5 O5 {T} {T} e3 e3 O3 O3 O11 O13 O11 O13 e4 e4 e13 e13 {T} {T} {F} {T} {T} {F} O4 O4 e11 e11 e2 e2 O1 O2 e5 O9 O10 O1 O2 e5 O9 O10 e10 e1 e1 e10 e6 e12 e12 e6 {F} {F} {F} {F} e9 e9 O6 O6 O8 O12 O8 O12 P1 e7 e7 P3 e8 e8 O7 O7 P ATTERN - BASED DECOMPOSITION OF ETL FLOWS Classify structural patterns & identify on each flow Derive utility as a function of the patterns that each flow contains Adaptive model: Knowledge Base enrichment Flow evaluation improvement 15 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Challenges R ELATE S TRUCTURAL P ATTERNS TO QUALITY MEASURES When and where is a quality pattern worth considering? Knowledge Base including pattern applications – detailed (measured) quality tradeoffs Also rules about pattern combinations M ODEL - THEORETIC PROPERTIES Accuracy, completeness How to evaluate significance of models? 16 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Future Plan BPM '16 ER '16 EDBT '16 J OURNALS DSS ’ 16: Using statistical methods to examine model-theoretic properties of ETL utility characteristics IJDWM ’ 16: ETL utility characteristics modelling and results from empirical study 17 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)
Recommend
More recommend