automating user centered design of data intensive
play

Automating User-Centered Design of Data- Intensive Processes - PowerPoint PPT Presentation

Automating User-Centered Design of Data- Intensive Processes Research Project Report (RPR) Vasileios Theodorou 26-05-2015 Host University Home University Coadvisor: Supervisor: Supervisor: Dr. Maik Thiele Prof. Alberto Abell Prof.


  1. Automating User-Centered Design of Data- Intensive Processes Research Project Report (RPR) Vasileios Theodorou 26-05-2015 Host University Home University Coadvisor: Supervisor: Supervisor: Dr. Maik Thiele Prof. Alberto Abelló Prof. Wolfgang Lehner RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  2. Example - Two Alternative Flows Conceptual model of flow: “Details about suppliers i n Europe sorted on revenue” • ETL Flow A • ETL Flow B 2 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  3. Measures from experiments ETL Flow A ETL Flow B Process cycle time 10.4 sec 18.9 sec Performance Throughput 52,906 tuples/sec 29,179 tuples/sec % of correct tuples 91.5% 100% Data quality % of non-null tuples 90.3% 95.2% # of precedence 20 40 Understandability dependencies Length of longest 9 steps 23 steps Manageability path E XECUTION • TPC-H with s.f.=1 • Executed on Pentaho Data Integration (Kettle) • Data quality improved – Performance, Understandability and Manageability reduced 3 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  4. Agenda A PPROACH  Conceptual model reflecting user requirements  User requirements-driven flow redesign  Automatic “quality” pattern integration  Configurable testing C HALLENGES AND D ISCUSSION  Relate patterns to utility  Assess pattern significance, model accuracy & completeness  Future plan 4 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  5. ETL Quality Attributes Paper: Quality Measures for ETL Processes (DaWaK ’ 14) T RADE - OFFS  It’s not only about performance!  Improving some quality attributes can affect others positively or negatively 5 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  6. ETL Quality Attributes Paper: Quality Measures for ETL Processes (DaWaK ’ 14) C ONTRIBUTION  Define a set of ETL process quality characteristics AND the relationships between them  Provide quantitative measures for each characteristic, backed by literature! M ETHODOLOGY  SLR for quality attributes specific to data intensive processes  Collection from literature of (proven) metrics for monitoring and quantitatively evaluating ETL processes I NVITED J OURNAL E XTENSION  Special Issue of Journal CCPE 2015 (under minor revision)  Introduce and apply goal modeling “stepping” on defined models  Showcase evaluation of use case ETLs using proposed measures 6 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  7. User requirements driving flow redesign Paper: A Framework for User-Centered Declarative ETL (DOLAP ’ 14) requirements T RADITIONAL A PPROACH P ROBLEMS Business User  Expensive process IT  Hard to map requirements-implementation  IT optimize only for performance ETL Process  Need more dynamicity (Big Data, data scope…) DB1 I NSPIRATION  Model-driven approach DW  ETL process as a business process  Agile BI, Self-service BI DB2 A PPROACH  User at the center of the iterative process  Functional and non-functional requirements are analyzed at the same time using automatic Pattern management 7 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  8. User requirements driving flow redesign Paper: A Framework for User-Centered Declarative ETL (DOLAP ’ 14)  High level representation for Business Users  Translation to low level models for IT and vice versa 8 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  9. Automated Process Redesign (POIESIS) Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) A UTOMATIC GENERATION OF ALTERNATIVE PHYSICAL ETL FLOWS • Alternative designs: Same functionality (constant data schemata), different flow components- permutations • Policies and patterns • Measures estimation for evaluation 9 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  10. Logical Modeling & FCPs Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) L OGICAL M ODELLING OF ETL F LOWS  Each operator is a node in a DAG structure  Flow Component Patterns represented in the same logical model  Each (combination of) pattern application(s) produces a new ETL flow 10 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  11. Flow Component Patterns (FCPs) Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) Component Types FCP example: Crosscheck Data Sources Application Point: Cro ssflo w Component Atomic ETL Step • Edge • Node Sequence Component Extract from • Complete Graph SC1 E Alternative Sequence Component Data Source Project Atomic ETL Step P Application Properties: Attributes of Interest Join on • Applicability based on rules  Pruning J Speci fi c Keys Flow Component • Fitness based on heuristics  Optimization A1 SC2 Sequence Component Project out Compare Added Attributes Attributes C P Crossflow Component 11 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  12. Example Visualization Demo Paper: POIESIS: a Tool for Quality-aware ETL Process Redesign (EDBT ’ 15) M ULTIDIMENSIONAL A NALYSIS  Pareto frontier  Each point represents an ETL flow  Metrics (compound and detailed) compared to initial flow 12 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  13. Quality-aware testing Paper: Bijoux: Data Generator for Evaluating ETL Process Quality (DOLAP ’ 14) A PPROACH  An automatic, semantic-aware framework for generating testing workloads for evaluating quality of ETL processes  Using a taxonomy of ETL operations and their semantics, create synthetic datasets to test flows  Configurable properties (e.g., selectivity, distribution) to emphasize specific flow parts characteristics I NVITED J OURNAL E XTENSION  Information Systems, Elsevier 2015 (under review)  Highlight workflow perspective and analyze properties like flow coverage  Propose architecture and showcase updated implementation that scales 13 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  14. Execution on the Cloud E LASTICITY FOR RESPONSIVENESS EC2 instance 1 nd: Slave Web app  Hundreds of flows executed very fast edg:  Load balancing based on pre-evaluation JSON Pentaho DI (Kettle) m1: m2: Master Web app EC2 instance 2 nd: Load Balancer edg: JSON Slave Web app Pre-evaluator O PEN RESEARCH QUESTIONS m1: Policy Manager m2: Pentaho DI (Kettle)  Do instances share state? Common input Monitor ... data? m1: Measures JSON  Can results be generalized for platform EC2 instance n m2: Collector dependent executions? Slave Web app nd: edg: Pentaho DI (Kettle) 14 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  15. Decomposition to Structural Patterns Q UALITY EVALUATION OF ETL FLOWS  Different design choices  large number of alternative ETL flows  Need for fine-grained cost models  Repository of patterns to increase reusability of models P2 P4 O5 O5 {T} {T} e3 e3 O3 O3 O11 O13 O11 O13 e4 e4 e13 e13 {T} {T} {F} {T} {T} {F} O4 O4 e11 e11 e2 e2 O1 O2 e5 O9 O10 O1 O2 e5 O9 O10 e10 e1 e1 e10 e6 e12 e12 e6 {F} {F} {F} {F} e9 e9 O6 O6 O8 O12 O8 O12 P1 e7 e7 P3 e8 e8 O7 O7 P ATTERN - BASED DECOMPOSITION OF ETL FLOWS  Classify structural patterns & identify on each flow  Derive utility as a function of the patterns that each flow contains  Adaptive model: Knowledge Base enrichment Flow evaluation improvement 15 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  16. Challenges R ELATE S TRUCTURAL P ATTERNS TO QUALITY MEASURES  When and where is a quality pattern worth considering?  Knowledge Base including pattern applications – detailed (measured) quality tradeoffs  Also rules about pattern combinations M ODEL - THEORETIC PROPERTIES  Accuracy, completeness  How to evaluate significance of models? 16 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

  17. Future Plan BPM '16 ER '16 EDBT '16 J OURNALS  DSS ’ 16: Using statistical methods to examine model-theoretic properties of ETL utility characteristics  IJDWM ’ 16: ETL utility characteristics modelling and results from empirical study 17 RPR - Doctoral Colloquium, IT4BI-DC (eBISS 2015)

Recommend


More recommend