data science in the wild
play

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses - PowerPoint PPT Presentation

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the Wild, Spring 2019 1 Data Engineering Extract Transform Load & Clean Sources Data Warehouse Data Science in the Wild, Spring 2019 2


  1. Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the Wild, Spring 2019 � 1

  2. Data Engineering Extract Transform Load & Clean Sources Data Warehouse Data Science in the Wild, Spring 2019 � 2

  3. Agenda 1.What are Data Warehouse? 2.Data warehouse architecture 3.The design process 4.Transaction design 5.Periodic snapshot 6.Accumulative transactions Data Science in the Wild, Spring 2019 � 3

  4. <1> What are Data Warehouse? Data Science in the Wild, Spring 2019 � 4

  5. Data Warehouse: A definition Inventory Suppliers • A Data Warehouse is a central Sales HR repository of integrated data from one or more disparate sources • Data warehouses don’t aim to solve a single problem Customers Finance • Instead, they provide the infrastructure for an organizational data science process Stores Pharmacy Data Science in the Wild, Spring 2019 � 5

  6. Data Warehouse: Basic Architecture • It stores current and historical data in one single place • Data marts represent repositories for specific subjects (sales, orders, website navigation) Reporting Machine Data ETL Learning Warehouse User Staging Operational Operationaliza Systems tion Data Marts Data Science in the Wild, Spring 2019 � 6

  7. Technologies for Data Warehouse • RDBMS - Relational Database Management Systems • Hadoop • Hadoop / Spark • And many other variations Data Science in the Wild, Spring 2019 � 7

  8. Spark-based Architecture Data Science in the Wild, Spring 2019 � 8

  9. Data Science in the Wild, Spring 2019 � 9

  10. Comparison Operational Databases Data Warehouses Process Oriented Subject Oriented Add, Modify, Remove single rows Bulk load, rarely modify, never remove Online human / sensors entry ETL jobs Queries for small sets of rows with all Scan large sets for aggregates their details Using trained models Training models Data Science in the Wild, Spring 2019 � 10

  11. <2> Data Warehouse Architectures Data Science in the Wild, Spring 2019 � 11

  12. Designing the Data Warehouse • Dimensional model Date Dimension Product Dimension Sale Transaction Facts of a business Date Key (PK) Product Key (PK) Date key (FK) process: Date Attributes Product Attributes Product key (FK) (TBD) (TBD) Store Key (FK) • The facts we Promotion Key (FK) want to analyze POS Transaction Number Sales Quantity • The dimensions Sales Dollar Amount Store Dimension Promotion Dimension we analyze the Cost Dollar Amount Store Key (PK) Product Key (PK) facts Gross Profit Margin Store Attributes Promotion Attributes (TBD) (TBD) Data Science in the Wild, Spring 2019 � 12

  13. Two Super Architectures Star Schema Snowflake https://www.guru99.com/star-snowflake-data-warehousing.html Data Science in the Wild, Spring 2019 � 13

  14. Star Schema • Fact table contains a key and measure • Every dimension in a star schema is represented with the only one-dimension table • The dimension table is joined to the fact table • Dimension tables are not joined to each other • The dimension tables are not normalized • Main advantages • Queries are simpler • Optimizes number of joins Data Science in the Wild, Spring 2019 � 14

  15. Snowflake • The dimension tables are normalized which splits data into additional tables • Main advantages: • Optimizes storage • Easier to understand • Easier to engineer the dimensions (adding, removing etc) Data Science in the Wild, Spring 2019 � 15

  16. <3> The Design Process Data Science in the Wild, Spring 2019 � 16

  17. 4 Step Design Process 1. Identify the business process 2. Identify the facts 3. Declare the grain 4. Choose the dimensions Data Science in the Wild, Spring 2019 � 17

  18. Case Study: A Retail Sales Operation • ~2000 Stores • ~$75.17B yearly revenue • Typical 80K individual products (SKU’s) • In a store • In any given moment • ~10 departments • Food, medicine, cosmetics, nature, kids... Data Science in the Wild, Spring 2019 � 18

  19. Step 2: Identify the Facts • Examples • Identifying the numeric facts for analysis • Sales quantity • Facts are determined by answering the • Sales dollar amount question, "What are we measuring?" • Cost dollar amount • All candidate facts in a design must be true to • Gross profit margin the grain defined in step 2 • Facts that clearly belong to a different grain must be in a separate fact table Data Science in the Wild, Spring 2019 � 19

  20. Step 3: Find the Grain • Declaring the grain means specifying exactly what an individual fact table row represents. • The grain conveys the level of detail associated with the fact table measurements. The small golden • It provides the answer to the question, disk is a piece of "How do you describe a single row in pure gold the fact table?" weighing one troy grain. Data Science in the Wild, Spring 2019 � 20

  21. Examples • Total sale for each customer • A record for an individual line item on a customer's retail sales ticket as measured by a scanner device • A record for each item Data Science in the Wild, Spring 2019 � 21

  22. Grain Design • What are the options? POS Sale Transaction Facts • For example, summary of sales per day per POS Transaction Number store... (what’s the problem?) Sales Quantity • The grain of data should support Sales Dollar Amount • The ability to drill down • The ability to support independent dimensions Price reduction • Selected option: • Individual line item on a POS Data Science in the Wild, Spring 2019 � 22

  23. Good Grains • How should we judge a design? • Preferably you should develop dimensional models for the most atomic information captured by a business process • Atomic data is the most detailed information collected; such data cannot be subdivided further Data Science in the Wild, Spring 2019 � 23

  24. Step 4: Choose Dimensions • Dimensions can be designed by thinking about: • "How do business describe the data that results from the business process?" • We want a robust set of dimensions representing all possible descriptions that take on single values in the context of each measurement Data Science in the Wild, Spring 2019 � 24

  25. Summary 1. Identify the business process 2. Identify the facts 3. Declare the grain 4. Choose the dimensions Data Science in the Wild, Spring 2019 � 25

  26. Grain Design Pattern • Transactions • e.g., POS transaction, financial transaction, medical action • Periodic snapshot • Inventory state in a given day, closing stock price • Accumulative transactions • Order management, hospitalization process • Other types… Data Science in the Wild, Spring 2019 � 26

  27. <4> Transaction design Data Science in the Wild, Spring 2019 � 27

  28. Case Study Data • Point of Sale (POS): • Individual product purchase (as scanned at the POS) • Supply: • the purchase price of each product • Promotions • Temporary price reductions • Ads • Inserts • Coupons Data Science in the Wild, Spring 2019 � 28

  29. Preliminary Star Schema Date Dimension Product Dimension POS Sale Transaction Facts Date Key (PK) Product Key (PK) Date key (FK) Date Attributes Product Attributes Product key (FK) (TBD) (TBD) Store Key (FK) Promotion Key (FK) POS Transaction Number Sales Quantity Sales Dollar Amount Store Dimension Promotion Dimension Cost Dollar Amount Store Key (PK) Product Key (PK) Gross Profit Margin Store Attributes Promotion Attributes (TBD) (TBD) Data Science in the Wild, Spring 2019 � 29

  30. Date Dimension • Almost always exists in DW systems. • Can be built and populated in advance • Why is this important? • Each row represents one day (or one time unit) • 10 years = 3,650 rows Data Science in the Wild, Spring 2019 � 30

  31. Date Dimension Date Dimension Date Key (PK) Date Day of Week Day num in Epoch Product Dimension POS Sale Transaction Facts Week num in Epoch Date key (FK) Month num in Epoch Promotion Dimension Day num in calendar month Product key (FK) Day num in calendar year Store Key (FK) Calendar week num in year Store Dimension Promotion Key (FK) Day num in fiscal year ... POS Transaction Number Fiscal Week Sales Quantity Fiscal Week number in year Fiscal Month Sales Dollar Amount Fiscal Quarter Cost Dollar Amount Fiscal Year Holiday Indicator Gross Profit Margin Weekday Indicator Selling Season Major Event ... Data Science in the Wild, Spring 2019 � 31

  32. Date Example • There are many date attributes not supported by the SparkSQL date function, including fiscal periods, seasons, holidays, and weekends. • Rather than attempting to determine these non-standard calendar calculations in a query, we should look them up in a date dimension table Data Science in the Wild, Spring 2019 � 32

  33. Advantages of Date Dimension • Indexing integer-based key is faster than date key • Simplifies calculations • Simpler comparisons • Month vs. Month • Year vs. Year • Day of week, day of month • Special events Data Science in the Wild, Spring 2019 � 33

  34. Product Dimension • 60K current products -> 150K distinct products (SKUs) • Contains hierarchy: • Department • Category • Brand • SKU Data Science in the Wild, Spring 2019 � 34

Recommend


More recommend