data warehousing and olap
play

Data Warehousing and OLAP Large retailer Several databases: - PDF document

Motivation Data Warehousing and OLAP Large retailer Several databases: inventory, personnel, sales etc. High volume of updates Management requirements INFO 330 Efficient support for decision making Comprehensive view


  1. Motivation Data Warehousing and OLAP • Large retailer • Several databases: inventory, personnel, sales etc. • High volume of updates • Management requirements INFO 330 • Efficient support for decision making • Comprehensive view of all aspects of an enterprise • Trends, summaries, analysis of historical data • Information from several departments Slides courtesy of Mirek Riedewald • Why not using operational systems? Motivation (contd.) Outline • Integrate data from diverse sources • Overview of data warehousing • Common schema • Semantic mismatches (currency, naming, normalization, databases structure) • Clean data (missing values, inconsistencies) • Accumulate historical data • Not relevant for operational databases • Efficient analysis • Complex queries versus frequent updates Terminology From OLTP to the Data Warehouse • OLTP (Online Transaction Processing) • Traditionally, database systems stored data relevant to current business processes • DSS (Decision Support System) • Old data was archived or purged • DW (Data Warehouse) • A database stores the current snapshot of the • OLAP (Online Analytical Processing) business: • Current customers with current addresses • Current inventory • Current orders • Current account balance 1

  2. The Data Warehouse OLTP Architecture • The data warehouse is a historical collection of all relevant data for analysis purposes Clients • Examples: OLTP DBMSs • Current customers versus all customers Cash • Current orders versus history of all orders Register • Current inventory versus history of all shipments Product Purchase • Thus the data warehouse stores information that might be useless for the operational part of a business Inventory Update DW Architecture Building a Data Warehouse • Data warehouse is a collection of data marts Data Warehouse Information Sources OLAP Servers Server • Data marts contain one dimensional star MOLAP OLTP schema that captures one business aspect Analysis DBMSs • Notes: • It is crucial to centralize the logical definition and Query/Reporting format of dimensions and facts (political challenge; assign a dimension authority to each dimension). Extract Other Data Everything else is a distributed effort throughout the Clean Sources company (technical challenge) Data Mining Transform Data Marts • Each data mart will have its own fact table, but Aggregate dimension tables are duplicated over several data Load marts Update ROLAP OLTP Versus Data Warehousing Three Complementary Trends • Data Warehousing: Consolidate data from OLTP Data Warehouse many sources in one large repository Typical user Clerical worker Management • Loading, periodic synchronization of replicas System usage Regular business Analysis • Semantic integration Workload Read/Write Read only • OLAP: Types of queries Predefined Ad-hoc • Complex SQL queries and views Unit of interaction Transaction Query • Queries based on spreadsheet-style operations and Level of isolation High Low “multidimensional” view of data required • Interactive and “online” queries No of records accessed < 100 > 1,000,000 • Data Mining: Exploratory search for interesting No of concurrent users Thousands Hundreds trends and anomalies. Focus Data in and out Information out 2

  3. EXTERNAL DATA SOURCES Data Warehousing Warehousing Issues • Integrated data spanning • Semantic Integration: When getting data from EXTRACT long time periods, often multiple sources, must eliminate mismatches, TRANSFORM augmented with LOAD e.g., different currencies, schemas REFRESH summary information • Heterogeneous Sources: Must access data from • Several gigabytes to a variety of source formats and repositories terabytes common • Replication capabilities can be exploited here DATA Metadata • Interactive response WAREHOUSE • Load, Refresh, Purge: Must load data, Repository times expected for periodically refresh it, and purge too-old data complex queries; ad-hoc SUPPORTS • Metadata Management: Must keep track of updates uncommon source, loading time, and other information for all data in the warehouse DATA OLAP MINING Outline Dimensional Data Modeling • Recall: The relational model • Overview of data warehousing • Dimensional Modeling The dimensional data model: • Relational model with two different types of attributes and tables • Attribute level: Facts (numerical, additive, dependent) versus dimensions (descriptive, independent) • Table level: Fact tables (large tables with facts and foreign keys to dimensions) versus dimension tables (small tables with dimensions) Dimensional Modeling (contd.) OLTP versus Data Warehouse OLTP Data warehouse • Fact (attribute): • Dimension (attribute): Measures performance Specifies a fact • Regular relational • Dimensional model schema • Fact table in BCNF of a business • Example dimensions: • Normalized • Dimension tables not • Example facts: normalized: few updates, • Product, customer • Updates overwrite mostly queries data, sales person, previous values: One • Sales, budget, profit, • Updates add new version: store instance of a customer inventory Several instances of the with a unique • Example dimension • Example fact table: same customer (with customerID different data, e.g., table: • Transactions (timekey, • Queries return address) information about the storekey, pkey, promkey, • Sales(productid, • Queries return aggregate current state of affairs ckey, units, price) storeid, …) information about historical facts 3

  4. Example: Dimensional Data Modeling Another View: Star Schema ckey cid name byear state Customers: Transactions Time Store Dimension Table (timekey, Time: storekey, timekey Day Month Year Dim. Table pkey, promkey, ckey, Transactions: ckey timekey pkey #units $price units, Fact Table price) pkey pid pname price category Customers Products: Products Promotions Dim. Table Fact versus Dimension Tables Grain • Fact tables are usually very large; they • The grain defines the level of resolution of can grow to several hundred GB and TB a single record in the fact table. • Dimension tables are usually smaller • Example fact tables: (although can grow large, e.g., Customers • Transactions (timekey, storekey, pkey, table), but they have many fields promkey, ckey, units, price); grain is individual item • Queries over fact tables usually involve • Transactions (timekey, storekey, ckey, units, many records price); grain is one market basket Typical Queries Example Query • SQL: • “Break down sales by year and category for the last two years; show only categories with more SELECT D1.d1, …, Dk.dk, agg1(F.f1) FROM Dimension D1, …, than $1M in sales.” Dimension Dk, Fact F • SQL: WHERE D1.key = F.key1 AND … AND SELECT T.year, P.category, SUM(X.units * X.price) Dk.keyk = F.keyk AND FROM Time T, Products P, Transactions X otherPredicates WHERE T.year = 1999 OR T.year = 2000 GROUP BY D1.d1, …, Dk.dk GROUP BY T.year, P.category HAVING groupPredicates HAVING SUM(X.units * X.price) > 1000000 • This query is called a “Star Join”. 4

  5. Outline Online Analytical Processing (OLAP) • Ad hoc complex queries • Overview of data warehousing • Simple, but intuitive and powerful query • Dimensional Modeling interface • Online Analytical Processing • Spreadsheet influenced analysis process • Specialized query operators for multidimensional analysis • Roll-up and drill-down • Slice and dice • Pivoting Visual Intuition: Cube Multidimensional Data Analysis roll-up to category Data warehouse: Customer Data Mart roll-up to state Transactions(ckey, timekey, pkey, units, price) SH SF Customers(ckey, cid, name, byear, city, state, country) LA Time(tkey, day, month, quarter, year) Product1 20 Product Product2 30 Products(pkey, pname, price, pid, category, industry) Product3 20 Product4 15 Hierarchies on dimensions: Product5 10 Year Industry Product6 50 Country roll-up to week M T W Th F S S Quarter Category Time State Month Week 50 Units of Product6 sold on Monday in LA Product City Day Multidimensional Data Analysis Corresponding Query in SQL NY CA WI • SELECT SUM(units) FROM Transactions T, Products P, Customers C Industry1 $1000 $2000 $1000 WHERE T.pkey = P.pkey AND T.ckey = C.ckey Industry2 $500 $1000 $500 AND C.country = “USA” GROUP BY P.industry, C.state Industry3 $3000 $3000 $3000 • We think that Industry3 in CA is interesting. Year Year Industry Industry Country=“USA” Country=“USA” Quarter Quarter Category Category State State Month Week Month Week Product Product City Day City Day 5

Recommend


More recommend