Overview Multidimensional Databases • Cubes: Dimensions, Facts, Measures • OLAP queries • Relational design • Redundancy • Strengths and weaknesses of the multidimensional model Original slides were written by Torben Bach Pedersen Aalborg University 2007 - DWML course 2 Why not ER model? The multidimensional model • One purpose • ER model: many purposes � Flexible � Data analysis � General • Better at that purpose • All types of data are “equal”, no difference between: � Less flexible � What is important � Not suited for OLTP systems � What just describes the important • ER models are large • More built in “meaning” � 50-1000 entities/relations � What is important � Hard to get an overview � What describes the important • ER models implemented in RDBMSes � What we want to optimize � Normalized databases spread information � easy for querying � When analyzing data, the information must be integrated again Aalborg University 2007 - DWML course 3 Aalborg University 2007 - DWML course 4
The multidimensional model Cube Example • Data is divided into: � Facts � Dimensions • Facts are the important entity: a sale � Facts have measures that can be aggregated: sales price • Dimensions describe facts � A sale has the dimensions Product, Store and Time • Goal for dimensional modeling: Dimension � Surround facts with as much context (dimensions) as possible values � But you should not try to model all relationships in the data (unlike ER modeling!) Cell (aggregated measure) Aalborg University 2007 - DWML course 5 Aalborg University 2007 - DWML course 6 Cubes Dimensions • A “cube” may have many dimensions! • Dimensions: core of multidimensional databases � Theoretically no limit for the number of dimensions • Dimensions are used for � Typical cubes have 4-12 dimensions � Selection of data • But only 2-3 dimensions can be viewed at a time � Grouping of data at the right level of detail � Dimensionality reduced by queries via projection/aggregation • Dimensions consist of dimension values • A cube consists of cells � Product dimension values: “milk”, “cream”, … � A given combination of dimension values � Time dimension values: “1/1/2001”, “2/1/2001”,… � empty cell = no data for this combination • Dimension values may have an ordering � sparse cube: few non-empty cells � Used for comparing cube data across values � dense cube: many non-empty cells � Especially used for Time dimension � Cubes become sparse at high dimensionality Aalborg University 2007 - DWML course 7 Aalborg University 2007 - DWML course 8
Dimensions Dimension Example • Dimensions have hierarchies with levels � Typically 3-5 levels (of detail) � Dimension values are organized in a tree structure ◆ Product : Product � Type � Category ◆ Store : Store � Area � City � County ◆ Time : Day � Month � Quarter � Year � Dimensions have a bottom level and a top level (ALL) • Levels may have attributes � Simple, non-hierarchical information � Day has Workday as attribute • Dimensions should contain much information � Time dimensions may contain holiday, season, events,… Schema Instance � Good dimensions have 50-100 or more attributes/levels Aalborg University 2007 - DWML course 9 Aalborg University 2007 - DWML course 10 Dimension Example (cont’) • Not necessarily total order • Why we need hierarchy in dimension values? • Can be partial order � Hint: Compare the following schemas from the user’s view • Why each dimension should contain many attributes? � Hint: What happens if the user cannot find holiday, season, T T events,… from the Time dimension? Category Year Week T Type Month Category T Day Product Product Type Product Schema Time Schema Product Product Schema A Product Schema B Aalborg University 2007 - DWML course 11 Aalborg University 2007 - DWML course 12
Facts Types Of Facts • Event fact (transaction) • Facts represent the subject of the desired analysis � A fact for every business event (sale) � The “important” in the business that should be analyzed • “ Fact-less ” facts • A fact is most often identified via its dimension values � A fact per event (customer contact) � A fact is a non-empty cell � No numerical measures � Some models give facts an explicit identity � An event has happened for a given dimension value combination • Generally a fact should • Snapshot fact � Be attached to exactly one dimension value in each dimension � A fact for every dimension combination at given time intervals � Only be attached to dimension values in the bottom levels � Captures current status (inventory) • Cumulative snapshot facts � A fact for every dimension combination at given time intervals � Captures cumulative status up to now (sales in year to date) Aalborg University 2007 - DWML course 13 Aalborg University 2007 - DWML course 14 Granularity Measures • Granularity of facts is important � What does a single fact mean? • Measures represent the fact property that the users � Level of detail want to study and optimize � Given by combination of bottom levels � Example: total sales price � Example: “total sales per store per day per product” • A measure has two components • Important for number of facts � Numerical value : (sales price) � Scalability � Aggregation formula (SUM): used for aggregating/combining • Often the granularity is a single business transaction a number of measure values into one � Example: sale � Measure value determined by dimension value combination � Sometimes the data is aggregated ( total sales per store per day � Measure value is meaningful for all aggregation levels per product) • Most multidimensional models have measures � Might be necessary due to scalability • Generally, transaction detail can be handled � Except perhaps huge clickstreams etc. Aalborg University 2007 - DWML course 15 Aalborg University 2007 - DWML course 16
Types Of Measures Schema Documentation Measure type Property Example Occurence Additive Can be aggregated Often occur in • No well-defined standard sales price over all dimensions event facts • Our own notation Semi-additive Cannot be Often occur in Seen to the right inventory � aggregated over snapshot facts T level corresponds to ALL � some dimensions - • Modeling and OLAP tools typically time have their own notation Non-additive Cannot be Occur in all average aggregated over types of facts sales price any dimensions Aalborg University 2007 - DWML course 17 Aalborg University 2007 - DWML course 18 (Relational) OLAP Queries OLAP Queries • Two kinds of queries Slice/Dice : Starting level selection, (City, Year, Product) � Navigation queries examine one dimension Year=2000 ◆ SELECT DISTINCT l FROM d [WHERE p] � Aggregation queries summarize fact data ◆ SELECT d1.l1,d2.l2,SUM(f.m) FROM d1,d2,f WHERE f.dk1=d1.dk1 AND f.dk2=d2.dk2 [AND p] Roll-up: get Drilll-down: GROUP BY d1.l1,d2.l2 overview more detail • Fast, interactive analysis of large amounts of data • Spreadsheet on a cube All Time Aalborg University 2007 - DWML course 19 Aalborg University 2007 - DWML course 20
ROLAP MOLAP • Relational OLAP • Multidimensional OLAP • Data stored in relational tables • Data stored in special multidimensional data structures � Star (or snowflake) schemas used for modeling • Pros � SQL used for querying � Less storage use (“foreign keys” not stored) • Pros � Faster query response times � Leverages investments in relational technology • Cons � Scalable (billions of facts) � Up till now not so good scalability (changing) � Flexible, designs easier to change � Less flexible, e.g., cube must be re-computed when design � New, performance enhancing techniques adapted from MOLAP changes ◆ Indices, materialized views, special treatment of star schemas � Does not reuse an existing investment (but often bundled with RDBMS) • Cons � Not as open technology � Storage use (often 3-4 times MOLAP) � Response times Aalborg University 2007 - DWML course 21 Aalborg University 2007 - DWML course 22 HOLAP Relational Implementation • Hybrid OLAP • The cube is often implemented in an RDBMS • Detail data stored in relational tables (ROLAP) • Fact table stores facts � One column for each measure • Aggregates stored in multidimensional structures (MOLAP) � One column for each dimension (foreign key to dimension table) • Pros � Dimensions keys make up composite primary key � Scalable (as ROLAP) • Dimension table stores dimension � Fast (as MOLAP) � Integer key column (surrogate keys) • Cons � Don’t use production keys/codes! Why? � Complexity • Goal for dimensional modeling: surround the facts with as much context (dimensions) as we can • Granularity of the fact table is important � What does one fact table row represent? � Important for the size of the fact table � Often corresponding to a single business transaction (sale) � But it can be aggregated (sales per product per day per store) Aalborg University 2007 - DWML course 23 Aalborg University 2007 - DWML course 24
Recommend
More recommend