w arehousing the most common form of information in
play

W arehousing The most common form of information in - PDF document

W arehousing The most common form of information in tegration: cop y sources in to a single DB and try to k eep it up-to-date. Usual metho d: p erio dic reconstruction of the w arehouse, p erhaps o v


  1. W arehousing The most common form of information � in tegration: cop y sources in to a single DB and try to k eep it up-to-date. Usual metho d: p erio dic reconstruction of the � w arehouse, p erhaps o v ernigh t. 1

  2. OL TP V ersus OLAP Most database op erations are of a t yp e called � ( OL TP ). on-line tr ansaction pr o c essing ✦ Short, simple queries and frequen t up dates in v olving one or a small n um b er of tuples. ✦ Examples: answ ering queries from a W eb in terface, recording sales at cash-registers, selling airline tic k ets. 2

  3. Of increasing imp ortance are op erations of the � ( OLAP ) t yp e. on-line analytic pr o c essing ✦ F ew, but v ery complex and time- consuming queries (can run for hours). ✦ Up dates are infrequen t, and/or the answ er to the query is not dep enden t on ha ving an absolutely up-to-date database. ✦ Example: Amazon analyzes purc hases b y all its customers to come up with an individual screen with pro ducts of lik ely in terest to the customer. ✦ Example: Analysts at W al-Mart lo ok for items with increasing sales at stores in some region. Common arc hitecture: Lo cal databases, sa y � one p er branc h store, handle OL TP , while a w arehouse in tegrating information from all branc hes handles OLAP . The most complex OLAP queries are often � referred to as mining . data 3

  4. Star Sc hemas Commonly , the data at a w arehouse is of t w o t yp es: 1. : V ery large, accum ulation of facts F act Data suc h as sales. ✦ Often \insert-only"; once there, a tuple remains. 2. : Smaller, generally static, Dimension Data information ab out the en tities in v olv ed in the facts. 4

  5. Example Supp ose w e w an ted to record ev ery sale of b eer at all bars: the bar, the b eer, the drink er who b ough t the b eer, the da y and time, the price c harged. F act data is in a relation with sc hema: � Sales(bar, beer, drinker, day, time, price) Dimension data could include a relation for � bars, one for b eers, and one for drink ers. Bars(bar, addr, lic) Beers(beer, manf) Drinkers(drinker, addr, phone) 5

  6. Tw o Approac hes to Building W arehouses 1. (Relational OLAP): relational R OLAP database system tuned for star sc hemas, e.g. using sp ecial index structures suc h as: ✦ \Bitmap indexes" (for eac h k ey of a dimension table, e.g., bar name, a bit- v ector telling whic h tuples of the fact table ha v e that v alue). ✦ = answ ers to general Materialize d views queries from whic h more sp eci�c queries can b e answ ered with less w ork than if w e had to w ork from the ra w data. 2. (Multidi mensional OLAP): A MOLAP sp eciali zed mo del based on a \cub e" view of data. 6

  7. R OLAP T ypical queries b egin with a complete \star join," for example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales.bar = Bars.bar AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker; T ypical OLAP query will: � 1. Do all or part of the star join. 2. Filter in teresting tuples based on fact and/or dimension data. 3. Group b y one or more dimensions. 4. Aggregate the result. Example: \F or eac h bar in P alo Alto, �nd � the total sale of eac h b eer man ufactured b y Anheuser-Busc h." 7

  8. P erformance Issues If the fact table is large, queries will tak e � m uc h to o long. Materializ ed views can help. � Example F or the question ab out bars in P alo Alto and b eers b y Anheuser-Busc h, w e w ould b e aided b y the materialized view: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) AS sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf; 8

  9. MOLAP Based on \data cub e": k eys of dimension tables form axes of the cub e. Example: for our running example, w e migh t � ha v e four dimensions: bar, b eer, drink er, and time. Dep enden t attributes (price of the sale in our � example) app ear at the p oin ts of the cub e. But the cub e also includes aggregations (sums, � t ypicall y) along the margins. ✦ Example: in our 4-dimensional cub e, w e w ould ha v e the sum o v er eac h bar, eac h b eer, eac h drink er, and eac h time instan t (p erhaps group b y da y). ✦ W e w ould also ha v e aggregations b y all subsets of the dimensions, e.g., b y eac h bar and b eer, or b y eac h b eer, drink er, and da y . 9

  10. Slicing and Dicing = select a v alue along one dimension, Slic e � e.g., a particular bar. = the same thing along another Dic e � dimension, e.g., a particular b eer. Drill-Do wn and Roll-Up = \de-aggregate" = break an Dril l-down � aggregate in to its constituen ts. ✦ Example: ha ving determined that Jo e's Bar in P alo Alto is selling v ery few Anheuser-Busc h b eers, break do wn his sales b y the particular b eer. = aggregate along one dimension. R ol l-up � ✦ Example: giv en a table of ho w m uc h Budw eiser eac h drink er consumes at eac h bar, roll it up in to a table of amoun t consumed b y eac h drink er. 10

  11. P erformance As with R OLAP , materialized views can help. Data-cub es in vite materialized views that are � aggregations in one or more dimensions. Dimensions need not b e aggregated � completely . Rather, grouping b y attributes of the dimension table is p ossible. ✦ Example: a materialized view migh t aggregate b y drink er completely , b y b eer not at all, b y time according to the da y , and b y bar only according to the cit y of the bar. ✦ Example: time is a really in teresting dimension, since there are natural groupings, suc h as w eeks and mon ths, that are not commensurate. 11

  12. Data Mining Large-scale queries designed to extract patterns from data. Big example: \asso ciation-rule s" or \frequen t � itemsets." Mark et-Bask et Data An imp ortan t source of data for asso ciation rules is askets . market b As a customer passes through the c hec k out, � w e learn what items they buy together, e.g., ham burger and k etc h up. Giv es us data with sc hema Baskets(bid, � item) . Mark eters w ould lik e to kno w what items � p eople buy together. ✦ Example: if p eople tend to buy ham burger and k etc h up together, put them near eac h other, with p otato c hips b et w een. ✦ Example: run a sale on ham burger and raise the price of k etc h up. 12

  13. Simplest Problem: Find the F requen t P airs of Items Giv en a s , w e could ask: supp ort thr eshold Find the pairs of items that app ear together in � at least bask ets. s SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.bid = b2.bid AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; 13

  14. A-Priori T ric k Ab o v e query is prohibitiv ely exp ensiv e for � large data. uses the fact that a pair A-priori algorithm � ( i; ) cannot ha v e supp ort unless and j s i j b oth ha v e supp ort b y themselv es. s More e�cien t implemen tation uses an � in termediate relation Baskets1 . INSERT INTO Baskets1(bid, item) SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); Then run the query for pairs on Baskets1 � instead of Baskets . 14

Recommend


More recommend