"We should forget about small efficiencies, Database Optimization say about 97% of the time: premature optimization is the root of all evil“ Indexes Non-natural keys - Donald Knuth, 1974 Denormalization 1 2 Quiz! Quiz! How costly is this operation (naive solution)? Can you think of a way to make it faster? course per weekday hour room SELECT * TDA356 2 VR Monday 13:15 FROM Lectures TDA356 2 VR Thursday 08:00 WHERE course = ’TDA357’ TDA356 4 HB1 Tuesday 08:00 AND period = 3; n TDA356 4 HB1 Friday 13:15 If rows were stored sorted according to the values TIN090 1 HC1 Wednesday 08:00 course and period, we could get all rows with the TIN090 1 HA3 Thursday 13:15 given values faster (O(log n) for tree structure). SELECT * Storing rows sorted is expensive, but we can use Go through all n rows, compare FROM Lectures an index that given values of these attributes with the values for course and WHERE course = ’TDA357’ points out all sought rows (an index could be a period = 2 n comparisons hash map, giving O(1) complexity to lookups). AND period = 3; 3 4 1
Index Quiz! • When relations are large, scanning all Asymptotic complexity (O(x) notation) is rows to find matching tuples becomes very misleading here. Why? expensive. The asymptotic complexity works for data structures in main • An index on an attribute A of a relation is a memory. But when working with stored persistent data, the running time of the data structure, once in main memory, is data structure that makes it efficient to find negligible compared to the time it takes to read data from those tuples that have a fixed value for disk. What really matters to get fast lookups in a database is to minimize the number of disk blocks accessed (could use attribute A. asymptotic complexity over disk block accessing though). – Example: a hash table gives amortized O(1) Indexes help here too though. If a relation is stored over a number of disk blocks, knowing in which of these to look is lookups. helpful. 5 6 Disk and main memory Typical costs Disk • Some (over-simplified) typical costs of disk accessing for database operations on a relation stored over n blocks: Main memory – Query the full relation: n (disk operations) read() input() Program – Query with the help of index: k, where k is the x = number of blocks pointed to (1 for key). y = write() output() – Access index: 1 – Insert new value: 2 (one read, one write) – Update index: 2 (one read, one write) Cheap! Costly! 7 8 2
Quiz! Example: SELECT * FROM Lectures How costly is this operation? WHERE course = ’TDA357’ AND period = 3; SELECT * Lectures: n disk blocks FROM Lectures, Courses Assume Lectures is stored in n disk blocks. With no Courses: m disk blocks WHERE course = code; index to help the lookup, we must look at all rows, which means looking in all n disk blocks for a total No index: Index on code in Courses: Go through all n blocks in Lectures, Go through all n blocks in Lectures, cost of n . compare the value for course from compare the value for course from each row with the values for code in each row with the index. Since With an index, we find that there are 2 rows with the all rows of Courses, stored in all m course is a key, each value will exist correct values for the course and period attributes. at most once, so the cost is 2 * n + 1 blocks. The total cost is thus n * m These are stored in two different blocks, so the total accessed disk blocks (1 for fetching accessed disk blocks. the index once). cost is 3 (2 blocks + reading index). 9 10 CREATE INDEX Important properties • Most DBMS support the statement • Indexes are separate data stored by itself. CREATE INDEX index name � Can be created ON table ( attributes ); � on newly created relations � on existing relations – Example: - will take a long time on large relations. CREATE INDEX courseIndex � Can be dropped without deleting any table data. ON Courses (code); • SQL statements do not have to be – Statement not in the SQL standard, but most changed DBMS support it anyway. – Primary keys are given indexes implicitly (by – a DBMS automatically uses any indexes. the SQL standard). 11 12 3
Quiz! Rule of thumb Why don’t we have indexes on all (combinations • Mostly queries on tables – use indexes for of) attributes for faster lookups? key attributes. – Indexes require disk space. • Mostly updates – be careful with indexes! – Modifications of tables are more expensive. • Need to update both table and index. – Not always useful • The table is very small. • We don’t perform lookups over it (Note: lookups � queries). – Using an index costs extra disk block accesses. 13 14 Quiz! Example: Suppose that the Lectures relation is stored in 20 disk blocks, and that we typically perform three operations on this table: Assume we have an index on Lectures for (course, – insert new lectures (Ins) – list all lectures of a particular course (Q1) period, weekday) which is the key. How costly – list all lectures in a given room (Q2) are these queries? Lectures: n disk blocks SELECT * SELECT * Let’s assume that in an average week there are: FROM Lectures FROM Lectures – 2 lectures for each course, and WHERE course = ’TDA357’ WHERE weekday = ’Monday’ – 10 lectures in each room. AND period = 3; AND room = ’VR’; Let’s also assume that A multi-attribute index is typically organized hierarchically. First the rows are indexed according to the first attribute, then according to – each course has lectures stored in 2 blocks, and the second within each group, and so on. – each room has lectures stored in 7 (some lectures are Thus the left query costs at most k + 1 where k is the number of stored in the same block). rows matching the values. The right query can’t use the index, and thus costs n , where n is the size of the relation in disk blocks. 15 16 4
Insert new lectures (Ins) Lectures example: blocks Costs List all lectures of a particular course (Q1) List all lectures in a given room (Q2) Case A Case B Case C Case D Index on Index on No index (course, period, weekday) room Both indexes Ins 2 4 4 6 Q1 20 3 20 3 Q2 20 20 8 8 The amortized cost depends on the proportion of operations of each kind. Ins Q1 Q2 Case A Case B Case C Case D 0.2 0.4 0.4 16.4 10 12 5.6 0.8 0.1 0.1 5.6 5.5 6 5.9 Index on Index on 0.1 0.6 0.3 18.2 8.2 14.8 4.8 (course, period, weekday) room 17 18 Real world Dense index on sequential file • The examples given here are very simplified! In KBB056 KBB056 KC Monday 08 reality, many more factors matter: KMB017 KMB017 MVH12 Tuesday 08 – Data layout on disk, storage schemes TDA357 KMB017 MVH12 Wednesday 15 – Size of disk blocks TMS145 TDA357 HA4 Monday 10 UMF012 – Size of main memory TDA357 HB1 Thursday 10 UMF018 TMS145 KC Friday 08 – Disk latency, bus speed, … UMF012 MVF23 Friday 13 • Indexes can be arbitrarily large! UMF012 MVF23 Monday 13 – Not uncommon for index to be larger than the data UMF018 MVF23 Tuesday 10 set. – Different index schemes also matter. 19 20 5
Sparse index on sequential file Multi-level indexes index data KBB056 KBB056 KC Monday 08 block 0 block 0 TDA357 KMB017 MVH12 Tuesday 08 UMF012 KMB017 MVH12 Wednesday 15 index data TDA357 HA4 Monday 10 block 1 block 1 TDA357 HB1 Thursday 10 TMS145 KC Friday 08 UMF012 MVF23 Friday 13 Outer index Inner index UMF012 MVF23 Monday 13 UMF018 MVF23 Tuesday 10 21 22 Secondary index on room name Quiz! • Indexes are incredibly useful (although they are HA4 KBB056 KC Monday 08 not part of the SQL standard). HB1 KMB017 MVH12 Tuesday 08 • Doing it wrong is costly. KC KMB017 MVH12 Wednesday 15 • Requires knowledge about the internals of a MVF23 TDA357 HA4 Monday 10 DBMS. MVH12 TDA357 HB1 Thursday 10 – How is data stored? How large is a block? TMS145 KC Friday 08 • A DBMS should be able to decide better than UMF012 MVF23 Friday 13 the user what indexes are needed, from usage analysis. UMF012 MVF23 Monday 13 UMF018 MVF23 Tuesday 10 So why don’t they?? 23 24 6
Summary – indexes Non-natural keys • Indexes make certain lookups and joins more • A natural key is a key consisting of efficient. attributes in the domain model. – Disk block access matters. • In some cases, no suitable natural key – Multi-attribute indexes exists. • CREATE INDEX • Usage analysis – No suitably unique natural candidate key. – What are the expected operations? – Natural candidate key ”too large”. – How much do they cost? – Natural candidate key ”not stable”. � (cost of operation)x(proportion of operations of that kind) – … 25 26 Quiz! Artificial key Listed is a personal number and a car registration • Extra attribute added to a table with the number, both Swedish, as well as a post ID and purpose of being the key. a comment ID from Wordpress. Can you tell – Does not exist in ”reality” which is which? – Can be verified for correctness – Can be distinguished from artificial keys on 1253 AKW965 other tables in database. • Examples: 861218-9324 4126 – Personal numbers, car registration numbers, course codes, etc. 27 28 7
Recommend
More recommend