Data Structure and Storage Storage CS386, Introduction to Database Systems Jay Urbain Credits: Database Management Systems, Raghu Ramkrishnan and Johannes Gehrke Data Management, Richard Watson
Data Structures � Disks (Durable storage) are a bottleneck � The goal is to minimize disk access � Disks are slow compared to main memory � Appropriate data structures can reduce disk accesses � Appropriate data structures can reduce disk accesses
Storage and Indexing Abstraction of data in a DBMS: � Collection of records or a file. � For MySQL MyISAM =>1 table = file. � Each file consists of one or more pages . � Each file consists of one or more pages . � Files and Access Layers organizes data to support fast access to desired records. � Understanding how records are organized is essential to using DBMS effectively.
File Organization File Organization – methods of arranging records in a file when a file is stored on disk. � Each file organization makes certain operations efficient, but others expensive.
File Organization Example: File of employee records(age, name, salary) Query: Retrieve records in-order of increasing age Options for organization??? Options for organization???
File Organization Example: File of employee records(age, name, salary) Query: Retrieve records in-order of increasing age Options: Options: � Sorting file by age is a good file organization, but sort order is expensive to maintain. � You are often interested in supporting more than one operation on a give collection of records. � Retrieve all employees making > $5000 would require scan of entire file!
Database access Record Page Page returned returned read File Disk DBMS DBMS manager manager manager manager Record Page Read request request page command
Data on External Storage � Data must persist across program executions. � Page – unit of information read/written to disk. Usually 4K to 8K. Configurable. Maps to block. � Cost of (disk) page I/O dominates cost of typical DB � Cost of (disk) page I/O dominates cost of typical DB operations. � DBMS are carefully optimized to minimize this cost. � Disks most important storage mechanism.
Data on External Storage � Each record in a file has a unique identifier - record ID . � Record ID ( rid ) uniquely identifies disk address of page containing record. � Buffer Manager (BM) – reads/writes data between memory and disk for persistent storage. and disk for persistent storage. � Files and access layer (FL) - asks BM to fetch pages with rid. BM fetches from disk if not in memory. � Disk Space Manager (DSM) – manages space (pages) on disk. � When FL needs additional space to hold new records in a file, it asks the DSM to allocate an additional page . � Also informs DSM when a page is no longer needed. � DSM tracks and attempts to reuse discarded pages .
Disks � Data stored on concentric tracks on a surface � A disk drive can have multiple surfaces � Disk access time � Spin up time � Rotational delay - Waiting for the physical storage location of the data to appear under the read/write head data to appear under the read/write head � Seek time - Moving the read/write head to the track on which the storage location can be found. Spindle RPM Average latency (ms) 4200 7.14 5400 5.55 7200 4.17 10000 3 15000 2
File Organizations and Indexing � File of records abstraction implemented by FL . � A relation (table) typically stored as a file of records. � Also supports scan. � FL stores records in a file in a collection of disk pages , FL stores records in a file in a collection of disk pages , keeps track of pages allocated to each file, and space within pages allocated to file (insert/delete).
Database Index � A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. � Indexes can be created using one or more columns of a database table. � The disk space required to store the index is typically less than that required by the table, yielding the possibility to store indices in memory for a table whose data is too large to store in memory.
Database Index � In a relational database , an index is a copy of one part of a table. � Some databases extend the power of indexing by allowing indices to be created on functions or conditional indices to be created on functions or conditional expressions. � Indices may be defined as unique or non-unique. � A unique index acts as a constraint on the table by preventing duplicate entries in the index and thus the backing table. � Choosing effective table organization and a good collection of indexes is critical for improving performance.
Indexing � An index is a small file that has data for one or more fields of a file � Indexes reduce disk accesses
File Organizations and Indexing � Heap File – Simplest file structure is unordered file. Records in a heap file are stored in random order across the pages of the file. FL must track pages allocated to files. � Index – data structure that organizes data records on disk to optimize certain kinds of retrieval operations. � Provides efficient retrieval for records satisfying search conditions on the search key fields of the index. � Can create additional indexes on a given collection of data records, each with a different search key.
File Organizations and Indexing � Example: employee records(age, name, salary) � Can retrieve records in a file organized as an index on employee age versus sorting the file by age. � Can create auxiliary index file based on salary, to speed queries involving salary. involving salary. � Data entry refers to the records stored in an index file. � A data entry with search key value k contains enough information to locate (one or more) data records with search key k . � Search an index to find the desired data entries, then use index data to obtain record data (if needed).
File Organizations and Indexing Main alternatives for what to store as a data entry in an index: A data entry k* is an actual data record (with search key 1. value k ). Indexed file organization. A data entry is a <k, rid> pair - rid is the record id of a data 2. record with search key value k . record with search key value k . A data entry is a <k, rid-list> pair, where rid-list is a list of 3. record ids of data records with search key value k . � Option 1 avoids storing data records multiple times. � Options 2 and 3 contain data entries that point to data records and are independent of file organization. � Option 3 has more efficient space utilization. � Option 2 allows entries of variable length.
Types of Indexes � Clustered – use search key of a clustered file, the rids in qualifying data entries point to a contiguous collection of records, need to only retrieve a few data pages. � Un-clustered – each qualifying data entry could contain a rid that points to a distinct data page, leading to many rid that points to a distinct data page, leading to many data page I/Os.
Non-Clustered Indexes � The data is present in random order, but the logical ordering is specified by the index. � Data rows may be randomly spread throughout the table. � The non-clustered index tree contains the index keys in sorted order. � The leaf level of the index containing the pointer to the page and the row number in the data page.
Non-Clustered Indexes (cont.) � The physical order of the rows is not the same as the index order. � Typically created on column used in JOIN, WHERE, and ORDER BY clauses. � Good for tables whose values (and structure) may be modified frequently. � There can be more than one non-clustered index on a database table. � Often the default index (SQL Server).
Clustered Indexes Clustered - File is organized so that the ordering of data records is the same (or close) to the ordering of data entries in some index. Otherwise unclustered . � Option 1 is clustered by definition. � Options 2 or 3 can be a clustered index only if the data records are sorted on the search key field. � In practice (1) is clustered, (2) and (3) are unclustered.
Clustered Indexes � Clustering alters the data block into a certain distinct order to match the index, resulting in the row data being stored in order. � Therefore, only one clustered index can be created on a given database table. given database table. � Clustered indices can greatly increase overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index (or when a range of items is selected).
Clustered Indexes � Since the physical records are in this sort order on disk, the next row item in the sequence is immediately before or after the last one, and so fewer data block reads are required. � They are known as "index organized tables" � They are known as "index organized tables" under Oracle database. � In Oracle database, multiple tables can be joined into a cluster. � Default PK index for MyISAM. � A cluster can be keyed with a B-Tree index or a hash table.
Recommend
More recommend