File Organisation Part - II Dr. V. V. Subrahmanyam Associate Professor, SOCIS, IGNOU
Heap File Organisation • The simplest file structure is an unordered file or heap file. • The data in the pages of a heap file is not ordered. • Every record in the file has a unique rid and every page in a file is of the same size.
Contd… • Records are inserted at the end of the file as and when they are inserted. • Once the data block is full, the next record is stored in the new block. This new block need not be the very next block. • This method can select any block in the memory to store the new records.
Contd… • It is similar to pile file in the sequential method, but here data blocks are not selected sequentially. • They can be any data blocks in the memory. • It is the responsibility of the DBMS to store the records and manage them.
Supported Operations on Heap Files • Create • Destroy • Insert a record with a given rid • Delete a record with a given rid • Get a record with a given rid • Scan all records in the file
Two alternative ways • Linked list of pages • Directory of pages **In each of these alternatives, pages must hold two pointers(which are page ids) for file-level bookkeeping in addition to the data
Linked List of Pages • One possibility is to maintain a heap file as a doubly linked list of pages. • DBMS can remember where the first page is located by maintaining a table containing pairs of Heap _file _name and Page_1 _address . • First page of the file is known as the header page.
Contd… • An important task is to maintain information about empty slots created by deleting a record from the heap file. • This task has 2 distinct parts: – How to keep track of free space within a page? – How to keep track of pages those are free? The second part can be addresses by 2 doubly linked lists (i) for free space and (ii) for full pages.
Contd… • If a new page is required, it is obtained by making a request to the disk space manager and then added to the list of pages in the file. • If a page is deleted from the heap file, it is removed from the list and the disk space manager is told to deallocate it.
Heap File Organisation with a Doubly Linked Lists Linked List of pages with free space Free Page Free Page Free Page Header Page Data Page 1 Data Page 2 Data Page N Linked List of full pages
Disadvantage • Virtually all pages in a file will be on the free list if records are of variable length. To insert a typical record, we must retrieve and examine several pages on the free list before we find one with enough free space. • This is overcome in the directory-based heap file organisation.
Directory of Pages • An alternative technique to maintain directory of pages. • DBMS must remember where the first directory page of each heap file is located. • The directory is itself a collection of pages • Each directory entry identifies a page in the heap file.
Contd… • The heap file grows or shrinks, the no. of entries in the directory. • Free space can be managed by maintaining a bit per entry, indicating whether the corresponding page has any free space, or a count per entry, indicating the amount of free space on the page.
Heap File Organisation with a Directory Header Page Data Page 1 Data Page 2 Data Page N Directory
Multikey File Organisation • Allow records to be accessed by more than one key field. • The ability to search on many keys is enabled by building multiple index files “on top of “ the data file. • The physical DB consists of one or more data files and many index files and each data file contains either one or several record types.
Two Approaches • Multilist file organisation • Inverted file organisation
Contd… • An index for each secondary key. • An index entry for each distinct value of the secondary key. • The index may be tabular or tree-structured. • The entries in an index may or may not be sorted. • The pointers to data records may be direct or indirect.
Contd.. • The indexes differ in that: – An entry in an inverted index has a pointer to each data record with that value. – An entry in a multilist index has a pointer to the first data record with that value.
Contd… • Inverted index may have variable-length entries whereas a multilist index has fixed length entries.
Hash / Direct File Organisation • Hash function is used to calculate the address of the block to store the records. • The hash function can be any simple or complex mathematical function. • The hash function is applied on some columns/attributes – either key or non-key columns to get the block address. • Hence each record is stored randomly irrespective of the order they come.
Contd… • This method is also known as Direct or Random file organization. • If the hash function is generated on key column, then that column is called hash key, and if hash function is generated on non-key column, then the column is hash column.
Contd… • When a record has to be retrieved, based on the hash key column, the address is generated and directly from that address whole record is retrieved. Here no effort to traverse through whole file. • Similarly , when a new record has to be inserted, the address is generated by hash key and record is directly inserted. Same is the case with update and delete.
Advantages • Records need not be sorted after any of the transaction. Hence the effort of sorting is reduced in this method. • Since block address is known by hash function, accessing any record is very faster. Similarly updating or deleting a record is also very quick. • This method can handle multiple transactions as each record is independent of other as there is no dependency on storage location for each record, multiple records can be accessed at the same time. • It is suitable for online transaction systems like online banking, ticket booking system etc.
Disadvantages • Since all the records are randomly stored, they are scattered in the memory. Hence memory is not efficiently used. • If we are searching for range of data, then this method is not suitable. Because, each record will be stored at random address. Hence range search will not give the correct address range and searching will be inefficient. • Searching for records with exact name or value will be efficient. If the Student name starting with ‘B’ will not be efficient as it does not give the exact name of the student.
• This method is efficient only when the search is done on hash column. Otherwise, it will not be able find the correct address of the data. • If there is multiple hash columns – say name and phone number of a person, to generate the address, and if we are searching any record using phone or name alone will not give correct results. • If these hash columns are frequently updated, then the data block address is also changed accordingly. Each update will generate new address. • Hardware and software required for the memory management are costlier in this case.
Recommend
More recommend