Transcript Hashing
Hashing Reference: Chapters: 11,12 FALL 2004 CENG 351 1 Motivation • The primary goal is to locate the desired record in a single access of disk. – Sequential search: O(N) – B+ trees: O(logk N) – Hashing: O(1) • In hashing, the key of a record is transformed into an address and the record is stored at that address. • Hash-based indexes are best for equality selections. Cannot support range searches. • Static and dynamic hashing techniques exist. FALL 2004 CENG 351 2 Hash-based Index • Data entries are kept in buckets (an abstract term) • Each bucket is a collection of one primary page and zero or more overflow pages. • Given a search key value, k, we can find the bucket where the data entry k* is stored as follows: – Use a hash function, denoted by h – The value of h(k) is the address for the desired bucket. h(k) should distribute the search key values uniformly over the collection of buckets FALL 2004 CENG 351 3 Design Factors • Bucket size: the number of records that can be held at the same address. • Loading factor: the ratio of the number of records put in the file to the total capacity of the buckets. • Hash function: should evenly distribute the keys among the addresses. • Overflow resolution technique. FALL 2004 CENG 351 4 Hash Functions • Key mod N: – N is the size of the table, better if it is prime. • Folding: – e.g. 123|456|789: add them and take mod. • Truncation: – e.g. 123456789 map to a table of 1000 addresses by picking 3 digits of the key. • Squaring: – Square the key and then truncate • Radix conversion: – e.g. 1 2 3 4 treat it to be base 11, truncate if necessary. FALL 2004 CENG 351 5 Static Hashing • Primary Area: # primary pages fixed, allocated sequentially, never de-allocated; (say M buckets). – A simple hash function: h(k) = f(k) mod M • Overflow area: disjoint from the primary area. It keeps buckets which hold records whose key maps to a full bucket. – Adding the address of an overflow bucket to a primary area bucket is called chaining. • Collision does not cause a problem as long as there is still room in the mapped bucket. Overflow occurs during insertion when a record is hashed to the bucket that is already full. FALL 2004 CENG 351 6 Example • Assume f(k) = k. Let M = 5. So, h(k) = k mod 5 • Bucket factor = 3 records. 0 35 60 1 6 46 2 12 57 3 33 4 44 62 Primary area FALL 2004 CENG 351 17 overflow 7 Load Factor (Packing density) • To limit the amount of overflow we allocate more space to the primary area than we need (i.e. the primary area will be, say, 70% full) • Load Factor = => Lf = # of records in the file # of spaces in primary area n M * Bkfr FALL 2004 CENG 351 8 Effects of Lf and Bkfr • Performance can be enhanced by the choice of bucket size and load factor. • In general, a smaller load factor means – less overflow and a faster fetch time; – but more wasted space. • A larger Bkfr means – less overflow in general, – but slower fetch. FALL 2004 CENG 351 9 Insertion and Deletion • Insertion: New records are inserted at the end of the chain. • Deletion: Two ways are possible: 1. Mark the record to be deleted 2. Consolidate sparse buckets when deleting records. – In the 2nd approach: • • FALL 2004 When a record is deleted, fill its place with the last record in the chain of the current bucket. Deallocate the last bucket when it becomes empty. CENG 351 10 Problem of Static Hashing • The main problem with static hashing: the number of buckets is fixed: – Long overflow chains can develop and degrade performance. – On the other hand, if a file shrinks greatly, a lot of bucket space will be wasted. • There are some other hashing techniques that allow dynamically growing and shrinking hash index. These include: – linear hashing – extendible hashing FALL 2004 CENG 351 11 Linear Hashing • It maintains a constant load factor. • Thus avoids reorganization. • It does so, by incrementally adding new buckets to the primary area. • In linear hashing the last bits in the hash number are used for placing the records. FALL 2004 CENG 351 12 Last 3 bits Lf = 15/24 = 63% Example 000 8 16 001 17 25 010 34 50 011 11 27 100 28 12 32 e.g. 34: 100010 28: 011100 13: 001101 21: 010101 Insert: 13, 21, 37 101 5 110 14 111 55 FALL 2004 15 CENG 351 13 Insertion of records • To expand the table: split an existing bucket denoted by k digits into two buckets using the last k+1 digits. • e.g. 0000 000 1000 FALL 2004 CENG 351 14 Expanding the table Boundary value FALL 2004 0000 001 010 011 100 101 110 111 1000 16 17 34 11 28 5 14 55 8 32 25 50 27 12 13 21 37 15 CENG 351 15 0000 16 32 k=3 0001 17 Boundary value 0010 34 50 011 11 27 100 28 12 101 5 13 Hash # 1000: uses last 4 digits Hash # 1101: uses last 3 digits 21 37 110 14 111 55 15 1000 8 1001 25 FALL 2004 1010 26 CENG 351 16 Fetching a record • Calculate the hash function. • Look at the last k digits. – If it’s less than the boundary value, the location is in the bucket labeled with the last k+1 digits. – Otherwise it is in the bucket labeled with the last k digits. • Follow overflow chains as with static hashing. FALL 2004 CENG 351 17 Insertion • Search for the correct bucket into which to place the new record. • If the bucket is full, allocate a new overflow bucket. • If there are now Lf*Bkfr records more than needed for the given Lf, – Add one more bucket to the primary area. – Distribute the records from the bucket chain at the boundary value between the original area and the new primary area buckets – Add 1 to the boundary value. FALL 2004 CENG 351 18 Deletion • Read in a chain of records. • Replace the deleted record with the last record in the chain. – If the last overflow bucket becomes empty, deallocate it. • When the number of records is Lf * Bkfr less than the number needed for Lf, contract the primary area by one bucket. Compressing the table is exact opposite of expanding it: • Keep the total # of records in the file and buckets in primary area. • When we have Lf * Bkfr fewer records than needed, consolidate the last bucket with the bucket which shares the same last k digits. FALL 2004 CENG 351 19 Extendible Hashing Extendable Hashing Hash prefix i i1 Length of common hash prefix bucket1 Data bucket i2 bucket2 Bucket address table i3 bucket3 • • • • Hash function returns b bits Only the prefix i bits are used to hash the item There are 2i entries in the bucket address table Let ij be the length of the common hash prefix for data bucket j, there is 2(i-ij) entries in bucket address table points to j FALL 2004 CENG 351 20 Splitting a bucket: Case 1 Extendable Hashing • Splitting (Case 1: ij=i) – Only one entry in bucket address table points to data bucket j – i++; split data bucket j to j, z; ij=iz=i; rehash all items previously in j; 3 2 3 2 00 01 10 11 000 001 010 011 100 101 110 111 2 1 FALL 2004 CENG 351 3 2 1 21 Splitting: Case 2 Extendable Hashing • Splitting (Case 2: ij< i) – More than one entry in bucket address table point to data bucket j – split data bucket j to j, z; ij = iz = ij +1; Adjust the pointers previously point to j to j and z; rehash all items previously in j; 2 3 000 001 010 011 100 101 110 111 FALL 2004 2 2 1 CENG 351 3 000 001 010 011 100 101 110 111 2 2 2 22 Example Extendable Hashing • Suppose the hash function is h(x) = x mod 8 and each bucket can hold at most two records. Show the extendable hash structure after inserting 1, 4, 5, 7, 8, 2, 20. 1 001 0 4 100 5 101 7 111 8 000 2 010 20 100 0 1 4 1 2 1 8 00 1 0 2 1 01 1 10 4 5 11 2 1 1 7 4 5 FALL 2004 CENG 351 23 Example Extendable Hashing inserting 1, 4, 5, 7, 8, 2, 20 1 001 4 100 5 101 7 111 8 000 2 010 20 100 2 3 2 00 2 000 1 8 1 8 001 2 010 2 01 2 10 2 11 011 100 3 2 101 4 20 4 5 110 3 111 5 2 7 2 7 FALL 2004 CENG 351 24 Comments on Extendible Hashing • If directory fits in memory, equality search answered with one disk access. – A typical example: a100MB file with 100 bytes/entry and a page size of 4K contains 1,000,000 records (as data entries) but only about 25,000 directory elements chances are high that directory will fit in memory. • If the distribution of hash values is skewed (e.g., a large number of search key values all are hashed to the same bucket ), directory can grow large. – But this kind of skew can be avoided with a well-tuned hashing function FALL 2004 CENG 351 25 Comments on Extendible Hashing • Delete: If removal of data entry makes bucket empty, can be merged with a “buddy” bucket. If each directory element points to same bucket as its split image, can halve directory. FALL 2004 CENG 351 26 Summary • Hash-based indexes: best for equality searches, cannot support range searches. • Static Hashing can lead to long overflow chains. • Extendible Hashing avoids overflow pages by splitting a full bucket when a new data entry is to be added to it. – – Directory to keep track of buckets, doubles periodically. Can get large with skewed data; additional I/O if this does not fit in main memory. FALL 2004 CENG 351 27