Hashing

Transcript Hashing

Hashing
Reference: Chapters: 11,12
FALL 2004
CENG 351
1
Motivation
• The primary goal is to locate the desired record in
a single access of disk.
– Sequential search: O(N)
– B+ trees: O(logk N)
– Hashing: O(1)
• In hashing, the key of a record is transformed into
an address and the record is stored at that address.
• Hash-based indexes are best for equality selections.
Cannot support range searches.
• Static and dynamic hashing techniques exist.
FALL 2004
CENG 351
2
Hash-based Index
• Data entries are kept in buckets (an abstract term)
• Each bucket is a collection of one primary page and
zero or more overflow pages.
• Given a search key value, k, we can find the bucket
where the data entry k* is stored as follows:
– Use a hash function, denoted by h
– The value of h(k) is the address for the desired bucket.
h(k) should distribute the search key values uniformly over
the collection of buckets
FALL 2004
CENG 351
3
Design Factors
• Bucket size: the number of records that can
be held at the same address.
• Loading factor: the ratio of the number of
records put in the file to the total capacity of
the buckets.
• Hash function: should evenly distribute the
keys among the addresses.
• Overflow resolution technique.
FALL 2004
CENG 351
4
Hash Functions
• Key mod N:
– N is the size of the table, better if it is prime.
• Folding:
– e.g. 123|456|789: add them and take mod.
• Truncation:
– e.g. 123456789 map to a table of 1000 addresses by
picking 3 digits of the key.
• Squaring:
– Square the key and then truncate
• Radix conversion:
– e.g. 1 2 3 4 treat it to be base 11, truncate if necessary.
FALL 2004
CENG 351
5
Static Hashing
• Primary Area: # primary pages fixed, allocated
sequentially, never de-allocated; (say M buckets).
– A simple hash function: h(k) = f(k) mod M

• Overflow area: disjoint from the primary area. It
keeps buckets which hold records whose key maps
to a full bucket.
– Adding the address of an overflow bucket to a primary
area bucket is called chaining.
• Collision does not cause a problem as long as there
is still room in the mapped bucket. Overflow occurs
during insertion when a record is hashed to the
bucket that is already full.
FALL 2004
CENG 351
6
Example
• Assume f(k) = k. Let M = 5. So, h(k) = k mod 5
• Bucket factor = 3 records.
0
35
60
1
6
46
2
12
57
3
33
4
44
62
Primary area
FALL 2004
CENG 351
17
overflow
7
Load Factor (Packing density)
• To limit the amount of overflow we allocate more
space to the primary area than we need (i.e. the
primary area will be, say, 70% full)
• Load Factor =
=> Lf =
# of records in the file
# of spaces in primary area
n
M * Bkfr
FALL 2004
CENG 351
8
Effects of Lf and Bkfr
• Performance can be enhanced by the choice
of bucket size and load factor.
• In general, a smaller load factor means
– less overflow and a faster fetch time;
– but more wasted space.
• A larger Bkfr means
– less overflow in general,
– but slower fetch.
FALL 2004
CENG 351
9
Insertion and Deletion
•
Insertion: New records are inserted at the end
of the chain.
•
Deletion: Two ways are possible:
1. Mark the record to be deleted
2. Consolidate sparse buckets when deleting
records.
– In the 2nd approach:
•
•
FALL 2004
When a record is deleted, fill its place with the last
record in the chain of the current bucket.
Deallocate the last bucket when it becomes empty.
CENG 351
10
Problem of Static Hashing
• The main problem with static hashing: the number of
buckets is fixed:
– Long overflow chains can develop and degrade performance.
– On the other hand, if a file shrinks greatly, a lot of bucket
space will be wasted.
• There are some other hashing techniques that allow dynamically
growing and shrinking hash index. These include:
– linear hashing
– extendible hashing
FALL 2004
CENG 351
11
Linear Hashing
• It maintains a constant load factor.
• Thus avoids reorganization.
• It does so, by incrementally adding new
buckets to the primary area.
• In linear hashing the last bits in the hash
number are used for placing the records.
FALL 2004
CENG 351
12
Last 3 bits
Lf = 15/24
= 63%
Example
000 8
16
001 17
25
010 34
50
011 11
27
100 28
12
32
e.g.
34: 100010
28: 011100
13: 001101
21: 010101
Insert: 13, 21, 37
101 5
110 14
111 55
FALL 2004
15
CENG 351
13
Insertion of records
• To expand the table: split an existing bucket
denoted by k digits into two buckets using
the last k+1 digits.
• e.g.
0000
000
1000
FALL 2004
CENG 351
14
Expanding the table
Boundary
value
FALL 2004
0000
001
010
011
100
101
110
111
1000
16
17
34
11
28
5
14
55
8
32
25
50
27
12
13
21
37
15
CENG 351
15
0000 16
32
k=3
0001 17
Boundary
value
0010 34
50
011 11
27
100 28
12
101 5
13
Hash # 1000: uses last 4 digits
Hash # 1101: uses last 3 digits
21
37
110 14
111 55
15
1000 8
1001 25
FALL 2004
1010 26
CENG 351
16
Fetching a record
• Calculate the hash function.
• Look at the last k digits.
– If it’s less than the boundary value, the location
is in the bucket labeled with the last k+1 digits.
– Otherwise it is in the bucket labeled with the
last k digits.
• Follow overflow chains as with static
hashing.
FALL 2004
CENG 351
17
Insertion
• Search for the correct bucket into which to place
the new record.
• If the bucket is full, allocate a new overflow
bucket.
• If there are now Lf*Bkfr records more than
needed for the given Lf,
– Add one more bucket to the primary area.
– Distribute the records from the bucket chain at the
boundary value between the original area and the new
primary area buckets
– Add 1 to the boundary value.
FALL 2004
CENG 351
18
Deletion
• Read in a chain of records.
• Replace the deleted record with the last record in the chain.
– If the last overflow bucket becomes empty, deallocate it.
• When the number of records is Lf * Bkfr less than the number
needed for Lf, contract the primary area by one bucket.
Compressing the table is exact opposite of expanding it:
• Keep the total # of records in the file and buckets in primary
area.
• When we have Lf * Bkfr fewer records than needed,
consolidate the last bucket with the bucket which shares the
same last k digits.
FALL 2004
CENG 351
19
Extendible Hashing
Extendable Hashing
Hash prefix
i
i1
Length of common hash prefix
bucket1
Data bucket
i2
bucket2
Bucket address table
i3
bucket3
•
•
•
•
Hash function returns b bits
Only the prefix i bits are used to hash the item
There are 2i entries in the bucket address table
Let ij be the length of the common hash prefix for data bucket j, there
is 2(i-ij) entries in bucket address table points to j
FALL 2004
CENG 351
20
Splitting a bucket: Case 1
Extendable Hashing
• Splitting (Case 1: ij=i)
– Only one entry in bucket address table points to data
bucket j
– i++; split data bucket j to j, z; ij=iz=i; rehash all items
previously in j;
3
2
3
2
00
01
10
11
000
001
010
011
100
101
110
111
2
1
FALL 2004
CENG 351
3
2
1
21
Splitting: Case 2
Extendable Hashing
• Splitting (Case 2: ij< i)
– More than one entry in bucket address table point to
data bucket j
– split data bucket j to j, z; ij = iz = ij +1; Adjust the
pointers previously point to j to j and z; rehash all items
previously in j;
2
3
000
001
010
011
100
101
110
111
FALL 2004
2
2
1
CENG 351
3
000
001
010
011
100
101
110
111
2
2
2
22
Example
Extendable Hashing
• Suppose the hash function is h(x) = x mod 8 and each
bucket can hold at most two records. Show the extendable
hash structure after inserting 1, 4, 5, 7, 8, 2, 20.
1
001
0
4
100
5
101
7
111
8
000
2
010
20
100
0
1
4
1
2
1
8
00
1
0
2
1
01
1
10
4
5
11
2
1
1
7
4
5
FALL 2004
CENG 351
23
Example
Extendable Hashing
inserting 1, 4, 5, 7, 8, 2, 20
1
001
4
100
5
101
7
111
8
000
2
010
20
100
2
3
2
00
2
000
1
8
1
8
001
2
010
2
01
2
10
2
11
011
100
3
2
101
4
20
4
5
110
3
111
5
2
7
2
7
FALL 2004
CENG 351
24
Comments on Extendible Hashing
• If directory fits in memory, equality search answered
with one disk access.
–
A typical example: a100MB file with 100 bytes/entry and
a page size of 4K contains 1,000,000 records (as data
entries) but only about 25,000 directory elements
 chances are high that directory will fit in memory.
• If the distribution of hash values is skewed (e.g., a
large number of search key values all are hashed to
the same bucket ), directory can grow large.
– But this kind of skew can be avoided with a well-tuned
hashing function
FALL 2004
CENG 351
25
Comments on Extendible Hashing
• Delete: If removal of data entry makes
bucket empty, can be merged with a “buddy”
bucket. If each directory element points to
same bucket as its split image, can halve
directory.
FALL 2004
CENG 351
26
Summary
• Hash-based indexes: best for equality searches,
cannot support range searches.
• Static Hashing can lead to long overflow chains.
• Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is to
be added to it.
–
–
Directory to keep track of buckets, doubles periodically.
Can get large with skewed data; additional I/O if this
does not fit in main memory.
FALL 2004
CENG 351
27