Transcript pptx

Reuse-based Online Models for
Caches
1
RATHIJIT SEN
DAVID A. WOOD
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
The Problem
2
 Caches: power vs performance
 Reconfigurable caches
 e.g., IvyBridge
Core
LLC
LLC
Core
Core
LLC
LLC
Core
Core
LLC
LLC
Core
Core
LLC
LLC
Core
Miss
 The Problem:
Fetch
DRAM
Which configuration to select?
e.g., to get the best energy-efficiency?
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Cache Performance Prediction
3
 We propose a framework
h = (r · B) · φ




h: hit ratio
r: reuse-distance distribution (novel hardware support)
B: stochastic Binomial matrix
φ: hit function (LRU, PLRU, RANDOM, NMRU)
 Case study:
Energy-Delay Product (EDP) within 7% of minimum
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
4
 The Problem
 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
 Hardware support
 Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Cache Overview
5
 Limited storage
 Sets of (usually 64-byte) blocks
 #blocks/set = associativity (#ways)
 Set Index + Address tags identify data
Address
N
Miss
Tag
Match?
Y
Associativity (A)
Hit
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Sets (S)
6/20/2013
Workload Variation
6
 Last-Level Cache (LLC)
Miss / 1000 Instruction
30
25
swim
20
mgrid
15
apache
zeus
10
5
0
oltp
jbb
equake, gafort, wupwise
fma3d
ammp, blackscholes, bodytrack, fluidanimate, freqmine, swaptions
2MB
4MB
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
8MB
16MB
32MB
6/20/2013
Bad configurations hurt!
7

Maximum EDP (energy-delay product)
Minimum
Relative to min. EDP
3.5
218% worse
Max. EDP
3
2.5
2
1.5
27% worse
1
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Problem Summary
8
 Reconfigurable caches
Associativity (A)
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
b
Sets (S)
 Multiple replacement policies
 Goal: Online miss-ratio prediction
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Indexing Assumption
9
 Mapping of unique addresses to cache sets
 Assumption: independent, uniform [Smith, 1978]
 Unique accesses as Bernoulli trials
 (Partial) Hashing
 POWER4, POWER5, POWER6, Xeon
 Simple XOR-based function [similar to Cypher, 2008]
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
10
 The Problem

 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
 Hardware support
 Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Temporal Locality Metrics
11
 Unique Reuse Distance (URD)
 #unique intervening addresses
 x y z z y x : URD(x)=2
 Stack Distance [Mattson, 1970] – 1
 Large cache  large distances to track
i
r ■■■■…■■
P(URD=i)
Size?
 Absolute Reuse Distance (ARD)
 #intervening addresses
 x y z z y x : ARD(x)=4
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Per-set Locality, r(S)
12
i
r ■■■■…■■
 r(S) is “compressed” as S (#sets) increases
 Less of the tail is important
Probability
Prob.
Cumulative
0.61
#sets: S > S
#sets: S
0.5
0.8
0.4
0.6
0.3
0.4
0.2
0.2
0.1
P(URD=i)
x x
x  x
S=2^14
S=2^13
S=2^14
S=2^13
S=2^12
S=2^12
S=2^11
S=2^11
S=2^10
S=2^10
0
0
4
8
12
16
20
24
 distance)

Per-set URD (unique reuse
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
28
32
6/20/2013
Agenda
13
 The Problem

 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ

 Hardware support
 Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Estimating per-set locality
14
 Generalized stochastic Binomial matrices [Strum, 1977]
 r(S) = r(1) · B(1 – 1/S, 1/S)
B
i
r ■■■■■■■■
P(URD=i)
1 00 0 0 0 00
 0 0 0 0 0 0
 0 0 0 0 0 i
 0 0 0 0
 0 0 0
 0 0
 0

k
P(k successes in i trials)
i.e.,
P(k of i to the same set)
 Composition:
r(S) = r(S) · B(1 – S/S, S/S)
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Computation reuse & speedup
15
 “Shorter” tail  smaller matrices
i
r ■■■■…■■
P(URD=i)
Poisson
Approximation
r(214)
r(213)
r(1)

r(212)
r(211)
r(210)
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Size?
r(1)
r(210)
Now: compute
Later: hardware support
r(214)
r(213)
r(212)
r(211)
6/20/2013
Size of r(210)?
16
i
 Prediction with r(210) limited to URD < n
P(URD=i)
0.3
n=32
n=256
0.25
Miss Ratio
r ■■■■…■■
n=64
n=512
n=128
Actual
0.2
0.15
0.1
0.05
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
0
2MB
4MB
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
8MB
16MB
32MB
6/20/2013
Agenda
17
 The Problem

 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ


 Hardware support
 Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Hit Function, φ
18
x  x
Not x
 φk: P(x will hit|URD(x)=k)
 Monotonically decreasing model
 Intuition: larger URD  same or larger eviction probability
φ0 = 1
φk ≤ φk-1
φ∞ = 0
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Hit Function, φ
19
Hit Probability
 Example: A=8
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LRU
PLRU
NMRU
RANDOM
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Unique Reuse Distance
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Formulating φ
20
 φ(LRU): step-function
 (r · B) · φ(LRU)  [Smith, 1978], [Hill & Smith, 1989]
 φ(PLRU):
 Assumes on average, traffic evenly divided between subtrees
 φ(RANDOM):
 Estimates #intervening misses using ARD
 φ(NMRU): similar to φ(RANDOM) except φ1=1
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
21
 The Problem

 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ



 Hardware support
 Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Prediction Accuracy
22
 LRU, PLRU(A=2), NMRU(A=2): exact per-set model
Cumulative Probability
 Others: approximate per-set model
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
LRU
0%
1%
PLRU
RANDOM
2%
3%
4%
5%
abs((predicted-actual)/actual) miss ratio
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
NMRU
6%
6/20/2013
Overheads
23
 r = r · B : 6  80 μsec
 Binomial  Poisson approximation for each row of B
 h = (r · B) · φ : 20  30 μsec
 Average over 24 configurations
 B applied 8 times
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
24
 The Problem

 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ




 Hardware support
 Case Study
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Computation reuse & speedup
25
 “Shorter” tail  smaller matrices
i
r ■■■■…■■
P(URD=i)
Poisson
Approximation
r(214)
r(213)
r(1)

r(212)
r(211)
r(210)
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Size=512
r(1)
r(210)
Now: compute
Later:
Now hardware support
r(214)
r(213)
r(212)
r(211)
6/20/2013
Insights
26
i
r ■■■■…■■
x y z z y x : URD(x)=2
P(URD=i)
 Unique “remember” addresses
 Only cardinality, not full addresses
Bloom filter for compact (approximate) representation
 r(210) is seen by any set of a cache with S=210
 Filter address stream
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Hardware Support for estimating r(210)
27
Start
Sample
Y
512-entry
Histogram
array
Addr
match?
access
Unique?
Set
Filter
N
read
inc
filtered access
reset
9-bit
Counter
Control Logic
load
Y (not hit)
Remember
hit
read
Reference address
register
insert
inc
1024-bit
Bloom Filter
2 hash fns
End
Sample
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Agenda
28
 The Problem

 Framework




Locality (r)
Matrix transformations (B)
Hit functions (φ)
h = (r · B) · φ
 Hardware support




 Case Study + way counters
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
LRU Way Counters [Suh, et al. 2002]
29
 One counter per logical way (stack position)
 Determining logical position is hard
 not totally (re-)ordered with every access
 heuristics, e.g., for PLRU [Kedzierski, et al. 2010]
 Other Limitations
 Inclusion property
 Fixed #sets
 S = S : special case of reuse framework
 S  S ? Use B
 provided, enough tail of r(S) is available
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Min. EDP configuration
30
 EDP within 7% of minimum
Relative to min. EDP
 Reuse models outperform PLRU way counters in most cases
1.08
1.07
1.06
1.05
1.04
1.03
1.02
1.01
1
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
Reuse Model
PLRU Way Counters
6/20/2013
Summary
31
 The Problem:
Online miss-rate estimation for reconfigurable caches
 We propose a framework
h = (r · B) · φ




h: hit-ratio
r: reuse-distance distribution (novel hardware support)
B: stochastic Binomial matrix
φ: hit function (LRU, PLRU, RANDOM, NMRU)
 Case study: EDP within 7% of minimum
 Future work: More policies, applications/case studies
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Also in the paper
32
 r: lossy summarization of the address trace
 Estimation for ARD
 Optimizations for LRU
 Conditions for PLRU eviction
 More details on models & evaluation
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Reuse-based Online Models for Caches
33
Questions?
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
Example LLC performance
34
 OLTP (TPC-C + IBM DB2)
0.4
RANDOM
NMRU
PLRU
LRU
Miss Ratio
0.3
0.2
0.1
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
2-w
4-w
8-w
16-w
32-w
0
2MB
4MB
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
8MB
16MB
32MB
6/20/2013
Estimating cache performance
35
 Hit ratio = hits/access
 ∑ P(URD=i) · P(hit|URD=i)
i
i
i
= r ■ ■ ■ ■ … ■ ■ · φ  … 
P(URD=i)
P(hit|URD=i)
 Miss ratio = misses/access
= 1 – hit ratio
 Miss rate = misses/instruction
= miss ratio x access/instruction
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013
URD vs ARD
36
{z0}* {z0,z1}* {z0,z1,z2}*
x
z0z1 z2
z3
dk
{z0,z1,z2,...,zk-1}*
x
zk-1
∞
Approximation: dk = dk-1 +1/ri
k
ACM SIGMETRICS 2013 @ CMU, Pittsburgh, PA
6/20/2013