Document 7246777

Download Report

Transcript Document 7246777

Architectural Synthesis and
Exploration using
Term Rewriting Systems
Arvind
James C. Hoe
Laboratory for Computer Science
Massachusetts Institute of Technology
http://www.csg.lcs.mit.edu
Outline

Introduction

Term Rewriting Systems (TRS) as a Hardware
Description Language

Hardware Synthesis from Term Rewriting Systems

Results
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 2
Internet/Communication Space

Rapidly changing functionality and performance
requirements necessitate rapid hardware development
- ATM, frame-relay, Gigabit Ethernet, packet-overSONET protocols
- voice-over-IP, video, streaming data,
QoS issues dominant
- merger of LAN and WAN infrastructures

Currently addressed by
- General-purpose or Embedded processors + ASICs
- Network processors (emerging)
ASIC development time and cost is the limiting factor in
product release
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 3
Current ASIC Design Flow
Informal Architectural Spec
Manual Steps
Verification nightmare
Labor Intensive
Time Consuming
Error Prone
High-level C Simulators
ASICs
Fab
Synthesis/Optimization
RTL Implementation
Time pressure means:
little architecture exploration & high technology risk
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 4
Our New Design Technology

Reduces time to market
- Faster design capture
- Same specification for simulation, verification and
synthesis
- Rapid feedback  architectural exploration

Enables rapid development of a large variety of chips
with related designs
 complex systems-on-a-chip

Reduces manpower requirement
Makes designing hardware as commonplace as
writing software
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 5
State-Centric Descriptions
Hardware description
languages
Schematics
dMod,a
dFlip,a
pModpFlip+ pMod
ce
dFlip,b
a
-
pFlip
dFlip,b
b
ce
pFlip
=0
pMod
dMod,a
dFlip,a
pFlip
<
always @ (posedge Clk) begin
if (a >= b) begin
a <= a - b;
b <= b;
end else begin
a <= b;
b <= a;
end
end
what does it describe?
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 6
Operation-Centric Descriptions
Euclid’s Algorithm
Gcd(a, b) if b0  Gcd(b, Rem(a, b))
Gcd(a, 0)  a
Rem(a, b) if ab  a
Rem(a, b) if ab  Rem(a-b, b)
Execution:
Gc11d(2,4)
R3

R4

R3

Gcd(4,2)
Gcd(2,Rem(2,2))
Gcd(2,0)
R1

R1

R4

R2

(Rule1)
(Rule2)
(Rule3)
(Rule4)
Gcd(4,Rem(2,4))
Gcd(2,Rem(4,2))
Gcd(2,Rem(0,2))
2
Hardware description?
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 7
Operation-Centric Description:MIPS
MIPS Microprocessor Manual
ADD
rd, rs, rt
GPR[rd]  GPR[rs] + GPR[rt]
PC  PC + 4
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 8
TRS as a
Hardware Description Language
Term Rewriting System
a set of terms
a set of
rewriting rules
TRS  < A, R>
hierarchically
organized
state elements
state
transitions
System  Structure + Behavior
An operation centric view of the world
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 10
TRS Execution Semantics
Given a set of rules and an initial term s
While ( some rules are applicable to s )
{
 choose an applicable rule
(non-deterministic)
 apply the rule atomically to s
}
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 11
Architectural Description
+1
PC
Iport
Arvind, MIT Lab for Computer Science
PROG
RF
BF
ALU
Oport
NTT, January 12, 2000, Slide 12
AX Architectural Description
Type SYS
= Sys( PROC, IPORT, OPORT )
Type PROC = Proc( PC, RF, PROG, BF )
Type PC
= Bit[16]
Type RF
= Array[RNAME] VAL
Abstract
Datatypes
Type RNAME= Reg0 || Reg1 || Reg2 || . . .
Type VAL
= Bit[16]
+1
Type PROG = Array[PC] INST
Type BF
= Fifo INST_D
PC
PROG
RF
BF
ALU
Type IPORT = Iport VAL
Type OPORT= Oport VAL
Arvind, MIT Lab for Computer Science
Iport
Oport
NTT, January 12, 2000, Slide 13
AX Instruction Set
Type INST =
||
||
||
||
||
||
||
Loadi (RD, VAL)
Loadpc (RD)
Add (RD, R1, R2)
Sub (RD, R1, R2)
...
Bz (RA,RC)
MovToO (R1)
MovFromI (RD)
Decoded instructions
Type INST_D = Addd (RD, V1, V2) || ...
RD, RA, etc. are RNAME’s. V1, V2, etc. are values
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 14
AX Processor Model: Fetch Rules
Fetch Add Rule
Proc( pc, rf, prog, bf )
if r1target(bf)  r2target(bf)
where Add(r, r1, r2)=prog[pc]

Proc( pc+1, rf, prog, enq(bf,Addd(r,rf[r1],rf[r2])) )
+1
PC
Iport
Arvind, MIT Lab for Computer Science
PROG
RF
BF
ALU
Oport
NTT, January 12, 2000, Slide 15
AX Processor Model: Execute Rules
Proc( pc, rf, prog, bf ) if r1target(bf)  r2target(bf)
where Add(r, r1, r2)=prog[pc]
 Proc( pc+1, rf, prog, enq(bf,Addd(r,rf[r1],rf[r2])) )

Proc( pc, rf, prog, bf )
where Addd(r, v1, v2)=first(bf)
Proc( pc, rf[r:=v1+v2], prog, deq(bf) )
+1
“Execute Add”
PC
Iport
Arvind, MIT Lab for Computer Science
PROG
RF
BF
ALU
Oport
NTT, January 12, 2000, Slide 16
TRS as an HDL





Clean, expressive, precise and concise
- speculative & superscalar microarchitectures
[IEEE Micro, June ’99]
- memory models & cache coherence protocols
[ISCA99, ICS99]
Supports parallel and non-deterministic specifications
The correctness of a TRS can be verified against a
reference TRS specification
Some pipelining can be done automatically as a source-tosource transformation on TRS’s
Superscalar versions of TRS’s can be derived
mechanically from pipelined TRS’s.
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 17
Synthesis from TRS’s
From TRS to Synchronous FSM
I
Transition
Logic
S“Next”
States
S
O
 Extract
state elements (registers) from the
type declaration
 Extract state transition logic from the rules
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 19
Rule: As a State Transformer

Proc( pc, rf, prog, bf ) where Bzd(va, 0 ) = first(bf)
Proc( va, rf, prog, clear(bf) )
enable
PC
RF
PR
OG
p
d
PC’
RF’
PR
OG’
BF
BF’
current
state
next
state values
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 20
Reference Implementation
 Synchronous
D
LE
WA
WD
WE
R
 Single
Q RA
1
RA2
RA3
state elements
ED
EE
A
DE
RD1
RD2 CE
RD3
F
first
_full
_empty
transition per clock cycle
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 21
Scheduler
p1
p2
f1
f2
Scheduler
pn
fn
1. fi  pi
2. p1  p2  ....  pn  f1  f2  ....  fn
3. One-rule-a-time  at most one fi is true
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 22
Combining Logic from Multiple Rules
latch
enables
from
different
rules
next state
values
from
different
rules
Arvind, MIT Lab for Computer Science
f0
f1
OR
fn
d0,PC
d1,PC
latch
enable
sel
PC’ next
state
value
dn,PC
NTT, January 12, 2000, Slide 23
Performance Considerations

Concurrent Execution
- Statically determine which transitions can be safely
executed concurrently
- Generate a scheduler and update logic that allows as
many concurrent transitions as possible
Caution: Concurrent firing of two rules can violate onetransition-at-a-time semantics if, for example, firing of
one rule disables the other
Conflict-free rules
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 24
Quality of Synthesis
TRAC Synthesis Flow
Design
SPEC
Transform
Compile
C
RTL Sim
RTL
Synopsys
C Sim
Arvind, MIT Lab for Computer Science
Std
Cell
Gate Array
FPGA
NTT, January 12, 2000, Slide 26
Performance: TRS vs. Verilog
32-bit MIPS Integer Core
CBA tc6a
Area
Clock
(cells)
LSI 10K
Area
Clock
(gates)
TRS
9521
10ns
100MHz
30756
19.48ns
51MHz
Verilog
RTL
8960
11.4ns
88MHz
29483
23.79ns
42MHz
TRS 1 day
Verilog 1 month
Arvind, MIT Lab for Computer Science
Dan Rosenband & James Hoe
NTT, January 12, 2000, Slide 27
Architectural Derivatives
+1
PC
PROG
BF
0
MIN
RF
BF
ALU
1
MOUT
Non-pipelined
2-stage
3-stage
Arvind, MIT Lab for Computer Science
Other Dimensions:
Superscalar, Custom Instructions,
Number of Registers, Word Size ...
NTT, January 12, 2000, Slide 28
Derivatives and Feedback

Derivatives of a 32-bit 4-GPR embedded RISC processor

Synopsys RTL Analyzer reports GTECH area and gate
delays (no wiring or load model)
simple
Delay
30+X
Delay(X=20) 50
Area
4334
2-stage
3-stage
3-stage,2-way
max(18+X,25)
38
5753
max(6+X,25)
26
6378
max(8+X,31)
31
9492
unit area=1 NAND
Arvind, MIT Lab for Computer Science
unit delay=1 NAND
NTT, January 12, 2000, Slide 29
Performance
Application: ASPN Chips
ASIC
ASPN
NP
GP
Flexibility
Application-Specific Programmable Network (ASPN)
Chips are based on a core architecture and a set of
domain-specific building blocks
TRAC allows rapid customization of ASPN designs
with ASIC like performance for evolving needs and for
different vertical markets within the communication
space
Arvind, MIT Lab for Computer Science
NTT, January 12, 2000, Slide 30