¨Code is iteratively generated &
timed until
optimal case is found. We try:
ãDiffering NBs
ãBreaking false dependencies
ãM, N and K loop unrolling
¨On-chip
multiply optimizes for:
ãTLB
access
ãL1
cache reuse
ãFP
unit usage
ãMemory
fetch
ãRegister
reuse
ãLoop
overhead minimization
¨Takes
a 30 minutes
to a hour to run.