Code generation strategy
Code is iteratively generated & timed until optimal case is found. We try:
- Differing NBs
- Breaking false dependencies
- M, N and K loop unrolling
On-chip multiply optimizes for:
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization