



#### Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Xiaoxia Wu<sup>†</sup>, Jian Li<sup>‡</sup>, Lixin Zhang<sup>‡</sup>,

Evan Speight<sup>‡</sup>, Ram Rajamony<sup>‡</sup>, Yuan Xie<sup>†</sup>

<sup>†</sup> Pennsylvania State University

**‡IBM Austin Research Laboratory** 

<u>Acknowledgement</u>: Elmootazbellah (Mootaz) Elnozahy, Hung Le, Balaram Sinharoy, William (Bill) J. Starke, and Chung-Lung Kevin Shum

- Motivation and Introduction
- Methodology
- Level based Hybrid Cache Architecture
- Region based Hybrid Cache Architecture
- 3D Hybrid Cache Stacking
- Conclusion

## Introduction



#### **Traditional SRAM-based Cache Architecture**

- Limited size with CMP: cache-core balance
- Leakage power
- More cache levels: design overhead, coherence
- Non-Uniform Cache Architecture (wire delay)

Improve cache power-performance with Emerging Memory Technologies, under the same chip area/footprint

- Embedded DRAM
- Magnetic RAM
- Phase Change RAM
- Three-dimensional space

## **Different Memory Technologies**



# Comparisons

|                               | SRAM                     | eDRAM                   | MRAM                             | PRAM                                     |
|-------------------------------|--------------------------|-------------------------|----------------------------------|------------------------------------------|
| Density (ratio)               | Low (1) <                | High (4)                | High (4)                         | High(16)                                 |
| Dynamic Power<br>Reduce Cache | Low<br>miss rate         | Medium<br><             | Low for read;<br>High for write  | Medium for<br>read; High for             |
| Leakage Power                 | hit latend<br>High       | y<br>Medium             | Low                              | Low                                      |
| Speed<br>Low leakage po       | Very<br>Fast<br>wer High | Fast<br>dynamic p       | Fast for read;<br>Slow for write | Slow for read;<br>Very slow for<br>write |
| Non-volatility                | No                       | No                      | Yes                              | Yes                                      |
| Scalability                   | Yes                      | Yes                     | Yes                              | Yes                                      |
| Endurance                     | 10 <sup>16</sup>         | <b>10</b> <sup>16</sup> | >10 <sup>15</sup>                | 108                                      |

## Motivation



- Introduction and Motivation
- Methodology
- Level based Hybrid Cache Architecture
- Region based Hybrid Cache Architecture
- 3D Hybrid Cache Stacking
- Conclusions

## **Evaluation Methodology**



8

## **Evaluation Setup**

| Cache      | Density | Latency<br>(cycles)  | Dyn. eng (nJ)         | Static power<br>(W) |
|------------|---------|----------------------|-----------------------|---------------------|
| SRAM(1MB)  | 1       | 8                    | 0.388                 | 1.36                |
| eDRAM(4MB) | 4       | 24                   | 0.72                  | 0.4                 |
| MRAM(4MB)  | 4       | Read:20,<br>write:60 | Read:0.4<br>write:2.3 | 0.15                |
| PRAM(16MB) | 16      | Read:40<br>write:200 | Read:0.8<br>write:1.5 | 0.3                 |

| Processor | 8-way issue out-of-order, 8-core, 4GHz      |  |  |
|-----------|---------------------------------------------|--|--|
| L1        | 32KB DL1, 32KB IL1, 128B, 4-way, 1 R/W port |  |  |
| L2/L3/L4  | See corresponding design cases              |  |  |
| Memory    | 400 cycles latency                          |  |  |

- Benchmarks: SpecInt06, Specjbb, NAS, Bioperf, Parsec
- Simulator: SystemSim full system simulator

- Introduction and Motivation
- Methodology
- Level based Hybrid Cache Architecture
- Intra-Level Hybrid Cache Architecture
- 3D Hybrid Cache Stacking
- Conclusions

#### **LHCA: Performance and Power**



- Introduction and Motivation
- Methodology
- Level based Hybrid Cache Architecture
- Region based Hybrid Cache Architecture
- 3D Hybrid Cache Stacking
- Conclusions





#### **RHCA Hardware Support**



#### Hardware support

- Saturating counter in slow and sticky bit in fast, swap buffer
- Minimum hardware support: 1-bit sticky bit in fast region

## **RHCA Configuration**



 DNUCA policy: more fine grained, move a line to a closer bank on each hit, bank-based, same size



- Introduction and Motivation
- Methodology
- Level based Hybrid Cache Architecture
- Region based Hybrid Cache Architecture
- 3D Hybrid Cache Stacking
- Conclusions

## **3DHCA-configuration**



• 3DHCA-C (3D LHCA): 256KB L2 SRAM, 4M L3 eDRAM, 32M L4 PRAM

- 3DHCA-D: 32M L2 fast, middle, slow region (3D RHCA)
  - Data in slow region can be moved to fast and middle regions
- 3DHCA-E: 4M L2 fast+slow region, 32M L3 PRAM (LHCA+RHCA)



# Conclusion

- Hybrid cache architecture is promising to improve cache power-performance under same chip area/footprint
- RHCA and LHCA achieve better power-performance than SRAM-based design
- RHCA outperforms LHCA with minimal hardware support
- 3DHCA achieves better performance than LHCA and RHCA, while still maintains lower power than 2D SRAM baseline