

#### A. F. J. Levi

# The University of Southern California University Park, DRB 118 Los Angeles, California 90089-1111

### http://www.usc.edu/dept/engineering/eleceng/Adv\_Network\_Tech/

| voice | (213) 740 -7318 |
|-------|-----------------|
| email | alevi@usc.edu   |

Presented at the Optoelectronics Industry Development Association workshop on "Opto-electronics on CMOS", May 12, 1999, Hilton Santa Fe, Santa Fe, New Mexico. OIDA voice (202)-785-4426, fax (202)-785-4428

# IO interconnection and packaging in a large system

UNIVERSITY OF SOUTHERN CALIFORNIA

Large system IO as function of gates per chip follows Rents rule  $(IO/k)^{1.8}$  = number of gates











#### Synchronous interconnection price comparison

CALIFORNIA

#### <u>A ssum ptions-price per line:</u>





## **Example: SGI Origin switch-based system architecture**

Electrical interconnect:

Node-to-node access

- 44 signal pins per direction
- up to 5 m electrical cable

0.73 GByte/s peak per direction



Graphice

Etopo

To SMONE

Other 3TAL

Bridge

Example IO subsystem block diagram

Node 0

Proc A

Proc B



**Memory access** 0.78 GByte/s peak total 0.78 GByte/s sustained total

#### Latency

Pin to pin hub 41 ns Local memory 310 ns 4P remote memory 540 ns 8P average remote memory 707 ns 16P average remote memory 726 ns 32P average remote memory 773 ns 64P average remote memory 867 ns 128P average remote memory 945 ns

Node-to-node 16 kB page block transfer  $< 30 \,\mu s$ 



### **HIPPI-6400** interconnect



| Dimension  | mm        | inches    |
|------------|-----------|-----------|
| B1         | 96.28 Max | 3.80 Max  |
| B2         | 43.18     | 1.70      |
| <b>B</b> 3 | 58.67 Max | 2.31 Max  |
| B4         | 50.80     | 2.00      |
| B5         | 25.40     | 1.00      |
| B6         | 10.92     | 0.43      |
| B7         | 12.70     | 0.50      |
| <b>B</b> 8 | 19.05 Max | 0.750 Max |
| B9         | 10.77     | 0.42      |

Copper interconnect:

- Poor form factor (< 1 GByte/s/inch)
- Limited bandwidth (< 1 Gb/s)
- Limited distance (< 50 m)

HIPPI-6400 electrical connector

- 2.31"x0.75" edge dimension
- 2 row, 100 pin connector (23x2x2=92 signals)
- 0.664" diameter cable up to 50 m
- 6" bend radius for cable
- 2 Byte data 500 Mb/s per direction, 4b/5b coding
- 0.8 GByte/s peak bandwidth per direction, 1.6 GByte/s bisection bandwidth (bisection bandwidth density 0.7 GByte/s/inch or 2.2 Gb/s/cm)



### **High-performance opto-electronic interface to CMOS**

Potential power savings using all-optical signal processing



"A 55 Gb/s/cm data bandwidth density interface in 0.5 μm CMOS ...", B. Madhavan and A. F. J. Levi, Electron. Lett. **34**, 1846-1847 (1998)

8



#### USC electrical and optical system test between two Pentium hosts



Electrical test fixture for LA chip requires 40 coaxial cables HP POLO-2 module and LA chip requires 2 ribbon fibers



# Multi-GByte/s data rate per linear cm



HP PONI Tx-module

HP PONI module + USC interface IC supports > 4X capacity IBM ATM switch in 1 cm form-factor



IBM 8265 Nways ATM switch 12.8 Gb/s capacity, ~ 1 m back-plane

 $\diamond$  12-wide fiber ribbon, 2.5 Gb/s per fiber signaling, and MTP connector

BGA surface mount to PCB



## **Example: SGI Origin switch-based system architecture**

Electrical interconnect:

Node-to-node access

0.78 GByte/s peak total 0.78 GByte/s sustained total

**Memory access** 

- 44 signal pins per direction
- up to 5 m electrical cable

0.73 GByte/s peak per direction



Graphice

Etopo

To SMONE

Other 3TAL

Bridge

Example IO subsystem block diagram

Node 0

Proc A

Proc B



Latency

Pin to pin hub 41 ns Local memory 310 ns 4P remote memory 540 ns 8P average remote memory 707 ns 16P average remote memory 726 ns 32P average remote memory 773 ns 64P average remote memory 867 ns 128P average remote memory 945 ns

Node-to-node 16 kB page block transfer  $< 30 \,\mu s$ 

**USC** UNIVERSITY OF SOUTHERN CALIFORNIA

- Alpha 21264, 9.6M transistors, 600 MHz, 2.4 BIPS, 64b, 45 W, internal V<sub>DD</sub> = 2.4 V, 14.4x14.5 = 208 mm<sup>2</sup> four-metal 0.35 mm CMOS, 499 PGA package.
- **11** W (25%) power for clock distribution
- 8 kB Icache, 8 kB Dcache
- 96 kB L2 cache



Source: http://infopad.EECS.Berkeley.EDU/ HotChips8/1.2/1.2.10.html



Alpha CPU clock skew in 1.7M transistor, 30 W, 200 MHz, 400 MIPS, 64b.  $V_{DD} = 3.3 \text{ V}$ , 16.8x13.9 mm<sup>2</sup> three-metal 0.75  $\mu$ m CMOS, 431 pin PGA package. 12 W for clock distribution. Source: Digital Technical Journal **4**, 1 (1992)



### **Microprocessor - DRAM performance gap**

- Average CPU clock rate doubles every 18 months
- Main memory data transfer speeds increase 10% every 18 months
- **i** Conventional interconnects cannot deliver performance that matches improvement in CPU



1997 - Jan. Intel Pentium MMX (150 - 233 MHz) 1997 - 2Q AMD K6 MMX (233 - 300 MHz)

1997 - 2Q Intel Pentium II (233 - 300 MHz)

1998 - Intel Deschutes (333 - 450 MHz)

1999 - Intel Pentium III (450 - 550 MHz)

1999 - Intel Merced

| <u>Processor</u> | <u>Clock (MHz)</u> | SPECint95 | SPECfp95 |
|------------------|--------------------|-----------|----------|
| Pentium II       | 266                | 10.8      | 6.9      |
| DEC Alpha        | 266                | 7.9       | 11.8     |
| Pentium II       | 300                | 11.6      | 7.2      |
| DEC Alpha        | 333                | 9.8       | 12.5     |
| Pentium II       | 450                | 18        | 13.3     |
| Pentium III      | 450                | 18.7      | 13.7     |
| DEC Alpha        | 500                | 15.0      | 20.4     |
| Pentium III      | 500                | 20.6      | 14.7     |



Source: Hennesy and Patterson "Computer architecture", Morgan Kaufmann (1996) 13



- 2.5M transistors, 160 MHz, 2.1 MIPS, 32b, 0.5 W, internal  $V_{DD} = 1.6$  V,  $7.8 \times 6.4 = 50 \text{ mm}^2$  three-metal 0.35 mm CMOS, 144 pin QFP package
- To minimize pin power and support a high-speed internal core, 50% of chip area is devoted to two 16 kB Dcache and Icache
- 90% of the transistors are devoted to Dcache and Icache
- The pad ring occupies 33% of chip area and the processor core fills the remaining 17% of chip area.



Source: IEEE J. Solid-state circuits 31, 1703 (1996)

| <b>Power dissipation</b> |     |
|--------------------------|-----|
| Icache                   | 27% |
| Ibox                     | 18% |
| Dcashe                   | 16% |
| Clock                    | 10% |
| IMMU                     | 9%  |
| Ebox                     | 8%  |
| DMMU                     | 8%  |
| Write buffer             | 2%  |
| Bus interface            | 2%  |
| PLL                      | <1% |







f = 10 GHz

Total 8, 32-bit instructions /  $\mathbf{f}$ Total 2, 64-bit data / **f** Power = 100 W

Scalar: 800 Gb/s communication channel < 4 W total micro-photonic power **162 W electrical rambus power consumption** Vector: 4.8 Tb/s communication channel < 19 W total micro-photonic power 7.8 kW electrical rambus power consumption



### The Rambus approach

# Physical organization of a Rambus-based system



17



Reduce high-speed communication power (x40 less than Rambus) Lower electrical noise in system (x10 less dI/dt)

> 0.1 m backplane interconnects can maintain bisection bandwidth (multiple Tb/s) WDM *all-optical* functionality can give ps node routing latency (useful for < 1 m), deadlock protection, and collisionless adaptive routing



> 0.1m backplane interconnect

#### USC UNIVERSITY OF SOUTHERN CALIFORNIA

# Future MCM-L packaging for direct CMOS to optical I/O

Combines CMOS VLSI function with dense high-speed optical I/O

- Standard-cell opto-electronic CMOS interface
- Separate optical and electrical thermal management using integrated heat spreader
- Signal rate > 2.5 Gb/s per I/O





Ultra-low node latency using emerging functional micro-photonic technology. WDM all-optical routing using active micro-resonators have potential for ps node latency and deadlock protection.



integrated with electrical

laminate



# **Increasing functionality of micro-photonic devices**

- Challenge: Rapid switching of optical power from one high-Q resonator to another.
  - $U(t) = U_0(t = 0)e^{-(w_0 t / Q)}$
- **Functionality beyond point-to-point interconnect: Example** 
  - System Area Network MAC state machine



Transfer of optical power between resonators



(i) Spoil Q to rapidly transfer optical power (ii) Transient dynamics of cavity formation (iii) Typical values,  $\lambda_0 = 1.5 \ \mu m$ ,  $\omega_0 = 10^{15} \ rads^{-1}$ , Q ~  $10^3 - 10^4$  and  $\tau_{ph} \sim 1 - 10 \ ps$ 



#### USC UNIVERSITY F SOUTHERN CALIFORNIA

# New system design optimization and functionality

- any distance for a given bandwidth
- less heat because of distributed nature of system enabled by optics
- reduce I/O count
- lower power than copper
- low EMI
- low cost
- ps node latency, deadlock protection, adaptive routing
- **Getting technology from here to there** 
  - Adoption helped by one-stop *technology shopping* for
    - $\diamond$  standard packaging and board-level integration
    - standard CMOS library cells
    - $\diamond$  standard OSA footprint
    - $\diamond$  proven reliability as good or better than copper
    - $\diamond$  demonstration systems and applications



- **Optoelectronics in CMOS-based systems** 
  - need data-com to keep component cost low
  - need availability
  - need standards that *help* the designer
    - $\diamond$  complete design support
      - library cells, evaluation boards, mechanical, system testing, software
  - need compelling system demonstrations

new architectures, new functions, higher performance, reduced cost
 integration with software

# **DARPA**

- Focus
  - Integrated optoelectronic / CMOS inside systems
  - **WDM** microphotonic functionality *inside* systems
- Support one-stop technology shopping