Cifar-10 Binary CNN Processor on Chip

  • mixed-signal binary convolutional neural netwok
    • 3.8 uJ/classification (forward pass)
    • 86% accuracy
  • BinaryNet {+1 -1}
    • multiplication to XNOR
    • weight stationary
    • data-parallel (all multiplies in parallel (?))
    • input reuse
    • wide vector sum as energy bottleneck
  • 28 nm CMOS
  • 328 kB on chip SRAM
  • 237 frame/s
  • 0.9 mW from 0.6 V meanint 3.8 uJ

Intro

  • problem: DNN have to do millions to billions of MAC per inference

  • weight stationary
  • computing in memory (CIM) (?)
  • CMOS-inspired, hardware specialization

  • output image pixels are binarized
  • always uses 2x2 filters, 256 channels and filters
    • low fan-out de-multipliers

No Hidden Fully Connected Layers

  • BinaryNet required 1.67 MB
    • 558 kB 6 CNN layers
    • 1.13 MB 3 FC layers
  • Instead only 261.5 kB
    • 256 kB 8 CNN layers
    • 5.5 kB 1 FC layer

Filter Computation

  • since we are dealing with +1, -1 and sign, batch norm is simplified

Top Level Architecture

  • pixel is quantized to 7 bits

Comparison

Summary

  • BinaryNet with XNOR operations
  • network architecture designed to work well with CMOS hasrdware
    • low weight memory
    • memory cost is amortized - weight stationay data parallel, input reuse
  • energy efficient SC neuron?