Domain Specific Architectures | Sahithyan's Notes

Custom hardware designs optimized for a specific class of applications. Flexibility is traded for higher performance and energy efficiency in a given domain.

General-purpose CPUs reached performance limits despite Moore’s Law. Even more performance improvements can be achieved using dedicated hardware designs for the given domain. However, designing a new silicon chip costs a significant amount of time and money.

Examples:

GPUs
Neural / Tensor Processing Units (NPUs, TPUs)
Crypto co-processors
Image / Vision accelerators

Design Guidelines

Use on-chip memories to reduce data movement.
Invest in arithmetic units or on-chip memory rather than control logic.
Choose parallelism that best fits the domain (SIMD, systolic array, etc.).
Use the most suitable data precision.
Support a domain-specific programming language or API (TensorFlow, Halide).

Deep Neural Network Architectures

Multi-Layer Perceptron (MLP)

Each neuron computes a weighted sum and activation. Number of operations per layer = 2 × input_dim × output_dim.

Convolutional Neural Network (CNN)

Layers extract hierarchical features (edges → corners → shapes → objects).
High computation and weight reuse.
Core operations: matrix–matrix multiply, convolution, ReLU, sigmoid, tanh.

Recurrent Neural Network (RNN)

Used for sequence data; recurrent connections across time steps.
Many matrix–vector multiplies per cell; operations/weight ≈ 2.

Tensor Processing Unit (TPU)

Google’s ASIC for neural networks.
256×256 8-bit matrix-multiply unit.
Uses large on-chip memory and acts as a PCIe co-processor.
Instruction set includes:
- Read/Write host memory
- Read weights
- Matrix multiply / convolution
- Activation functions

Design Approach

Dedicated memories (24 MiB buffer).
Many arithmetic units (250× vs CPU).
Exploits 2D SIMD parallelism.
8-bit integers.
Programmed via TensorFlow.

Microsoft Catapult

FPGA-based accelerator for CNNs and search ranking.
Connected in a 6×8 torus network.
Uses Verilog RTL; 3926 ALUs, 5 MiB memory.
Parallelism types:
- 2D SIMD for CNN
- MISD for search ranking.
Uses both 8-bit integers and 64-bit floating point.

Intel Crest

DNN training accelerator.
16-bit fixed-point arithmetic.
Works on 32×32 matrix blocks.
Uses SRAM + high-bandwidth HBM2 memory.

Google Pixel Visual Core

Image processing and vision accelerator.
Performs stencil operations for imaging.
Uses Halide (DSL) → compiled to virtual and physical ISAs.
Optimized for energy and 2D access patterns.
Each core: 128+64 MiB memory, 16×16 2D array processing elements, 2D SIMD & VLIW.
Uses 8- and 16-bit integers.

Common DSA Principles

Principle	Example
Dedicated memory	TPU (unified buffer), Visual Core SRAM
Domain-matched parallelism	2D SIMD, systolic arrays
Reduced data precision	8- or 16-bit fixed-point
Domain language	TensorFlow, Halide
Arithmetic-focused design	Many ALUs vs small control

Fallacies and Pitfalls

Designing custom chips costs >$100M.
Performance counters added too late.
Wrong DNN task selection can waste effort.
Metrics like IPS may not capture full performance.
Ignoring past architectural lessons leads to repeat mistakes.