Accelerating Sparse Matrix Kernels with Co-Optimized Architecture

Sadi, Fazle

doi:10.1184/R1/7583975.v1

Sadi_cmu_0041E_10343.pdf (22.09 MB)

Accelerating Sparse Matrix Kernels with Co-Optimized Architecture

thesis

posted on 2018-12-01, 00:00 authored by Fazle SadiFazle Sadi

This dissertation presents an architecture to accelerate sparse matrix linear algebra,
which is among the most important numerical methods for numerous science and engineering domains, such as graph analytics, in current big data driven era. Sparse matrix
operations, especially for unstructured and large matrices with very few nonzeros, are
devoid of guaranteed temporal and/or spatial locality making them inherently unsuitable for cache based general purpose commercial o-the-shelf (COTS) architectures.
Lack of locality causes data dependent high latency memory accesses and eventually
exhausts limited load buer in COTS architectures. These render sparse kernels to run
at a small fraction of peak machine speed and yield poor o-chip bandwidth saturation
despite finely tuned software libraries/frameworks and optimized code. The poor
utilization of compute and memory resources on COTS structures indicates significant
room for improvement using existing technology. However, a computation paradigm
that is not dependent on data locality is essential for sparse operations to achieve
same level of performance that BLAS-like standards on COTS have delivered for dense
matrix linear algebra.
An algorithm/hardware co-optimized architecture that provides a data locality
independent platform for sparse matrix operations is developed in this work. The
proposed architecture is founded on a fundamental principle of trading streaming
bandwidth and compute to avoid high latency accesses. This principle stems from
a counterintuitive insight that minimally required number of irregular memory accesses for sparse operations generally incur high traffic that is transferred at slow
speed, whereas, more regular accesses can provide reduced traffic overall and faster
transfer through better usage of block level data. This work finds that a scalable, high performance and parallelizable multi-way merge network, which is absent in
current literature, is the core hardware primitive required in developing our proposed
architecture. Through both algorithmic and circuit level techniques, this work develops
a novel multi-way merge hardware primitive that meaningfully eliminates high latency
accesses for sparse operations. This work also demonstrates methodologies to avoid
strong dependency on fast random access on-chip memory for scaling, which is a major
limiting factor of current custom hardware solutions in handling very large problems.
Using a common custom platform, this work shows implementations of Sparse
Matrix dense Vector multiplication (SpMV), iterative SpMV and Sparse General
Matrix-Matrix multiplication (SpGEMM), which are core kernels for a broad range of
graph analytic applications. A number of architecture and circuit level optimization
techniques for reducing o-chip traffic and improving computation throughput to
saturate extreme o-chip steaming bandwidth, provided by state of the art 3D stacking
technology, are developed. Our proposed custom hardware is demonstrated on ASIC
(fabricated in 16nm FinFET) and FPGA platforms and evaluated against state of the
art COTS and custom hardware solutions. Experimental results show more than an
order of magnitude improvement over current custom hardware solutions and more than
two orders of magnitude improvement over COTS architectures for both performance
and energy efficiency. This work is intended to contribute through a software stack
provided by GraphBLAS-like [1] standards where broadest possible audience can utilize
this architecture using a well-defined and concise set of matrix-based graph operations.

History

Date

2018-12-01

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Larry Pileggi Franz Franchetti

Usage metrics

Keywords

Accelerator Co-optimization graph hardware Matrix vector multiplication sparse matrix

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Accelerating Sparse Matrix Kernels with Co-Optimized Architecture

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports