Research Statement

My research focuses on the interface between hardware and software because the best solutions to hard computer engineering problems use both. High-performance, secure, and low-power systems rely on software that appropriately uses well-designed hardware.

Hardware-based Performance and Power Estimation

Modern processors are constrained by power and thermal limits. This "dark silicon" problem means that only a fraction of the processors on a chip can run at full speed at any point in time. Similarly, software must deal with energy restirctions in mobile devices. In both cases, it is imperitive that the hardware be configured to best run the application at the appropriate power point.

Appropriately reconfiguring a chip to match the applications its running requires the ability to monitor the application and estimate how hardware changes will affect it. My works at USENIX ATC 2014, MICRO 2014, and HPCA 2014 all focus on mechanisms for estimating how the power and performance of applications change as CPUs and GPUs are reconfigured.

As my paper at ModSim 2013 points out, these techniques can also be useful in building high-level estimates of system performance. An example of how this is useful can be seen in the TOP-PIM paper AMD Research presented at HPDC 2014.

Sparse BLAS Primitives

Basic Linear Algebra Subprograms (BLAS) are an important series of functions that are broadly used in the computational sciences. These primitives, such as matrix-matrix multiplication, can be used to build up much more complicated applications. Because these functions are well-defined, it's possible to write highly optimized versions that will accelerate many applications. AMD's Core Math Library, Intel's Math Kernel Library, and Nvidia's cuBLAS are examples of such highly optimized BLAS libraries.

Sparse BLAS functions work on sparse data – vectors or matrices that primarily contain the value zero. Such functions are of growing importance in modern high-performance computers, as sparse data can be used to represent large, important problems. As such, some of my recent work focuses on building optimized sparse BLAS functions for modern and future hardware designs.

My work at SC14, for isntance, described a new method for performing sparse matrix-vector mutliplication, which we called CSR-Adaptive. This work, which was done in collaboration with Mayank Daga, presented a new way to perform this important algorithm on GPUs without needing to change the underlying data structure, which was a limitation of most previous works. We also have a follow-on paper at HiPC 2015 that describes methods of further accelerating this algorithm. The guiding priciple for these works was to pay attention to the GPU's memory system and structure the software algorithm to use it well.

The algorithms described in these papers are now part of the open source clSPARSE library, which AMD helped develop. In addition, an independently reimplemented version of CSR-Adaptive is available in the ViennaCL library.

Hardware for Low-overhead Software Analysis

Software bugs cost money and can cost lives. Wore still are security flaws that, besides causing economic losses, are now bought and sold on black markets by criminal organizations and government agencies. These bugs are used as weapons against people both foreign and domestic.

Computer architects must play a role in mitigating this problem. While current processors provide performance tuning facilities such as performance counters, few analogous features exist to help correctness. This project, which was the focus of my Ph.D. dissertation, contended that future architectures would be defined not just by their performance, but also by their debugging and correctness-checking capabilities.

Dynamic software tests benefit from checking a program under numerous input conditions, but their runtime slowdowns are far too high for widespread adoption. Hardware can make this approach to software checking tenable. The Testudo project looked at ways to allow individual users to perform analysis on a small portion of a program's dynamic dataflows. My work at MICRO-41 detailed a way to make simple hardware additions that to allow this, while my work at CGO 2011 showed how to do this with no additional hardware. I also wrote a position paper for PLAS 2011 that makes an argument for the general concept of sampling dynamic analyses.

I also looked at using hardware to accelerate dynamic analyses. Dynamic data race detectors for instance, search for data races, or shared memory accesses that are inappropriately locked. My work at ISCA 2011, which was done while I was an intern at Intel, looked at ways to repurpose hardware performance counters to help find data races with little much lower runtime overhead. Similarly, my work at ASPLOS 2012 described how numerous software analyses could utilize a hardware system that allowed a virtually unlimited number of fine-grained data watchpoints.