Updated June 25, 2024

Single-Page PDF | Multi-Page PDF
Work Experience
  • Advanced Micro Devices, Inc. – Fellow
    August 2012 - Present
    • Software architect for AMD’s Instinct GPUs, handling the design of HW, FW, and SW interactions
      • Co-designed HW/SW mechanisms for kernel dispatch, GPU cache coherence, virtual memory optimizations, performance monitoring mechanisms, DMA engine design, and RAS mechanisms
      • Drove requirements gathering from SW teams to create dozens of new HW architectural features
      • Led debug, workaround development, and customer communication for multiple post-silicon issues
      • Created and led both internal and customer training about AMD accelerators, including performance optimization, and deep-dives on microarchitecture, coherence, and memory management
    • Performance engineer responsible for optimizing SW, HW, and FW for GPU compute solutions
      • Architected and implemented multiple GPGPU software features, including HIP Cooperative Groups
    • Designed, implemented, and published leading GPGPU algorithms for HPC math libraries, including:
      • Sparse matrix-vector multiplication algorithm that is up to 36% faster than previous state-of-the-art
      • Sparse triangular solve algorithm that is 34% faster than industrial competition
    • Awarded 24 US patents; 11 patent submissions pending; 25 conference and 7 workshop publications
  • University of Michigan – Graduate Student Research Assistant
    May 2007 - August 2012
    • Identified methods of distributing software analyses across many users to reduce slowdowns
    • Managed graduate and undergraduate students through development of prototype systems
  • University of Michigan – Graduate Student Instructor
    January 2012 - April 2012
    • Led discussions and evaluated projects for graduate level parallel computer architecture course
  • Kelly Services / Intel Corp. – Research Contractor
    May 2010 - October 2010
    • Researched HW & SW approaches for improving the speed of the Intel Inspector XE data race detector
  • International Business Machines Corp. – Speed Team Intern
    May 2008 - August 2008
    • Designed and built an InfiniBand verification suite that caught multiple bugs in IBM PowerVM firmware
  • University of Illinois – Teaching Assistant
    January 2005 - August 2006
    • Taught discussion sections and graded for undergraduate computer architecture and digital logic courses
  • University of Michigan, Ann Arbor
    Ph.D., Computer Science and Engineering
    May 2012
    Advisor: Prof. Todd Austin
    Dissertation Topic: Hardware Mechanisms for Distributed Dynamic Software Analysis
  • University of Michigan, Ann Arbor
    M.S.E. Computer Science and Engineering
    May 2008
    Concentration: Hardware Systems
    GPA: 7.73/9.0 (3.79/4.0)
  • University of Illinois at Urbana-Champaign
    B.S. Computer Engineering with Honors
    May 2006
    Minor: International Engineering – Japanese
    GPA: 3.71/4.0
Selected Publications
  • Raghavendra Pradyumna Pothukuchi, Joseph L. Greathouse, Karthik Rao, Christopher Erb, Leonardo Piga, Petros Voulgaris, Josep Torrellas, "Tangram: Integrated Control of Heterogeneous Computers," in the Proceedings of the 52nd IEEE/ACM International Symposium on Microarchitecture (MICRO-52), October, 2019
  • Arkaprava Basu, Joseph L. Greathouse, Guru Venkataramani, Ján Veselý, "Interference from GPU System Service Requests," in the Proceedings of the 2018 IEEE International Symposium on Workload Characterization (IISWC), September, 2018 – Nominated for Best Paper
  • Vignesh Adhinarayanan, Indrani Paul, Joseph L. Greathouse, Wei Huang, Ashutosh Pattnaik, Wu-chun Feng, "Measuring and Modeling On-Chip Interconnect Power on Real Hardware," in the Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC), September, 2016 – Awarded Best Pape
  • Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, Derek Chiou, "GPGPU Performance and Power Estimation Using Machine Learning," in the Proceedings of the 21st IEEE Symposium on High Performance Computer Architecture (HPCA), February, 2015
  • Joseph L Greathouse, Mayank Daga, "Efficient Sparse Matrix-Vector Multiplication on GPUs using the CSR Storage Format," in the Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November, 2014
  • Bo Su, Joseph L. Greathouse, Junli Gu, Michael Boyer, Li Shen, Zhiying Wang, "Implementing a Leading Loads Performance Predictor on Commodity Processors," in the Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC), June, 2014
  • Joseph L. Greathouse, Zhiqiang Ma, Matthew I. Frank, Ramesh Peri, Todd Austin, "Demand-Driven Software Race Detection using Hardware Performance Counters," in the Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA 2011), June, 2011
Computer Languages and Software Qualifications
  • Programming Languages
    C, C++, HIP, CUDA, OpenCL, x86 assembly, AMD GCN, CDNA, and RDNA assembly, Python
  • Software Systems
    Linux kernel, multiple AMD-internal simulation, firmware, and analysis tools
Software Projects
  • AMD Matrix Instruction Calculator
    Released tool to interactively detail the instructions for AMD GPUs' Matrix Cores and AI Accelerators
    Available at https://github.com/ROCm/amd_matrix_instruction_calculator
  • AMD Research Instruction Based Sampling Toolkit
    Released driver to allow easy access to IBS, AMD’s low-level CPU performance monitoring hardware
    Available at https://github.com/jlgreathouse/AMD_IBS_Toolkit
  • clARMOR – An OpenCL Kernel Buffer Overflow Detector
    Researched and productized an OpenCL kernel buffer overflow detector
    Archived at https://github.com/ROCm/clARMOR
  • clSPARSE - GPU Accelerated Sparse Linear Algebr
    Transferred research on sparse linear algebra algorithms to open source vendor-optimized library
    Available at https://github.com/clMathLibraries/clSPARSE
    Algorithms later used as the computational basis for rocSPARSE library
  • High-Level Performance and Power Simulator
    Created CPU and GPU power and performance models based on scaling real HW measurements
  • Demand-Driven Dynamic Data Race Detection
    Utilized hardware performance counters to dynamically observe shared memory accesses
    Integrated this into Intel Inspector XE race detector, yielding large speedups when little sharing occurs
Relevant Coursework
  • Computer Architecture
  • Parallel Computer Architecture
  • Microarchitecture
  • Enterprise Systems
  • Advanced Operating Systems
  • Advanced Compilers
  • Electronic Circuits
  • IC Device Theory and Fabrication
Awards and Honors
  • Awards at Advanced Micro Devices, Inc.
    • AMD Q1 2024 Next 5% Award for work on AMD Instinct MI300 execution
    • AMD Q3 2022 Next 5% Award for work on work breaking the exaflop barrier
    • AMD Q1 2020 Next 5% Award for work on AMD's Frontier supercomputer design win
    • AMD Executive Spotlight Award: Q4 2019, Q4 2020, Q2 2021, Q2 2023 (2x)
    • AMD DCGPU Spotlight Award: Q2 2020, Q1 2021, Q3 2021, Q2 2022, Q4 2022, Q2 2023, Q4 2023 (2x)
    • AMD Research Spotlight Award: Q2 2017
  • Academic Awards and Honors
    • IISWC 2016 Best Paper Award
    • CGO 2011 Best Student Presentation Award
    • Nomination for Best Paper at IISWC 2018
    • Nomination for Best Paper at HPDC 2014
    • 2011 University of Michigan CSE Graduate Student Honors Competition 1st Place
    • University of Michigan EECS Departmental Fellowship, 2006-2007