Papers
2025
ETS: Efficient Tree Search for Inference-Time Scaling
Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami
SuperNoVA: Algorithm-Hardware Co-Design for Resource-Aware SLAM
Seah Kim, Roger Hsiao, Borivoje Nikolic, James Demmel, Yakun Sophia Shao
Design Space Exploration of Embedded SoC Architectures for Real-Time Optimal Control
Kris Shengjun Dong, Dima Nikiforov, Widyadewi Soedarmadji, Minh Nguyen, Christopher Fletcher, Yakun Sophia Shao
Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency
Hansung Kim, Ruohan Richard Yan, Joshua You, Tieliang Vamber Yang, Yakun Sophia Shao
LLM-Aided Compilation for Tensor Accelerators
Charles Hong, Sahil Bhatia, Altan Haan, Shengjun Kris Dong, Dima Nikiforov, Alvin Cheung, Yakun Sophia Shao
Stellar: An Automated Design Framework for Dense and Sparse Spatial Accelerators
Hasan Nazim Genc; Hansung Kim; Prashanth Ganesh; Yakun Sophia Shao
Design Approach for Die-to-Die Interfaces to Enable Energy-Efficient Chiplet Systems
Vikram Jain, Wei Tang, Zuoguo Wu, Viansa Schmulbach, Sophia Shao, Zhengya Zhang, Borivoje Nikolic
FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs. ISCA 2024: 501-515
Joonho Whangbo; Edwin Lim; Chengyi Lux Zhang; Kevin Anderson; Abraham Gonzalez; Raghav Gupta
AuRORA: A Full-Stack Solution for Scalable and Virtualized Accelerator Integration
Seah Kim; Jerry Zhao; Krste Asanović; Borivoje Nikolić; Yakun Sophia Shao
Instruction Scheduling in the Saturn Vector Unit
Jerry Zhao, Daniel Grubb, Miles Rusch, Tianrui Wei, Kevin Anderson, Borivoje Nikolic, Krste Asanovic
A CVNN-Aided Anti-Interference Channel Estimation for Massive MIMO Systems
NeCTAr and RASoC: Tale of Two Class SoCs for Language Model Interference and Robotics in Intel 16
Viansa Schmulbach; Jason Kim; Ethan Gao; Nikhil Jha; Ethan Wu; Oliver Yu
An Efficient Sparse Kernel Generator for O(3)-Equivariant Deep Networks
Vivek Bharadwaj, Austin Glover, Aydin Buluc, James Demmel
Fast multiplication of random dense matrices with sparse matrices
Tianyu Liang; Riley Murray; Aydın Buluç; James Demmel
Non-smooth Bayesian optimization in tuning scientific applications
Hengrui Luo, Younghyun Cho, James Weldon Demmel, Igor Kozachenko, Xiaoye S. Li, Yang Liu
Verified Code Transpilation with LLMs
Sahil Bhatia, Jie Qiu, Niranjan Hasabnis, Sanjit A. Seshia, Alvin Cheung
Controlled Preemption: Amplifying Side-Channel Attacks from Userspace
Yongye Zhu, Boru Chen, Zirui Neil Zhao, Christopher W. Fletcher
H-Houdini: Scalable Invariant Learning
Sushant Dinesh, Yongye Zhu, Christopher W. Fletcher
GoFetch: Breaking Constant-Time Cryptographic Implementations Using Data Memory-Dependent Prefetchers
Boru Chen, University of Illinois Urbana-Champaign; Yingchen Wang, University of Texas at Austin; Pradyumna Shome, Georgia Institute of Technology; Christopher Fletcher, University of California, Berkeley; David Kohlbrenner, University of Washington; Riccardo Paccagnella, Carnegie Mellon University; Daniel Genkin, Georgia Institute of Technology
ConjunCT: Learning Inductive Invariants to Prove Unbounded Instruction Safety Against Microarchitectural Timing Attacks
Sushant Dinesh; Madhusudan Parthasarathy; Christopher W. Fletcher
FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design
Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, Christopher W. Fletcher
TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators (Abstract)
Nandeeka Nayak, Toluwanimi O. Odemuyiwa, Shubham Ugare, Christopher W. Fletcher, Michael Pellauer, Joel S. Emer
Sparsity-Aware Communication for Distributed Graph Neural Network Training
Ujjaini Mukhopadhyay, Alok Tripathy, Oguz Selvitopi, Katherine Yelick, Aydin Buluc
2024
Distributed-Memory Randomized Algorithms for Sparse Tensor CP Decomposition
Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Aydin Buluç, James Demmel
RDMA-Based Algorithms for Sparse Matrix Multiplication on GPUs
Benjamin Brock, Aydin Buluç, Katherine A. Yelick
DiffuseLoco: Real-Time Legged Locomotion Control with Diffusion from Offline Datasets
Xiaoyu Huang, Yufeng Chi, Ruofeng Wang, Zhongyu Li, Xue Bin Peng, Yakun Sophia Shao, Borivoje Nikolic, Koushil Sreenath
Zoomie: A Software-like Debugging Tool for FPGAs
Tianrui Wei, Kevin Laeufer, Katie Lim, Jerry Zhao, Koushik Sen, Jonathan Balkind, Krste Asanovic
RTL-Repair: Fast Symbolic Repair of Hardware Design Code
Xiaoyu Huang, Yufeng Chi, Ruofeng Wang, Zhongyu Li, Xue Bin Peng, Yakun Sophia Shao, Borivoje Nikolic, Koushil Sreenath
Tenspiler: A Verified Lifting-Based Compiler for Tensor Operations
Jie Qiu, Colin Cai, Sahil Bhatia, Niranjan Hasabnis, Sanjit A. Seshia, Alvin Cheung
Next-Generation Domain-Specific Accelerators: From Hardware to System
Yakun Sophia Shao
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
Linyuan Gong, Mostafa Elhoushi, Alvin Cheung
2023
DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators
Charles Hong, Qijing Huang, Grace Dinh, Mahesh Subedar, Yakun Sophia Shao
AuRORA: Virtualized Accelerator Orchestration for Multi-Tenant Workloads
Seah Kim, Jerry Zhao, Krste Asanovic, Borivoje Nikolic, Yakun Sophia Shao
Distributed Matrix-Based Sampling for Graph Neural Network Training
Alok Tripathy, Katherine Yelick, Aydin Buluc
CholeskyQR with Randomization and Pivoting for Tall Matrices (CQRRPT)
Maksim Melnichenko, Oleg Balabanov, Riley Murray, James Demmel, Michael W. Mahoney, Piotr Luszczek
Scalable Evidential K-Nearest Neighbor Classification on Big Data
James Demmel, Yang You
SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao
Fast multiplication of random dense matrices with fixed sparse matrices
Tianyu Liang, Riley Murray, Aydın Buluç, James Demmel
Surrogate-based Autotuning for Randomized Sketching Algorithms in Regression Problems
Younghyun Cho, James W. Demmel, Michał Dereziński, Haoyun Li, Hengrui Luo, Michael W. Mahoney, Riley J. Murray
Code Transpilation for Hardware Accelerators
Yuto Nishida, Sahil Bhatia, Shadaj Laddad, Hasan Genc, Yakun Sophia Shao, Alvin Cheung
Towards Auto-Generated Data Systems
Alvin Cheung, Maaz Bin Safeer Ahmad, Brandon Haynes, Chanwut Kittivorawong, Shadaj Laddad, Xiaoxuan Liu, Chenglong Wang, Cong Yan
Harnessing the Crowd for Autotuning High-Performance Computing Applications
Younghyun Cho; James W. Demmel; Jacob King; Xiaoye S. Li; Yang Liu; Hengrui Luo
Building Code Transpilers for Domain-Specific Languages Using Program Synthesis (Experience Paper)
Sahil Bhatia, Sumer Kohli, Sanjit A. Seshia, Alvin Cheung
RoSÉ: A Hardware-Software Co-Simulation Infrastructure Enabling Pre-Silicon Full-Stack Robotics SoC Evaluation
Dima Nikiforov, Shengjun Chris Dong, Chengyi Lux Zhang, Seah Kim, Borivoje Nikolic, Yakun Sophia Shao
CDPU: Co-designing Compression and Decompression Processing Units for Hyperscale Systems
Sagar Karandikar, Aniruddha N. Udipi, Junsun Choi, Joonho Whangbo, Jerry Zhao, Svilen Kanev, Edwin Lim, Jyrki Alakuijala, Vrishab Madduri, Yakun Sophia Shao, Borivoje Nikolic, Krste Asanovic, Parthasarathy Ranganathan
Profiling Hyperscale Big Data Processing
Abraham Gonzalez, Aasheesh Kolli, Samira Manabi Khan, Sihang Liu, Vidushi Dadu, Sagar Karandikar, Jichuan Chang, Krste Asanovic, Parthasarathy Ranganathan
Nearly Optimal Block-Jacobi Preconditioning
James Demmel
An Improved Analysis and Unified Perspective on Deterministic and Randomized Low-Rank Matrix Approximation
James Demmel, Laura Grigori
Silicon Process Technology Constraints for Standardized Vertical Die-to-Die Interconnects
Harrison Liew; Farhana Sheikh; David Kehlet; Borivoje Nikolić
Guest Editorial Introduction to the Special Issue on the 2022 Symposium on VLSI Circuits
Borivoje Nikolic, Mototsugu Hamada
Simulator Independent Coverage for RTL Hardware Languages
Kevin Laeufer, Vighnesh Iyer, David Biancolin, Jonathan Bachrach
ADELT: Transpilation Between Deep Learning Frameworks
Linyuan Gong, Jiayi Wang, Alvin Cheung
Full Stack Optimization of Transformer Inference: a Survey
Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami
Randomized Numerical Linear Algebra : A Perspective on the Field With an Eye to Software
Riley Murray, James Demmel, Michael W. Mahoney, N. Benjamin Erichson, Maksim Melnichenko, Osman Asif Malik, Laura Grigori, Piotr Luszczek, Michał Dereziński, Miles E. Lopes, Tianyu Liang, Hengrui Luo, Jack Dongarra
Fast Exact Leverage Score Sampling from Khatri-Rao Products with Applications to Tensor Decomposition
Vivek Bharadwaj, Osman Asif Malik, Riley Murray, Laura Grigori, Aydin Buluc, James Demmel
2022
Cerberus: A Formal Approach to Secure and Efficient Enclave Memory Sharing
Dayeol Lee, Kevin Cheang, Alexander Thomas, Catherine Lu, Pranav Gaddamadugu, Anjo Vahldiek-Oberwagner, Mona Vij, Dawn Song, Sanjit A. Seshia, Krste Asanović
ML for Analog Design: Good Progress, but More to Do
Borivoje Nikolić
Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly
Giulia Guidi, Gabriel Raulet, Daniel Rokhsar, Leonid Oliker, Katherine Yelick, Aydin Buluç
Hammer: a modular and reusable physical design flow tool: invited
Harrison Liew, Daniel Grubb, John Wright, Colin Schmidt, Nayiri Krzysztofowicz, Adam M. Izraelevitz, Edward Wang, Krste Asanovic, Jonathan Bachrach, Borivoje Nikolic
Distributed-Memory Sparse Kernels for Machine Learning
Vivek Bharadwaj; Aydın Buluç; James Demmel
Hybrid Models for Mixed Variables in Bayesian Optimization
Hengrui Luo, Younghyun Cho, James W. Demmel, Xiaoye S. Li, Yang Liu
Learning A Continuous and Reconstructible Latent Space for Hardware Accelerator Design
Qijing Huang; Charles Hong; John Wawrzynek; Mahesh Subedar; Yakun Sophia Shao
2021
BELLA: Berkeley Efficient Long-Read to Long-Read Aligner and Overlapper
Giulia Guidi, Marquita Ellis, Daniel Rokhsar, Katherine Yelick, and Aydın Buluç
Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration
Hasan Genc, Seah Kim, Alon Amid, Ameer Haj-Ali, Vighnesh Iyer, Pranav Prakash, Jerry Zhao, Daniel Grubb, Harrison Liew, Howard Mao, Albert J. Ou, Colin Schmidt, Samuel Steffl, John Charles Wright, Ion Stoica, Jonathan Ragan-Kelley, Krste Asanovic, Borivoje Nikolic, Yakun Sophia Shao
Verifying RISC-V Physical Memory Protection
Kevin Cheang, Cameron Rasmussen, Dayeol Lee, David W. Kohlbrenner, Krste Asanovic, Sanjit A. Seshia
A 16mm2 106.1 GOPS/W Heterogeneous RISC-V Multi-Core Multi-Accelerator SoC in Low-Power 22nm FinFET
Abraham Gonzalez, Jerry Zhao, Ben Korpan, Hasan Genc, Colin Schmidt, John Charles Wright, Ayan Biswas, Alon Amid, Farhana Sheikh, Anton Sorokin, Sirisha Kale, Mani Yalamanchi, Ramya Yarlagadda, Mark Flannigan, Larry Abramowitz, Elad Alon, Yakun Sophia Shao, Krste Asanovic, Borivoje Nikolic
Automated Design of Analog Circuits Using Reinforcement Learning
Keertana Settaluri, Zhaokai Liu, Rishubh Khurana, Arash Mirhaj, Rajeev Jain, Borivoje Nikolic
A Hardware Accelerator for Protocol Buffers
Sagar Karandikar, Chris Leary, Chris Kennelly, Jerry Zhao, Dinesh Parimi, Borivoje Nikolic, Krste Asanovic, Parthasarathy Ranganathan
A Hardware Accelerator for Protocol Buffers
Sagar Karandikar, Chris Leary, Chris Kennelly, Jerry Zhao, Dinesh Parimi, Borivoje Nikolic, Krste Asanovic, Parthasarathy Ranganathan
Scaling Generalized N-Body Problems, A Case Study from Genomics
Marquita Ellis, Aydin Buluc, Katherine Yelick
A PACTful Agenda for Cloud Programming Research: (Invited Talk)
Alvin Cheung
An Automated and Process-Portable Generator for Phase-Locked Loop
Zhongkai Wang, Minsoo Choi, Eric Chang, John Charles Wright, Wooham Bae, Sijun Du, Zhaokai Liu, Nathan Narevsky, Colin Schmidt, Ayan Biswas, Borivoje Nikolic, Elad Alon
Accessible, FPGA Resource-Optimized Simulation of Multiclock Systems in FireSim
David Biancolin, Albert Magyar, Sagar Karandikar, Alon Amid, Borivoje Nikolic, Jonathan Bachrach, Krste Asanovic
MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural Networks
Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanovic, Borivoje Nikolic, Yakun Sophia Shao
CoSA: Scheduling by Constrained Optimization for Spatial Accelerators
Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, Yakun Sophia Shao
Vertically Integrated Computing Labs Using Open-Source Hardware Generators and Cloud-Hosted FPGAs
Alon Amid, Albert J. Ou, Krste Asanovic, Yakun Sophia Shao, Borivoje Nikolic
COBRA: A Framework for Evaluating Compositions of Hardware Branch Predictors
Jerry Zhao, Abraham Gonzalez, Alon Amid, Sagar Karandikar, Krste Asanovic
Memory-Efficient Hardware Performance Counters with Approximate-Counting Algorithms
Jingyi Xu, Sehoon Kim, Borivoje Nikolic
4-3-an-eight-core-1-44ghz-risc-v-vector-machine-in-16nm-finfet
Colin Schmidt, John Charles Wright, Zhongkai Wang, Eric Chang, Albert J. Ou, Woo-Rham Bae, Sean Huang, Anita Flynn, Brian C. Richards, Krste Asanovic, Elad Alon
GPTune: multitask learning for autotuning exascale applications
Yang Liu, Wissam M. Sid-Lakhdar, Osni Marques, Xinran Zhu, Chang Meng, James W. Demmel, Xiaoye S. Li