Table of Contents

Publications

[SOSP ‘24] Perseus: Reducing Energy Bloat in Large Model Training (43/248 = 17.3%)
Jae-Won Chung, Yile Gu, Insu Jang, Luoxi Meng, Nikhil Bansal, and Mosharaf Chowdhury
Paper Bibtex
[SOSP ‘23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (43/229 = 18.8%)
Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury
Paper Slides Bibtex
[SOSP ‘21] LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism (54/348 = 15.5%) Best Paper Award!
Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Peter, and Emmett Witchel
Paper Bibtex
[ASPLOS ‘19] Heterogeneous Isolated Execution for Commodity GPUs (74/350 = 21.1%)
Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethumadhavan, and Jaehyuk Huh
Paper Slides Bibtex

Experience

Fault Tolerant Systems for Distributed ML Training
PyTorch DeepSpeed
Recent trend of growing model size forces to use distributed training with multiple computing devices. As more number of devices are used, higher the probability of failures, yet there was no efficient fault tolerance algorithm for such complicated parallelism. I led the Oobleck project that supports fast failure recovery in distributed training with hybrid parallleism.
Reimplementing Hyperloop
RDMA Infiniband
As a part of LineFS research project, we had to measure Hyperloop performance, however, its implementation was not open sourced. I built simulated Hyperloop for comparison. I was not able to fully implement it due to lack of RDMA functionality; later RedN introduces ENABLE verb that makes full Hyperloop implementation possible.
Implementing Heterogeneous Trusted Execution Environment
Intel SGX GPU PCIe architecture
HIX extends the protection scope of hardware-based trusted execution environment (TEE) to heterogeneous computing devices. Based on the insight that Intel SGX protects the data with manuever in address translation (TLB entries are not inserted into the TLB for unauthorized accesses) and modern high performance device access is done through memory-mapped I/O (MMIO), we extended the protection mechanism to MMIO. Only trusted process called the GPU enclave can access the GPU, and trusted processes can use the GPu service only through the GPU enclave via encrypted communication.\