Skip to main content

Table of Contents

Publications

  1. [SOSP ‘23] Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (43/229 = 18.8%)
    Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowdhury
    Paper Slides Bibtx
  2. [SOSP ‘21] LineFS: Efficient SmartNIC Offload of a Distributed File System with Pipeline Parallelism (54/348 = 15.5%) Best Paper Award!
    Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon Peter, and Emmett Witchel
    Paper Bibtex
  3. [ASPLOS ‘19] Heterogeneous Isolated Execution for Commodity GPUs (74/350 = 21.1%)
    Insu Jang, Adrian Tang, Taehoon Kim, Simha Sethumadhavan, and Jaehyuk Huh
    Paper Slides Bibtex

Experience

  • Fault Tolerant Systems for Distributed ML Training
    PyTorch DeepSpeed
    Recent trend of growing model size forces to use distributed training with multiple computing devices. As more number of devices are used, higher the probability of failures, yet there was no efficient fault tolerance algorithm for such complicated parallelism. I led the Oobleck project that supports fast failure recovery in distributed training with hybrid parallleism.

  • Reimplementing Hyperloop
    RDMA Infiniband
    As a part of LineFS research project, we had to measure Hyperloop performance, however, its implementation was not open sourced. I built simulated Hyperloop for comparison. I was not able to fully implement it due to lack of RDMA functionality; later RedN introduces ENABLE verb that makes full Hyperloop implementation possible.

  • Implementing Heterogeneous Trusted Execution Environment
    Intel SGX GPU PCIe architecture
    HIX extends the protection scope of hardware-based trusted execution environment (TEE) to heterogeneous computing devices. Based on the insight that Intel SGX protects the data with manuever in address translation (TLB entries are not inserted into the TLB for unauthorized accesses) and modern high performance device access is done through memory-mapped I/O (MMIO), we extended the protection mechanism to MMIO. Only trusted process called the GPU enclave can access the GPU, and trusted processes can use the GPu service only through the GPU enclave via encrypted communication.\