Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
CoRR(2024)
Abstract
Transformer-based large language model (LLM) inference serving is now the
backbone of many cloud services. LLM inference consists of a prefill phase and
a decode phase. However, existing LLM deployment practices often overlook the
distinct characteristics of these phases, leading to significant interference.
To mitigate interference, our insight is to carefully schedule and group
inference requests based on their characteristics. We realize this idea in
TetriInfer through three pillars. First, it partitions prompts into fixed-size
chunks so that the accelerator always runs close to its computationsaturated
limit. Second, it disaggregates prefill and decode instances so each can run
independently. Finally, it uses a smart two-level scheduling algorithm
augmented with predicted resource usage to avoid decode scheduling hotspots.
Results show that TetriInfer improves time-to-first-token (TTFT), job
completion time (JCT), and inference efficiency in turns of performance per
dollar by a large margin, e.g., it uses 38
lowering average TTFT and average JCT by 97
MoreTranslated text
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Chat Paper
Summary is being generated by the instructions you defined