PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design
CoRR(2024)
摘要
Retrieval-augmented generation (RAG) can enhance the generation quality of
large language models (LLMs) by incorporating external token databases.
However, retrievals from large databases can constitute a substantial portion
of the overall generation time, particularly when retrievals are periodically
performed to align the retrieved content with the latest states of generation.
In this paper, we introduce PipeRAG, a novel algorithm-system co-design
approach to reduce generation latency and enhance generation quality. PipeRAG
integrates (1) pipeline parallelism to enable concurrent retrieval and
generation processes, (2) flexible retrieval intervals to maximize the
efficiency of pipeline parallelism, and (3) a performance model to
automatically balance retrieval quality and latency based on the generation
states and underlying hardware. Our evaluation shows that, by combining the
three aforementioned methods, PipeRAG achieves up to 2.6× speedup in
end-to-end generation latency while improving generation quality. These
promising results showcase the effectiveness of co-designing algorithms with
underlying systems, paving the way for the adoption of PipeRAG in future RAG
systems.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要