Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
arxiv(2023)
摘要
Diagnosis in histopathology requires a global whole slide images (WSIs)
analysis, requiring pathologists to compound evidence from different WSI
patches. The gigapixel scale of WSIs poses a challenge for histopathology
multi-modal models. Training multi-model models for histopathology requires
instruction tuning datasets, which currently contain information for individual
image patches, without a spatial grounding of the concepts within each patch
and without a wider view of the WSI. Therefore, they lack sufficient diagnostic
capacity for histopathology. To bridge this gap, we introduce Quilt-Instruct, a
large-scale dataset of 107,131 histopathology-specific instruction
question/answer pairs, grounded within diagnostically relevant image patches
that make up the WSI. Our dataset is collected by leveraging educational
histopathology videos from YouTube, which provides spatial localization of
narrations by automatically extracting the narrators' cursor positions.
Quilt-Instruct supports contextual reasoning by extracting diagnosis and
supporting facts from the entire WSI. Using Quilt-Instruct, we train
Quilt-LLaVA, which can reason beyond the given single image patch, enabling
diagnostic reasoning across patches. To evaluate Quilt-LLaVA, we propose a
comprehensive evaluation dataset created from 985 images and 1283
human-generated question-answers. We also thoroughly evaluate Quilt-LLaVA using
public histopathology datasets, where Quilt-LLaVA significantly outperforms
SOTA by over 10
VQA. Our code, data, and model are publicly accessible at
quilt-llava.github.io.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要