ICCV, pp. 386-397, 2017.
These advances have been driven by powerful baseline systems, such as the Fast/Faster RCNN and Fully Convolutional Network frameworks for object detection and semantic segmentation, respectively
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in...More
PPT (Upload PPT)
- The vision community has rapidly improved object detection and semantic segmentation results over a short period of time.
- Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task , including the heavilyengineered entries from the 2016 competition winner.
- In parallel to predicting the class and box offset, Mask R-CNN outputs a binary mask for each RoI.
- We differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and the network head for bounding-box recognition and mask prediction that is applied separately to each RoI.
- We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table 1.
- Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++ , which includes multi-scale train/test, horizontal flip test, and online hard example mining (OHEM) .
- AP75 32.8 32.6 35.3 (e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs multi-layer perceptrons (MLP, fully-connected) for mask prediction.
- For keypoint detection that requires finer alignment, RoIAlign shows large gains even with FPN (Table 6).
- We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table 3.
- Even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference.
- Mask R-CNN using ResNet-101FPN outperforms the base variants of all previous state-ofthe-art models, including the single-model variant of GRMI , the winner of the COCO 2016 Detection Challenge.
- Using ResNeXt-101-FPN, Mask R-CNN further improves results, with a margin of 3.0 points box AP over the best previous single model entry from .
- This gap of Mask R-CNN on box detection is due solely to the benefits of multi-task training.
- Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN .
- Experiments on Human Pose Estimation: We evaluate the person keypoint AP (APkp) using ResNet-50-FPN.
- Table 4 shows that our result (62.7 APkp) is 0.9 points higher than the COCO 2016 keypoint detection winner  that uses a multi-stage processing pipeline.
- Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks.
- Though this ResNet-50-FPN backbone has finer strides (e.g., 4 pixels on the finest level), RoIAlign still shows significant improvement over RoIPool and increases APkp by 4.4 points.
- Given the effectiveness of Mask R-CNN for extracting object bounding boxes, masks, and keypoints, we expect it be an effective framework for other instance-level tasks.