Mask R-CNN

ICCV, pp. 386-397, 2017.

Cited by: 7773|Bibtex|Views412|DOI:
Other Links:||
These advances have been driven by powerful baseline systems, such as the Fast/Faster RCNN and Fully Convolutional Network frameworks for object detection and semantic segmentation, respectively


We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in...More



  • The vision community has rapidly improved object detection and semantic segmentation results over a short period of time.
  • Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task [23], including the heavilyengineered entries from the 2016 competition winner.
  • In parallel to predicting the class and box offset, Mask R-CNN outputs a binary mask for each RoI.
  • We differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and the network head for bounding-box recognition and mask prediction that is applied separately to each RoI.
  • We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table 1.
  • Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++ [21], which includes multi-scale train/test, horizontal flip test, and online hard example mining (OHEM) [30].
  • AP75 32.8 32.6 35.3 (e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs multi-layer perceptrons (MLP, fully-connected) for mask prediction.
  • For keypoint detection that requires finer alignment, RoIAlign shows large gains even with FPN (Table 6).
  • We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table 3.
  • Even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference.
  • Mask R-CNN using ResNet-101FPN outperforms the base variants of all previous state-ofthe-art models, including the single-model variant of GRMI [17], the winner of the COCO 2016 Detection Challenge.
  • Using ResNeXt-101-FPN, Mask R-CNN further improves results, with a margin of 3.0 points box AP over the best previous single model entry from [31].
  • This gap of Mask R-CNN on box detection is due solely to the benefits of multi-task training.
  • Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN [29].
  • Experiments on Human Pose Estimation: We evaluate the person keypoint AP (APkp) using ResNet-50-FPN.
  • Table 4 shows that our result (62.7 APkp) is 0.9 points higher than the COCO 2016 keypoint detection winner [4] that uses a multi-stage processing pipeline.
  • Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks.
  • Though this ResNet-50-FPN backbone has finer strides (e.g., 4 pixels on the finest level), RoIAlign still shows significant improvement over RoIPool and increases APkp by 4.4 points.
  • Given the effectiveness of Mask R-CNN for extracting object bounding boxes, masks, and keypoints, we expect it be an effective framework for other instance-level tasks.
Full Text
Your rating :