Use Mask-RCNN to do Object Segmentation

5 min readApr 6, 2020

For the past two weeks, I’ve been working on computer vision tasks to extract out objects from a image or a frame of video to:

Understand what type of objects appeared in image
Understand the location of the object in the image and how big is it

By understanding those two aspects, we can easily use them as inputs with an extra layer to do train classifying images to our custom taxonomy. Instead of retraining every time we want to update or swap out the taxonomy, we could simply just change that extra text layer describing the objects segmented from the images and map to a new taxonomy.

Using Mask-RCNN

To achieve this task, I’ve been searching for papers that comes with code implementations that could be plugged in easily for production use. The algorithm I found most promising is the Mask-RCNN approach, which is published by Facebook AI research.

Their official implementation is Detectron2 which comes with multiple capabilities. But we currently prefer tensorflow so instead I used the tensorflow implementation wrote by Waleed Abdulla using his pretrained mask-RCNN Coco model (trained with 80 classes in total) in hierarchical data format. The version the author wrote was in tensorflow 1.3 which I forked and upgraded to the latest tensorflow version 2.1.0 (github link).

You can very easily test out the implementation by cloning and running this repo: https://github.com/jklife3/maskrcnn-impl following the README:

git clone https://github.com/jklife3/maskrcnn-impl.git
cd maskrcnn-impl
mkdir images

Add sample.jpg you want to run under the images/ folder

pip3 install -r requirements.txt
python3 instance_segmentation.py

The code will return results in following format:

[{'class_name': 'person', 'offset': (0.29248046875, 0.22509765625), 'size_percentage': 0.17861270904541016},  {'class_name': 'chair', 'offset': (0.59912109375, 0.083984375), 'size_percentage': 0.03147125244140625}]

Here class_name is the object name detected in the image, while offset is the location (center point) of where the object appears in the image, and the size_percentage is the size of the object that takes up how much percentage of the entire image.

See result of a sample image:

My colleague’s wedding last summer in New York City

Be aware, I’ve tried to replace the keras package with tensorflow.keras but failed due to the difference functionality between keras.engine and tensorflow.python.keras.engine translating to tensor shape errors occur while passing through custom keras layers. Thus, I didn’t touch the keras part other then upgrade the version.

Origin: R-CNN

Before Mask-RCNN, there were R-CNN, Fast R-CNN, and Faster R-CNN. R-CNN uses Selective Search that first generate all possible segments based on the image color and texture, then use greedy algorithm to consolidate similar ones. The approach is intuitive but costly.

Advancement: Fast R-CNN

Fast R-CNN, as an improvement, feed the input image to the CNN to generate a convolutional feature map only once per image instead of feeding every proposed regions which could be up to 2000. From the convolutional feature map, it identifies the region of proposals and warp them into squares and by using a RoI pooling layer reshape them into a fixed size so that it can be fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

More Advancement: Faster R-CNN

While R-CNN & Fast R-CNN still use selective search to find out the region proposals, which is a slow and time-consuming process. It is also not a learning process, so it makes sense to replace it with object detection network that learns the region proposals end-to-end.

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

Mask R-CNN based on Faster R-CNN

The Mask R-CNN algorithm builds on the Faster R-CNN architecture with two major contributions:

Replacing the ROI Pooling module with a more accurate ROI Align module
Inserting an additional branch out of the ROI Align module

This additional branch accepts the output of the ROI Align and then feeds it into two CONV layers.

The output of the CONV layers is the mask itself.

We can visualize the Mask R-CNN architecture in the following figure:

As we know, the Faster R-CNN/Mask R-CNN architectures leverage a Region Proposal Network (RPN) to generate regions of an image that potentially contain an object.

Each of these regions is ranked based on their “objectness score” (i.e., how likely it is that a given region could potentially contain an object) and then the top N most confident objectness regions are kept.

In the original Faster R-CNN publication Girshick et al. set N=2,000, but in practice, we can use a smaller N, such as 300 and still obtain good results.

Each of the selected ROIs go through three parallel branches of the network:

Label prediction
Bounding box prediction
Mask prediction

During prediction, each of the 300 ROIs go through non-maxima suppression and the top 100 detection boxes are kept, resulting in a 4D tensor of 100 x L x 15 x 15 where L is the number of class labels in the dataset and 15 x 15 is the size of each of the L masks.

The Mask R-CNN we’re using here today was trained on the COCO dataset, which has L=80 classes, thus the resulting volume size from the mask module of the Mask R CNN is 100 x 80 x 15 x 15.

Conclusion

In this post, we’ve covered how to run the Mask R-CNN tensorflow implementation to get object segmentations with object type, location, and size information. We’ve also looked at the improvement from intuitively merging region proposals like R-CNN to end-to-end training pipelines like Faster R-CNN.