Mask RCNN (Faster RCNN + segmentation)

Detection

Mask RCNN (Faster RCNN + segmentation)

Hongma 2022. 2. 6. 00:41

mask rcnn의 논문 리뷰는 인터넷에 수도 없이 많기 때문에, 논문 리뷰보다는 코드를 보면서 리뷰를 하겠습니다.

논문에 대한 설명은 다른 곳으로..

이 글에서 사용할 코드 : nhm0819/mask_rcnn_prac (github.com)

- pytorch tutorial이며 torchvision에 있는 mask rcnn 모델을 기반으로 합니다.

(readme에 올려놓은 dataset 받아서 debugging 해보시면 됩니다.)

* 본 글에 달아놓은 코드들은 제가 상당히 많은 부분을 삭제한 pseudo code 입니다.

요약

먼저 mask rcnn은 faster rcnn에 segmentation 기능만 추가되었다고 봐도 무방할 정도로 거의 비슷합니다.

전체적인 흐름도는 다음과 같습니다.

출처 :&amp;amp;amp;amp;amp;amp;nbsp;Getting Started with Mask R-CNN for Instance Segmentation - MATLAB &amp;amp;amp;amp;amp;amp;amp; Simulink - MathWorks 한국

- 이미지가 들어가면 CNN으로 구성된 Backbone network를 통과합니다.

- output은 backbone network의 과정에서 나오는 feature를 특정 stage마다 가져와서 N개의 feature를 output으로 산출합니다.(=Feature Pyramid Network)

- 그 N개의 feature들은 RPN(Region Proposal Network)을 통과하여 물체가 존재하는 지점에 대한 proposal 값을 산출합니다.

- 그다음 feature와 proposal이 함께 ROI(Region Of Interest)를 통과하여 물체가 어떤 클래스인지, 또 클래스마다의 box와 segmentation 값들을 산출하게 됩니다.

- 전체 과정에서 RPN과 ROI에서 각각 loss를 구하고, 2번의 Backpropagation이 이뤄지게 됩니다.

크게 3가지 단계입니다.

1. Image --> Backbone = Features

2. Features --> RPN = Proposals (해당 지역에 물체의 존재 여부 파악 단계)

3. Features + Proposals --> ROI = Class, Box, Segmentation

# Mask RCNN의 Forward
def forward(self, images, targets)
        
        # 0. preprocessing
        images, targets = self.transform(images, targets)
        
        # 1. Backbone (Images --> Backbone = Features)
        features = self.backbone(images.tensors)
        
        # 2. RPN (Features --> RPN = Proposals)
        proposals, proposal_losses = self.rpn(images, features, targets)
        
        # 3. ROI (Features + Proposals --> ROI) -> Class, Boxes, Segmentations
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        
        # 4. postprocessing
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)
		
        # Loss (RPN, ROI)
        losses = {}
        losses.update(detector_losses)
        losses.update(proposal_losses)

먼저 tv-training-code.py 를 디버그 모드로 중단점 찍어서 실행합시다.

모델 선언 부분을 쫓아가 보면

model = torchvision.models.detection.maskrcnn_resnet50_fpn(pretrained=True)

입니다.

본 코드에선 backbone이 restnet50인 maskrcnn 모델을 사용합니다.

0. preprocessing

전처리는 크게 Normalize와 Resize를 합니다.

normalize는 평범한 normalize입니다.

Resize는 shortest edge resize라고 하는데 논문에도 나와있습니다.

가장 짧은 부분은 800(default)이 되도록, 긴 부분은 1333(default)이 되도록 resize합니다.

그다음 32의 배수로 나누어 떨어지게 다시 이미지 크기를 크게 만듭니다.(빈 부분은 0으로 대체)

backbone에서 1/2 downsampling이 5번 이루어지기 때문입니다.

(resize할 때 box와 mask도 같이 resize 됩니다.)

# generalized_rcnn.py - 77 line
	images, targets = self.transform(images, targets)

# transform.py - 최소 800, 최대 1333으로 resize 하는 부분.
image, target = _resize_image_and_masks(image, size, float(self.max_size), target, self.fixed_size)

# transform.py - 이미지 사이즈가 각각 32의 배수로 나누어 떨어지게 upsampling
images = self.batch_images(images, size_divisible=self.size_divisible)


def batch_images()
    if torchvision._is_tracing():
        # batch_images() does not export well to ONNX
        # call _onnx_batch_images() instead
        return self._onnx_batch_images(images, size_divisible)

    max_size = self.max_by_axis([list(img.shape) for img in images])
    stride = float(size_divisible)
    max_size = list(max_size)
    max_size[1] = int(math.ceil(float(max_size[1]) / stride) * stride)
    max_size[2] = int(math.ceil(float(max_size[2]) / stride) * stride)

    batch_shape = [len(images)] + max_size
    batched_imgs = images[0].new_full(batch_shape, 0)
    for img, pad_img in zip(images, batched_imgs):
        pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)

    return batched_imgs

1. Backbone

Backbone은 mask rcnn에서 크게 다를 주제는 아니니 어떤 output이 나오는지만 보도록 하겠습니다.

        # 0. preprocessing
        images, targets = self.transform(images, targets)

        # 1. backbone
        features = self.backbone(images.tensors)
		
        # 2. rpn
        proposals, proposal_losses = self.rpn(images, features, targets)
        
        # 3. roi
        detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
        
        # 4. postprocess
        detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

위의 코드를 보시면 backbone의 output은 features 입니다.

features의 값은 오른쪽과 같은 형태입니다. ( shape = [N, C, H, W] )

default로 총 5개의 feature를 뽑아냅니다.

물론 input image의 크기에 따라 feature의 크기도 조금씩 다를 겁니다.

(0~3은 pooling 후 conv를 한번 더 통과하고, 마지막은 pooling에서 끝나는 값인데 우선 신경 쓰지 맙시다.)

일단 중요한 건 5개(default)의 feature를 뽑아낸다는 것!

2. RPN

RPN은 Feature를 입력으로 받아 objectness와 pred_bbox_deltas를 산출합니다.

그리고 각 feature에 맞추어 anchor를 만듭니다. (anchor_generator는 생략)

objectness, pred_bbox_deltas = self.head(features)
anchors = self.anchor_generator(images, features)

# self.head
class RPNHead(nn.Module):
def forward(self, x):
    logits = []
    bbox_reg = []
    for feature in x:
        t = F.relu(self.conv(feature))
        logits.append(self.cls_logits(t))
        bbox_reg.append(self.bbox_pred(t))
    return logits, bbox_reg

이때 objectness의 채널은 3개, bbox의 채널은 12개입니다.

왜냐하면 anchor의 개수는 3개가 default이기 때문입니다. (3가지 비율의 anchor = 1:1, 0.5:1, 2:1)

즉 anchor는 각 픽셀마다 3개씩 있습니다.

objectness는 말 그대로 물체가 있냐 없냐에 대한 값입니다.

각 anchor에 대해 물체의 존재 여부를 판단하고 있기 때문에 다음과 같은 shape이 나옵니다.

pred_bbox_deltas는 box 좌표가 4개니까 3x4=12 의 채널이 나옵니다.

따라서 1개의 이미지 당 anchor의 개수는 3*(200*272 + 100*136 + 50*68 + 25*34 + 13*17) = 217413개입니다.

이후 연산하기 편하게 하기 위해 다 flatten하고 concat하여 (217413,1), (217413,4)의 형태로 바꿉니다.

또 하나 중요한 것이 있습니다.

산출된 값은 pred_bbox_deltas이고, 저희가 찾아야 하는 값은 pred_bbox입니다.

왜냐하면 pred_bbox_deltas는 bbox의 좌표가 나온 것이 아니라, 이름 그대로 anchor 기준으로 얼마만큼의 좌표 수정이 필요한지에 대한 값이기 때문입니다.

따라서 진짜 bbox 좌표를 뽑기 위해선 pred_bbox_deltas를 pred_bbox로 바꾸는 과정이 또 필요합니다.

    def decode_single(self, rel_codes, boxes):
    
    	# boxes --> ctr_x, ctr_y, widhts, heights
        # rel_codes --> dx, dy, dw, dh
        
    	pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]
        pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]
        pred_w = torch.exp(dw) * widths[:, None]
        pred_h = torch.exp(dh) * heights[:, None]

        pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
        pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
        pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype, device=pred_w.device) * pred_w
        pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype, device=pred_h.device) * pred_h
        pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)
        return pred_boxes

rel_codes = pred_bbox_deltas 이고,
boxes = anchors 입니다.

dw와 dh에는 exp을 씌우고, dx와 dy는 그냥 곱하기만 합니다.

중요한 건 이래 저래 해서 x1, y1, x2, y2 좌표를 만들어 냅니다.

그리고 끝이 아닙니다..

각 feature에 대해 objectness logit 값이 높은 2000개(default)씩만 뽑고,

objectness에 sigmoid 씌우고, ( = scores)

이미지 크기에 맞춰 좌표값 clipping 해주고,

아주 작은 box들은 잘라주고,

NMS(Non-Maximum Suppression)을 0.7(default)로 적용하고,

그중 다시 상위 2000개(default)를 뽑습니다. (우측 그림은 최종 output 2개입니다.)

이렇게 만들어진 boxes가 proposals 값입니다. (이름을 통해 유추해보면 "물체가 있을 거라고 제안된 박스들" 정도가 되겠네요..)

    def filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level):

        # select top_n boxes independently per level before applying nms
        top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)

        objectness = objectness[batch_idx, top_n_idx]
        levels = levels[batch_idx, top_n_idx]
        proposals = proposals[batch_idx, top_n_idx]

        objectness_prob = torch.sigmoid(objectness)

        final_boxes = []
        final_scores = []
        for boxes, scores, lvl, img_shape in zip(proposals, objectness_prob, levels, image_shapes):
            boxes = box_ops.clip_boxes_to_image(boxes, img_shape)

            # remove small boxes
            keep = box_ops.remove_small_boxes(boxes, self.min_size)
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

            # remove low scoring boxes
            # use >= for Backwards compatibility
            keep = torch.where(scores >= self.score_thresh)[0]
            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]

            # non-maximum suppression, independently done per level
            keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh)

            # keep only topk scoring predictions
            keep = keep[:self.post_nms_top_n()]
            boxes, scores = boxes[keep], scores[keep]

            final_boxes.append(boxes)
            final_scores.append(scores)
        return final_boxes, final_scores

이렇게 objectness에 대한 box와 score 값을 산출하고 나면 이제 loss를 구해야죠...

loss를 구하기 전에 또 다른 걸 해야 합니다...

(윗부분과 이 부분은 mask rcnn의 concept보단 연산의 최적화와 정확도를 위한 부분이기 때문에 빨리빨리 넘어가도 좋을 것 같습니다.)

target box와 anchor box들을 비교하여 iou를 산출합니다.

anchor마다 가장 가까운 ground truth를 target으로 선별합니다.

그리고 iou를 산출하는 과정에서 모든 anchor를 사용하는 비효율적인 일을 없애기 위해 다음과 같은 처리를 합니다.

iou < 0.3 --> background = 0

0.3 < iou < 0.7 --> between = -1 (애매한 값은 버림)

0.7 < iou --> object = 1

위에서는 pred_bbox_deltas와 anchor를 통해 pred_bbox를 만들어 냈다면,

이번엔 target과 anchor를 통해 target_bbox_deltas를 만들어 냅니다.

코드상으론 target_bbox_deltas가 아니라 regression_targets로 기재되어 있습니다.

실제 논문에서도 regression으로 표현하구요.

# anchor마다 가까운 gt_box 선정
# iou에 따라 background, between, object 구별
labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)

# regression_targets = target_bbox_deltas
regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)

# Loss
loss_objectness, loss_rpn_box_reg = self.compute_loss(
    objectness, pred_bbox_deltas, labels, regression_targets)

최종적으로 Loss를 산출할 땐 Positive한 값과 Negative(Background)한 값만 사용하여 Loss를 구합니다.

box_loss = F.smooth_l1_loss(
    pred_bbox_deltas[sampled_pos_inds],
    regression_targets[sampled_pos_inds],
    beta=1 / 9,
    reduction='sum',
) / (sampled_inds.numel())


objectness_loss = F.binary_cross_entropy_with_logits(
    objectness[sampled_inds], labels[sampled_inds]
)

여기까지로 절반이 끝났네요..

3. ROI

ROI 계산에는 features와 proposals(rpn에서 만들어진 boxes), 2가지 값이 필요합니다.

1) 먼저 training일 때는 위에서 한 것과 같이 다시 샘플링을 합니다.

2) 그다음 ROI Align을 하고,

3) embedding

4) class, box 추출

5) mask 추출

6) Loss

class RoIHeads(nn.Module):
    def forward(self,
                features,      # type: Dict[str, Tensor]
                proposals,     # type: List[Tensor]
                image_shapes,  # type: List[Tuple[int, int]]
                targets=None   # type: Optional[List[Dict[str, Tensor]]]
                ):

        # 1. sampling (default : 512)
        if self.training:
            proposals, matched_idxs, labels, regression_targets = self.select_training_samples(proposals, targets)
        
        # 2. roi align
        box_features = self.box_roi_pool(features, proposals, image_shapes)
        
        # 3. embedding
        #       CNN feature --> Embedding
        # (1024, 256, 7, 7) --> (1024, 1024)
        box_features = self.box_head(box_features)
        
        # 4. class, box 값 추출
        class_logits, box_regression = self.box_predictor(box_features)
        
        # 5. mask 추출
        if self.has_mask():
            mask_proposals = [p["boxes"] for p in result]
            
            if self.training:
                # during training, only focus on positive boxes
                num_images = len(proposals)
                mask_proposals = []
                pos_matched_idxs = []
                for img_id in range(num_images):
                    pos = torch.where(labels[img_id] > 0)[0]
                    mask_proposals.append(proposals[img_id][pos])
                    pos_matched_idxs.append(matched_idxs[img_id][pos])

            mask_features = self.mask_roi_pool(features, mask_proposals, image_shapes)
            mask_features = self.mask_head(mask_features)
            mask_logits = self.mask_predictor(mask_features)
            
        # 6. Loss
        if self.training:
            
            # cls, box loss
            loss_classifier, loss_box_reg = fastrcnn_loss(
                class_logits, box_regression, labels, regression_targets)
            losses = {
                "loss_classifier": loss_classifier,
                "loss_box_reg": loss_box_reg
            }
            
            # mask loss
            gt_masks = [t["masks"] for t in targets]
            gt_labels = [t["labels"] for t in targets]
            rcnn_loss_mask = maskrcnn_loss(
                mask_logits, mask_proposals,
                gt_masks, gt_labels, pos_matched_idxs)
            loss_mask = {
                "loss_mask": rcnn_loss_mask
            }

1) sampling 과정은 위와 비슷하니 생략하겠습니다.(proposals가 2000개에서 512개(default)로 줄어듭니다.)

2) ROI Align은 Mask RCNN이 Faster RCNN과 다른 중요한 차이점 중 하나입니다.

코드상으론 torchvision에 함수로 구현되어 있기 때문에 크게 어렵지 않습니다.

다만 FPN을 거친 Feature들이 5개가 있고, feature마다 크기가 다르기 때문에 feature의 level별로 ROI Align을 실행하기 때문에 코드들이 조금 깁니다.

class MultiScaleRoIAlign(nn.Module):
    def forward(
        self,
        x: Dict[str, Tensor],
        boxes: List[Tensor],
        image_shapes: List[Tuple[int, int]],
    ) -> Tensor:
	
    ### x = features
    ### boxes = proposals
    
        for k, v in x.items():
            if k in self.featmap_names:
                x_filtered.append(v)
        num_levels = len(x_filtered)
        rois = self.convert_to_roi_format(boxes)
        if self.scales is None:
            self.setup_scales(x_filtered, image_shapes)
		
        scales = self.scales
        mapper = self.map_levels
        levels = mapper(boxes)

		# result default shape = [512*2, 256, 7, 7]
        result = torch.zeros(
            (num_rois, num_channels,) + self.output_size,
            dtype=dtype,
            device=device,
        )

        tracing_results = []
        for level, (per_level_feature, scale) in enumerate(zip(x_filtered, scales)):
            idx_in_level = torch.where(levels == level)[0]
            rois_per_level = rois[idx_in_level]
			
            # ROI Align
            result_idx_in_level = roi_align(
                per_level_feature, rois_per_level,
                output_size=self.output_size,
                spatial_scale=scale, sampling_ratio=self.sampling_ratio)

            result[idx_in_level] = result_idx_in_level.to(result.dtype)


        return result

result의 shape은 [1024, 256, 7, 7] 입니다.

1024 = 512 * 2 인데, image 2장 * sample 512개입니다.

output의 기본 사이즈는 7,7 입니다.

채널은 bacbone에서 256으로 만들어놨구요.

proposals의 값들이 어떤 feature에서 나왔는지 mapping하고, feature의 level별로 roi align을 실행하는 코드입니다.

3) embedding은 cnn으로 나온 feature들을 flatten 하고 1024개로 압축하는 간단한 과정이기 때문에 역시 생략합니다.

4) class와 box를 구합니다.

class는 class 개수 + 1(background) 만큼의 채널을 갖습니다.

box는 위의 채널 * 4 겠죠.

5) mask 추출은 위의 코드처럼 별도로, 독립적인 network에서 이루어집니다.

cls, box를 추출하는 과정과 동일합니다.

다만 마지막 mask_predictor를 통과한 mask_logits의 값은 우측과 같이

(28,28)의 feature 모양입니다.(default) 또한 class는 2개이고 positive box가 72개네요.

각 클래스마다 mask 이미지가 output으로 나온다고 생각하시면 됩니다.

(predict된 output은 logit값이니 0~1의 값을 가진 grayscale image라고 생각하면 될 것 같네요.)

더 자세한 부분은 loss에서 보시죠.

6) Loss

먼저 Box와 Class의 Loss는 위와 같으니까 생략하고 바로 mask loss로 가겠습니다.

def maskrcnn_loss(mask_logits, proposals, gt_masks, gt_labels, mask_matched_idxs):
    # type: (Tensor, List[Tensor], List[Tensor], List[Tensor], List[Tensor]) -> Tensor
    """
    Args:
        proposals (list[BoxList])
        mask_logits (Tensor)
        targets (list[BoxList])

    Return:
        mask_loss (Tensor): scalar tensor containing the loss
    """

    discretization_size = mask_logits.shape[-1]
    labels = [gt_label[idxs] for gt_label, idxs in zip(gt_labels, mask_matched_idxs)]
    mask_targets = [
        project_masks_on_boxes(m, p, i, discretization_size)
        for m, p, i in zip(gt_masks, proposals, mask_matched_idxs)
    ]

    labels = torch.cat(labels, dim=0)
    mask_targets = torch.cat(mask_targets, dim=0)

    # torch.mean (in binary_cross_entropy_with_logits) doesn't
    # accept empty tensors, so handle it separately
    if mask_targets.numel() == 0:
        return mask_logits.sum() * 0

    mask_loss = F.binary_cross_entropy_with_logits(
        mask_logits[torch.arange(labels.shape[0], device=labels.device), labels], mask_targets
    )
    return mask_loss

위에 project_masks_on_boxes 란 함수가 있습니다. discretization_size = 28 이구요.

mask image에서 box인 부분만 따서 (28,28) 사이즈로 만드는 과정입니다. interpolation은 roi align을 사용하구요.

(gt_mask = 원본 이미지 사이즈의 mask image)

쉽게 그림으로 표현하자면 아래와 같습니다.

하트라는 클래스가 있다면, gt_box 를 잘라냅니다.

그러면 하트 주변의 사각형이 이미지로 나올겁니다.

그다음 (28,28)로 resize 하는 것입니다.

이 과정을 모든 positive box들에 대해서, 모든 class를 대상으로 진행하면 (72, 2, 28, 28)로 mask_logits와 같은 shape이 나옵니다.

이 이미지가 target이 됩니다.

이를 통해 mask loss를 구할 수 있습니다.

* target이 mask image가 아니라 좌표(rle)인 경우.

coco dataset 같은 경우엔 segmentation 값이 여러 개의 point 좌표로 이루어져 있습니다. (RLE 형식)

이 또한 마찬가지로 위처럼 (28,28) 모양의 image로 바꾸는 건 같습니다. 다만 과정이 다르겠죠.

segmentation points들을 통해 box의 좌표를 얻고, rle --> mask로 변환하여 (28,28)의 이미지를 만들면 됩니다.

pycocotools 패키지에 rle to mask, mask to polygon 등등의 함수가 구현되어 있습니다.

4. Postprocess

생략

마무리

이렇게 해서 RPN loss와 ROI loss를 Backpropagation하는 과정까지 안다면 mask rcnn의 학습 과정에 대한 전반적인 흐름은 이해했다고 볼 수 있습니다.

학습 과정을 알면 추론 과정은 금방 알 수 있습니다.

논문에서 말하는 mask rcnn의 장점 중 하나는 mask 부분을 다른 형태로도 변환이 쉽게 가능하다는 것입니다.

대표적인 변환은 keypoint detection이죠. (코드에도 구현되어 있습니다.)

2017년에 나온 조금 오래된 기술이지만 faster rcnn을 기반으로 하는 기술들은 object detection을 공부함에 있어 피해 갈 수 없는 기술인 것 같네요.

저도 이번에 글을 쓰며 다시 점검하게 되어 좋은 공부가 된 것 같습니다.

본 튜토리얼 코드는 공부를 위한 것이고, 실제로 사용하게 된다면 detectron이나 mmdetection을 이용하여 모델을 쉽게 사용하실 수 있습니다.

여기까지만 하고 마치겠습니다.

다음번엔 detectron2를 이용하여 mask rcnn을 학습하는 것에 대해 올리겠습니다.