|
| 1 | +# 🤿 DeepLabv3+ |
| 2 | +[\[Example\]](#example) · [\[Architecture\]](#architecture) · [\[Training Hyperparameters\]](#training-hyperparameters) · [\[Attribution\]](#attribution) · [\[API Reference\]](#api-reference) |
| 3 | + |
| 4 | +[DeepLabv3+](https://arxiv.org/abs/1802.02611) is an architecture designed for semantic segmenation i.e. per-pixel classification. DeepLabv3+ takes in a feature map from a backbone architecture (e.g. ResNet-101), then outputs classifications for each pixel in the input image. Our implementation is a simple wrapper around [torchvision’s ResNet](https://pytorch.org/vision/stable/models.html#id10) for the backbone and [mmsegmentation’s DeepLabv3+](https://github.com/open-mmlab/mmsegmentation/tree/master/configs/deeplabv3plus) for the head. |
| 5 | + |
| 6 | +## Example |
| 7 | + |
| 8 | +<!--pytest-codeblocks:skip--> |
| 9 | +```python |
| 10 | +from composer.models import ComposerDeepLabV3 |
| 11 | + |
| 12 | +model = ComposerDeepLabV3(num_classes=150, |
| 13 | + backbone_arch="resnet101", |
| 14 | + is_backbone_pretrained=True, |
| 15 | + backbone_url="https://download.pytorch.org/models/resnet101-cd907fc2.pth", |
| 16 | + sync_bn=False |
| 17 | +) |
| 18 | +``` |
| 19 | + |
| 20 | +## Architecture |
| 21 | + |
| 22 | +Based on [Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation](https://arxiv.org/abs/1802.02611) |
| 23 | + |
| 24 | +<div align=center> |
| 25 | +<img src="https://storage.googleapis.com/docs.mosaicml.com/images/models/deeplabv3_v2.png" alt="deeplabv3plus" width="650"> |
| 26 | +</div> |
| 27 | + |
| 28 | + |
| 29 | +- **Backbone network**: converts the input image into a feature map. |
| 30 | + * Usually ResNet-101 with the strided convolutions converted to dilations convolutions in stage 3 and 4. |
| 31 | + * The 3x3 convolutions in stage 3 and 4 have dilation sizes of 2 and 4, respectively, to compensate for the decreased receptive field. |
| 32 | + * The average pooling and classification layer are ignored. |
| 33 | +- **Spatial Pyramid Pooling**: extracts multi-resolution features from the stage 4 backbone feature map. |
| 34 | + * The backbone feature map is processed with four parallel convolution layers with dilations {1, 12, 24, 36} and kernel sizes {1x1, 3x3, 3x3, 3x3}. |
| 35 | + * In parallel to the convolutions, global average pool the backbone feature map, then bilinearly upsample to be the same spatial dimension as the feature map. |
| 36 | + * Concatenate the outputs from the convolutions and global average pool, then process with a 1x1 convolution. |
| 37 | + * The 3x3 convolutions are implemented as depth-wise convolutions to reduce memory and computation cost. |
| 38 | +- **Decoder**: converts the output of spatial pyramid pooling (SPP) to class predictions of the same spatial dimension as the input image. |
| 39 | + * SPP output is bilinearly upsampled to be the same spatial dimension as the output from the first stage in the backbone network. |
| 40 | + * A 1x1 convolution is applied to the first stage activations, then this is concatenated with the upsampled SPP output. |
| 41 | + * The concatenation is processed by a 3x3 convolution with dropout followed by a classification layer. |
| 42 | + * The predictions are bilinearly upsampled to be the same resolution as the input image. |
| 43 | + |
| 44 | +## Training Hyperparameters |
| 45 | + |
| 46 | +We tested two sets of hyperparameters for DeepLabv3+ trained on the ADE20k dataset. |
| 47 | + |
| 48 | +### Typical ADE20k Model Hyperparameters |
| 49 | + |
| 50 | +```yaml |
| 51 | +model: |
| 52 | + deeplabv3: |
| 53 | + initializers: |
| 54 | + - kaiming_normal |
| 55 | + - bn_ones |
| 56 | + num_classes: 150 |
| 57 | + backbone_arch: resnet101 |
| 58 | + is_backbone_pretrained: true |
| 59 | + use_plus: true |
| 60 | + sync_bn: true |
| 61 | +optimizer: |
| 62 | + sgd: |
| 63 | + lr: 0.01 |
| 64 | + momentum: 0.9 |
| 65 | + weight_decay: 5.0e-4 |
| 66 | + dampening: 0 |
| 67 | + nesterov: false |
| 68 | +schedulers: |
| 69 | + - polynomial: |
| 70 | + alpha_f: 0.01 |
| 71 | + power: 0.9 |
| 72 | +max_duration: 127ep |
| 73 | +train_batch_size: 16 |
| 74 | +precision: amp |
| 75 | +``` |
| 76 | +
|
| 77 | +| Model | mIoU | Time-to-Train on 8xA100 | |
| 78 | +| --- | --- | --- | |
| 79 | +| ResNet101-DeepLabv3+ | 44.17 +/- 0.17 | 6.385 hr | |
| 80 | +
|
| 81 | +### Composer ADE20k Model Hyperparameters |
| 82 | +
|
| 83 | +```yaml |
| 84 | +model: |
| 85 | + deeplabv3: |
| 86 | + initializers: |
| 87 | + - kaiming_normal |
| 88 | + - bn_ones |
| 89 | + num_classes: 150 |
| 90 | + backbone_arch: resnet101 |
| 91 | + is_backbone_pretrained: true |
| 92 | + use_plus: true |
| 93 | + sync_bn: true |
| 94 | + # New Pytorch pretrained weights |
| 95 | + backbone_url: https://download.pytorch.org/models/resnet101-cd907fc2.pth |
| 96 | +optimizer: |
| 97 | + decoupled_sgdw: |
| 98 | + lr: 0.01 |
| 99 | + momentum: 0.9 |
| 100 | + weight_decay: 2.0e-5 |
| 101 | + dampening: 0 |
| 102 | + nesterov: false |
| 103 | +schedulers: |
| 104 | + - cosine_decay: |
| 105 | + t_max: 1dur |
| 106 | +max_duration: 128ep |
| 107 | +train_batch_size: 32 |
| 108 | +precision: amp |
| 109 | +``` |
| 110 | +
|
| 111 | +| Model | mIoU | Time-to-Train on 8xA100 | |
| 112 | +| --- | --- | --- | |
| 113 | +| ResNet101-DeepLabv3+ | 45.764 +/- 0.29 | 4.67 hr | |
| 114 | +
|
| 115 | +Improvements: |
| 116 | +
|
| 117 | +- New PyTorch pretrained weights |
| 118 | +- Cosine decay |
| 119 | +- Decoupled Weight Decay |
| 120 | +- Increase batch size to 32 |
| 121 | +- Decrease weight decay to 2e-5 |
| 122 | +
|
| 123 | +## Attribution |
| 124 | +
|
| 125 | +[Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation](https://arxiv.org/abs/1802.02611) by Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, Hartwig Adam |
| 126 | +
|
| 127 | +[OpenMMLab Semantic Segmentation Toolbox and Benchmark](https://github.com/open-mmlab/mmsegmentation) |
| 128 | +
|
| 129 | +[How to Train State-Of-The-Art Models Using TorchVision’s Latest Primitives](https://pytorch.org/blog/how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/) by Vasilis Vryniotis |
| 130 | +
|
| 131 | +## API Reference |
| 132 | +
|
| 133 | +```{eval-rst} |
| 134 | +.. autoclass:: composer.models.deeplabv3.ComposerDeepLabV3 |
| 135 | + :noindex: |
| 136 | +``` |
0 commit comments